public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] x86: Convert CONST_WIDE_INT/CONST_VECTOR to broadcast
@ 2021-06-26 20:02 H.J. Lu
  2021-06-26 20:02 ` [PATCH v5 1/2] " H.J. Lu
  2021-06-26 20:02 ` [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander H.J. Lu
  0 siblings, 2 replies; 12+ messages in thread
From: H.J. Lu @ 2021-06-26 20:02 UTC (permalink / raw)
  To: gcc-patches
  Cc: Uros Bizjak, Jakub Jelinek, Hongtao Liu, Richard Sandiford,
	Richard Biener

Changes in the v5 patch:

1. Allow AVX with SI/DI broadcast.
2. Add a comment for broadcasting to V64QI and V32HI with AVX512F, but
without AVX512BW.

---
1. Update move expanders to convert the CONST_WIDE_INT and CONST_VECTO
operands to vector broadcast from an integer with AVX2.
2. Add ix86_gen_scratch_sse_rtx to return a scratch SSE register which
won't increase stack alignment requirement and blocks transformation by
the combine pass.

A small benchmark:

https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/memset/broadcast

shows that broadcast is a little bit faster on Intel Core i7-8559U:

$ make
gcc -g -I. -O2   -c -o test.o test.c
gcc -g   -c -o memory.o memory.S
gcc -g   -c -o broadcast.o broadcast.S
gcc -g   -c -o vec_dup_sse2.o vec_dup_sse2.S
gcc -o test test.o memory.o broadcast.o vec_dup_sse2.o
./test
memory      : 147215
broadcast   : 121213
vec_dup_sse2: 171366
$

broadcast is also smaller:

$ size memory.o broadcast.o
   text	   data	    bss	    dec	    hex	filename
    132	      0	      0	    132	     84	memory.o
    122	      0	      0	    122	     7a	broadcast.o
$

3. Update PR 87767 tests to expect integer broadcast instead of broadcast
from memory.
4. Update avx512f_cond_move.c to expect integer broadcast.

A small benchmark:

https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vpaddd/broadcast

shows that integer broadcast is faster than embedded memory broadcast:

$ make
gcc -g -I. -O2 -march=skylake-avx512   -c -o test.o test.c
gcc -g   -c -o memory.o memory.S
gcc -g   -c -o broadcast.o broadcast.S
gcc -o test test.o memory.o broadcast.o
./test
memory      : 425538
broadcast   : 375260
$

5. Update vec_duplicate to allow to fail so that backend can only allow
broadcasting an integer constant to a vector when broadcast instruction
is available.  This can be used by memset expander to avoid vec_duplicate
when loading from constant pool is more efficient.
6. Add vec_duplicate<mode> expander and enable vec_duplicate from a
non-standard SSE constant integer only if vector broadcast is available.

H.J. Lu (2):
  x86: Convert CONST_WIDE_INT/CONST_VECTOR to broadcast
  x86: Add vec_duplicate<mode> expander

 gcc/config/i386/i386-expand.c                 | 214 +++++++++++++++++-
 gcc/config/i386/i386-protos.h                 |   3 +
 gcc/config/i386/i386.c                        |  13 ++
 gcc/config/i386/sse.md                        |  28 +++
 gcc/doc/md.texi                               |   2 -
 .../i386/avx512f-broadcast-pr87767-1.c        |   7 +-
 .../i386/avx512f-broadcast-pr87767-5.c        |   5 +-
 .../gcc.target/i386/avx512f_cond_move.c       |   4 +-
 .../i386/avx512vl-broadcast-pr87767-1.c       |  12 +-
 .../i386/avx512vl-broadcast-pr87767-5.c       |   9 +-
 gcc/testsuite/gcc.target/i386/pr100865-1.c    |  13 ++
 gcc/testsuite/gcc.target/i386/pr100865-10a.c  |  33 +++
 gcc/testsuite/gcc.target/i386/pr100865-10b.c  |   7 +
 gcc/testsuite/gcc.target/i386/pr100865-11a.c  |  23 ++
 gcc/testsuite/gcc.target/i386/pr100865-11b.c  |   8 +
 gcc/testsuite/gcc.target/i386/pr100865-12a.c  |  20 ++
 gcc/testsuite/gcc.target/i386/pr100865-12b.c  |   8 +
 gcc/testsuite/gcc.target/i386/pr100865-2.c    |  14 ++
 gcc/testsuite/gcc.target/i386/pr100865-3.c    |  15 ++
 gcc/testsuite/gcc.target/i386/pr100865-4a.c   |  16 ++
 gcc/testsuite/gcc.target/i386/pr100865-4b.c   |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-5a.c   |  16 ++
 gcc/testsuite/gcc.target/i386/pr100865-5b.c   |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-6a.c   |  16 ++
 gcc/testsuite/gcc.target/i386/pr100865-6b.c   |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-6c.c   |  16 ++
 gcc/testsuite/gcc.target/i386/pr100865-7a.c   |  17 ++
 gcc/testsuite/gcc.target/i386/pr100865-7b.c   |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-7c.c   |  17 ++
 gcc/testsuite/gcc.target/i386/pr100865-8a.c   |  24 ++
 gcc/testsuite/gcc.target/i386/pr100865-8b.c   |   7 +
 gcc/testsuite/gcc.target/i386/pr100865-9a.c   |  25 ++
 gcc/testsuite/gcc.target/i386/pr100865-9b.c   |   7 +
 33 files changed, 609 insertions(+), 26 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-10a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-10b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-11a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-11b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-12a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-12b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-4a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-4b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-5a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-5b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-8a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-8b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-9a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-9b.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 1/2] x86: Convert CONST_WIDE_INT/CONST_VECTOR to broadcast
  2021-06-26 20:02 [PATCH v5 0/2] x86: Convert CONST_WIDE_INT/CONST_VECTOR to broadcast H.J. Lu
@ 2021-06-26 20:02 ` H.J. Lu
  2021-06-28  1:48   ` Hongtao Liu
  2021-06-26 20:02 ` [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander H.J. Lu
  1 sibling, 1 reply; 12+ messages in thread
From: H.J. Lu @ 2021-06-26 20:02 UTC (permalink / raw)
  To: gcc-patches
  Cc: Uros Bizjak, Jakub Jelinek, Hongtao Liu, Richard Sandiford,
	Richard Biener

1. Update move expanders to convert the CONST_WIDE_INT and CONST_VECTO
operands to vector broadcast from an integer with AVX2.
2. Add ix86_gen_scratch_sse_rtx to return a scratch SSE register which
won't increase stack alignment requirement and blocks transformation by
the combine pass.

A small benchmark:

https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/memset/broadcast

shows that broadcast is a little bit faster on Intel Core i7-8559U:

$ make
gcc -g -I. -O2   -c -o test.o test.c
gcc -g   -c -o memory.o memory.S
gcc -g   -c -o broadcast.o broadcast.S
gcc -g   -c -o vec_dup_sse2.o vec_dup_sse2.S
gcc -o test test.o memory.o broadcast.o vec_dup_sse2.o
./test
memory      : 147215
broadcast   : 121213
vec_dup_sse2: 171366
$

broadcast is also smaller:

$ size memory.o broadcast.o
   text	   data	    bss	    dec	    hex	filename
    132	      0	      0	    132	     84	memory.o
    122	      0	      0	    122	     7a	broadcast.o
$

3. Update PR 87767 tests to expect integer broadcast instead of broadcast
from memory.
4. Update avx512f_cond_move.c to expect integer broadcast.

A small benchmark:

https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vpaddd/broadcast

shows that integer broadcast is faster than embedded memory broadcast:

$ make
gcc -g -I. -O2 -march=skylake-avx512   -c -o test.o test.c
gcc -g   -c -o memory.o memory.S
gcc -g   -c -o broadcast.o broadcast.S
gcc -o test test.o memory.o broadcast.o
./test
memory      : 425538
broadcast   : 375260
$

gcc/

	PR target/100865
	* config/i386/i386-expand.c (ix86_expand_vector_init_duplicate):
	New prototype.
	(ix86_byte_broadcast): New function.
	(ix86_convert_const_wide_int_to_broadcast): Likewise.
	(ix86_expand_move): Convert CONST_WIDE_INT to broadcast if mode
	size is 16 bytes or bigger.
	(ix86_broadcast_from_integer_constant): New function.
	(ix86_expand_vector_move): Convert CONST_WIDE_INT and CONST_VECTOR
	to broadcast if mode size is 16 bytes or bigger.
	* config/i386/i386-protos.h (ix86_gen_scratch_sse_rtx): New
	prototype.
	* config/i386/i386.c (ix86_gen_scratch_sse_rtx): New function.

gcc/testsuite/

	PR target/100865
	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: Expect integer
	broadcast.
	* gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
	* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
	* gcc.target/i386/avx512f_cond_move.c: Also pass
	-mprefer-vector-width=512 and expect integer broadcast.
	* gcc.target/i386/pr100865-1.c: New test.
	* gcc.target/i386/pr100865-2.c: Likewise.
	* gcc.target/i386/pr100865-3.c: Likewise.
	* gcc.target/i386/pr100865-4a.c: Likewise.
	* gcc.target/i386/pr100865-4b.c: Likewise.
	* gcc.target/i386/pr100865-5a.c: Likewise.
	* gcc.target/i386/pr100865-5b.c: Likewise.
	* gcc.target/i386/pr100865-6a.c: Likewise.
	* gcc.target/i386/pr100865-6b.c: Likewise.
	* gcc.target/i386/pr100865-6c.c: Likewise.
	* gcc.target/i386/pr100865-7a.c: Likewise.
	* gcc.target/i386/pr100865-7b.c: Likewise.
	* gcc.target/i386/pr100865-7c.c: Likewise.
	* gcc.target/i386/pr100865-8a.c: Likewise.
	* gcc.target/i386/pr100865-8b.c: Likewise.
	* gcc.target/i386/pr100865-9a.c: Likewise.
	* gcc.target/i386/pr100865-9b.c: Likewise.
	* gcc.target/i386/pr100865-10a.c: Likewise.
	* gcc.target/i386/pr100865-10b.c: Likewise.
	* gcc.target/i386/pr100865-11a.c: Likewise.
	* gcc.target/i386/pr100865-11b.c: Likewise.
	* gcc.target/i386/pr100865-12a.c: Likewise.
	* gcc.target/i386/pr100865-12b.c: Likewise.
---
 gcc/config/i386/i386-expand.c                 | 190 ++++++++++++++++--
 gcc/config/i386/i386-protos.h                 |   2 +
 gcc/config/i386/i386.c                        |  13 ++
 .../i386/avx512f-broadcast-pr87767-1.c        |   7 +-
 .../i386/avx512f-broadcast-pr87767-5.c        |   5 +-
 .../gcc.target/i386/avx512f_cond_move.c       |   4 +-
 .../i386/avx512vl-broadcast-pr87767-1.c       |  12 +-
 .../i386/avx512vl-broadcast-pr87767-5.c       |   9 +-
 gcc/testsuite/gcc.target/i386/pr100865-1.c    |  13 ++
 gcc/testsuite/gcc.target/i386/pr100865-10a.c  |  33 +++
 gcc/testsuite/gcc.target/i386/pr100865-10b.c  |   7 +
 gcc/testsuite/gcc.target/i386/pr100865-11a.c  |  23 +++
 gcc/testsuite/gcc.target/i386/pr100865-11b.c  |   8 +
 gcc/testsuite/gcc.target/i386/pr100865-12a.c  |  20 ++
 gcc/testsuite/gcc.target/i386/pr100865-12b.c  |   8 +
 gcc/testsuite/gcc.target/i386/pr100865-2.c    |  14 ++
 gcc/testsuite/gcc.target/i386/pr100865-3.c    |  15 ++
 gcc/testsuite/gcc.target/i386/pr100865-4a.c   |  16 ++
 gcc/testsuite/gcc.target/i386/pr100865-4b.c   |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-5a.c   |  16 ++
 gcc/testsuite/gcc.target/i386/pr100865-5b.c   |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-6a.c   |  16 ++
 gcc/testsuite/gcc.target/i386/pr100865-6b.c   |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-6c.c   |  16 ++
 gcc/testsuite/gcc.target/i386/pr100865-7a.c   |  17 ++
 gcc/testsuite/gcc.target/i386/pr100865-7b.c   |   9 +
 gcc/testsuite/gcc.target/i386/pr100865-7c.c   |  17 ++
 gcc/testsuite/gcc.target/i386/pr100865-8a.c   |  24 +++
 gcc/testsuite/gcc.target/i386/pr100865-8b.c   |   7 +
 gcc/testsuite/gcc.target/i386/pr100865-9a.c   |  25 +++
 gcc/testsuite/gcc.target/i386/pr100865-9b.c   |   7 +
 31 files changed, 556 insertions(+), 24 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-10a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-10b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-11a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-11b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-12a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-12b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-4a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-4b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-5a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-5b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-8a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-8b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-9a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-9b.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index e9763eb5b3e..e9e89c82764 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -93,6 +93,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "i386-builtins.h"
 #include "i386-expand.h"
 
+static bool ix86_expand_vector_init_duplicate (bool, machine_mode, rtx,
+					       rtx);
+
 /* Split one or more double-mode RTL references into pairs of half-mode
    references.  The RTL can be REG, offsettable MEM, integer constant, or
    CONST_DOUBLE.  "operands" is a pointer to an array of double-mode RTLs to
@@ -190,6 +193,83 @@ ix86_expand_clear (rtx dest)
   emit_insn (tmp);
 }
 
+/* Return true if V can be broadcasted from an integer of WIDTH bits
+   which is returned in VAL_BROADCAST.  Otherwise, return false.  */
+
+static bool
+ix86_broadcast (HOST_WIDE_INT v, unsigned int width,
+		HOST_WIDE_INT &val_broadcast)
+{
+  wide_int val = wi::uhwi (v, HOST_BITS_PER_WIDE_INT);
+  val_broadcast = wi::extract_uhwi (val, 0, width);
+  for (unsigned int i = width; i < HOST_BITS_PER_WIDE_INT; i += width)
+    {
+      HOST_WIDE_INT each = wi::extract_uhwi (val, i, width);
+      if (val_broadcast != each)
+	return false;
+    }
+  val_broadcast = sext_hwi (val_broadcast, width);
+  return true;
+}
+
+/* Convert the CONST_WIDE_INT operand OP to broadcast in MODE.  */
+
+static rtx
+ix86_convert_const_wide_int_to_broadcast (machine_mode mode, rtx op)
+{
+  /* Don't use integer vector broadcast if we can't move from GPR to SSE
+     register directly.  */
+  if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
+    return nullptr;
+
+  /* Convert CONST_WIDE_INT to a non-standard SSE constant integer
+     broadcast only if vector broadcast is available.  */
+  if (!(TARGET_AVX2
+	|| (TARGET_AVX
+	    && (GET_MODE_INNER (mode) == SImode
+		|| GET_MODE_INNER (mode) == DImode)))
+      || !CONST_WIDE_INT_P (op)
+      || standard_sse_constant_p (op, mode))
+    return nullptr;
+
+  HOST_WIDE_INT val = CONST_WIDE_INT_ELT (op, 0);
+  HOST_WIDE_INT val_broadcast;
+  scalar_int_mode broadcast_mode;
+  if (ix86_broadcast (val, GET_MODE_BITSIZE (QImode),
+		      val_broadcast))
+    broadcast_mode = QImode;
+  else if (ix86_broadcast (val, GET_MODE_BITSIZE (HImode),
+			   val_broadcast))
+    broadcast_mode = HImode;
+  else if (ix86_broadcast (val, GET_MODE_BITSIZE (SImode),
+			   val_broadcast))
+    broadcast_mode = SImode;
+  else if (TARGET_64BIT
+	   && ix86_broadcast (val, GET_MODE_BITSIZE (DImode),
+			      val_broadcast))
+    broadcast_mode = DImode;
+  else
+    return nullptr;
+
+  /* Check if OP can be broadcasted from VAL.  */
+  for (int i = 1; i < CONST_WIDE_INT_NUNITS (op); i++)
+    if (val != CONST_WIDE_INT_ELT (op, i))
+      return nullptr;
+
+  unsigned int nunits = (GET_MODE_SIZE (mode)
+			 / GET_MODE_SIZE (broadcast_mode));
+  machine_mode vector_mode;
+  if (!mode_for_vector (broadcast_mode, nunits).exists (&vector_mode))
+    gcc_unreachable ();
+  rtx target = ix86_gen_scratch_sse_rtx (vector_mode);
+  bool ok = ix86_expand_vector_init_duplicate (false, vector_mode,
+					       target,
+					       GEN_INT (val_broadcast));
+  gcc_assert (ok);
+  target = lowpart_subreg (mode, target, vector_mode);
+  return target;
+}
+
 void
 ix86_expand_move (machine_mode mode, rtx operands[])
 {
@@ -347,20 +427,29 @@ ix86_expand_move (machine_mode mode, rtx operands[])
 	  && optimize)
 	op1 = copy_to_mode_reg (mode, op1);
 
-      if (can_create_pseudo_p ()
-	  && CONST_DOUBLE_P (op1))
+      if (can_create_pseudo_p ())
 	{
-	  /* If we are loading a floating point constant to a register,
-	     force the value to memory now, since we'll get better code
-	     out the back end.  */
+	  if (CONST_DOUBLE_P (op1))
+	    {
+	      /* If we are loading a floating point constant to a
+		 register, force the value to memory now, since we'll
+		 get better code out the back end.  */
 
-	  op1 = validize_mem (force_const_mem (mode, op1));
-	  if (!register_operand (op0, mode))
+	      op1 = validize_mem (force_const_mem (mode, op1));
+	      if (!register_operand (op0, mode))
+		{
+		  rtx temp = gen_reg_rtx (mode);
+		  emit_insn (gen_rtx_SET (temp, op1));
+		  emit_move_insn (op0, temp);
+		  return;
+		}
+	    }
+	  else if (GET_MODE_SIZE (mode) >= 16)
 	    {
-	      rtx temp = gen_reg_rtx (mode);
-	      emit_insn (gen_rtx_SET (temp, op1));
-	      emit_move_insn (op0, temp);
-	      return;
+	      rtx tmp = ix86_convert_const_wide_int_to_broadcast
+		(GET_MODE (op0), op1);
+	      if (tmp != nullptr)
+		op1 = tmp;
 	    }
 	}
     }
@@ -368,6 +457,54 @@ ix86_expand_move (machine_mode mode, rtx operands[])
   emit_insn (gen_rtx_SET (op0, op1));
 }
 
+static rtx
+ix86_broadcast_from_integer_constant (machine_mode mode, rtx op)
+{
+  int nunits = GET_MODE_NUNITS (mode);
+  if (nunits < 2)
+    return nullptr;
+
+  /* Don't use integer vector broadcast if we can't move from GPR to SSE
+     register directly.  */
+  if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
+    return nullptr;
+
+  /* Don't broadcast from a standard SSE constant integer.  */
+  if (standard_sse_constant_p (op, mode))
+    return nullptr;
+
+  /* Don't broadcast from a 64-bit integer constant in 32-bit mode.  */
+  if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT)
+    return nullptr;
+
+  rtx constant = get_pool_constant (XEXP (op, 0));
+  if (GET_CODE (constant) != CONST_VECTOR)
+    return nullptr;
+
+  /* There could be some rtx like
+     (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
+     but with "*.LC1" refer to V2DI constant vector.  */
+  if (GET_MODE (constant) != mode)
+    {
+      constant = simplify_subreg (mode, constant, GET_MODE (constant),
+				  0);
+      if (constant == nullptr || GET_CODE (constant) != CONST_VECTOR)
+	return nullptr;
+    }
+
+  rtx first = XVECEXP (constant, 0, 0);
+
+  for (int i = 1; i < nunits; ++i)
+    {
+      rtx tmp = XVECEXP (constant, 0, i);
+      /* Vector duplicate value.  */
+      if (!rtx_equal_p (tmp, first))
+	return nullptr;
+    }
+
+  return first;
+}
+
 void
 ix86_expand_vector_move (machine_mode mode, rtx operands[])
 {
@@ -407,7 +544,36 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[])
 	  op1 = simplify_gen_subreg (mode, r, imode, SUBREG_BYTE (op1));
 	}
       else
-	op1 = validize_mem (force_const_mem (mode, op1));
+	{
+	  machine_mode mode = GET_MODE (op0);
+	  rtx tmp = ix86_convert_const_wide_int_to_broadcast
+	    (mode, op1);
+	  if (tmp == nullptr)
+	    op1 = validize_mem (force_const_mem (mode, op1));
+	  else
+	    op1 = tmp;
+	}
+    }
+
+  if (can_create_pseudo_p ()
+      && GET_MODE_SIZE (mode) >= 16
+      && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+      && (MEM_P (op1)
+	  && SYMBOL_REF_P (XEXP (op1, 0))
+	  && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0))))
+    {
+      rtx first = ix86_broadcast_from_integer_constant (mode, op1);
+      if (first != nullptr)
+	{
+	  /* Broadcast to XMM/YMM/ZMM register from an integer
+	     constant.  */
+	  op1 = ix86_gen_scratch_sse_rtx (mode);
+	  bool ok = ix86_expand_vector_init_duplicate (false, mode,
+						       op1, first);
+	  gcc_assert (ok);
+	  emit_move_insn (op0, op1);
+	  return;
+	}
     }
 
   /* We need to check memory alignment for SSE mode since attribute
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 65fc307dc7b..71745b9a1ea 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -50,6 +50,8 @@ extern void ix86_reset_previous_fndecl (void);
 
 extern bool ix86_using_red_zone (void);
 
+extern rtx ix86_gen_scratch_sse_rtx (machine_mode);
+
 extern unsigned int ix86_regmode_natural_size (machine_mode);
 #ifdef RTX_CODE
 extern int standard_80387_constant_p (rtx);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index c71c9e666a4..1c167c9a841 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -23126,6 +23126,19 @@ ix86_optab_supported_p (int op, machine_mode mode1, machine_mode,
     }
 }
 
+/* Return a scratch register in MODE for vector load and store.  */
+
+rtx
+ix86_gen_scratch_sse_rtx (machine_mode mode)
+{
+  if (TARGET_SSE)
+    return gen_rtx_REG (mode, (TARGET_64BIT
+			       ? LAST_REX_SSE_REG
+			       : LAST_SSE_REG));
+  else
+    return gen_reg_rtx (mode);
+}
+
 /* Address space support.
 
    This is not "far pointers" in the 16-bit sense, but an easy way
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
index 0563e696316..a2664d87f29 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -2,8 +2,11 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mavx512f -mavx512dq" } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 5 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 { target { ! ia32 } } } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 3 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 3 { target { ! ia32 } } } } */
 
 typedef int v16si  __attribute__ ((vector_size (64)));
 typedef long long v8di  __attribute__ ((vector_size (64)));
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
index ffbe95980ca..477f9ca1282 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
@@ -2,8 +2,9 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mavx512f" } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to8\\\}" 4 } }  */
-/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to16\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 4 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 4 { target { ! ia32 } } } } */
 
 typedef int v16si  __attribute__ ((vector_size (64)));
 typedef long long v8di  __attribute__ ((vector_size (64)));
diff --git a/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c b/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
index 99a89f51202..ca49a585232 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-O3 -mavx512f" } */
-/* { dg-final { scan-assembler-times "(?:vpblendmd|vmovdqa32)\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 8 } } */
+/* { dg-options "-O3 -mavx512f -mprefer-vector-width=512" } */
+/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vmovdqa32)\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 8 } } */
 
 unsigned int x[128];
 int y[128];
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
index c06369d93fd..f8eb99f0b5f 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -2,9 +2,15 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 10 } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 { target { ! ia32 } } } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target { ! ia32 } } } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 7 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 3 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 3 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 3 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 3 { target { ! ia32 } } } } */
 
 typedef int v4si  __attribute__ ((vector_size (16)));
 typedef int v8si  __attribute__ ((vector_size (32)));
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
index 4998a9b8d51..32f6ac81841 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
@@ -2,9 +2,12 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mavx512f -mavx512vl" } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 4 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 4 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 4 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 4 { target { ! ia32 } } } } */
 
 typedef int v4si  __attribute__ ((vector_size (16)));
 typedef int v8si  __attribute__ ((vector_size (32)));
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-1.c b/gcc/testsuite/gcc.target/i386/pr100865-1.c
new file mode 100644
index 00000000000..6c3097fb2a6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-1.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=x86-64" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 16);
+}
+
+/* { dg-final { scan-assembler-times "movdqa\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10a.c b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
new file mode 100644
index 00000000000..7ffc19e56a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
@@ -0,0 +1,33 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake" } */
+
+extern __int128 array[16];
+
+#define MK_CONST128_BROADCAST(A) \
+  ((((unsigned __int128) (unsigned char) A) << 120) \
+   | (((unsigned __int128) (unsigned char) A) << 112) \
+   | (((unsigned __int128) (unsigned char) A) << 104) \
+   | (((unsigned __int128) (unsigned char) A) << 96) \
+   | (((unsigned __int128) (unsigned char) A) << 88) \
+   | (((unsigned __int128) (unsigned char) A) << 80) \
+   | (((unsigned __int128) (unsigned char) A) << 72) \
+   | (((unsigned __int128) (unsigned char) A) << 64) \
+   | (((unsigned __int128) (unsigned char) A) << 56) \
+   | (((unsigned __int128) (unsigned char) A) << 48) \
+   | (((unsigned __int128) (unsigned char) A) << 40) \
+   | (((unsigned __int128) (unsigned char) A) << 32) \
+   | (((unsigned __int128) (unsigned char) A) << 24) \
+   | (((unsigned __int128) (unsigned char) A) << 16) \
+   | (((unsigned __int128) (unsigned char) A) << 8) \
+   | ((unsigned __int128) (unsigned char) A) )
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = MK_CONST128_BROADCAST (0x1f);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10b.c b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
new file mode 100644
index 00000000000..edf52765c60
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake-avx512" } */
+
+#include "pr100865-10a.c"
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11a.c b/gcc/testsuite/gcc.target/i386/pr100865-11a.c
new file mode 100644
index 00000000000..04ce1662f3c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-11a.c
@@ -0,0 +1,23 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake" } */
+
+extern __int128 array[16];
+
+#define MK_CONST128_BROADCAST(A) \
+  ((((unsigned __int128) (unsigned long long) A) << 64) \
+   | ((unsigned __int128) (unsigned long long) A) )
+
+#define MK_CONST128_BROADCAST_SIGNED(A) \
+  ((__int128) MK_CONST128_BROADCAST (A))
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = MK_CONST128_BROADCAST_SIGNED (-0x1ffffffffLL);
+}
+
+/* { dg-final { scan-assembler-times "movabsq" 1 } } */
+/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpunpcklqdq)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11b.c b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
new file mode 100644
index 00000000000..12d55b9a642
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
@@ -0,0 +1,8 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake-avx512" } */
+
+#include "pr100865-11a.c"
+
+/* { dg-final { scan-assembler-times "movabsq" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-12a.c b/gcc/testsuite/gcc.target/i386/pr100865-12a.c
new file mode 100644
index 00000000000..d4833d44475
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-12a.c
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake" } */
+
+extern __int128 array[16];
+
+#define MK_CONST128_BROADCAST(A) \
+  ((((unsigned __int128) (unsigned long long) A) << 64) \
+   | ((unsigned __int128) (unsigned long long) A) )
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = MK_CONST128_BROADCAST (0x1ffffffffLL);
+}
+
+/* { dg-final { scan-assembler-times "movabsq" 1 } } */
+/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpunpcklqdq)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-12b.c b/gcc/testsuite/gcc.target/i386/pr100865-12b.c
new file mode 100644
index 00000000000..12d55b9a642
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-12b.c
@@ -0,0 +1,8 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake-avx512" } */
+
+#include "pr100865-11a.c"
+
+/* { dg-final { scan-assembler-times "movabsq" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-2.c b/gcc/testsuite/gcc.target/i386/pr100865-2.c
new file mode 100644
index 00000000000..17efe2d72a3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 16);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-3.c b/gcc/testsuite/gcc.target/i386/pr100865-3.c
new file mode 100644
index 00000000000..b6dbcf7809b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-3.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 16);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
+/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4a.c b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
new file mode 100644
index 00000000000..f55883598f9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern char array[64];
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = -45;
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, " 4 } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4b.c b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
new file mode 100644
index 00000000000..f41e6147b4c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
@@ -0,0 +1,9 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+#include "pr100865-4a.c"
+
+/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, " 4 } } */
+/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5a.c b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
new file mode 100644
index 00000000000..4149797fe81
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=skylake" } */
+
+extern short array[64];
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = -45;
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 4 } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5b.c b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
new file mode 100644
index 00000000000..ded41b680d3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=skylake-avx512" } */
+
+#include "pr100865-5a.c"
+
+/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu16\[\\t \]%ymm\[0-9\]+, " 4 } } */
+/* { dg-final { scan-assembler-not "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6a.c b/gcc/testsuite/gcc.target/i386/pr100865-6a.c
new file mode 100644
index 00000000000..3fde549a10d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-6a.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=skylake" } */
+
+extern int array[64];
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = -45;
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6b.c b/gcc/testsuite/gcc.target/i386/pr100865-6b.c
new file mode 100644
index 00000000000..44e74c64e55
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-6b.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=skylake-avx512" } */
+
+#include "pr100865-6a.c"
+
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
+/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6c.c b/gcc/testsuite/gcc.target/i386/pr100865-6c.c
new file mode 100644
index 00000000000..46d31030ce8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-6c.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=skylake -mno-avx2" } */
+
+extern int array[64];
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = -45;
+}
+
+/* { dg-final { scan-assembler-times "vbroadcastss" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7a.c b/gcc/testsuite/gcc.target/i386/pr100865-7a.c
new file mode 100644
index 00000000000..f6f2be91120
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-7a.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=skylake" } */
+
+extern long long int array[64];
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = -45;
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
+/* { dg-final { scan-assembler-not "vpbroadcastq" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "vmovdqa" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7b.c b/gcc/testsuite/gcc.target/i386/pr100865-7b.c
new file mode 100644
index 00000000000..0a68820aa32
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-7b.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=skylake-avx512" } */
+
+#include "pr100865-7a.c"
+
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
+/* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7c.c b/gcc/testsuite/gcc.target/i386/pr100865-7c.c
new file mode 100644
index 00000000000..4d50bb7e2f6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-7c.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=skylake -mno-avx2" } */
+
+extern long long int array[64];
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = -45;
+}
+
+/* { dg-final { scan-assembler-times "vbroadcastsd" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
+/* { dg-final { scan-assembler-not "vbroadcastsd" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "vmovdqa" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8a.c b/gcc/testsuite/gcc.target/i386/pr100865-8a.c
new file mode 100644
index 00000000000..96e9f13204c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-8a.c
@@ -0,0 +1,24 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake" } */
+
+extern __int128 array[16];
+
+#define MK_CONST128_BROADCAST(A) \
+  ((((unsigned __int128) (unsigned int) A) << 96) \
+   | (((unsigned __int128) (unsigned int) A) << 64) \
+   | (((unsigned __int128) (unsigned int) A) << 32) \
+   | ((unsigned __int128) (unsigned int) A) )
+
+#define MK_CONST128_BROADCAST_SIGNED(A) \
+  ((__int128) MK_CONST128_BROADCAST (A))
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = MK_CONST128_BROADCAST_SIGNED (-45);
+}
+
+/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8b.c b/gcc/testsuite/gcc.target/i386/pr100865-8b.c
new file mode 100644
index 00000000000..99a10ad83bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-8b.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake-avx512" } */
+
+#include "pr100865-8a.c"
+
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9a.c b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
new file mode 100644
index 00000000000..45d0e0d0e2e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
@@ -0,0 +1,25 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake" } */
+
+extern __int128 array[16];
+
+#define MK_CONST128_BROADCAST(A) \
+  ((((unsigned __int128) (unsigned short) A) << 112) \
+   | (((unsigned __int128) (unsigned short) A) << 96) \
+   | (((unsigned __int128) (unsigned short) A) << 80) \
+   | (((unsigned __int128) (unsigned short) A) << 64) \
+   | (((unsigned __int128) (unsigned short) A) << 48) \
+   | (((unsigned __int128) (unsigned short) A) << 32) \
+   | (((unsigned __int128) (unsigned short) A) << 16) \
+   | ((unsigned __int128) (unsigned short) A) )
+
+void
+foo (void)
+{
+  int i;
+  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
+    array[i] = MK_CONST128_BROADCAST (0x1fff);
+}
+
+/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9b.c b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
new file mode 100644
index 00000000000..14696248525
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O3 -march=skylake-avx512" } */
+
+#include "pr100865-9a.c"
+
+/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander
  2021-06-26 20:02 [PATCH v5 0/2] x86: Convert CONST_WIDE_INT/CONST_VECTOR to broadcast H.J. Lu
  2021-06-26 20:02 ` [PATCH v5 1/2] " H.J. Lu
@ 2021-06-26 20:02 ` H.J. Lu
  2021-06-27  8:43   ` Richard Sandiford
  1 sibling, 1 reply; 12+ messages in thread
From: H.J. Lu @ 2021-06-26 20:02 UTC (permalink / raw)
  To: gcc-patches
  Cc: Uros Bizjak, Jakub Jelinek, Hongtao Liu, Richard Sandiford,
	Richard Biener

1. Update vec_duplicate to allow to fail so that backend can only allow
broadcasting an integer constant to a vector when broadcast instruction
is available.  This can be used by memset expander to avoid vec_duplicate
when loading from constant pool is more efficient.
2. Add vec_duplicate<mode> expander and enable vec_duplicate from a
non-standard SSE constant integer only if vector broadcast is available.

	* config/i386/i386-expand.c (ix86_expand_integer_vec_duplicate):
	New function.
	* config/i386/i386-protos.h (ix86_expand_integer_vec_duplicat):
	New prototype.
	* config/i386/sse.md (INT_BROADCAST_MODE): New mode iterator.
	(vec_duplicate<mode>): New expander.
	* doc/md.texi: Update vec_duplicate.
---
 gcc/config/i386/i386-expand.c | 24 ++++++++++++++++++++++++
 gcc/config/i386/i386-protos.h |  1 +
 gcc/config/i386/sse.md        | 28 ++++++++++++++++++++++++++++
 gcc/doc/md.texi               |  2 --
 4 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index e9e89c82764..75c160d4349 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -15742,6 +15742,30 @@ ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt)
     }
 }
 
+/* Expand integer vec_duplicate.  Return true if successful.  */
+
+bool
+ix86_expand_integer_vec_duplicate (rtx *operands)
+{
+  /* Enable VEC_DUPLICATE from a non-standard SSE constant integer only
+     if vector broadcast is available.  */
+  machine_mode mode = GET_MODE (operands[0]);
+  if (CONST_INT_P (operands[1])
+      && (!(TARGET_AVX2
+	    || (TARGET_AVX
+		&& (GET_MODE_INNER (mode) == SImode
+		    || GET_MODE_INNER (mode) == DImode)))
+	  || standard_sse_constant_p (operands[1], mode)))
+    return false;
+
+  bool ok = ix86_expand_vector_init_duplicate (false, mode,
+					       operands[0],
+					       operands[1]);
+  gcc_assert (ok);
+
+  return true;
+}
+
 /* Generate code to copy vector bits i / 2 ... i - 1 from vector SRC
    to bits 0 ... i / 2 - 1 of vector DEST, which has the same mode.
    The upper bits of DEST are undefined, though they shouldn't cause
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 71745b9a1ea..a6cc09bb75b 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -258,6 +258,7 @@ extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
 extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_abs (rtx, rtx);
+extern bool ix86_expand_integer_vec_duplicate (rtx *);
 
 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index e4f01e64bc1..53a703fb466 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -24640,3 +24640,31 @@ (define_insn "*aes<aeswideklvariant>u8"
   "TARGET_WIDEKL"
   "aes<aeswideklvariant>\t{%0}"
   [(set_attr "type" "other")])
+
+;; Modes handled by broadcast patterns.  NB: Allow V64QI and V32HI with
+;; TARGET_AVX512F since ix86_expand_integer_vec_duplicate can expand
+;; without TARGET_AVX512BW which is used by memset vector broadcast
+;; expander to XI with:
+;; 	vmovd		%edi, %xmm15
+;;	vpbroadcastb	%xmm15, %ymm15
+;;	vinserti64x4	$0x1, %ymm15, %zmm15, %zmm15
+
+(define_mode_iterator INT_BROADCAST_MODE
+  [(V64QI "TARGET_AVX512F") (V32QI "TARGET_AVX") V16QI
+   (V32HI "TARGET_AVX512F") (V16HI "TARGET_AVX") V8HI
+   (V16SI "TARGET_AVX512F") (V8SI "TARGET_AVX") V4SI
+   (V8DI "TARGET_AVX512F && TARGET_64BIT")
+   (V4DI "TARGET_AVX && TARGET_64BIT") (V2DI "TARGET_64BIT")])
+
+;; Broadcast from an integer.  NB: Enable broadcast only if we can move
+;; from GPR to SSE register directly.
+(define_expand "vec_duplicate<mode>"
+  [(set (match_operand:INT_BROADCAST_MODE 0 "register_operand")
+	(vec_duplicate:INT_BROADCAST_MODE
+	  (match_operand:<ssescalarmode> 1 "general_operand")))]
+  "TARGET_SSE2 && TARGET_INTER_UNIT_MOVES_TO_VEC"
+{
+  if (!ix86_expand_integer_vec_duplicate (operands))
+    FAIL;
+  DONE;
+})
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 1b918144330..a892c94d163 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5077,8 +5077,6 @@ the mode appropriate for one element of @var{m}.
 This pattern only handles duplicates of non-constant inputs.  Constant
 vectors go through the @code{mov@var{m}} pattern instead.
 
-This pattern is not allowed to @code{FAIL}.
-
 @cindex @code{vec_series@var{m}} instruction pattern
 @item @samp{vec_series@var{m}}
 Initialize vector output operand 0 so that element @var{i} is equal to
-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander
  2021-06-26 20:02 ` [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander H.J. Lu
@ 2021-06-27  8:43   ` Richard Sandiford
  2021-06-27 11:29     ` H.J. Lu
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2021-06-27  8:43 UTC (permalink / raw)
  To: H.J. Lu
  Cc: gcc-patches, Uros Bizjak, Jakub Jelinek, Hongtao Liu, Richard Biener

"H.J. Lu" <hjl.tools@gmail.com> writes:
> 1. Update vec_duplicate to allow to fail so that backend can only allow
> broadcasting an integer constant to a vector when broadcast instruction
> is available.  This can be used by memset expander to avoid vec_duplicate
> when loading from constant pool is more efficient.

I don't see any changes in target-independent code though, other than
the doc update.  It's still the case that (existing) uses of
vec_duplicate_optab do not allow it to fail.

Thanks,
Richard

> 2. Add vec_duplicate<mode> expander and enable vec_duplicate from a
> non-standard SSE constant integer only if vector broadcast is available.
>
> 	* config/i386/i386-expand.c (ix86_expand_integer_vec_duplicate):
> 	New function.
> 	* config/i386/i386-protos.h (ix86_expand_integer_vec_duplicat):
> 	New prototype.
> 	* config/i386/sse.md (INT_BROADCAST_MODE): New mode iterator.
> 	(vec_duplicate<mode>): New expander.
> 	* doc/md.texi: Update vec_duplicate.
> ---
>  gcc/config/i386/i386-expand.c | 24 ++++++++++++++++++++++++
>  gcc/config/i386/i386-protos.h |  1 +
>  gcc/config/i386/sse.md        | 28 ++++++++++++++++++++++++++++
>  gcc/doc/md.texi               |  2 --
>  4 files changed, 53 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index e9e89c82764..75c160d4349 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -15742,6 +15742,30 @@ ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt)
>      }
>  }
>  
> +/* Expand integer vec_duplicate.  Return true if successful.  */
> +
> +bool
> +ix86_expand_integer_vec_duplicate (rtx *operands)
> +{
> +  /* Enable VEC_DUPLICATE from a non-standard SSE constant integer only
> +     if vector broadcast is available.  */
> +  machine_mode mode = GET_MODE (operands[0]);
> +  if (CONST_INT_P (operands[1])
> +      && (!(TARGET_AVX2
> +	    || (TARGET_AVX
> +		&& (GET_MODE_INNER (mode) == SImode
> +		    || GET_MODE_INNER (mode) == DImode)))
> +	  || standard_sse_constant_p (operands[1], mode)))
> +    return false;
> +
> +  bool ok = ix86_expand_vector_init_duplicate (false, mode,
> +					       operands[0],
> +					       operands[1]);
> +  gcc_assert (ok);
> +
> +  return true;
> +}
> +
>  /* Generate code to copy vector bits i / 2 ... i - 1 from vector SRC
>     to bits 0 ... i / 2 - 1 of vector DEST, which has the same mode.
>     The upper bits of DEST are undefined, though they shouldn't cause
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 71745b9a1ea..a6cc09bb75b 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -258,6 +258,7 @@ extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
>  extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
>  extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
>  extern void ix86_expand_sse2_abs (rtx, rtx);
> +extern bool ix86_expand_integer_vec_duplicate (rtx *);
>  
>  /* In i386-c.c  */
>  extern void ix86_target_macros (void);
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index e4f01e64bc1..53a703fb466 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -24640,3 +24640,31 @@ (define_insn "*aes<aeswideklvariant>u8"
>    "TARGET_WIDEKL"
>    "aes<aeswideklvariant>\t{%0}"
>    [(set_attr "type" "other")])
> +
> +;; Modes handled by broadcast patterns.  NB: Allow V64QI and V32HI with
> +;; TARGET_AVX512F since ix86_expand_integer_vec_duplicate can expand
> +;; without TARGET_AVX512BW which is used by memset vector broadcast
> +;; expander to XI with:
> +;; 	vmovd		%edi, %xmm15
> +;;	vpbroadcastb	%xmm15, %ymm15
> +;;	vinserti64x4	$0x1, %ymm15, %zmm15, %zmm15
> +
> +(define_mode_iterator INT_BROADCAST_MODE
> +  [(V64QI "TARGET_AVX512F") (V32QI "TARGET_AVX") V16QI
> +   (V32HI "TARGET_AVX512F") (V16HI "TARGET_AVX") V8HI
> +   (V16SI "TARGET_AVX512F") (V8SI "TARGET_AVX") V4SI
> +   (V8DI "TARGET_AVX512F && TARGET_64BIT")
> +   (V4DI "TARGET_AVX && TARGET_64BIT") (V2DI "TARGET_64BIT")])
> +
> +;; Broadcast from an integer.  NB: Enable broadcast only if we can move
> +;; from GPR to SSE register directly.
> +(define_expand "vec_duplicate<mode>"
> +  [(set (match_operand:INT_BROADCAST_MODE 0 "register_operand")
> +	(vec_duplicate:INT_BROADCAST_MODE
> +	  (match_operand:<ssescalarmode> 1 "general_operand")))]
> +  "TARGET_SSE2 && TARGET_INTER_UNIT_MOVES_TO_VEC"
> +{
> +  if (!ix86_expand_integer_vec_duplicate (operands))
> +    FAIL;
> +  DONE;
> +})
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 1b918144330..a892c94d163 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5077,8 +5077,6 @@ the mode appropriate for one element of @var{m}.
>  This pattern only handles duplicates of non-constant inputs.  Constant
>  vectors go through the @code{mov@var{m}} pattern instead.
>  
> -This pattern is not allowed to @code{FAIL}.
> -
>  @cindex @code{vec_series@var{m}} instruction pattern
>  @item @samp{vec_series@var{m}}
>  Initialize vector output operand 0 so that element @var{i} is equal to

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander
  2021-06-27  8:43   ` Richard Sandiford
@ 2021-06-27 11:29     ` H.J. Lu
  2021-06-27 21:00       ` Richard Sandiford
  0 siblings, 1 reply; 12+ messages in thread
From: H.J. Lu @ 2021-06-27 11:29 UTC (permalink / raw)
  To: GCC Patches, Uros Bizjak, Jakub Jelinek, Hongtao Liu,
	Richard Biener, Richard Sandiford

On Sun, Jun 27, 2021 at 1:43 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> "H.J. Lu" <hjl.tools@gmail.com> writes:
> > 1. Update vec_duplicate to allow to fail so that backend can only allow
> > broadcasting an integer constant to a vector when broadcast instruction
> > is available.  This can be used by memset expander to avoid vec_duplicate
> > when loading from constant pool is more efficient.
>
> I don't see any changes in target-independent code though, other than
> the doc update.  It's still the case that (existing) uses of
> vec_duplicate_optab do not allow it to fail.

I have a followup patch set on

https://gitlab.com/x86-gcc/gcc/-/commits/users/hjl/pieces/broadcast

to use it to expand memset with vector broadcast:

https://gitlab.com/x86-gcc/gcc/-/commit/991c87f8a83ca736ae9ed92baa3ebadca289f6e3

For SSE2 which doesn't have vector broadcast, the constant vector broadcast
expander returns FAIL and load from constant pool will be used.

> Thanks,
> Richard
>
> > 2. Add vec_duplicate<mode> expander and enable vec_duplicate from a
> > non-standard SSE constant integer only if vector broadcast is available.
> >
> >       * config/i386/i386-expand.c (ix86_expand_integer_vec_duplicate):
> >       New function.
> >       * config/i386/i386-protos.h (ix86_expand_integer_vec_duplicat):
> >       New prototype.
> >       * config/i386/sse.md (INT_BROADCAST_MODE): New mode iterator.
> >       (vec_duplicate<mode>): New expander.
> >       * doc/md.texi: Update vec_duplicate.
> > ---
> >  gcc/config/i386/i386-expand.c | 24 ++++++++++++++++++++++++
> >  gcc/config/i386/i386-protos.h |  1 +
> >  gcc/config/i386/sse.md        | 28 ++++++++++++++++++++++++++++
> >  gcc/doc/md.texi               |  2 --
> >  4 files changed, 53 insertions(+), 2 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > index e9e89c82764..75c160d4349 100644
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > @@ -15742,6 +15742,30 @@ ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt)
> >      }
> >  }
> >
> > +/* Expand integer vec_duplicate.  Return true if successful.  */
> > +
> > +bool
> > +ix86_expand_integer_vec_duplicate (rtx *operands)
> > +{
> > +  /* Enable VEC_DUPLICATE from a non-standard SSE constant integer only
> > +     if vector broadcast is available.  */
> > +  machine_mode mode = GET_MODE (operands[0]);
> > +  if (CONST_INT_P (operands[1])
> > +      && (!(TARGET_AVX2
> > +         || (TARGET_AVX
> > +             && (GET_MODE_INNER (mode) == SImode
> > +                 || GET_MODE_INNER (mode) == DImode)))
> > +       || standard_sse_constant_p (operands[1], mode)))
> > +    return false;
> > +
> > +  bool ok = ix86_expand_vector_init_duplicate (false, mode,
> > +                                            operands[0],
> > +                                            operands[1]);
> > +  gcc_assert (ok);
> > +
> > +  return true;
> > +}
> > +
> >  /* Generate code to copy vector bits i / 2 ... i - 1 from vector SRC
> >     to bits 0 ... i / 2 - 1 of vector DEST, which has the same mode.
> >     The upper bits of DEST are undefined, though they shouldn't cause
> > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > index 71745b9a1ea..a6cc09bb75b 100644
> > --- a/gcc/config/i386/i386-protos.h
> > +++ b/gcc/config/i386/i386-protos.h
> > @@ -258,6 +258,7 @@ extern void ix86_expand_mul_widen_hilo (rtx, rtx, rtx, bool, bool);
> >  extern void ix86_expand_sse2_mulv4si3 (rtx, rtx, rtx);
> >  extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
> >  extern void ix86_expand_sse2_abs (rtx, rtx);
> > +extern bool ix86_expand_integer_vec_duplicate (rtx *);
> >
> >  /* In i386-c.c  */
> >  extern void ix86_target_macros (void);
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index e4f01e64bc1..53a703fb466 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -24640,3 +24640,31 @@ (define_insn "*aes<aeswideklvariant>u8"
> >    "TARGET_WIDEKL"
> >    "aes<aeswideklvariant>\t{%0}"
> >    [(set_attr "type" "other")])
> > +
> > +;; Modes handled by broadcast patterns.  NB: Allow V64QI and V32HI with
> > +;; TARGET_AVX512F since ix86_expand_integer_vec_duplicate can expand
> > +;; without TARGET_AVX512BW which is used by memset vector broadcast
> > +;; expander to XI with:
> > +;;   vmovd           %edi, %xmm15
> > +;;   vpbroadcastb    %xmm15, %ymm15
> > +;;   vinserti64x4    $0x1, %ymm15, %zmm15, %zmm15
> > +
> > +(define_mode_iterator INT_BROADCAST_MODE
> > +  [(V64QI "TARGET_AVX512F") (V32QI "TARGET_AVX") V16QI
> > +   (V32HI "TARGET_AVX512F") (V16HI "TARGET_AVX") V8HI
> > +   (V16SI "TARGET_AVX512F") (V8SI "TARGET_AVX") V4SI
> > +   (V8DI "TARGET_AVX512F && TARGET_64BIT")
> > +   (V4DI "TARGET_AVX && TARGET_64BIT") (V2DI "TARGET_64BIT")])
> > +
> > +;; Broadcast from an integer.  NB: Enable broadcast only if we can move
> > +;; from GPR to SSE register directly.
> > +(define_expand "vec_duplicate<mode>"
> > +  [(set (match_operand:INT_BROADCAST_MODE 0 "register_operand")
> > +     (vec_duplicate:INT_BROADCAST_MODE
> > +       (match_operand:<ssescalarmode> 1 "general_operand")))]
> > +  "TARGET_SSE2 && TARGET_INTER_UNIT_MOVES_TO_VEC"
> > +{
> > +  if (!ix86_expand_integer_vec_duplicate (operands))
> > +    FAIL;
> > +  DONE;
> > +})
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index 1b918144330..a892c94d163 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5077,8 +5077,6 @@ the mode appropriate for one element of @var{m}.
> >  This pattern only handles duplicates of non-constant inputs.  Constant
> >  vectors go through the @code{mov@var{m}} pattern instead.
> >
> > -This pattern is not allowed to @code{FAIL}.
> > -
> >  @cindex @code{vec_series@var{m}} instruction pattern
> >  @item @samp{vec_series@var{m}}
> >  Initialize vector output operand 0 so that element @var{i} is equal to



-- 
H.J.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander
  2021-06-27 11:29     ` H.J. Lu
@ 2021-06-27 21:00       ` Richard Sandiford
  2021-06-28 12:16         ` H.J. Lu
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2021-06-27 21:00 UTC (permalink / raw)
  To: H.J. Lu via Gcc-patches
  Cc: Uros Bizjak, Jakub Jelinek, Hongtao Liu, Richard Biener, H.J. Lu

"H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> On Sun, Jun 27, 2021 at 1:43 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> "H.J. Lu" <hjl.tools@gmail.com> writes:
>> > 1. Update vec_duplicate to allow to fail so that backend can only allow
>> > broadcasting an integer constant to a vector when broadcast instruction
>> > is available.  This can be used by memset expander to avoid vec_duplicate
>> > when loading from constant pool is more efficient.
>>
>> I don't see any changes in target-independent code though, other than
>> the doc update.  It's still the case that (existing) uses of
>> vec_duplicate_optab do not allow it to fail.
>
> I have a followup patch set on
>
> https://gitlab.com/x86-gcc/gcc/-/commits/users/hjl/pieces/broadcast
>
> to use it to expand memset with vector broadcast:
>
> https://gitlab.com/x86-gcc/gcc/-/commit/991c87f8a83ca736ae9ed92baa3ebadca289f6e3
>
> For SSE2 which doesn't have vector broadcast, the constant vector broadcast
> expander returns FAIL and load from constant pool will be used.

Hmm, but as Jeff and I mentioned in the earlier replies,
vec_duplicate_optab shouldn't be used for constants.  Constants
should go via the move expanders instead.

In a previous message I suggested:

  … would it work to change:

	/* Try using vec_duplicate_optab for uniform vectors.  */
	if (!TREE_SIDE_EFFECTS (exp)
	    && VECTOR_MODE_P (mode)
	    && eltmode == GET_MODE_INNER (mode)
	    && ((icode = optab_handler (vec_duplicate_optab, mode))
		!= CODE_FOR_nothing)
	    && (elt = uniform_vector_p (exp)))

  to something like:

	/* Try using vec_duplicate_optab for uniform vectors.  */
	if (!TREE_SIDE_EFFECTS (exp)
	    && VECTOR_MODE_P (mode)
	    && eltmode == GET_MODE_INNER (mode)
	    && (elt = uniform_vector_p (exp)))
	  {
	    if (TREE_CODE (elt) == INTEGER_CST
		|| TREE_CODE (elt) == POLY_INT_CST
		|| TREE_CODE (elt) == REAL_CST
		|| TREE_CODE (elt) == FIXED_CST)
	      {
		rtx src = gen_const_vec_duplicate (mode, expand_normal (node));
		emit_move_insn (target, src);
		break;
	      }
	    …
	  }

if that code was the source of the constant operand.  If we're adding a
new use of vec_duplicate_optab then that should be similarly protected
against constant operands.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] x86: Convert CONST_WIDE_INT/CONST_VECTOR to broadcast
  2021-06-26 20:02 ` [PATCH v5 1/2] " H.J. Lu
@ 2021-06-28  1:48   ` Hongtao Liu
  2021-06-29  0:40     ` H.J. Lu
  0 siblings, 1 reply; 12+ messages in thread
From: Hongtao Liu @ 2021-06-28  1:48 UTC (permalink / raw)
  To: H.J. Lu
  Cc: GCC Patches, Uros Bizjak, Jakub Jelinek, Richard Sandiford,
	Richard Biener

On Sun, Jun 27, 2021 at 4:02 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> 1. Update move expanders to convert the CONST_WIDE_INT and CONST_VECTO
> operands to vector broadcast from an integer with AVX2.
> 2. Add ix86_gen_scratch_sse_rtx to return a scratch SSE register which
> won't increase stack alignment requirement and blocks transformation by
> the combine pass.
>
> A small benchmark:
>
> https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/memset/broadcast
>
> shows that broadcast is a little bit faster on Intel Core i7-8559U:
>
> $ make
> gcc -g -I. -O2   -c -o test.o test.c
> gcc -g   -c -o memory.o memory.S
> gcc -g   -c -o broadcast.o broadcast.S
> gcc -g   -c -o vec_dup_sse2.o vec_dup_sse2.S
> gcc -o test test.o memory.o broadcast.o vec_dup_sse2.o
> ./test
> memory      : 147215
> broadcast   : 121213
> vec_dup_sse2: 171366
> $
>
> broadcast is also smaller:
>
> $ size memory.o broadcast.o
>    text    data     bss     dec     hex filename
>     132       0       0     132      84 memory.o
>     122       0       0     122      7a broadcast.o
> $
>
> 3. Update PR 87767 tests to expect integer broadcast instead of broadcast
> from memory.
> 4. Update avx512f_cond_move.c to expect integer broadcast.
>
> A small benchmark:
>
> https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vpaddd/broadcast
>
> shows that integer broadcast is faster than embedded memory broadcast:
>
> $ make
> gcc -g -I. -O2 -march=skylake-avx512   -c -o test.o test.c
> gcc -g   -c -o memory.o memory.S
> gcc -g   -c -o broadcast.o broadcast.S
> gcc -o test test.o memory.o broadcast.o
> ./test
> memory      : 425538
> broadcast   : 375260
> $
>
> gcc/
>
>         PR target/100865
>         * config/i386/i386-expand.c (ix86_expand_vector_init_duplicate):
>         New prototype.
>         (ix86_byte_broadcast): New function.
>         (ix86_convert_const_wide_int_to_broadcast): Likewise.
>         (ix86_expand_move): Convert CONST_WIDE_INT to broadcast if mode
>         size is 16 bytes or bigger.
>         (ix86_broadcast_from_integer_constant): New function.
>         (ix86_expand_vector_move): Convert CONST_WIDE_INT and CONST_VECTOR
>         to broadcast if mode size is 16 bytes or bigger.
>         * config/i386/i386-protos.h (ix86_gen_scratch_sse_rtx): New
>         prototype.
>         * config/i386/i386.c (ix86_gen_scratch_sse_rtx): New function.
>
> gcc/testsuite/
>
>         PR target/100865
>         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Expect integer
>         broadcast.
>         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
>         * gcc.target/i386/avx512f_cond_move.c: Also pass
>         -mprefer-vector-width=512 and expect integer broadcast.
>         * gcc.target/i386/pr100865-1.c: New test.
>         * gcc.target/i386/pr100865-2.c: Likewise.
>         * gcc.target/i386/pr100865-3.c: Likewise.
>         * gcc.target/i386/pr100865-4a.c: Likewise.
>         * gcc.target/i386/pr100865-4b.c: Likewise.
>         * gcc.target/i386/pr100865-5a.c: Likewise.
>         * gcc.target/i386/pr100865-5b.c: Likewise.
>         * gcc.target/i386/pr100865-6a.c: Likewise.
>         * gcc.target/i386/pr100865-6b.c: Likewise.
>         * gcc.target/i386/pr100865-6c.c: Likewise.
>         * gcc.target/i386/pr100865-7a.c: Likewise.
>         * gcc.target/i386/pr100865-7b.c: Likewise.
>         * gcc.target/i386/pr100865-7c.c: Likewise.
>         * gcc.target/i386/pr100865-8a.c: Likewise.
>         * gcc.target/i386/pr100865-8b.c: Likewise.
>         * gcc.target/i386/pr100865-9a.c: Likewise.
>         * gcc.target/i386/pr100865-9b.c: Likewise.
>         * gcc.target/i386/pr100865-10a.c: Likewise.
>         * gcc.target/i386/pr100865-10b.c: Likewise.
>         * gcc.target/i386/pr100865-11a.c: Likewise.
>         * gcc.target/i386/pr100865-11b.c: Likewise.
>         * gcc.target/i386/pr100865-12a.c: Likewise.
>         * gcc.target/i386/pr100865-12b.c: Likewise.
> ---
>  gcc/config/i386/i386-expand.c                 | 190 ++++++++++++++++--
>  gcc/config/i386/i386-protos.h                 |   2 +
>  gcc/config/i386/i386.c                        |  13 ++
>  .../i386/avx512f-broadcast-pr87767-1.c        |   7 +-
>  .../i386/avx512f-broadcast-pr87767-5.c        |   5 +-
>  .../gcc.target/i386/avx512f_cond_move.c       |   4 +-
>  .../i386/avx512vl-broadcast-pr87767-1.c       |  12 +-
>  .../i386/avx512vl-broadcast-pr87767-5.c       |   9 +-
>  gcc/testsuite/gcc.target/i386/pr100865-1.c    |  13 ++
>  gcc/testsuite/gcc.target/i386/pr100865-10a.c  |  33 +++
>  gcc/testsuite/gcc.target/i386/pr100865-10b.c  |   7 +
>  gcc/testsuite/gcc.target/i386/pr100865-11a.c  |  23 +++
>  gcc/testsuite/gcc.target/i386/pr100865-11b.c  |   8 +
>  gcc/testsuite/gcc.target/i386/pr100865-12a.c  |  20 ++
>  gcc/testsuite/gcc.target/i386/pr100865-12b.c  |   8 +
>  gcc/testsuite/gcc.target/i386/pr100865-2.c    |  14 ++
>  gcc/testsuite/gcc.target/i386/pr100865-3.c    |  15 ++
>  gcc/testsuite/gcc.target/i386/pr100865-4a.c   |  16 ++
>  gcc/testsuite/gcc.target/i386/pr100865-4b.c   |   9 +
>  gcc/testsuite/gcc.target/i386/pr100865-5a.c   |  16 ++
>  gcc/testsuite/gcc.target/i386/pr100865-5b.c   |   9 +
>  gcc/testsuite/gcc.target/i386/pr100865-6a.c   |  16 ++
>  gcc/testsuite/gcc.target/i386/pr100865-6b.c   |   9 +
>  gcc/testsuite/gcc.target/i386/pr100865-6c.c   |  16 ++
>  gcc/testsuite/gcc.target/i386/pr100865-7a.c   |  17 ++
>  gcc/testsuite/gcc.target/i386/pr100865-7b.c   |   9 +
>  gcc/testsuite/gcc.target/i386/pr100865-7c.c   |  17 ++
>  gcc/testsuite/gcc.target/i386/pr100865-8a.c   |  24 +++
>  gcc/testsuite/gcc.target/i386/pr100865-8b.c   |   7 +
>  gcc/testsuite/gcc.target/i386/pr100865-9a.c   |  25 +++
>  gcc/testsuite/gcc.target/i386/pr100865-9b.c   |   7 +
>  31 files changed, 556 insertions(+), 24 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-10a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-10b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-11a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-11b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-12a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-12b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-4a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-4b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-5a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-5b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6c.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7c.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-8a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-8b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-9a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-9b.c
>
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index e9763eb5b3e..e9e89c82764 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -93,6 +93,9 @@ along with GCC; see the file COPYING3.  If not see
>  #include "i386-builtins.h"
>  #include "i386-expand.h"
>
> +static bool ix86_expand_vector_init_duplicate (bool, machine_mode, rtx,
> +                                              rtx);
> +
>  /* Split one or more double-mode RTL references into pairs of half-mode
>     references.  The RTL can be REG, offsettable MEM, integer constant, or
>     CONST_DOUBLE.  "operands" is a pointer to an array of double-mode RTLs to
> @@ -190,6 +193,83 @@ ix86_expand_clear (rtx dest)
>    emit_insn (tmp);
>  }
>
> +/* Return true if V can be broadcasted from an integer of WIDTH bits
> +   which is returned in VAL_BROADCAST.  Otherwise, return false.  */
> +
> +static bool
> +ix86_broadcast (HOST_WIDE_INT v, unsigned int width,
> +               HOST_WIDE_INT &val_broadcast)
> +{
> +  wide_int val = wi::uhwi (v, HOST_BITS_PER_WIDE_INT);
> +  val_broadcast = wi::extract_uhwi (val, 0, width);
> +  for (unsigned int i = width; i < HOST_BITS_PER_WIDE_INT; i += width)
> +    {
> +      HOST_WIDE_INT each = wi::extract_uhwi (val, i, width);
> +      if (val_broadcast != each)
> +       return false;
> +    }
> +  val_broadcast = sext_hwi (val_broadcast, width);
> +  return true;
> +}
> +
> +/* Convert the CONST_WIDE_INT operand OP to broadcast in MODE.  */
> +
> +static rtx
> +ix86_convert_const_wide_int_to_broadcast (machine_mode mode, rtx op)
> +{
> +  /* Don't use integer vector broadcast if we can't move from GPR to SSE
> +     register directly.  */
> +  if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
> +    return nullptr;
> +
> +  /* Convert CONST_WIDE_INT to a non-standard SSE constant integer
> +     broadcast only if vector broadcast is available.  */
> +  if (!(TARGET_AVX2
> +       || (TARGET_AVX
> +           && (GET_MODE_INNER (mode) == SImode
> +               || GET_MODE_INNER (mode) == DImode)))
This won't work for TI/XO/XImode, maybe better to just keep TARGET_AVX
here, and move TARGET_AVX2 to ..(below)
> +      || !CONST_WIDE_INT_P (op)
> +      || standard_sse_constant_p (op, mode))
> +    return nullptr;
> +
> +  HOST_WIDE_INT val = CONST_WIDE_INT_ELT (op, 0);
> +  HOST_WIDE_INT val_broadcast;
> +  scalar_int_mode broadcast_mode;
> +  if (ix86_broadcast (val, GET_MODE_BITSIZE (QImode),
> +                     val_broadcast))
here..
> +    broadcast_mode = QImode;
> +  else if (ix86_broadcast (val, GET_MODE_BITSIZE (HImode),
> +                          val_broadcast))
> +    broadcast_mode = HImode;
And here.
> +  else if (ix86_broadcast (val, GET_MODE_BITSIZE (SImode),
> +                          val_broadcast))
> +    broadcast_mode = SImode;
> +  else if (TARGET_64BIT
> +          && ix86_broadcast (val, GET_MODE_BITSIZE (DImode),
> +                             val_broadcast))
> +    broadcast_mode = DImode;
> +  else
> +    return nullptr;
> +
> +  /* Check if OP can be broadcasted from VAL.  */
> +  for (int i = 1; i < CONST_WIDE_INT_NUNITS (op); i++)
> +    if (val != CONST_WIDE_INT_ELT (op, i))
> +      return nullptr;
> +
> +  unsigned int nunits = (GET_MODE_SIZE (mode)
> +                        / GET_MODE_SIZE (broadcast_mode));
> +  machine_mode vector_mode;
> +  if (!mode_for_vector (broadcast_mode, nunits).exists (&vector_mode))
> +    gcc_unreachable ();
> +  rtx target = ix86_gen_scratch_sse_rtx (vector_mode);
> +  bool ok = ix86_expand_vector_init_duplicate (false, vector_mode,
> +                                              target,
> +                                              GEN_INT (val_broadcast));
> +  gcc_assert (ok);
> +  target = lowpart_subreg (mode, target, vector_mode);
> +  return target;
> +}
> +
>  void
>  ix86_expand_move (machine_mode mode, rtx operands[])
>  {
> @@ -347,20 +427,29 @@ ix86_expand_move (machine_mode mode, rtx operands[])
>           && optimize)
>         op1 = copy_to_mode_reg (mode, op1);
>
> -      if (can_create_pseudo_p ()
> -         && CONST_DOUBLE_P (op1))
> +      if (can_create_pseudo_p ())
>         {
> -         /* If we are loading a floating point constant to a register,
> -            force the value to memory now, since we'll get better code
> -            out the back end.  */
> +         if (CONST_DOUBLE_P (op1))
> +           {
> +             /* If we are loading a floating point constant to a
> +                register, force the value to memory now, since we'll
> +                get better code out the back end.  */
>
> -         op1 = validize_mem (force_const_mem (mode, op1));
> -         if (!register_operand (op0, mode))
> +             op1 = validize_mem (force_const_mem (mode, op1));
> +             if (!register_operand (op0, mode))
> +               {
> +                 rtx temp = gen_reg_rtx (mode);
> +                 emit_insn (gen_rtx_SET (temp, op1));
> +                 emit_move_insn (op0, temp);
> +                 return;
> +               }
> +           }
> +         else if (GET_MODE_SIZE (mode) >= 16)
>             {
> -             rtx temp = gen_reg_rtx (mode);
> -             emit_insn (gen_rtx_SET (temp, op1));
> -             emit_move_insn (op0, temp);
> -             return;
> +             rtx tmp = ix86_convert_const_wide_int_to_broadcast
> +               (GET_MODE (op0), op1);
> +             if (tmp != nullptr)
> +               op1 = tmp;
>             }
>         }
>      }
> @@ -368,6 +457,54 @@ ix86_expand_move (machine_mode mode, rtx operands[])
>    emit_insn (gen_rtx_SET (op0, op1));
>  }
>
> +static rtx
> +ix86_broadcast_from_integer_constant (machine_mode mode, rtx op)
> +{
> +  int nunits = GET_MODE_NUNITS (mode);
> +  if (nunits < 2)
> +    return nullptr;
> +
> +  /* Don't use integer vector broadcast if we can't move from GPR to SSE
> +     register directly.  */
> +  if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
> +    return nullptr;
> +
> +  /* Don't broadcast from a standard SSE constant integer.  */
> +  if (standard_sse_constant_p (op, mode))
> +    return nullptr;
> +
> +  /* Don't broadcast from a 64-bit integer constant in 32-bit mode.  */
> +  if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT)
> +    return nullptr;
> +
> +  rtx constant = get_pool_constant (XEXP (op, 0));
> +  if (GET_CODE (constant) != CONST_VECTOR)
> +    return nullptr;
> +
> +  /* There could be some rtx like
> +     (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> +     but with "*.LC1" refer to V2DI constant vector.  */
> +  if (GET_MODE (constant) != mode)
> +    {
> +      constant = simplify_subreg (mode, constant, GET_MODE (constant),
> +                                 0);
> +      if (constant == nullptr || GET_CODE (constant) != CONST_VECTOR)
> +       return nullptr;
> +    }
> +
> +  rtx first = XVECEXP (constant, 0, 0);
> +
> +  for (int i = 1; i < nunits; ++i)
> +    {
> +      rtx tmp = XVECEXP (constant, 0, i);
> +      /* Vector duplicate value.  */
> +      if (!rtx_equal_p (tmp, first))
> +       return nullptr;
> +    }
> +
> +  return first;
> +}
> +
>  void
>  ix86_expand_vector_move (machine_mode mode, rtx operands[])
>  {
> @@ -407,7 +544,36 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[])
>           op1 = simplify_gen_subreg (mode, r, imode, SUBREG_BYTE (op1));
>         }
>        else
> -       op1 = validize_mem (force_const_mem (mode, op1));
> +       {
> +         machine_mode mode = GET_MODE (op0);
> +         rtx tmp = ix86_convert_const_wide_int_to_broadcast
> +           (mode, op1);
> +         if (tmp == nullptr)
> +           op1 = validize_mem (force_const_mem (mode, op1));
> +         else
> +           op1 = tmp;
> +       }
> +    }
> +
> +  if (can_create_pseudo_p ()
> +      && GET_MODE_SIZE (mode) >= 16
> +      && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> +      && (MEM_P (op1)
> +         && SYMBOL_REF_P (XEXP (op1, 0))
> +         && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0))))
> +    {
> +      rtx first = ix86_broadcast_from_integer_constant (mode, op1);
> +      if (first != nullptr)
> +       {
> +         /* Broadcast to XMM/YMM/ZMM register from an integer
> +            constant.  */
> +         op1 = ix86_gen_scratch_sse_rtx (mode);
> +         bool ok = ix86_expand_vector_init_duplicate (false, mode,
> +                                                      op1, first);
> +         gcc_assert (ok);
> +         emit_move_insn (op0, op1);
> +         return;
> +       }
>      }
>
>    /* We need to check memory alignment for SSE mode since attribute
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 65fc307dc7b..71745b9a1ea 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -50,6 +50,8 @@ extern void ix86_reset_previous_fndecl (void);
>
>  extern bool ix86_using_red_zone (void);
>
> +extern rtx ix86_gen_scratch_sse_rtx (machine_mode);
> +
>  extern unsigned int ix86_regmode_natural_size (machine_mode);
>  #ifdef RTX_CODE
>  extern int standard_80387_constant_p (rtx);
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index c71c9e666a4..1c167c9a841 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -23126,6 +23126,19 @@ ix86_optab_supported_p (int op, machine_mode mode1, machine_mode,
>      }
>  }
>
> +/* Return a scratch register in MODE for vector load and store.  */
> +
> +rtx
> +ix86_gen_scratch_sse_rtx (machine_mode mode)
> +{
> +  if (TARGET_SSE)
> +    return gen_rtx_REG (mode, (TARGET_64BIT
> +                              ? LAST_REX_SSE_REG
> +                              : LAST_SSE_REG));
> +  else
> +    return gen_reg_rtx (mode);
> +}
> +
>  /* Address space support.
>
>     This is not "far pointers" in the 16-bit sense, but an easy way
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> index 0563e696316..a2664d87f29 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> @@ -2,8 +2,11 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mavx512f -mavx512dq" } */
>  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 5 } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 { target { ! ia32 } } } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 3 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 3 { target { ! ia32 } } } } */
>
>  typedef int v16si  __attribute__ ((vector_size (64)));
>  typedef long long v8di  __attribute__ ((vector_size (64)));
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> index ffbe95980ca..477f9ca1282 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> @@ -2,8 +2,9 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mavx512f" } */
>  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> -/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to8\\\}" 4 } }  */
> -/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to16\\\}" 4 } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 4 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 4 { target { ! ia32 } } } } */
>
>  typedef int v16si  __attribute__ ((vector_size (64)));
>  typedef long long v8di  __attribute__ ((vector_size (64)));
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c b/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
> index 99a89f51202..ca49a585232 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O3 -mavx512f" } */
> -/* { dg-final { scan-assembler-times "(?:vpblendmd|vmovdqa32)\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 8 } } */
> +/* { dg-options "-O3 -mavx512f -mprefer-vector-width=512" } */
> +/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vmovdqa32)\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 8 } } */
>
>  unsigned int x[128];
>  int y[128];
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> index c06369d93fd..f8eb99f0b5f 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> @@ -2,9 +2,15 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
>  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 } }  */
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 10 } }  */
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 { target { ! ia32 } } } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target { ! ia32 } } } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 7 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 3 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 3 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 3 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 3 { target { ! ia32 } } } } */
>
>  typedef int v4si  __attribute__ ((vector_size (16)));
>  typedef int v8si  __attribute__ ((vector_size (32)));
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> index 4998a9b8d51..32f6ac81841 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> @@ -2,9 +2,12 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mavx512f -mavx512vl" } */
>  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 4 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 4 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 4 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 4 { target { ! ia32 } } } } */
>
>  typedef int v4si  __attribute__ ((vector_size (16)));
>  typedef int v8si  __attribute__ ((vector_size (32)));
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-1.c b/gcc/testsuite/gcc.target/i386/pr100865-1.c
> new file mode 100644
> index 00000000000..6c3097fb2a6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-1.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -march=x86-64" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 3, 16);
> +}
> +
> +/* { dg-final { scan-assembler-times "movdqa\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10a.c b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
> new file mode 100644
> index 00000000000..7ffc19e56a8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake" } */
> +
> +extern __int128 array[16];
> +
> +#define MK_CONST128_BROADCAST(A) \
> +  ((((unsigned __int128) (unsigned char) A) << 120) \
> +   | (((unsigned __int128) (unsigned char) A) << 112) \
> +   | (((unsigned __int128) (unsigned char) A) << 104) \
> +   | (((unsigned __int128) (unsigned char) A) << 96) \
> +   | (((unsigned __int128) (unsigned char) A) << 88) \
> +   | (((unsigned __int128) (unsigned char) A) << 80) \
> +   | (((unsigned __int128) (unsigned char) A) << 72) \
> +   | (((unsigned __int128) (unsigned char) A) << 64) \
> +   | (((unsigned __int128) (unsigned char) A) << 56) \
> +   | (((unsigned __int128) (unsigned char) A) << 48) \
> +   | (((unsigned __int128) (unsigned char) A) << 40) \
> +   | (((unsigned __int128) (unsigned char) A) << 32) \
> +   | (((unsigned __int128) (unsigned char) A) << 24) \
> +   | (((unsigned __int128) (unsigned char) A) << 16) \
> +   | (((unsigned __int128) (unsigned char) A) << 8) \
> +   | ((unsigned __int128) (unsigned char) A) )
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = MK_CONST128_BROADCAST (0x1f);
> +}
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10b.c b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
> new file mode 100644
> index 00000000000..edf52765c60
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
> @@ -0,0 +1,7 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake-avx512" } */
> +
> +#include "pr100865-10a.c"
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11a.c b/gcc/testsuite/gcc.target/i386/pr100865-11a.c
> new file mode 100644
> index 00000000000..04ce1662f3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-11a.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake" } */
> +
> +extern __int128 array[16];
> +
> +#define MK_CONST128_BROADCAST(A) \
> +  ((((unsigned __int128) (unsigned long long) A) << 64) \
> +   | ((unsigned __int128) (unsigned long long) A) )
> +
> +#define MK_CONST128_BROADCAST_SIGNED(A) \
> +  ((__int128) MK_CONST128_BROADCAST (A))
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = MK_CONST128_BROADCAST_SIGNED (-0x1ffffffffLL);
> +}
> +
> +/* { dg-final { scan-assembler-times "movabsq" 1 } } */
> +/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpunpcklqdq)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11b.c b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
> new file mode 100644
> index 00000000000..12d55b9a642
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
> @@ -0,0 +1,8 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake-avx512" } */
> +
> +#include "pr100865-11a.c"
> +
> +/* { dg-final { scan-assembler-times "movabsq" 1 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-12a.c b/gcc/testsuite/gcc.target/i386/pr100865-12a.c
> new file mode 100644
> index 00000000000..d4833d44475
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-12a.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake" } */
> +
> +extern __int128 array[16];
> +
> +#define MK_CONST128_BROADCAST(A) \
> +  ((((unsigned __int128) (unsigned long long) A) << 64) \
> +   | ((unsigned __int128) (unsigned long long) A) )
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = MK_CONST128_BROADCAST (0x1ffffffffLL);
> +}
> +
> +/* { dg-final { scan-assembler-times "movabsq" 1 } } */
> +/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpunpcklqdq)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-12b.c b/gcc/testsuite/gcc.target/i386/pr100865-12b.c
> new file mode 100644
> index 00000000000..12d55b9a642
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-12b.c
> @@ -0,0 +1,8 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake-avx512" } */
> +
> +#include "pr100865-11a.c"
> +
> +/* { dg-final { scan-assembler-times "movabsq" 1 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-2.c b/gcc/testsuite/gcc.target/i386/pr100865-2.c
> new file mode 100644
> index 00000000000..17efe2d72a3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -march=skylake" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 3, 16);
> +}
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-3.c b/gcc/testsuite/gcc.target/i386/pr100865-3.c
> new file mode 100644
> index 00000000000..b6dbcf7809b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-3.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -march=skylake-avx512" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 3, 16);
> +}
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4a.c b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
> new file mode 100644
> index 00000000000..f55883598f9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -march=skylake" } */
> +
> +extern char array[64];
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = -45;
> +}
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, " 4 } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4b.c b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
> new file mode 100644
> index 00000000000..f41e6147b4c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -march=skylake-avx512" } */
> +
> +#include "pr100865-4a.c"
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, " 4 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5a.c b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
> new file mode 100644
> index 00000000000..4149797fe81
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=skylake" } */
> +
> +extern short array[64];
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = -45;
> +}
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 4 } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5b.c b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
> new file mode 100644
> index 00000000000..ded41b680d3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=skylake-avx512" } */
> +
> +#include "pr100865-5a.c"
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu16\[\\t \]%ymm\[0-9\]+, " 4 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6a.c b/gcc/testsuite/gcc.target/i386/pr100865-6a.c
> new file mode 100644
> index 00000000000..3fde549a10d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-6a.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=skylake" } */
> +
> +extern int array[64];
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = -45;
> +}
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6b.c b/gcc/testsuite/gcc.target/i386/pr100865-6b.c
> new file mode 100644
> index 00000000000..44e74c64e55
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-6b.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=skylake-avx512" } */
> +
> +#include "pr100865-6a.c"
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6c.c b/gcc/testsuite/gcc.target/i386/pr100865-6c.c
> new file mode 100644
> index 00000000000..46d31030ce8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-6c.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=skylake -mno-avx2" } */
> +
> +extern int array[64];
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = -45;
> +}
> +
> +/* { dg-final { scan-assembler-times "vbroadcastss" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7a.c b/gcc/testsuite/gcc.target/i386/pr100865-7a.c
> new file mode 100644
> index 00000000000..f6f2be91120
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-7a.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=skylake" } */
> +
> +extern long long int array[64];
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = -45;
> +}
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcastq" { target ia32 } } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" { target { ! ia32 } } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7b.c b/gcc/testsuite/gcc.target/i386/pr100865-7b.c
> new file mode 100644
> index 00000000000..0a68820aa32
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-7b.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=skylake-avx512" } */
> +
> +#include "pr100865-7a.c"
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7c.c b/gcc/testsuite/gcc.target/i386/pr100865-7c.c
> new file mode 100644
> index 00000000000..4d50bb7e2f6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-7c.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=skylake -mno-avx2" } */
> +
> +extern long long int array[64];
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = -45;
> +}
> +
> +/* { dg-final { scan-assembler-times "vbroadcastsd" 1 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
> +/* { dg-final { scan-assembler-not "vbroadcastsd" { target ia32 } } } */
> +/* { dg-final { scan-assembler-not "vmovdqa" { target { ! ia32 } } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8a.c b/gcc/testsuite/gcc.target/i386/pr100865-8a.c
> new file mode 100644
> index 00000000000..96e9f13204c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-8a.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake" } */
> +
> +extern __int128 array[16];
> +
> +#define MK_CONST128_BROADCAST(A) \
> +  ((((unsigned __int128) (unsigned int) A) << 96) \
> +   | (((unsigned __int128) (unsigned int) A) << 64) \
> +   | (((unsigned __int128) (unsigned int) A) << 32) \
> +   | ((unsigned __int128) (unsigned int) A) )
> +
> +#define MK_CONST128_BROADCAST_SIGNED(A) \
> +  ((__int128) MK_CONST128_BROADCAST (A))
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = MK_CONST128_BROADCAST_SIGNED (-45);
> +}
> +
> +/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8b.c b/gcc/testsuite/gcc.target/i386/pr100865-8b.c
> new file mode 100644
> index 00000000000..99a10ad83bd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-8b.c
> @@ -0,0 +1,7 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake-avx512" } */
> +
> +#include "pr100865-8a.c"
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9a.c b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
> new file mode 100644
> index 00000000000..45d0e0d0e2e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake" } */
> +
> +extern __int128 array[16];
> +
> +#define MK_CONST128_BROADCAST(A) \
> +  ((((unsigned __int128) (unsigned short) A) << 112) \
> +   | (((unsigned __int128) (unsigned short) A) << 96) \
> +   | (((unsigned __int128) (unsigned short) A) << 80) \
> +   | (((unsigned __int128) (unsigned short) A) << 64) \
> +   | (((unsigned __int128) (unsigned short) A) << 48) \
> +   | (((unsigned __int128) (unsigned short) A) << 32) \
> +   | (((unsigned __int128) (unsigned short) A) << 16) \
> +   | ((unsigned __int128) (unsigned short) A) )
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> +    array[i] = MK_CONST128_BROADCAST (0x1fff);
> +}
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9b.c b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
> new file mode 100644
> index 00000000000..14696248525
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
> @@ -0,0 +1,7 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O3 -march=skylake-avx512" } */
> +
> +#include "pr100865-9a.c"
> +
> +/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> --
> 2.31.1
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander
  2021-06-27 21:00       ` Richard Sandiford
@ 2021-06-28 12:16         ` H.J. Lu
  2021-06-28 12:36           ` Richard Sandiford
  0 siblings, 1 reply; 12+ messages in thread
From: H.J. Lu @ 2021-06-28 12:16 UTC (permalink / raw)
  To: H.J. Lu via Gcc-patches, Uros Bizjak, Jakub Jelinek, Hongtao Liu,
	Richard Biener, H.J. Lu, Richard Sandiford

On Sun, Jun 27, 2021 at 2:00 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > On Sun, Jun 27, 2021 at 1:43 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> "H.J. Lu" <hjl.tools@gmail.com> writes:
> >> > 1. Update vec_duplicate to allow to fail so that backend can only allow
> >> > broadcasting an integer constant to a vector when broadcast instruction
> >> > is available.  This can be used by memset expander to avoid vec_duplicate
> >> > when loading from constant pool is more efficient.
> >>
> >> I don't see any changes in target-independent code though, other than
> >> the doc update.  It's still the case that (existing) uses of
> >> vec_duplicate_optab do not allow it to fail.
> >
> > I have a followup patch set on
> >
> > https://gitlab.com/x86-gcc/gcc/-/commits/users/hjl/pieces/broadcast
> >
> > to use it to expand memset with vector broadcast:
> >
> > https://gitlab.com/x86-gcc/gcc/-/commit/991c87f8a83ca736ae9ed92baa3ebadca289f6e3
> >
> > For SSE2 which doesn't have vector broadcast, the constant vector broadcast
> > expander returns FAIL and load from constant pool will be used.
>
> Hmm, but as Jeff and I mentioned in the earlier replies,
> vec_duplicate_optab shouldn't be used for constants.  Constants
> should go via the move expanders instead.
>
> In a previous message I suggested:
>
>   … would it work to change:
>
>         /* Try using vec_duplicate_optab for uniform vectors.  */
>         if (!TREE_SIDE_EFFECTS (exp)
>             && VECTOR_MODE_P (mode)
>             && eltmode == GET_MODE_INNER (mode)
>             && ((icode = optab_handler (vec_duplicate_optab, mode))
>                 != CODE_FOR_nothing)
>             && (elt = uniform_vector_p (exp)))
>
>   to something like:
>
>         /* Try using vec_duplicate_optab for uniform vectors.  */
>         if (!TREE_SIDE_EFFECTS (exp)
>             && VECTOR_MODE_P (mode)
>             && eltmode == GET_MODE_INNER (mode)
>             && (elt = uniform_vector_p (exp)))
>           {
>             if (TREE_CODE (elt) == INTEGER_CST
>                 || TREE_CODE (elt) == POLY_INT_CST
>                 || TREE_CODE (elt) == REAL_CST
>                 || TREE_CODE (elt) == FIXED_CST)
>               {
>                 rtx src = gen_const_vec_duplicate (mode, expand_normal (node));
>                 emit_move_insn (target, src);
>                 break;
>               }
>             …
>           }
>
> if that code was the source of the constant operand.  If we're adding a
> new use of vec_duplicate_optab then that should be similarly protected
> against constant operands.
>

Your comments apply to my initial vec_duplicate patch that caused the
gcc.dg/pr100239.c failure.  It has been fixed by

commit ffe3a37f54ab866d85bdde48c2a32be5e09d8515
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Jun 7 20:08:13 2021 +0200

    middle-end/100951 - make sure to generate VECTOR_CST in lowering

    When vector lowering creates piecewise ops make sure to create
    VECTOR_CSTs instead of CONSTRUCTORs when possible.

The problem I am running into now is in my memset vector broadcast
patch.  In order to optimize vector broadcast for memset, I need to
generate a pseudo register for

 __builtin_memset (ops, 3, 38);

only when vector broadcast is available:

  rtx target = nullptr;

  unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
  machine_mode vector_mode;
  if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
    gcc_unreachable ();

  enum insn_code icode = optab_handler (vec_duplicate_optab,
                                        vector_mode);
  if (icode != CODE_FOR_nothing)
    {
      rtx reg = targetm.gen_memset_scratch_rtx (vector_mode);
      class expand_operand ops[2];
      create_output_operand (&ops[0], reg, vector_mode);
      create_input_operand (&ops[1], data, QImode);
      if (maybe_expand_insn (icode, 2, ops))
        {
          if (!rtx_equal_p (reg, ops[0].value))
            emit_move_insn (reg, ops[0].value);
          target = lowpart_subreg (mode, reg, vector_mode);
        }
    }

  return target;  <<< Return nullptr to load from constant pool.

-- 
H.J.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander
  2021-06-28 12:16         ` H.J. Lu
@ 2021-06-28 12:36           ` Richard Sandiford
  2021-06-28 19:38             ` H.J. Lu
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2021-06-28 12:36 UTC (permalink / raw)
  To: H.J. Lu
  Cc: H.J. Lu via Gcc-patches, Uros Bizjak, Jakub Jelinek, Hongtao Liu,
	Richard Biener

"H.J. Lu" <hjl.tools@gmail.com> writes:
> On Sun, Jun 27, 2021 at 2:00 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
>> > On Sun, Jun 27, 2021 at 1:43 AM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> "H.J. Lu" <hjl.tools@gmail.com> writes:
>> >> > 1. Update vec_duplicate to allow to fail so that backend can only allow
>> >> > broadcasting an integer constant to a vector when broadcast instruction
>> >> > is available.  This can be used by memset expander to avoid vec_duplicate
>> >> > when loading from constant pool is more efficient.
>> >>
>> >> I don't see any changes in target-independent code though, other than
>> >> the doc update.  It's still the case that (existing) uses of
>> >> vec_duplicate_optab do not allow it to fail.
>> >
>> > I have a followup patch set on
>> >
>> > https://gitlab.com/x86-gcc/gcc/-/commits/users/hjl/pieces/broadcast
>> >
>> > to use it to expand memset with vector broadcast:
>> >
>> > https://gitlab.com/x86-gcc/gcc/-/commit/991c87f8a83ca736ae9ed92baa3ebadca289f6e3
>> >
>> > For SSE2 which doesn't have vector broadcast, the constant vector broadcast
>> > expander returns FAIL and load from constant pool will be used.
>>
>> Hmm, but as Jeff and I mentioned in the earlier replies,
>> vec_duplicate_optab shouldn't be used for constants.  Constants
>> should go via the move expanders instead.
>>
>> In a previous message I suggested:
>>
>>   … would it work to change:
>>
>>         /* Try using vec_duplicate_optab for uniform vectors.  */
>>         if (!TREE_SIDE_EFFECTS (exp)
>>             && VECTOR_MODE_P (mode)
>>             && eltmode == GET_MODE_INNER (mode)
>>             && ((icode = optab_handler (vec_duplicate_optab, mode))
>>                 != CODE_FOR_nothing)
>>             && (elt = uniform_vector_p (exp)))
>>
>>   to something like:
>>
>>         /* Try using vec_duplicate_optab for uniform vectors.  */
>>         if (!TREE_SIDE_EFFECTS (exp)
>>             && VECTOR_MODE_P (mode)
>>             && eltmode == GET_MODE_INNER (mode)
>>             && (elt = uniform_vector_p (exp)))
>>           {
>>             if (TREE_CODE (elt) == INTEGER_CST
>>                 || TREE_CODE (elt) == POLY_INT_CST
>>                 || TREE_CODE (elt) == REAL_CST
>>                 || TREE_CODE (elt) == FIXED_CST)
>>               {
>>                 rtx src = gen_const_vec_duplicate (mode, expand_normal (node));
>>                 emit_move_insn (target, src);
>>                 break;
>>               }
>>             …
>>           }
>>
>> if that code was the source of the constant operand.  If we're adding a
>> new use of vec_duplicate_optab then that should be similarly protected
>> against constant operands.
>>
>
> Your comments apply to my initial vec_duplicate patch that caused the
> gcc.dg/pr100239.c failure.  It has been fixed by
>
> commit ffe3a37f54ab866d85bdde48c2a32be5e09d8515
> Author: Richard Biener <rguenther@suse.de>
> Date:   Mon Jun 7 20:08:13 2021 +0200
>
>     middle-end/100951 - make sure to generate VECTOR_CST in lowering
>
>     When vector lowering creates piecewise ops make sure to create
>     VECTOR_CSTs instead of CONSTRUCTORs when possible.
>
> The problem I am running into now is in my memset vector broadcast
> patch.  In order to optimize vector broadcast for memset, I need to
> generate a pseudo register for
>
>  __builtin_memset (ops, 3, 38);
>
> only when vector broadcast is available:
>
>   rtx target = nullptr;
>
>   unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
>   machine_mode vector_mode;
>   if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
>     gcc_unreachable ();
>
>   enum insn_code icode = optab_handler (vec_duplicate_optab,
>                                         vector_mode);
>   if (icode != CODE_FOR_nothing)
>     {
>       rtx reg = targetm.gen_memset_scratch_rtx (vector_mode);
>       class expand_operand ops[2];
>       create_output_operand (&ops[0], reg, vector_mode);
>       create_input_operand (&ops[1], data, QImode);
>       if (maybe_expand_insn (icode, 2, ops))
>         {
>           if (!rtx_equal_p (reg, ops[0].value))
>             emit_move_insn (reg, ops[0].value);
>           target = lowpart_subreg (mode, reg, vector_mode);
>         }
>     }
>
>   return target;  <<< Return nullptr to load from constant pool.

I don't think this is a correct use of vec_duplicate_optab.  If the
scalar operand is a constant then the move should always go through
the move expanders instead, as a move from a CONST_VECTOR.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander
  2021-06-28 12:36           ` Richard Sandiford
@ 2021-06-28 19:38             ` H.J. Lu
  2021-06-29  8:17               ` Richard Sandiford
  0 siblings, 1 reply; 12+ messages in thread
From: H.J. Lu @ 2021-06-28 19:38 UTC (permalink / raw)
  To: H.J. Lu via Gcc-patches, Uros Bizjak, Jakub Jelinek, Hongtao Liu,
	Richard Biener, Richard Sandiford

On Mon, Jun 28, 2021 at 5:36 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> "H.J. Lu" <hjl.tools@gmail.com> writes:
> > On Sun, Jun 27, 2021 at 2:00 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> >> > On Sun, Jun 27, 2021 at 1:43 AM Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> "H.J. Lu" <hjl.tools@gmail.com> writes:
> >> >> > 1. Update vec_duplicate to allow to fail so that backend can only allow
> >> >> > broadcasting an integer constant to a vector when broadcast instruction
> >> >> > is available.  This can be used by memset expander to avoid vec_duplicate
> >> >> > when loading from constant pool is more efficient.
> >> >>
> >> >> I don't see any changes in target-independent code though, other than
> >> >> the doc update.  It's still the case that (existing) uses of
> >> >> vec_duplicate_optab do not allow it to fail.
> >> >
> >> > I have a followup patch set on
> >> >
> >> > https://gitlab.com/x86-gcc/gcc/-/commits/users/hjl/pieces/broadcast
> >> >
> >> > to use it to expand memset with vector broadcast:
> >> >
> >> > https://gitlab.com/x86-gcc/gcc/-/commit/991c87f8a83ca736ae9ed92baa3ebadca289f6e3
> >> >
> >> > For SSE2 which doesn't have vector broadcast, the constant vector broadcast
> >> > expander returns FAIL and load from constant pool will be used.
> >>
> >> Hmm, but as Jeff and I mentioned in the earlier replies,
> >> vec_duplicate_optab shouldn't be used for constants.  Constants
> >> should go via the move expanders instead.
> >>
> >> In a previous message I suggested:
> >>
> >>   … would it work to change:
> >>
> >>         /* Try using vec_duplicate_optab for uniform vectors.  */
> >>         if (!TREE_SIDE_EFFECTS (exp)
> >>             && VECTOR_MODE_P (mode)
> >>             && eltmode == GET_MODE_INNER (mode)
> >>             && ((icode = optab_handler (vec_duplicate_optab, mode))
> >>                 != CODE_FOR_nothing)
> >>             && (elt = uniform_vector_p (exp)))
> >>
> >>   to something like:
> >>
> >>         /* Try using vec_duplicate_optab for uniform vectors.  */
> >>         if (!TREE_SIDE_EFFECTS (exp)
> >>             && VECTOR_MODE_P (mode)
> >>             && eltmode == GET_MODE_INNER (mode)
> >>             && (elt = uniform_vector_p (exp)))
> >>           {
> >>             if (TREE_CODE (elt) == INTEGER_CST
> >>                 || TREE_CODE (elt) == POLY_INT_CST
> >>                 || TREE_CODE (elt) == REAL_CST
> >>                 || TREE_CODE (elt) == FIXED_CST)
> >>               {
> >>                 rtx src = gen_const_vec_duplicate (mode, expand_normal (node));
> >>                 emit_move_insn (target, src);
> >>                 break;
> >>               }
> >>             …
> >>           }
> >>
> >> if that code was the source of the constant operand.  If we're adding a
> >> new use of vec_duplicate_optab then that should be similarly protected
> >> against constant operands.
> >>
> >
> > Your comments apply to my initial vec_duplicate patch that caused the
> > gcc.dg/pr100239.c failure.  It has been fixed by
> >
> > commit ffe3a37f54ab866d85bdde48c2a32be5e09d8515
> > Author: Richard Biener <rguenther@suse.de>
> > Date:   Mon Jun 7 20:08:13 2021 +0200
> >
> >     middle-end/100951 - make sure to generate VECTOR_CST in lowering
> >
> >     When vector lowering creates piecewise ops make sure to create
> >     VECTOR_CSTs instead of CONSTRUCTORs when possible.
> >
> > The problem I am running into now is in my memset vector broadcast
> > patch.  In order to optimize vector broadcast for memset, I need to
> > generate a pseudo register for
> >
> >  __builtin_memset (ops, 3, 38);
> >
> > only when vector broadcast is available:
> >
> >   rtx target = nullptr;
> >
> >   unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
> >   machine_mode vector_mode;
> >   if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
> >     gcc_unreachable ();
> >
> >   enum insn_code icode = optab_handler (vec_duplicate_optab,
> >                                         vector_mode);
> >   if (icode != CODE_FOR_nothing)
> >     {
> >       rtx reg = targetm.gen_memset_scratch_rtx (vector_mode);
> >       class expand_operand ops[2];
> >       create_output_operand (&ops[0], reg, vector_mode);
> >       create_input_operand (&ops[1], data, QImode);
> >       if (maybe_expand_insn (icode, 2, ops))
> >         {
> >           if (!rtx_equal_p (reg, ops[0].value))
> >             emit_move_insn (reg, ops[0].value);
> >           target = lowpart_subreg (mode, reg, vector_mode);
> >         }
> >     }
> >
> >   return target;  <<< Return nullptr to load from constant pool.
>
> I don't think this is a correct use of vec_duplicate_optab.  If the
> scalar operand is a constant then the move should always go through
> the move expanders instead, as a move from a CONST_VECTOR.

Like this?

  enum insn_code icode = optab_handler (vec_duplicate_optab,
                                        vector_mode);
  if (icode != CODE_FOR_nothing)
    {
      rtx reg = targetm.gen_memset_scratch_rtx (vector_mode);
      if (CONST_INT_P (data))
        {
          /* Use the move expander with CONST_VECTOR.  */
          rtvec v = rtvec_alloc (nunits);
          for (unsigned int i = 0; i < nunits; i++)
            RTVEC_ELT (v, i) = data;
          rtx const_vec = gen_rtx_CONST_VECTOR (vector_mode, v);
          emit_move_insn (reg, const_vec);
        }
      else
        {

          class expand_operand ops[2];
          create_output_operand (&ops[0], reg, vector_mode);
          create_input_operand (&ops[1], data, QImode);
          expand_insn (icode, 2, ops);
          if (!rtx_equal_p (reg, ops[0].value))
            emit_move_insn (reg, ops[0].value);
        }
      target = lowpart_subreg (mode, reg, vector_mode);
    }


-- 
H.J.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] x86: Convert CONST_WIDE_INT/CONST_VECTOR to broadcast
  2021-06-28  1:48   ` Hongtao Liu
@ 2021-06-29  0:40     ` H.J. Lu
  0 siblings, 0 replies; 12+ messages in thread
From: H.J. Lu @ 2021-06-29  0:40 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: GCC Patches, Uros Bizjak, Jakub Jelinek, Richard Sandiford,
	Richard Biener

On Sun, Jun 27, 2021 at 6:43 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Sun, Jun 27, 2021 at 4:02 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > 1. Update move expanders to convert the CONST_WIDE_INT and CONST_VECTO
> > operands to vector broadcast from an integer with AVX2.
> > 2. Add ix86_gen_scratch_sse_rtx to return a scratch SSE register which
> > won't increase stack alignment requirement and blocks transformation by
> > the combine pass.
> >
> > A small benchmark:
> >
> > https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/memset/broadcast
> >
> > shows that broadcast is a little bit faster on Intel Core i7-8559U:
> >
> > $ make
> > gcc -g -I. -O2   -c -o test.o test.c
> > gcc -g   -c -o memory.o memory.S
> > gcc -g   -c -o broadcast.o broadcast.S
> > gcc -g   -c -o vec_dup_sse2.o vec_dup_sse2.S
> > gcc -o test test.o memory.o broadcast.o vec_dup_sse2.o
> > ./test
> > memory      : 147215
> > broadcast   : 121213
> > vec_dup_sse2: 171366
> > $
> >
> > broadcast is also smaller:
> >
> > $ size memory.o broadcast.o
> >    text    data     bss     dec     hex filename
> >     132       0       0     132      84 memory.o
> >     122       0       0     122      7a broadcast.o
> > $
> >
> > 3. Update PR 87767 tests to expect integer broadcast instead of broadcast
> > from memory.
> > 4. Update avx512f_cond_move.c to expect integer broadcast.
> >
> > A small benchmark:
> >
> > https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vpaddd/broadcast
> >
> > shows that integer broadcast is faster than embedded memory broadcast:
> >
> > $ make
> > gcc -g -I. -O2 -march=skylake-avx512   -c -o test.o test.c
> > gcc -g   -c -o memory.o memory.S
> > gcc -g   -c -o broadcast.o broadcast.S
> > gcc -o test test.o memory.o broadcast.o
> > ./test
> > memory      : 425538
> > broadcast   : 375260
> > $
> >
> > gcc/
> >
> >         PR target/100865
> >         * config/i386/i386-expand.c (ix86_expand_vector_init_duplicate):
> >         New prototype.
> >         (ix86_byte_broadcast): New function.
> >         (ix86_convert_const_wide_int_to_broadcast): Likewise.
> >         (ix86_expand_move): Convert CONST_WIDE_INT to broadcast if mode
> >         size is 16 bytes or bigger.
> >         (ix86_broadcast_from_integer_constant): New function.
> >         (ix86_expand_vector_move): Convert CONST_WIDE_INT and CONST_VECTOR
> >         to broadcast if mode size is 16 bytes or bigger.
> >         * config/i386/i386-protos.h (ix86_gen_scratch_sse_rtx): New
> >         prototype.
> >         * config/i386/i386.c (ix86_gen_scratch_sse_rtx): New function.
> >
> > gcc/testsuite/
> >
> >         PR target/100865
> >         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Expect integer
> >         broadcast.
> >         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> >         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> >         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> >         * gcc.target/i386/avx512f_cond_move.c: Also pass
> >         -mprefer-vector-width=512 and expect integer broadcast.
> >         * gcc.target/i386/pr100865-1.c: New test.
> >         * gcc.target/i386/pr100865-2.c: Likewise.
> >         * gcc.target/i386/pr100865-3.c: Likewise.
> >         * gcc.target/i386/pr100865-4a.c: Likewise.
> >         * gcc.target/i386/pr100865-4b.c: Likewise.
> >         * gcc.target/i386/pr100865-5a.c: Likewise.
> >         * gcc.target/i386/pr100865-5b.c: Likewise.
> >         * gcc.target/i386/pr100865-6a.c: Likewise.
> >         * gcc.target/i386/pr100865-6b.c: Likewise.
> >         * gcc.target/i386/pr100865-6c.c: Likewise.
> >         * gcc.target/i386/pr100865-7a.c: Likewise.
> >         * gcc.target/i386/pr100865-7b.c: Likewise.
> >         * gcc.target/i386/pr100865-7c.c: Likewise.
> >         * gcc.target/i386/pr100865-8a.c: Likewise.
> >         * gcc.target/i386/pr100865-8b.c: Likewise.
> >         * gcc.target/i386/pr100865-9a.c: Likewise.
> >         * gcc.target/i386/pr100865-9b.c: Likewise.
> >         * gcc.target/i386/pr100865-10a.c: Likewise.
> >         * gcc.target/i386/pr100865-10b.c: Likewise.
> >         * gcc.target/i386/pr100865-11a.c: Likewise.
> >         * gcc.target/i386/pr100865-11b.c: Likewise.
> >         * gcc.target/i386/pr100865-12a.c: Likewise.
> >         * gcc.target/i386/pr100865-12b.c: Likewise.
> > ---
> >  gcc/config/i386/i386-expand.c                 | 190 ++++++++++++++++--
> >  gcc/config/i386/i386-protos.h                 |   2 +
> >  gcc/config/i386/i386.c                        |  13 ++
> >  .../i386/avx512f-broadcast-pr87767-1.c        |   7 +-
> >  .../i386/avx512f-broadcast-pr87767-5.c        |   5 +-
> >  .../gcc.target/i386/avx512f_cond_move.c       |   4 +-
> >  .../i386/avx512vl-broadcast-pr87767-1.c       |  12 +-
> >  .../i386/avx512vl-broadcast-pr87767-5.c       |   9 +-
> >  gcc/testsuite/gcc.target/i386/pr100865-1.c    |  13 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-10a.c  |  33 +++
> >  gcc/testsuite/gcc.target/i386/pr100865-10b.c  |   7 +
> >  gcc/testsuite/gcc.target/i386/pr100865-11a.c  |  23 +++
> >  gcc/testsuite/gcc.target/i386/pr100865-11b.c  |   8 +
> >  gcc/testsuite/gcc.target/i386/pr100865-12a.c  |  20 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-12b.c  |   8 +
> >  gcc/testsuite/gcc.target/i386/pr100865-2.c    |  14 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-3.c    |  15 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-4a.c   |  16 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-4b.c   |   9 +
> >  gcc/testsuite/gcc.target/i386/pr100865-5a.c   |  16 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-5b.c   |   9 +
> >  gcc/testsuite/gcc.target/i386/pr100865-6a.c   |  16 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-6b.c   |   9 +
> >  gcc/testsuite/gcc.target/i386/pr100865-6c.c   |  16 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-7a.c   |  17 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-7b.c   |   9 +
> >  gcc/testsuite/gcc.target/i386/pr100865-7c.c   |  17 ++
> >  gcc/testsuite/gcc.target/i386/pr100865-8a.c   |  24 +++
> >  gcc/testsuite/gcc.target/i386/pr100865-8b.c   |   7 +
> >  gcc/testsuite/gcc.target/i386/pr100865-9a.c   |  25 +++
> >  gcc/testsuite/gcc.target/i386/pr100865-9b.c   |   7 +
> >  31 files changed, 556 insertions(+), 24 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-10a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-10b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-11a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-11b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-12a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-12b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-4a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-4b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-5a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-5b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-6c.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-7c.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-8a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-8b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-9a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr100865-9b.c
> >
> > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > index e9763eb5b3e..e9e89c82764 100644
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > @@ -93,6 +93,9 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "i386-builtins.h"
> >  #include "i386-expand.h"
> >
> > +static bool ix86_expand_vector_init_duplicate (bool, machine_mode, rtx,
> > +                                              rtx);
> > +
> >  /* Split one or more double-mode RTL references into pairs of half-mode
> >     references.  The RTL can be REG, offsettable MEM, integer constant, or
> >     CONST_DOUBLE.  "operands" is a pointer to an array of double-mode RTLs to
> > @@ -190,6 +193,83 @@ ix86_expand_clear (rtx dest)
> >    emit_insn (tmp);
> >  }
> >
> > +/* Return true if V can be broadcasted from an integer of WIDTH bits
> > +   which is returned in VAL_BROADCAST.  Otherwise, return false.  */
> > +
> > +static bool
> > +ix86_broadcast (HOST_WIDE_INT v, unsigned int width,
> > +               HOST_WIDE_INT &val_broadcast)
> > +{
> > +  wide_int val = wi::uhwi (v, HOST_BITS_PER_WIDE_INT);
> > +  val_broadcast = wi::extract_uhwi (val, 0, width);
> > +  for (unsigned int i = width; i < HOST_BITS_PER_WIDE_INT; i += width)
> > +    {
> > +      HOST_WIDE_INT each = wi::extract_uhwi (val, i, width);
> > +      if (val_broadcast != each)
> > +       return false;
> > +    }
> > +  val_broadcast = sext_hwi (val_broadcast, width);
> > +  return true;
> > +}
> > +
> > +/* Convert the CONST_WIDE_INT operand OP to broadcast in MODE.  */
> > +
> > +static rtx
> > +ix86_convert_const_wide_int_to_broadcast (machine_mode mode, rtx op)
> > +{
> > +  /* Don't use integer vector broadcast if we can't move from GPR to SSE
> > +     register directly.  */
> > +  if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
> > +    return nullptr;
> > +
> > +  /* Convert CONST_WIDE_INT to a non-standard SSE constant integer
> > +     broadcast only if vector broadcast is available.  */
> > +  if (!(TARGET_AVX2
> > +       || (TARGET_AVX
> > +           && (GET_MODE_INNER (mode) == SImode
> > +               || GET_MODE_INNER (mode) == DImode)))
> This won't work for TI/XO/XImode, maybe better to just keep TARGET_AVX
> here, and move TARGET_AVX2 to ..(below)

Fixed in the v6 patch.

Thanks.

> > +      || !CONST_WIDE_INT_P (op)
> > +      || standard_sse_constant_p (op, mode))
> > +    return nullptr;
> > +
> > +  HOST_WIDE_INT val = CONST_WIDE_INT_ELT (op, 0);
> > +  HOST_WIDE_INT val_broadcast;
> > +  scalar_int_mode broadcast_mode;
> > +  if (ix86_broadcast (val, GET_MODE_BITSIZE (QImode),
> > +                     val_broadcast))
> here..
> > +    broadcast_mode = QImode;
> > +  else if (ix86_broadcast (val, GET_MODE_BITSIZE (HImode),
> > +                          val_broadcast))
> > +    broadcast_mode = HImode;
> And here.
> > +  else if (ix86_broadcast (val, GET_MODE_BITSIZE (SImode),
> > +                          val_broadcast))
> > +    broadcast_mode = SImode;
> > +  else if (TARGET_64BIT
> > +          && ix86_broadcast (val, GET_MODE_BITSIZE (DImode),
> > +                             val_broadcast))
> > +    broadcast_mode = DImode;
> > +  else
> > +    return nullptr;
> > +
> > +  /* Check if OP can be broadcasted from VAL.  */
> > +  for (int i = 1; i < CONST_WIDE_INT_NUNITS (op); i++)
> > +    if (val != CONST_WIDE_INT_ELT (op, i))
> > +      return nullptr;
> > +
> > +  unsigned int nunits = (GET_MODE_SIZE (mode)
> > +                        / GET_MODE_SIZE (broadcast_mode));
> > +  machine_mode vector_mode;
> > +  if (!mode_for_vector (broadcast_mode, nunits).exists (&vector_mode))
> > +    gcc_unreachable ();
> > +  rtx target = ix86_gen_scratch_sse_rtx (vector_mode);
> > +  bool ok = ix86_expand_vector_init_duplicate (false, vector_mode,
> > +                                              target,
> > +                                              GEN_INT (val_broadcast));
> > +  gcc_assert (ok);
> > +  target = lowpart_subreg (mode, target, vector_mode);
> > +  return target;
> > +}
> > +
> >  void
> >  ix86_expand_move (machine_mode mode, rtx operands[])
> >  {
> > @@ -347,20 +427,29 @@ ix86_expand_move (machine_mode mode, rtx operands[])
> >           && optimize)
> >         op1 = copy_to_mode_reg (mode, op1);
> >
> > -      if (can_create_pseudo_p ()
> > -         && CONST_DOUBLE_P (op1))
> > +      if (can_create_pseudo_p ())
> >         {
> > -         /* If we are loading a floating point constant to a register,
> > -            force the value to memory now, since we'll get better code
> > -            out the back end.  */
> > +         if (CONST_DOUBLE_P (op1))
> > +           {
> > +             /* If we are loading a floating point constant to a
> > +                register, force the value to memory now, since we'll
> > +                get better code out the back end.  */
> >
> > -         op1 = validize_mem (force_const_mem (mode, op1));
> > -         if (!register_operand (op0, mode))
> > +             op1 = validize_mem (force_const_mem (mode, op1));
> > +             if (!register_operand (op0, mode))
> > +               {
> > +                 rtx temp = gen_reg_rtx (mode);
> > +                 emit_insn (gen_rtx_SET (temp, op1));
> > +                 emit_move_insn (op0, temp);
> > +                 return;
> > +               }
> > +           }
> > +         else if (GET_MODE_SIZE (mode) >= 16)
> >             {
> > -             rtx temp = gen_reg_rtx (mode);
> > -             emit_insn (gen_rtx_SET (temp, op1));
> > -             emit_move_insn (op0, temp);
> > -             return;
> > +             rtx tmp = ix86_convert_const_wide_int_to_broadcast
> > +               (GET_MODE (op0), op1);
> > +             if (tmp != nullptr)
> > +               op1 = tmp;
> >             }
> >         }
> >      }
> > @@ -368,6 +457,54 @@ ix86_expand_move (machine_mode mode, rtx operands[])
> >    emit_insn (gen_rtx_SET (op0, op1));
> >  }
> >
> > +static rtx
> > +ix86_broadcast_from_integer_constant (machine_mode mode, rtx op)
> > +{
> > +  int nunits = GET_MODE_NUNITS (mode);
> > +  if (nunits < 2)
> > +    return nullptr;
> > +
> > +  /* Don't use integer vector broadcast if we can't move from GPR to SSE
> > +     register directly.  */
> > +  if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
> > +    return nullptr;
> > +
> > +  /* Don't broadcast from a standard SSE constant integer.  */
> > +  if (standard_sse_constant_p (op, mode))
> > +    return nullptr;
> > +
> > +  /* Don't broadcast from a 64-bit integer constant in 32-bit mode.  */
> > +  if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT)
> > +    return nullptr;
> > +
> > +  rtx constant = get_pool_constant (XEXP (op, 0));
> > +  if (GET_CODE (constant) != CONST_VECTOR)
> > +    return nullptr;
> > +
> > +  /* There could be some rtx like
> > +     (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > +     but with "*.LC1" refer to V2DI constant vector.  */
> > +  if (GET_MODE (constant) != mode)
> > +    {
> > +      constant = simplify_subreg (mode, constant, GET_MODE (constant),
> > +                                 0);
> > +      if (constant == nullptr || GET_CODE (constant) != CONST_VECTOR)
> > +       return nullptr;
> > +    }
> > +
> > +  rtx first = XVECEXP (constant, 0, 0);
> > +
> > +  for (int i = 1; i < nunits; ++i)
> > +    {
> > +      rtx tmp = XVECEXP (constant, 0, i);
> > +      /* Vector duplicate value.  */
> > +      if (!rtx_equal_p (tmp, first))
> > +       return nullptr;
> > +    }
> > +
> > +  return first;
> > +}
> > +
> >  void
> >  ix86_expand_vector_move (machine_mode mode, rtx operands[])
> >  {
> > @@ -407,7 +544,36 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[])
> >           op1 = simplify_gen_subreg (mode, r, imode, SUBREG_BYTE (op1));
> >         }
> >        else
> > -       op1 = validize_mem (force_const_mem (mode, op1));
> > +       {
> > +         machine_mode mode = GET_MODE (op0);
> > +         rtx tmp = ix86_convert_const_wide_int_to_broadcast
> > +           (mode, op1);
> > +         if (tmp == nullptr)
> > +           op1 = validize_mem (force_const_mem (mode, op1));
> > +         else
> > +           op1 = tmp;
> > +       }
> > +    }
> > +
> > +  if (can_create_pseudo_p ()
> > +      && GET_MODE_SIZE (mode) >= 16
> > +      && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> > +      && (MEM_P (op1)
> > +         && SYMBOL_REF_P (XEXP (op1, 0))
> > +         && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0))))
> > +    {
> > +      rtx first = ix86_broadcast_from_integer_constant (mode, op1);
> > +      if (first != nullptr)
> > +       {
> > +         /* Broadcast to XMM/YMM/ZMM register from an integer
> > +            constant.  */
> > +         op1 = ix86_gen_scratch_sse_rtx (mode);
> > +         bool ok = ix86_expand_vector_init_duplicate (false, mode,
> > +                                                      op1, first);
> > +         gcc_assert (ok);
> > +         emit_move_insn (op0, op1);
> > +         return;
> > +       }
> >      }
> >
> >    /* We need to check memory alignment for SSE mode since attribute
> > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > index 65fc307dc7b..71745b9a1ea 100644
> > --- a/gcc/config/i386/i386-protos.h
> > +++ b/gcc/config/i386/i386-protos.h
> > @@ -50,6 +50,8 @@ extern void ix86_reset_previous_fndecl (void);
> >
> >  extern bool ix86_using_red_zone (void);
> >
> > +extern rtx ix86_gen_scratch_sse_rtx (machine_mode);
> > +
> >  extern unsigned int ix86_regmode_natural_size (machine_mode);
> >  #ifdef RTX_CODE
> >  extern int standard_80387_constant_p (rtx);
> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > index c71c9e666a4..1c167c9a841 100644
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -23126,6 +23126,19 @@ ix86_optab_supported_p (int op, machine_mode mode1, machine_mode,
> >      }
> >  }
> >
> > +/* Return a scratch register in MODE for vector load and store.  */
> > +
> > +rtx
> > +ix86_gen_scratch_sse_rtx (machine_mode mode)
> > +{
> > +  if (TARGET_SSE)
> > +    return gen_rtx_REG (mode, (TARGET_64BIT
> > +                              ? LAST_REX_SSE_REG
> > +                              : LAST_SSE_REG));
> > +  else
> > +    return gen_reg_rtx (mode);
> > +}
> > +
> >  /* Address space support.
> >
> >     This is not "far pointers" in the 16-bit sense, but an easy way
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > index 0563e696316..a2664d87f29 100644
> > --- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > @@ -2,8 +2,11 @@
> >  /* { dg-do compile } */
> >  /* { dg-options "-O2 -mavx512f -mavx512dq" } */
> >  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> > -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
> > -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 5 } }  */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 { target { ! ia32 } } } }  */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 { target ia32 } } } */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 3 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 3 { target { ! ia32 } } } } */
> >
> >  typedef int v16si  __attribute__ ((vector_size (64)));
> >  typedef long long v8di  __attribute__ ((vector_size (64)));
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> > index ffbe95980ca..477f9ca1282 100644
> > --- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> > @@ -2,8 +2,9 @@
> >  /* { dg-do compile } */
> >  /* { dg-options "-O2 -mavx512f" } */
> >  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> > -/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to8\\\}" 4 } }  */
> > -/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to16\\\}" 4 } }  */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 { target ia32 } } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 4 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 4 { target { ! ia32 } } } } */
> >
> >  typedef int v16si  __attribute__ ((vector_size (64)));
> >  typedef long long v8di  __attribute__ ((vector_size (64)));
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c b/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
> > index 99a89f51202..ca49a585232 100644
> > --- a/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
> > +++ b/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
> > @@ -1,6 +1,6 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O3 -mavx512f" } */
> > -/* { dg-final { scan-assembler-times "(?:vpblendmd|vmovdqa32)\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 8 } } */
> > +/* { dg-options "-O3 -mavx512f -mprefer-vector-width=512" } */
> > +/* { dg-final { scan-assembler-times "(?:vpbroadcastd|vmovdqa32)\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 8 } } */
> >
> >  unsigned int x[128];
> >  int y[128];
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > index c06369d93fd..f8eb99f0b5f 100644
> > --- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > @@ -2,9 +2,15 @@
> >  /* { dg-do compile } */
> >  /* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> >  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> > -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 } }  */
> > -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 10 } }  */
> > -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 { target { ! ia32 } } } }  */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target { ! ia32 } } } }  */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 { target ia32 } } } */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 7 { target ia32 } } } */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 3 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 3 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 3 { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 3 { target { ! ia32 } } } } */
> >
> >  typedef int v4si  __attribute__ ((vector_size (16)));
> >  typedef int v8si  __attribute__ ((vector_size (32)));
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> > index 4998a9b8d51..32f6ac81841 100644
> > --- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> > @@ -2,9 +2,12 @@
> >  /* { dg-do compile } */
> >  /* { dg-options "-O2 -mavx512f -mavx512vl" } */
> >  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> > -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
> > -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
> > -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 { target ia32 } } } */
> > +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target ia32 } } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 4 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 4 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 4 { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 4 { target { ! ia32 } } } } */
> >
> >  typedef int v4si  __attribute__ ((vector_size (16)));
> >  typedef int v8si  __attribute__ ((vector_size (32)));
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-1.c b/gcc/testsuite/gcc.target/i386/pr100865-1.c
> > new file mode 100644
> > index 00000000000..6c3097fb2a6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-1.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-O2 -march=x86-64" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memset (dst, 3, 16);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "movdqa\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10a.c b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
> > new file mode 100644
> > index 00000000000..7ffc19e56a8
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake" } */
> > +
> > +extern __int128 array[16];
> > +
> > +#define MK_CONST128_BROADCAST(A) \
> > +  ((((unsigned __int128) (unsigned char) A) << 120) \
> > +   | (((unsigned __int128) (unsigned char) A) << 112) \
> > +   | (((unsigned __int128) (unsigned char) A) << 104) \
> > +   | (((unsigned __int128) (unsigned char) A) << 96) \
> > +   | (((unsigned __int128) (unsigned char) A) << 88) \
> > +   | (((unsigned __int128) (unsigned char) A) << 80) \
> > +   | (((unsigned __int128) (unsigned char) A) << 72) \
> > +   | (((unsigned __int128) (unsigned char) A) << 64) \
> > +   | (((unsigned __int128) (unsigned char) A) << 56) \
> > +   | (((unsigned __int128) (unsigned char) A) << 48) \
> > +   | (((unsigned __int128) (unsigned char) A) << 40) \
> > +   | (((unsigned __int128) (unsigned char) A) << 32) \
> > +   | (((unsigned __int128) (unsigned char) A) << 24) \
> > +   | (((unsigned __int128) (unsigned char) A) << 16) \
> > +   | (((unsigned __int128) (unsigned char) A) << 8) \
> > +   | ((unsigned __int128) (unsigned char) A) )
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = MK_CONST128_BROADCAST (0x1f);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10b.c b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
> > new file mode 100644
> > index 00000000000..edf52765c60
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
> > @@ -0,0 +1,7 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-10a.c"
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11a.c b/gcc/testsuite/gcc.target/i386/pr100865-11a.c
> > new file mode 100644
> > index 00000000000..04ce1662f3c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-11a.c
> > @@ -0,0 +1,23 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake" } */
> > +
> > +extern __int128 array[16];
> > +
> > +#define MK_CONST128_BROADCAST(A) \
> > +  ((((unsigned __int128) (unsigned long long) A) << 64) \
> > +   | ((unsigned __int128) (unsigned long long) A) )
> > +
> > +#define MK_CONST128_BROADCAST_SIGNED(A) \
> > +  ((__int128) MK_CONST128_BROADCAST (A))
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = MK_CONST128_BROADCAST_SIGNED (-0x1ffffffffLL);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "movabsq" 1 } } */
> > +/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpunpcklqdq)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11b.c b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
> > new file mode 100644
> > index 00000000000..12d55b9a642
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
> > @@ -0,0 +1,8 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-11a.c"
> > +
> > +/* { dg-final { scan-assembler-times "movabsq" 1 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-12a.c b/gcc/testsuite/gcc.target/i386/pr100865-12a.c
> > new file mode 100644
> > index 00000000000..d4833d44475
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-12a.c
> > @@ -0,0 +1,20 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake" } */
> > +
> > +extern __int128 array[16];
> > +
> > +#define MK_CONST128_BROADCAST(A) \
> > +  ((((unsigned __int128) (unsigned long long) A) << 64) \
> > +   | ((unsigned __int128) (unsigned long long) A) )
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = MK_CONST128_BROADCAST (0x1ffffffffLL);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "movabsq" 1 } } */
> > +/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpunpcklqdq)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-12b.c b/gcc/testsuite/gcc.target/i386/pr100865-12b.c
> > new file mode 100644
> > index 00000000000..12d55b9a642
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-12b.c
> > @@ -0,0 +1,8 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-11a.c"
> > +
> > +/* { dg-final { scan-assembler-times "movabsq" 1 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-2.c b/gcc/testsuite/gcc.target/i386/pr100865-2.c
> > new file mode 100644
> > index 00000000000..17efe2d72a3
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-2.c
> > @@ -0,0 +1,14 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-O2 -march=skylake" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memset (dst, 3, 16);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-3.c b/gcc/testsuite/gcc.target/i386/pr100865-3.c
> > new file mode 100644
> > index 00000000000..b6dbcf7809b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-3.c
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-O2 -march=skylake-avx512" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memset (dst, 3, 16);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
> > +/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4a.c b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
> > new file mode 100644
> > index 00000000000..f55883598f9
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-O2 -march=skylake" } */
> > +
> > +extern char array[64];
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = -45;
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, " 4 } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4b.c b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
> > new file mode 100644
> > index 00000000000..f41e6147b4c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
> > @@ -0,0 +1,9 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-O2 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-4a.c"
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, " 4 } } */
> > +/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5a.c b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
> > new file mode 100644
> > index 00000000000..4149797fe81
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=skylake" } */
> > +
> > +extern short array[64];
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = -45;
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 4 } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5b.c b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
> > new file mode 100644
> > index 00000000000..ded41b680d3
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
> > @@ -0,0 +1,9 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-5a.c"
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu16\[\\t \]%ymm\[0-9\]+, " 4 } } */
> > +/* { dg-final { scan-assembler-not "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6a.c b/gcc/testsuite/gcc.target/i386/pr100865-6a.c
> > new file mode 100644
> > index 00000000000..3fde549a10d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-6a.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=skylake" } */
> > +
> > +extern int array[64];
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = -45;
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6b.c b/gcc/testsuite/gcc.target/i386/pr100865-6b.c
> > new file mode 100644
> > index 00000000000..44e74c64e55
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-6b.c
> > @@ -0,0 +1,9 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-6a.c"
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
> > +/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-6c.c b/gcc/testsuite/gcc.target/i386/pr100865-6c.c
> > new file mode 100644
> > index 00000000000..46d31030ce8
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-6c.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=skylake -mno-avx2" } */
> > +
> > +extern int array[64];
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = -45;
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vbroadcastss" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7a.c b/gcc/testsuite/gcc.target/i386/pr100865-7a.c
> > new file mode 100644
> > index 00000000000..f6f2be91120
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-7a.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=skylake" } */
> > +
> > +extern long long int array[64];
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = -45;
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
> > +/* { dg-final { scan-assembler-not "vpbroadcastq" { target ia32 } } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" { target { ! ia32 } } } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7b.c b/gcc/testsuite/gcc.target/i386/pr100865-7b.c
> > new file mode 100644
> > index 00000000000..0a68820aa32
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-7b.c
> > @@ -0,0 +1,9 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-7a.c"
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target ia32 } } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7c.c b/gcc/testsuite/gcc.target/i386/pr100865-7c.c
> > new file mode 100644
> > index 00000000000..4d50bb7e2f6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-7c.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=skylake -mno-avx2" } */
> > +
> > +extern long long int array[64];
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = -45;
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vbroadcastsd" 1 { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 16 } } */
> > +/* { dg-final { scan-assembler-not "vbroadcastsd" { target ia32 } } } */
> > +/* { dg-final { scan-assembler-not "vmovdqa" { target { ! ia32 } } } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8a.c b/gcc/testsuite/gcc.target/i386/pr100865-8a.c
> > new file mode 100644
> > index 00000000000..96e9f13204c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-8a.c
> > @@ -0,0 +1,24 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake" } */
> > +
> > +extern __int128 array[16];
> > +
> > +#define MK_CONST128_BROADCAST(A) \
> > +  ((((unsigned __int128) (unsigned int) A) << 96) \
> > +   | (((unsigned __int128) (unsigned int) A) << 64) \
> > +   | (((unsigned __int128) (unsigned int) A) << 32) \
> > +   | ((unsigned __int128) (unsigned int) A) )
> > +
> > +#define MK_CONST128_BROADCAST_SIGNED(A) \
> > +  ((__int128) MK_CONST128_BROADCAST (A))
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = MK_CONST128_BROADCAST_SIGNED (-45);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "(?:vpbroadcastq|vpshufd)\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-8b.c b/gcc/testsuite/gcc.target/i386/pr100865-8b.c
> > new file mode 100644
> > index 00000000000..99a10ad83bd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-8b.c
> > @@ -0,0 +1,7 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-8a.c"
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9a.c b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
> > new file mode 100644
> > index 00000000000..45d0e0d0e2e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake" } */
> > +
> > +extern __int128 array[16];
> > +
> > +#define MK_CONST128_BROADCAST(A) \
> > +  ((((unsigned __int128) (unsigned short) A) << 112) \
> > +   | (((unsigned __int128) (unsigned short) A) << 96) \
> > +   | (((unsigned __int128) (unsigned short) A) << 80) \
> > +   | (((unsigned __int128) (unsigned short) A) << 64) \
> > +   | (((unsigned __int128) (unsigned short) A) << 48) \
> > +   | (((unsigned __int128) (unsigned short) A) << 32) \
> > +   | (((unsigned __int128) (unsigned short) A) << 16) \
> > +   | ((unsigned __int128) (unsigned short) A) )
> > +
> > +void
> > +foo (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < sizeof (array) / sizeof (array[0]); i++)
> > +    array[i] = MK_CONST128_BROADCAST (0x1fff);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9b.c b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
> > new file mode 100644
> > index 00000000000..14696248525
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
> > @@ -0,0 +1,7 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O3 -march=skylake-avx512" } */
> > +
> > +#include "pr100865-9a.c"
> > +
> > +/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
> > --
> > 2.31.1
> >
>
>
> --
> BR,
> Hongtao



-- 
H.J.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander
  2021-06-28 19:38             ` H.J. Lu
@ 2021-06-29  8:17               ` Richard Sandiford
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Sandiford @ 2021-06-29  8:17 UTC (permalink / raw)
  To: H.J. Lu via Gcc-patches
  Cc: Uros Bizjak, Jakub Jelinek, Hongtao Liu, Richard Biener, H.J. Lu

"H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> On Mon, Jun 28, 2021 at 5:36 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> "H.J. Lu" <hjl.tools@gmail.com> writes:
>> > On Sun, Jun 27, 2021 at 2:00 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
>> >> > On Sun, Jun 27, 2021 at 1:43 AM Richard Sandiford
>> >> > <richard.sandiford@arm.com> wrote:
>> >> >>
>> >> >> "H.J. Lu" <hjl.tools@gmail.com> writes:
>> >> >> > 1. Update vec_duplicate to allow to fail so that backend can only allow
>> >> >> > broadcasting an integer constant to a vector when broadcast instruction
>> >> >> > is available.  This can be used by memset expander to avoid vec_duplicate
>> >> >> > when loading from constant pool is more efficient.
>> >> >>
>> >> >> I don't see any changes in target-independent code though, other than
>> >> >> the doc update.  It's still the case that (existing) uses of
>> >> >> vec_duplicate_optab do not allow it to fail.
>> >> >
>> >> > I have a followup patch set on
>> >> >
>> >> > https://gitlab.com/x86-gcc/gcc/-/commits/users/hjl/pieces/broadcast
>> >> >
>> >> > to use it to expand memset with vector broadcast:
>> >> >
>> >> > https://gitlab.com/x86-gcc/gcc/-/commit/991c87f8a83ca736ae9ed92baa3ebadca289f6e3
>> >> >
>> >> > For SSE2 which doesn't have vector broadcast, the constant vector broadcast
>> >> > expander returns FAIL and load from constant pool will be used.
>> >>
>> >> Hmm, but as Jeff and I mentioned in the earlier replies,
>> >> vec_duplicate_optab shouldn't be used for constants.  Constants
>> >> should go via the move expanders instead.
>> >>
>> >> In a previous message I suggested:
>> >>
>> >>   … would it work to change:
>> >>
>> >>         /* Try using vec_duplicate_optab for uniform vectors.  */
>> >>         if (!TREE_SIDE_EFFECTS (exp)
>> >>             && VECTOR_MODE_P (mode)
>> >>             && eltmode == GET_MODE_INNER (mode)
>> >>             && ((icode = optab_handler (vec_duplicate_optab, mode))
>> >>                 != CODE_FOR_nothing)
>> >>             && (elt = uniform_vector_p (exp)))
>> >>
>> >>   to something like:
>> >>
>> >>         /* Try using vec_duplicate_optab for uniform vectors.  */
>> >>         if (!TREE_SIDE_EFFECTS (exp)
>> >>             && VECTOR_MODE_P (mode)
>> >>             && eltmode == GET_MODE_INNER (mode)
>> >>             && (elt = uniform_vector_p (exp)))
>> >>           {
>> >>             if (TREE_CODE (elt) == INTEGER_CST
>> >>                 || TREE_CODE (elt) == POLY_INT_CST
>> >>                 || TREE_CODE (elt) == REAL_CST
>> >>                 || TREE_CODE (elt) == FIXED_CST)
>> >>               {
>> >>                 rtx src = gen_const_vec_duplicate (mode, expand_normal (node));
>> >>                 emit_move_insn (target, src);
>> >>                 break;
>> >>               }
>> >>             …
>> >>           }
>> >>
>> >> if that code was the source of the constant operand.  If we're adding a
>> >> new use of vec_duplicate_optab then that should be similarly protected
>> >> against constant operands.
>> >>
>> >
>> > Your comments apply to my initial vec_duplicate patch that caused the
>> > gcc.dg/pr100239.c failure.  It has been fixed by
>> >
>> > commit ffe3a37f54ab866d85bdde48c2a32be5e09d8515
>> > Author: Richard Biener <rguenther@suse.de>
>> > Date:   Mon Jun 7 20:08:13 2021 +0200
>> >
>> >     middle-end/100951 - make sure to generate VECTOR_CST in lowering
>> >
>> >     When vector lowering creates piecewise ops make sure to create
>> >     VECTOR_CSTs instead of CONSTRUCTORs when possible.
>> >
>> > The problem I am running into now is in my memset vector broadcast
>> > patch.  In order to optimize vector broadcast for memset, I need to
>> > generate a pseudo register for
>> >
>> >  __builtin_memset (ops, 3, 38);
>> >
>> > only when vector broadcast is available:
>> >
>> >   rtx target = nullptr;
>> >
>> >   unsigned int nunits = GET_MODE_SIZE (mode) / GET_MODE_SIZE (QImode);
>> >   machine_mode vector_mode;
>> >   if (!mode_for_vector (QImode, nunits).exists (&vector_mode))
>> >     gcc_unreachable ();
>> >
>> >   enum insn_code icode = optab_handler (vec_duplicate_optab,
>> >                                         vector_mode);
>> >   if (icode != CODE_FOR_nothing)
>> >     {
>> >       rtx reg = targetm.gen_memset_scratch_rtx (vector_mode);
>> >       class expand_operand ops[2];
>> >       create_output_operand (&ops[0], reg, vector_mode);
>> >       create_input_operand (&ops[1], data, QImode);
>> >       if (maybe_expand_insn (icode, 2, ops))
>> >         {
>> >           if (!rtx_equal_p (reg, ops[0].value))
>> >             emit_move_insn (reg, ops[0].value);
>> >           target = lowpart_subreg (mode, reg, vector_mode);
>> >         }
>> >     }
>> >
>> >   return target;  <<< Return nullptr to load from constant pool.
>>
>> I don't think this is a correct use of vec_duplicate_optab.  If the
>> scalar operand is a constant then the move should always go through
>> the move expanders instead, as a move from a CONST_VECTOR.
>
> Like this?
>
>   enum insn_code icode = optab_handler (vec_duplicate_optab,
>                                         vector_mode);
>   if (icode != CODE_FOR_nothing)
>     {
>       rtx reg = targetm.gen_memset_scratch_rtx (vector_mode);
>       if (CONST_INT_P (data))
>         {
>           /* Use the move expander with CONST_VECTOR.  */
>           rtvec v = rtvec_alloc (nunits);
>           for (unsigned int i = 0; i < nunits; i++)
>             RTVEC_ELT (v, i) = data;
>           rtx const_vec = gen_rtx_CONST_VECTOR (vector_mode, v);

FWIW, this is:

        rtx const_vec = gen_const_vec_duplicate (vector_mode, data);

>           emit_move_insn (reg, const_vec);
>         }
>       else
>         {
>
>           class expand_operand ops[2];
>           create_output_operand (&ops[0], reg, vector_mode);
>           create_input_operand (&ops[1], data, QImode);
>           expand_insn (icode, 2, ops);
>           if (!rtx_equal_p (reg, ops[0].value))
>             emit_move_insn (reg, ops[0].value);
>         }
>       target = lowpart_subreg (mode, reg, vector_mode);
>     }

I guess what I don't understand here is why we want to move a
“vector_mode” constant and take the “mode” subreg of the result.
Won't that get folded to a “mode” move anyway?  It seems more natural
to emit it as a “mode” move from the start.

Also, IMO it looks odd to be guarding the constant case with the
optab check.

Perhaps this is more obvious in the context of the wider patch though.

But yeah, the use of the optab itself looks good.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-06-29  8:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-26 20:02 [PATCH v5 0/2] x86: Convert CONST_WIDE_INT/CONST_VECTOR to broadcast H.J. Lu
2021-06-26 20:02 ` [PATCH v5 1/2] " H.J. Lu
2021-06-28  1:48   ` Hongtao Liu
2021-06-29  0:40     ` H.J. Lu
2021-06-26 20:02 ` [PATCH v5 2/2] x86: Add vec_duplicate<mode> expander H.J. Lu
2021-06-27  8:43   ` Richard Sandiford
2021-06-27 11:29     ` H.J. Lu
2021-06-27 21:00       ` Richard Sandiford
2021-06-28 12:16         ` H.J. Lu
2021-06-28 12:36           ` Richard Sandiford
2021-06-28 19:38             ` H.J. Lu
2021-06-29  8:17               ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).