public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
@ 2020-07-09  8:33 Hongtao Liu
  2020-07-10  9:24 ` Hongtao Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Hongtao Liu @ 2020-07-09  8:33 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1200 bytes --]

Hi:
  For a constant vector having one duplicated value, there's no need
to put the whole vector in the constant pool, using embedded broadcast
instead.

  Bootstrap test is Ok, regression test for i386/x86-64 backend is ok.

gcc/ChangeLog:

        PR target/87767
        * config/i386/i386-features.c
        (replace_constant_pool_with_broadcast): New function.
        (constant_pool_broadcast): Ditto.
        (class pass_constant_pool_broadcast): New pass.
        (make_pass_constant_pool_broadcast): Ditto.
        * config/i386/i386-passes.def: Insert new pass after combine.
        * config/i386/i386-protos.h
        (make_pass_constant_pool_broadcast): Declare.
        * config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
        *avx512f_mul<mode>3<mask_name>_bcst): New define_insn.

gcc/testsuite/ChangeLog:

        PR target/87767
        * gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
        * gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
        * gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
        * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
        * gcc.target/i386/pr92865-1.c: Adjust testcase.




-- 
BR,
Hongtao

[-- Attachment #2: 0001-Optimize-memory-broadcast-for-constant-vector-under-.patch --]
[-- Type: text/x-patch, Size: 17898 bytes --]

From b8f49299e3d23f927a659cd394e3099e3291a76f Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 8 Jul 2020 17:14:36 +0800
Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.

For constant vector having one duplicated value, there's no need to put
whole vector in the constant pool, using embedded broadcast instead.

2020-07-09  Hongtao Liu  <hongtao.liu@intel.com>

gcc/ChangeLog:

	PR target/87767
	* config/i386/i386-features.c
	(replace_constant_pool_with_broadcast): New function.
	(constant_pool_broadcast): Ditto.
	(class pass_constant_pool_broadcast): New pass.
	(make_pass_constant_pool_broadcast): Ditto.
	* config/i386/i386-passes.def: Insert new pass after combine.
	* config/i386/i386-protos.h
	(make_pass_constant_pool_broadcast): Declare.
	* config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
	*avx512f_mul<mode>3<mask_name>_bcst): New define_insn.

gcc/testsuite/ChangeLog:

	PR target/87767
	* gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/pr92865-1.c: Adjust testcase.
---
 gcc/config/i386/i386-features.c               | 146 ++++++++++++++++++
 gcc/config/i386/i386-passes.def               |   1 +
 gcc/config/i386/i386-protos.h                 |   1 +
 gcc/config/i386/sse.md                        |  25 +++
 .../i386/avx2-broadcast-pr87767-1.c           |  40 +++++
 .../i386/avx512f-broadcast-pr87767-1.c        |  66 ++++++++
 .../i386/avx512f-broadcast-pr87767-2.c        |  54 +++++++
 .../i386/avx512vl-broadcast-pr87767-1.c       |  40 +++++
 gcc/testsuite/gcc.target/i386/pr92865-1.c     |   9 +-
 9 files changed, 378 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 535fc7e981d..8f81d101382 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
   return new pass_remove_partial_avx_dependency (ctxt);
 }
 
+/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
+   with embedded broadcast. i.e.transform
+
+     vpaddq .LC0(%rip), %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+
+    to
+
+     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3  */
+static void
+replace_constant_pool_with_broadcast (rtx_insn* insn)
+{
+  subrtx_ptr_iterator::array_type array;
+  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
+    {
+      rtx *loc = *iter;
+      rtx x = *loc;
+      rtx broadcast_mem, vec_dup, constant, first;
+      machine_mode mode;
+      if (GET_CODE (x) != MEM
+	  || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
+	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
+	continue;
+
+      mode = GET_MODE (x);
+      if (!VECTOR_MODE_P (mode))
+	return;
+
+      constant = get_pool_constant (XEXP (x, 0));
+      first = XVECEXP (constant, 0, 0);
+      /* There could be some rtx like
+	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
+	 but with "*.LC1" refer to V2DI constant vector.  */
+      if (GET_MODE (constant) != mode)
+	return;
+
+      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
+	{
+	  rtx tmp = XVECEXP (constant, 0, i);
+	  /* Only handle one-value const vector.  */
+	  if (!rtx_equal_p (tmp, first))
+	    return;
+	}
+
+      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
+      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
+      *loc = vec_dup;
+      INSN_CODE (insn) = -1;
+      /* Revert change if there's no corresponding pattern.  */
+      if (recog_memoized (insn) < 0)
+      	{
+      	  *loc = x;
+      	  recog_memoized (insn);
+      	}
+      /* At most 1 memory_operand in an insn.  */
+      return;
+    }
+}
+
+/* For const vector having one duplicated value, there's no need to put
+   whole vector in the constant pool when target supports embedded broadcast. */
+static unsigned int
+constant_pool_broadcast (void)
+{
+  timevar_push (TV_MACH_DEP);
+  rtx_insn *insn;
+
+  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
+    {
+      if (!INSN_P (insn))
+	continue;
+
+      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
+	 insns, not any notes that may be attached.  We don't want to mark
+	 a constant just because it happens to appear in a REG_EQUIV note.  */
+      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
+	{
+	  int i, n = seq->len ();
+	  for (i = 0; i < n; ++i)
+	    {
+	      rtx subinsn = seq->element (i);
+	      if (INSN_P (subinsn))
+		replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
+	    }
+	}
+      else
+	replace_constant_pool_with_broadcast (insn);
+    }
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_constant_pool_broadcast =
+{
+  RTL_PASS, /* type */
+  "cpb", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_constant_pool_broadcast : public rtl_opt_pass
+{
+public:
+  pass_constant_pool_broadcast (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return TARGET_AVX512F;
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return constant_pool_broadcast ();
+    }
+}; // class pass_cpb
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_constant_pool_broadcast (gcc::context *ctxt)
+{
+  return new pass_constant_pool_broadcast (ctxt);
+}
+
 /* This compares the priority of target features in function DECL1
    and DECL2.  It returns positive value if DECL1 is higher priority,
    negative value if DECL2 is higher priority and 0 if they are the
diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index d83c7b956b1..07ecf8e790f 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
 
   INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
+  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 7c2ce618f3f..6c6909b41dd 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
   (gcc::context *);
 extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
   (gcc::context *);
+extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 431571a4bc1..fbfb459c5bf 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -12127,6 +12127,19 @@
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
+	(mult:VI8_AVX512VL
+	  (vec_duplicate:VI8_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI8_AVX512VL 2 "register_operand" "v")
+))]
+  "TARGET_AVX512DQ"
+  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3<mask_name>"
   [(set (match_operand:VI4_AVX512F 0 "register_operand")
 	(mult:VI4_AVX512F
@@ -12167,6 +12180,18 @@
    (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
+	(mult:VI4_AVX512VL
+	  (vec_duplicate:VI4_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512F"
+   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3"
   [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
 	(mult:VI8_AVX2_AVX512F
diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..800ef1f957e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
@@ -0,0 +1,40 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..21249bc0cf9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -0,0 +1,66 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v16si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v8di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v16sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+FOO (v8df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v16si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v8di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v16sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
+FOO (v8df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..938346743c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
@@ -0,0 +1,54 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+
+#include<stdlib.h>
+#include<stdio.h>
+#include "avx512f-broadcast-pr87767-1.c"
+
+#define TEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP 101;				\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+int main()
+{
+  TEST (v4si, int, 4, add, +);
+  TEST (v8si, int, 8, add, +);
+  TEST (v16si, int, 16, add, +);
+  TEST (v2di, long long, 2, add, +);
+  TEST (v4di, long long, 4, add, +);
+  TEST (v8di, long long, 8, add, +);
+  TEST (v4sf, float, 4, add, +);
+  TEST (v8sf, float, 8, add, +);
+  TEST (v16sf, float, 16, add, +);
+  TEST (v2df, double, 2, add, +);
+  TEST (v4df, double, 4, add, +);
+  TEST (v8df, double, 8, add, +);
+
+  TEST (v4si, int, 4, mul, *);
+  TEST (v8si, int, 8, mul, *);
+  TEST (v16si, int, 16, mul, *);
+  TEST (v2di, long long, 2, mul, *);
+  TEST (v4di, long long, 4, mul, *);
+  TEST (v8di, long long, 8, mul, *);
+  TEST (v4sf, float, 4, mul, *);
+  TEST (v8sf, float, 8, mul, *);
+  TEST (v16sf, float, 16, mul, *);
+  TEST (v2df, double, 2, mul, *);
+  TEST (v4df, double, 4, mul, *);
+  TEST (v8df, double, 8, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..ec159a68158
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -0,0 +1,40 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
index 49b5778a067..a37487d9af7 100644
--- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
+++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
@@ -3,10 +3,11 @@
 /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
 /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
 /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
-/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
+/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
 
 extern char arraysb[64];
 extern short arraysw[32];
-- 
2.18.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-07-09  8:33 [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512 Hongtao Liu
@ 2020-07-10  9:24 ` Hongtao Liu
  2020-07-17  7:24   ` Hongtao Liu
  2020-07-23  8:39 ` Jan Hubicka
  2020-08-27 12:24 ` Jakub Jelinek
  2 siblings, 1 reply; 30+ messages in thread
From: Hongtao Liu @ 2020-07-10  9:24 UTC (permalink / raw)
  To: GCC Patches, kirill.yukhin

+ maintainer.
cc H.J

On Thu, Jul 9, 2020 at 4:33 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Hi:
>   For a constant vector having one duplicated value, there's no need
> to put the whole vector in the constant pool, using embedded broadcast
> instead.
>
>   Bootstrap test is Ok, regression test for i386/x86-64 backend is ok.
>
> gcc/ChangeLog:
>
>         PR target/87767
>         * config/i386/i386-features.c
>         (replace_constant_pool_with_broadcast): New function.
>         (constant_pool_broadcast): Ditto.
>         (class pass_constant_pool_broadcast): New pass.
>         (make_pass_constant_pool_broadcast): Ditto.
>         * config/i386/i386-passes.def: Insert new pass after combine.
>         * config/i386/i386-protos.h
>         (make_pass_constant_pool_broadcast): Declare.
>         * config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
>         *avx512f_mul<mode>3<mask_name>_bcst): New define_insn.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/87767
>         * gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
>         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
>         * gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
>         * gcc.target/i386/pr92865-1.c: Adjust testcase.
>
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-07-10  9:24 ` Hongtao Liu
@ 2020-07-17  7:24   ` Hongtao Liu
  0 siblings, 0 replies; 30+ messages in thread
From: Hongtao Liu @ 2020-07-17  7:24 UTC (permalink / raw)
  To: GCC Patches, kirill.yukhin

ping!

On Fri, Jul 10, 2020 at 5:24 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> + maintainer.
> cc H.J
>
> On Thu, Jul 9, 2020 at 4:33 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > Hi:
> >   For a constant vector having one duplicated value, there's no need
> > to put the whole vector in the constant pool, using embedded broadcast
> > instead.
> >
> >   Bootstrap test is Ok, regression test for i386/x86-64 backend is ok.
> >
> > gcc/ChangeLog:
> >
> >         PR target/87767
> >         * config/i386/i386-features.c
> >         (replace_constant_pool_with_broadcast): New function.
> >         (constant_pool_broadcast): Ditto.
> >         (class pass_constant_pool_broadcast): New pass.
> >         (make_pass_constant_pool_broadcast): Ditto.
> >         * config/i386/i386-passes.def: Insert new pass after combine.
> >         * config/i386/i386-protos.h
> >         (make_pass_constant_pool_broadcast): Declare.
> >         * config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
> >         *avx512f_mul<mode>3<mask_name>_bcst): New define_insn.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/87767
> >         * gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
> >         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
> >         * gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
> >         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
> >         * gcc.target/i386/pr92865-1.c: Adjust testcase.
> >
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-07-09  8:33 [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512 Hongtao Liu
  2020-07-10  9:24 ` Hongtao Liu
@ 2020-07-23  8:39 ` Jan Hubicka
  2020-07-23 13:53   ` Hongtao Liu
  2020-08-27 12:24 ` Jakub Jelinek
  2 siblings, 1 reply; 30+ messages in thread
From: Jan Hubicka @ 2020-07-23  8:39 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: GCC Patches

Hello,
sorry for taking so long to get to this.
> diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> index 535fc7e981d..8f81d101382 100644
> --- a/gcc/config/i386/i386-features.c
> +++ b/gcc/config/i386/i386-features.c
> @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
>    return new pass_remove_partial_avx_dependency (ctxt);
>  }
>  
> +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> +   with embedded broadcast. i.e.transform
> +
> +     vpaddq .LC0(%rip), %zmm0, %zmm0
> +     ret
> +  .LC0:
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +
> +    to
> +
> +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0

It seems to me that having a special purpose pass for this is bit 
overzelaous.  It seems to me that you can do same pattern matching via 
splitter and fit it into the usual insn splitting pass?

Honza
> +     ret
> +  .LC0:
> +    .quad 3  */
> +static void
> +replace_constant_pool_with_broadcast (rtx_insn* insn)
> +{
> +  subrtx_ptr_iterator::array_type array;
> +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> +    {
> +      rtx *loc = *iter;
> +      rtx x = *loc;
> +      rtx broadcast_mem, vec_dup, constant, first;
> +      machine_mode mode;
> +      if (GET_CODE (x) != MEM
> +	  || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
> +	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> +	continue;
> +
> +      mode = GET_MODE (x);
> +      if (!VECTOR_MODE_P (mode))
> +	return;
> +
> +      constant = get_pool_constant (XEXP (x, 0));
> +      first = XVECEXP (constant, 0, 0);
> +      /* There could be some rtx like
> +	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> +	 but with "*.LC1" refer to V2DI constant vector.  */
> +      if (GET_MODE (constant) != mode)
> +	return;
> +
> +      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
> +	{
> +	  rtx tmp = XVECEXP (constant, 0, i);
> +	  /* Only handle one-value const vector.  */
> +	  if (!rtx_equal_p (tmp, first))
> +	    return;
> +	}
> +
> +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> +      *loc = vec_dup;
> +      INSN_CODE (insn) = -1;
> +      /* Revert change if there's no corresponding pattern.  */
> +      if (recog_memoized (insn) < 0)
> +      	{
> +      	  *loc = x;
> +      	  recog_memoized (insn);
> +      	}
> +      /* At most 1 memory_operand in an insn.  */
> +      return;
> +    }
> +}
> +
> +/* For const vector having one duplicated value, there's no need to put
> +   whole vector in the constant pool when target supports embedded broadcast. */
> +static unsigned int
> +constant_pool_broadcast (void)
> +{
> +  timevar_push (TV_MACH_DEP);
> +  rtx_insn *insn;
> +
> +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> +    {
> +      if (!INSN_P (insn))
> +	continue;
> +
> +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> +	 insns, not any notes that may be attached.  We don't want to mark
> +	 a constant just because it happens to appear in a REG_EQUIV note.  */
> +      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
> +	{
> +	  int i, n = seq->len ();
> +	  for (i = 0; i < n; ++i)
> +	    {
> +	      rtx subinsn = seq->element (i);
> +	      if (INSN_P (subinsn))
> +		replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
> +	    }
> +	}
> +      else
> +	replace_constant_pool_with_broadcast (insn);
> +    }
> +  timevar_pop (TV_MACH_DEP);
> +  return 0;
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_constant_pool_broadcast =
> +{
> +  RTL_PASS, /* type */
> +  "cpb", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_MACH_DEP, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_constant_pool_broadcast : public rtl_opt_pass
> +{
> +public:
> +  pass_constant_pool_broadcast (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return TARGET_AVX512F;
> +    }
> +
> +  virtual unsigned int execute (function *)
> +    {
> +      return constant_pool_broadcast ();
> +    }
> +}; // class pass_cpb
> +
> +} // anon namespace
> +
> +rtl_opt_pass *
> +make_pass_constant_pool_broadcast (gcc::context *ctxt)
> +{
> +  return new pass_constant_pool_broadcast (ctxt);
> +}
> +
>  /* This compares the priority of target features in function DECL1
>     and DECL2.  It returns positive value if DECL1 is higher priority,
>     negative value if DECL2 is higher priority and 0 if they are the
> diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> index d83c7b956b1..07ecf8e790f 100644
> --- a/gcc/config/i386/i386-passes.def
> +++ b/gcc/config/i386/i386-passes.def
> @@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
>    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
>  
>    INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> +  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 7c2ce618f3f..6c6909b41dd 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
>    (gcc::context *);
>  extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
>    (gcc::context *);
> +extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 431571a4bc1..fbfb459c5bf 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12127,6 +12127,19 @@
>     (set_attr "prefix" "evex")
>     (set_attr "mode" "<sseinsnmode>")])
>  
> +(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
> +  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
> +	(mult:VI8_AVX512VL
> +	  (vec_duplicate:VI8_AVX512VL
> +	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> +	  (match_operand:VI8_AVX512VL 2 "register_operand" "v")
> +))]
> +  "TARGET_AVX512DQ"
> +  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> +  [(set_attr "type" "sseimul")
> +   (set_attr "prefix" "evex")
> +   (set_attr "mode" "<sseinsnmode>")])
> +
>  (define_expand "mul<mode>3<mask_name>"
>    [(set (match_operand:VI4_AVX512F 0 "register_operand")
>  	(mult:VI4_AVX512F
> @@ -12167,6 +12180,18 @@
>     (set_attr "btver2_decode" "vector,vector,vector")
>     (set_attr "mode" "<sseinsnmode>")])
>  
> +(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
> +  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
> +	(mult:VI4_AVX512VL
> +	  (vec_duplicate:VI4_AVX512VL
> +	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> +	  (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
> +  "TARGET_AVX512F"
> +   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> +  [(set_attr "type" "sseimul")
> +   (set_attr "prefix" "evex")
> +   (set_attr "mode" "<sseinsnmode>")])
> +
>  (define_expand "mul<mode>3"
>    [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
>  	(mult:VI8_AVX2_AVX512F
> diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> new file mode 100644
> index 00000000000..800ef1f957e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> @@ -0,0 +1,40 @@
> +/* PR target/87767 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx2" } */
> +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> +
> +typedef int v4si  __attribute__ ((vector_size (16)));
> +typedef int v8si  __attribute__ ((vector_size (32)));
> +typedef long long v2di  __attribute__ ((vector_size (16)));
> +typedef long long v4di  __attribute__ ((vector_size (32)));
> +typedef float v4sf  __attribute__ ((vector_size (16)));
> +typedef float v8sf  __attribute__ ((vector_size (32)));
> +typedef double v2df  __attribute__ ((vector_size (16)));
> +typedef double v4df  __attribute__ ((vector_size (32)));
> +
> +#define FOO(VTYPE, OP_NAME, OP)			\
> +VTYPE						\
> + __attribute__ ((noipa))			\
> +foo_##OP_NAME##_##VTYPE (VTYPE a)		\
> +{						\
> +  return a OP 101;				\
> +}						\
> +
> +FOO (v4si, add, +);
> +FOO (v8si, add, +);
> +FOO (v2di, add, +);
> +FOO (v4di, add, +);
> +FOO (v4sf, add, +);
> +FOO (v8sf, add, +);
> +FOO (v2df, add, +);
> +FOO (v4df, add, +);
> +
> +FOO (v4si, mul, *);
> +FOO (v8si, mul, *);
> +FOO (v2di, mul, *);
> +FOO (v4di, mul, *);
> +FOO (v4sf, mul, *);
> +FOO (v8sf, mul, *);
> +FOO (v2df, mul, *);
> +FOO (v4df, mul, *);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> new file mode 100644
> index 00000000000..21249bc0cf9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> @@ -0,0 +1,66 @@
> +/* PR target/87767 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> +
> +typedef int v4si  __attribute__ ((vector_size (16)));
> +typedef int v8si  __attribute__ ((vector_size (32)));
> +typedef int v16si  __attribute__ ((vector_size (64)));
> +typedef long long v2di  __attribute__ ((vector_size (16)));
> +typedef long long v4di  __attribute__ ((vector_size (32)));
> +typedef long long v8di  __attribute__ ((vector_size (64)));
> +typedef float v4sf  __attribute__ ((vector_size (16)));
> +typedef float v8sf  __attribute__ ((vector_size (32)));
> +typedef float v16sf  __attribute__ ((vector_size (64)));
> +typedef double v2df  __attribute__ ((vector_size (16)));
> +typedef double v4df  __attribute__ ((vector_size (32)));
> +typedef double v8df  __attribute__ ((vector_size (64)));
> +
> +#define FOO(VTYPE, OP_NAME, OP)			\
> +VTYPE						\
> + __attribute__ ((noipa))			\
> +foo_##OP_NAME##_##VTYPE (VTYPE a)		\
> +{						\
> +  return a OP 101;				\
> +}						\
> +
> +FOO (v4si, add, +);
> +FOO (v8si, add, +);
> +FOO (v16si, add, +);
> +FOO (v2di, add, +);
> +FOO (v4di, add, +);
> +FOO (v8di, add, +);
> +FOO (v4sf, add, +);
> +FOO (v8sf, add, +);
> +FOO (v16sf, add, +);
> +FOO (v2df, add, +);
> +FOO (v4df, add, +);
> +FOO (v8df, add, +);
> +
> +FOO (v4si, mul, *);
> +FOO (v8si, mul, *);
> +FOO (v16si, mul, *);
> +FOO (v2di, mul, *);
> +FOO (v4di, mul, *);
> +FOO (v8di, mul, *);
> +FOO (v4sf, mul, *);
> +FOO (v8sf, mul, *);
> +FOO (v16sf, mul, *);
> +FOO (v2df, mul, *);
> +FOO (v4df, mul, *);
> +FOO (v8df, mul, *);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> new file mode 100644
> index 00000000000..938346743c2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> @@ -0,0 +1,54 @@
> +/* PR target/87767 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> +
> +#include<stdlib.h>
> +#include<stdio.h>
> +#include "avx512f-broadcast-pr87767-1.c"
> +
> +#define TEST(VTYPE, TYPE, N, OP_NAME, OP)		\
> +  do							\
> +    {							\
> +      TYPE exp[N], src[N];				\
> +      VTYPE res;					\
> +      for (int i = 0; i < N; i++)			\
> +	src[i] = i * i * 107;				\
> +      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
> +      for (int i = 0; i < N; i ++)			\
> +	exp[i] = src[i] OP 101;				\
> +      for (int j = 0; j < N; j++)			\
> +	{						\
> +	  if (res[j] != exp[j])				\
> +	    abort();					\
> +	}						\
> +    }							\
> +  while (0)
> +
> +int main()
> +{
> +  TEST (v4si, int, 4, add, +);
> +  TEST (v8si, int, 8, add, +);
> +  TEST (v16si, int, 16, add, +);
> +  TEST (v2di, long long, 2, add, +);
> +  TEST (v4di, long long, 4, add, +);
> +  TEST (v8di, long long, 8, add, +);
> +  TEST (v4sf, float, 4, add, +);
> +  TEST (v8sf, float, 8, add, +);
> +  TEST (v16sf, float, 16, add, +);
> +  TEST (v2df, double, 2, add, +);
> +  TEST (v4df, double, 4, add, +);
> +  TEST (v8df, double, 8, add, +);
> +
> +  TEST (v4si, int, 4, mul, *);
> +  TEST (v8si, int, 8, mul, *);
> +  TEST (v16si, int, 16, mul, *);
> +  TEST (v2di, long long, 2, mul, *);
> +  TEST (v4di, long long, 4, mul, *);
> +  TEST (v8di, long long, 8, mul, *);
> +  TEST (v4sf, float, 4, mul, *);
> +  TEST (v8sf, float, 8, mul, *);
> +  TEST (v16sf, float, 16, mul, *);
> +  TEST (v2df, double, 2, mul, *);
> +  TEST (v4df, double, 4, mul, *);
> +  TEST (v8df, double, 8, mul, *);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> new file mode 100644
> index 00000000000..ec159a68158
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> @@ -0,0 +1,40 @@
> +/* PR target/87767 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512f" } */
> +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> +
> +typedef int v4si  __attribute__ ((vector_size (16)));
> +typedef int v8si  __attribute__ ((vector_size (32)));
> +typedef long long v2di  __attribute__ ((vector_size (16)));
> +typedef long long v4di  __attribute__ ((vector_size (32)));
> +typedef float v4sf  __attribute__ ((vector_size (16)));
> +typedef float v8sf  __attribute__ ((vector_size (32)));
> +typedef double v2df  __attribute__ ((vector_size (16)));
> +typedef double v4df  __attribute__ ((vector_size (32)));
> +
> +#define FOO(VTYPE, OP_NAME, OP)			\
> +VTYPE						\
> + __attribute__ ((noipa))			\
> +foo_##OP_NAME##_##VTYPE (VTYPE a)		\
> +{						\
> +  return a OP 101;				\
> +}						\
> +
> +FOO (v4si, add, +);
> +FOO (v8si, add, +);
> +FOO (v2di, add, +);
> +FOO (v4di, add, +);
> +FOO (v4sf, add, +);
> +FOO (v8sf, add, +);
> +FOO (v2df, add, +);
> +FOO (v4df, add, +);
> +
> +FOO (v4si, mul, *);
> +FOO (v8si, mul, *);
> +FOO (v2di, mul, *);
> +FOO (v4di, mul, *);
> +FOO (v4sf, mul, *);
> +FOO (v8sf, mul, *);
> +FOO (v2df, mul, *);
> +FOO (v4df, mul, *);
> diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> index 49b5778a067..a37487d9af7 100644
> --- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
> +++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> @@ -3,10 +3,11 @@
>  /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
>  /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
>  /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
> -/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
> -/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
> -/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
> -/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
>  
>  extern char arraysb[64];
>  extern short arraysw[32];
> -- 
> 2.18.1
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-07-23  8:39 ` Jan Hubicka
@ 2020-07-23 13:53   ` Hongtao Liu
  2020-07-24  2:37     ` Hongtao Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Hongtao Liu @ 2020-07-23 13:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches

On Thu, Jul 23, 2020 at 4:39 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hello,
> sorry for taking so long to get to this.
> > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > index 535fc7e981d..8f81d101382 100644
> > --- a/gcc/config/i386/i386-features.c
> > +++ b/gcc/config/i386/i386-features.c
> > @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> >    return new pass_remove_partial_avx_dependency (ctxt);
> >  }
> >
> > +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> > +   with embedded broadcast. i.e.transform
> > +
> > +     vpaddq .LC0(%rip), %zmm0, %zmm0
> > +     ret
> > +  .LC0:
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +
> > +    to
> > +
> > +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
>
> It seems to me that having a special purpose pass for this is bit
> overzelaous.  It seems to me that you can do same pattern matching via
> splitter and fit it into the usual insn splitting pass?
>

From an implementation perspective, there could be lots of work, since
memory embedding broadcast is available for nearly every instruction
in AVX512. And for new added AVX512 instructions, we also need to add
a define_split for them.

> Honza
> > +     ret
> > +  .LC0:
> > +    .quad 3  */
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
> > +      /* There could be some rtx like
> > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > +      but with "*.LC1" refer to V2DI constant vector.  */
> > +      if (GET_MODE (constant) != mode)
> > +     return;
> > +
> > +      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
> > +     {
> > +       rtx tmp = XVECEXP (constant, 0, i);
> > +       /* Only handle one-value const vector.  */
> > +       if (!rtx_equal_p (tmp, first))
> > +         return;
> > +     }
> > +
> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
> > +      /* At most 1 memory_operand in an insn.  */
> > +      return;
> > +    }
> > +}
> > +
> > +/* For const vector having one duplicated value, there's no need to put
> > +   whole vector in the constant pool when target supports embedded broadcast. */
> > +static unsigned int
> > +constant_pool_broadcast (void)
> > +{
> > +  timevar_push (TV_MACH_DEP);
> > +  rtx_insn *insn;
> > +
> > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > +    {
> > +      if (!INSN_P (insn))
> > +     continue;
> > +
> > +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> > +      insns, not any notes that may be attached.  We don't want to mark
> > +      a constant just because it happens to appear in a REG_EQUIV note.  */
> > +      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
> > +     {
> > +       int i, n = seq->len ();
> > +       for (i = 0; i < n; ++i)
> > +         {
> > +           rtx subinsn = seq->element (i);
> > +           if (INSN_P (subinsn))
> > +             replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
> > +         }
> > +     }
> > +      else
> > +     replace_constant_pool_with_broadcast (insn);
> > +    }
> > +  timevar_pop (TV_MACH_DEP);
> > +  return 0;
> > +}
> > +
> > +namespace {
> > +
> > +const pass_data pass_data_constant_pool_broadcast =
> > +{
> > +  RTL_PASS, /* type */
> > +  "cpb", /* name */
> > +  OPTGROUP_NONE, /* optinfo_flags */
> > +  TV_MACH_DEP, /* tv_id */
> > +  0, /* properties_required */
> > +  0, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  TODO_df_finish, /* todo_flags_finish */
> > +};
> > +
> > +class pass_constant_pool_broadcast : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_constant_pool_broadcast (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      return TARGET_AVX512F;
> > +    }
> > +
> > +  virtual unsigned int execute (function *)
> > +    {
> > +      return constant_pool_broadcast ();
> > +    }
> > +}; // class pass_cpb
> > +
> > +} // anon namespace
> > +
> > +rtl_opt_pass *
> > +make_pass_constant_pool_broadcast (gcc::context *ctxt)
> > +{
> > +  return new pass_constant_pool_broadcast (ctxt);
> > +}
> > +
> >  /* This compares the priority of target features in function DECL1
> >     and DECL2.  It returns positive value if DECL1 is higher priority,
> >     negative value if DECL2 is higher priority and 0 if they are the
> > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> > index d83c7b956b1..07ecf8e790f 100644
> > --- a/gcc/config/i386/i386-passes.def
> > +++ b/gcc/config/i386/i386-passes.def
> > @@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
> >    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
> >
> >    INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> > +  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
> > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > index 7c2ce618f3f..6c6909b41dd 100644
> > --- a/gcc/config/i386/i386-protos.h
> > +++ b/gcc/config/i386/i386-protos.h
> > @@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
> >    (gcc::context *);
> >  extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
> >    (gcc::context *);
> > +extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 431571a4bc1..fbfb459c5bf 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -12127,6 +12127,19 @@
> >     (set_attr "prefix" "evex")
> >     (set_attr "mode" "<sseinsnmode>")])
> >
> > +(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
> > +  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
> > +     (mult:VI8_AVX512VL
> > +       (vec_duplicate:VI8_AVX512VL
> > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > +       (match_operand:VI8_AVX512VL 2 "register_operand" "v")
> > +))]
> > +  "TARGET_AVX512DQ"
> > +  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > +  [(set_attr "type" "sseimul")
> > +   (set_attr "prefix" "evex")
> > +   (set_attr "mode" "<sseinsnmode>")])
> > +
> >  (define_expand "mul<mode>3<mask_name>"
> >    [(set (match_operand:VI4_AVX512F 0 "register_operand")
> >       (mult:VI4_AVX512F
> > @@ -12167,6 +12180,18 @@
> >     (set_attr "btver2_decode" "vector,vector,vector")
> >     (set_attr "mode" "<sseinsnmode>")])
> >
> > +(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
> > +  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
> > +     (mult:VI4_AVX512VL
> > +       (vec_duplicate:VI4_AVX512VL
> > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > +       (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
> > +  "TARGET_AVX512F"
> > +   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > +  [(set_attr "type" "sseimul")
> > +   (set_attr "prefix" "evex")
> > +   (set_attr "mode" "<sseinsnmode>")])
> > +
> >  (define_expand "mul<mode>3"
> >    [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
> >       (mult:VI8_AVX2_AVX512F
> > diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > new file mode 100644
> > index 00000000000..800ef1f957e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > @@ -0,0 +1,40 @@
> > +/* PR target/87767 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx2" } */
> > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > +
> > +typedef int v4si  __attribute__ ((vector_size (16)));
> > +typedef int v8si  __attribute__ ((vector_size (32)));
> > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > +typedef double v2df  __attribute__ ((vector_size (16)));
> > +typedef double v4df  __attribute__ ((vector_size (32)));
> > +
> > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > +VTYPE                                                \
> > + __attribute__ ((noipa))                     \
> > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > +{                                            \
> > +  return a OP 101;                           \
> > +}                                            \
> > +
> > +FOO (v4si, add, +);
> > +FOO (v8si, add, +);
> > +FOO (v2di, add, +);
> > +FOO (v4di, add, +);
> > +FOO (v4sf, add, +);
> > +FOO (v8sf, add, +);
> > +FOO (v2df, add, +);
> > +FOO (v4df, add, +);
> > +
> > +FOO (v4si, mul, *);
> > +FOO (v8si, mul, *);
> > +FOO (v2di, mul, *);
> > +FOO (v4di, mul, *);
> > +FOO (v4sf, mul, *);
> > +FOO (v8sf, mul, *);
> > +FOO (v2df, mul, *);
> > +FOO (v4df, mul, *);
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > new file mode 100644
> > index 00000000000..21249bc0cf9
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > @@ -0,0 +1,66 @@
> > +/* PR target/87767 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > +
> > +typedef int v4si  __attribute__ ((vector_size (16)));
> > +typedef int v8si  __attribute__ ((vector_size (32)));
> > +typedef int v16si  __attribute__ ((vector_size (64)));
> > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > +typedef long long v8di  __attribute__ ((vector_size (64)));
> > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > +typedef float v16sf  __attribute__ ((vector_size (64)));
> > +typedef double v2df  __attribute__ ((vector_size (16)));
> > +typedef double v4df  __attribute__ ((vector_size (32)));
> > +typedef double v8df  __attribute__ ((vector_size (64)));
> > +
> > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > +VTYPE                                                \
> > + __attribute__ ((noipa))                     \
> > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > +{                                            \
> > +  return a OP 101;                           \
> > +}                                            \
> > +
> > +FOO (v4si, add, +);
> > +FOO (v8si, add, +);
> > +FOO (v16si, add, +);
> > +FOO (v2di, add, +);
> > +FOO (v4di, add, +);
> > +FOO (v8di, add, +);
> > +FOO (v4sf, add, +);
> > +FOO (v8sf, add, +);
> > +FOO (v16sf, add, +);
> > +FOO (v2df, add, +);
> > +FOO (v4df, add, +);
> > +FOO (v8df, add, +);
> > +
> > +FOO (v4si, mul, *);
> > +FOO (v8si, mul, *);
> > +FOO (v16si, mul, *);
> > +FOO (v2di, mul, *);
> > +FOO (v4di, mul, *);
> > +FOO (v8di, mul, *);
> > +FOO (v4sf, mul, *);
> > +FOO (v8sf, mul, *);
> > +FOO (v16sf, mul, *);
> > +FOO (v2df, mul, *);
> > +FOO (v4df, mul, *);
> > +FOO (v8df, mul, *);
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > new file mode 100644
> > index 00000000000..938346743c2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > @@ -0,0 +1,54 @@
> > +/* PR target/87767 */
> > +/* { dg-do run } */
> > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > +
> > +#include<stdlib.h>
> > +#include<stdio.h>
> > +#include "avx512f-broadcast-pr87767-1.c"
> > +
> > +#define TEST(VTYPE, TYPE, N, OP_NAME, OP)            \
> > +  do                                                 \
> > +    {                                                        \
> > +      TYPE exp[N], src[N];                           \
> > +      VTYPE res;                                     \
> > +      for (int i = 0; i < N; i++)                    \
> > +     src[i] = i * i * 107;                           \
> > +      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);      \
> > +      for (int i = 0; i < N; i ++)                   \
> > +     exp[i] = src[i] OP 101;                         \
> > +      for (int j = 0; j < N; j++)                    \
> > +     {                                               \
> > +       if (res[j] != exp[j])                         \
> > +         abort();                                    \
> > +     }                                               \
> > +    }                                                        \
> > +  while (0)
> > +
> > +int main()
> > +{
> > +  TEST (v4si, int, 4, add, +);
> > +  TEST (v8si, int, 8, add, +);
> > +  TEST (v16si, int, 16, add, +);
> > +  TEST (v2di, long long, 2, add, +);
> > +  TEST (v4di, long long, 4, add, +);
> > +  TEST (v8di, long long, 8, add, +);
> > +  TEST (v4sf, float, 4, add, +);
> > +  TEST (v8sf, float, 8, add, +);
> > +  TEST (v16sf, float, 16, add, +);
> > +  TEST (v2df, double, 2, add, +);
> > +  TEST (v4df, double, 4, add, +);
> > +  TEST (v8df, double, 8, add, +);
> > +
> > +  TEST (v4si, int, 4, mul, *);
> > +  TEST (v8si, int, 8, mul, *);
> > +  TEST (v16si, int, 16, mul, *);
> > +  TEST (v2di, long long, 2, mul, *);
> > +  TEST (v4di, long long, 4, mul, *);
> > +  TEST (v8di, long long, 8, mul, *);
> > +  TEST (v4sf, float, 4, mul, *);
> > +  TEST (v8sf, float, 8, mul, *);
> > +  TEST (v16sf, float, 16, mul, *);
> > +  TEST (v2df, double, 2, mul, *);
> > +  TEST (v4df, double, 4, mul, *);
> > +  TEST (v8df, double, 8, mul, *);
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > new file mode 100644
> > index 00000000000..ec159a68158
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > @@ -0,0 +1,40 @@
> > +/* PR target/87767 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx512f" } */
> > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > +
> > +typedef int v4si  __attribute__ ((vector_size (16)));
> > +typedef int v8si  __attribute__ ((vector_size (32)));
> > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > +typedef double v2df  __attribute__ ((vector_size (16)));
> > +typedef double v4df  __attribute__ ((vector_size (32)));
> > +
> > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > +VTYPE                                                \
> > + __attribute__ ((noipa))                     \
> > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > +{                                            \
> > +  return a OP 101;                           \
> > +}                                            \
> > +
> > +FOO (v4si, add, +);
> > +FOO (v8si, add, +);
> > +FOO (v2di, add, +);
> > +FOO (v4di, add, +);
> > +FOO (v4sf, add, +);
> > +FOO (v8sf, add, +);
> > +FOO (v2df, add, +);
> > +FOO (v4df, add, +);
> > +
> > +FOO (v4si, mul, *);
> > +FOO (v8si, mul, *);
> > +FOO (v2di, mul, *);
> > +FOO (v4di, mul, *);
> > +FOO (v4sf, mul, *);
> > +FOO (v8sf, mul, *);
> > +FOO (v2df, mul, *);
> > +FOO (v4df, mul, *);
> > diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > index 49b5778a067..a37487d9af7 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > @@ -3,10 +3,11 @@
> >  /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
> >  /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
> >  /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
> > -/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
> > -/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
> > -/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
> > -/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
> >
> >  extern char arraysb[64];
> >  extern short arraysw[32];
> > --
> > 2.18.1
> >
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-07-23 13:53   ` Hongtao Liu
@ 2020-07-24  2:37     ` Hongtao Liu
  2020-08-04  6:05       ` Hongtao Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Hongtao Liu @ 2020-07-24  2:37 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches

On Thu, Jul 23, 2020 at 9:53 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 4:39 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > Hello,
> > sorry for taking so long to get to this.
> > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > > index 535fc7e981d..8f81d101382 100644
> > > --- a/gcc/config/i386/i386-features.c
> > > +++ b/gcc/config/i386/i386-features.c
> > > @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> > >    return new pass_remove_partial_avx_dependency (ctxt);
> > >  }
> > >
> > > +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> > > +   with embedded broadcast. i.e.transform
> > > +
> > > +     vpaddq .LC0(%rip), %zmm0, %zmm0
> > > +     ret
> > > +  .LC0:
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +
> > > +    to
> > > +
> > > +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
> >
> > It seems to me that having a special purpose pass for this is bit
> > overzelaous.  It seems to me that you can do same pattern matching via
> > splitter and fit it into the usual insn splitting pass?
> >
>
> From an implementation perspective, there could be lots of work, since
> memory embedding broadcast is available for nearly every instruction
> in AVX512. And for new added AVX512 instructions, we also need to add
> a define_split for them.
>

I'll add more tests to show my point.

> > Honza
> > > +     ret
> > > +  .LC0:
> > > +    .quad 3  */
> > > +static void
> > > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > > +{
> > > +  subrtx_ptr_iterator::array_type array;
> > > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > > +    {
> > > +      rtx *loc = *iter;
> > > +      rtx x = *loc;
> > > +      rtx broadcast_mem, vec_dup, constant, first;
> > > +      machine_mode mode;
> > > +      if (GET_CODE (x) != MEM
> > > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
> > > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > > +     continue;
> > > +
> > > +      mode = GET_MODE (x);
> > > +      if (!VECTOR_MODE_P (mode))
> > > +     return;
> > > +
> > > +      constant = get_pool_constant (XEXP (x, 0));
> > > +      first = XVECEXP (constant, 0, 0);
> > > +      /* There could be some rtx like
> > > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > > +      but with "*.LC1" refer to V2DI constant vector.  */
> > > +      if (GET_MODE (constant) != mode)
> > > +     return;
> > > +
> > > +      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
> > > +     {
> > > +       rtx tmp = XVECEXP (constant, 0, i);
> > > +       /* Only handle one-value const vector.  */
> > > +       if (!rtx_equal_p (tmp, first))
> > > +         return;
> > > +     }
> > > +
> > > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > > +      *loc = vec_dup;
> > > +      INSN_CODE (insn) = -1;
> > > +      /* Revert change if there's no corresponding pattern.  */
> > > +      if (recog_memoized (insn) < 0)
> > > +             {
> > > +               *loc = x;
> > > +               recog_memoized (insn);
> > > +             }
> > > +      /* At most 1 memory_operand in an insn.  */
> > > +      return;
> > > +    }
> > > +}
> > > +
> > > +/* For const vector having one duplicated value, there's no need to put
> > > +   whole vector in the constant pool when target supports embedded broadcast. */
> > > +static unsigned int
> > > +constant_pool_broadcast (void)
> > > +{
> > > +  timevar_push (TV_MACH_DEP);
> > > +  rtx_insn *insn;
> > > +
> > > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > > +    {
> > > +      if (!INSN_P (insn))
> > > +     continue;
> > > +
> > > +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> > > +      insns, not any notes that may be attached.  We don't want to mark
> > > +      a constant just because it happens to appear in a REG_EQUIV note.  */
> > > +      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
> > > +     {
> > > +       int i, n = seq->len ();
> > > +       for (i = 0; i < n; ++i)
> > > +         {
> > > +           rtx subinsn = seq->element (i);
> > > +           if (INSN_P (subinsn))
> > > +             replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
> > > +         }
> > > +     }
> > > +      else
> > > +     replace_constant_pool_with_broadcast (insn);
> > > +    }
> > > +  timevar_pop (TV_MACH_DEP);
> > > +  return 0;
> > > +}
> > > +
> > > +namespace {
> > > +
> > > +const pass_data pass_data_constant_pool_broadcast =
> > > +{
> > > +  RTL_PASS, /* type */
> > > +  "cpb", /* name */
> > > +  OPTGROUP_NONE, /* optinfo_flags */
> > > +  TV_MACH_DEP, /* tv_id */
> > > +  0, /* properties_required */
> > > +  0, /* properties_provided */
> > > +  0, /* properties_destroyed */
> > > +  0, /* todo_flags_start */
> > > +  TODO_df_finish, /* todo_flags_finish */
> > > +};
> > > +
> > > +class pass_constant_pool_broadcast : public rtl_opt_pass
> > > +{
> > > +public:
> > > +  pass_constant_pool_broadcast (gcc::context *ctxt)
> > > +    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
> > > +  {}
> > > +
> > > +  /* opt_pass methods: */
> > > +  virtual bool gate (function *)
> > > +    {
> > > +      return TARGET_AVX512F;
> > > +    }
> > > +
> > > +  virtual unsigned int execute (function *)
> > > +    {
> > > +      return constant_pool_broadcast ();
> > > +    }
> > > +}; // class pass_cpb
> > > +
> > > +} // anon namespace
> > > +
> > > +rtl_opt_pass *
> > > +make_pass_constant_pool_broadcast (gcc::context *ctxt)
> > > +{
> > > +  return new pass_constant_pool_broadcast (ctxt);
> > > +}
> > > +
> > >  /* This compares the priority of target features in function DECL1
> > >     and DECL2.  It returns positive value if DECL1 is higher priority,
> > >     negative value if DECL2 is higher priority and 0 if they are the
> > > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> > > index d83c7b956b1..07ecf8e790f 100644
> > > --- a/gcc/config/i386/i386-passes.def
> > > +++ b/gcc/config/i386/i386-passes.def
> > > @@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
> > >    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
> > >
> > >    INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> > > +  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
> > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > > index 7c2ce618f3f..6c6909b41dd 100644
> > > --- a/gcc/config/i386/i386-protos.h
> > > +++ b/gcc/config/i386/i386-protos.h
> > > @@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
> > >    (gcc::context *);
> > >  extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
> > >    (gcc::context *);
> > > +extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
> > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > index 431571a4bc1..fbfb459c5bf 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -12127,6 +12127,19 @@
> > >     (set_attr "prefix" "evex")
> > >     (set_attr "mode" "<sseinsnmode>")])
> > >
> > > +(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
> > > +  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
> > > +     (mult:VI8_AVX512VL
> > > +       (vec_duplicate:VI8_AVX512VL
> > > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > > +       (match_operand:VI8_AVX512VL 2 "register_operand" "v")
> > > +))]
> > > +  "TARGET_AVX512DQ"
> > > +  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > > +  [(set_attr "type" "sseimul")
> > > +   (set_attr "prefix" "evex")
> > > +   (set_attr "mode" "<sseinsnmode>")])
> > > +
> > >  (define_expand "mul<mode>3<mask_name>"
> > >    [(set (match_operand:VI4_AVX512F 0 "register_operand")
> > >       (mult:VI4_AVX512F
> > > @@ -12167,6 +12180,18 @@
> > >     (set_attr "btver2_decode" "vector,vector,vector")
> > >     (set_attr "mode" "<sseinsnmode>")])
> > >
> > > +(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
> > > +  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
> > > +     (mult:VI4_AVX512VL
> > > +       (vec_duplicate:VI4_AVX512VL
> > > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > > +       (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
> > > +  "TARGET_AVX512F"
> > > +   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > > +  [(set_attr "type" "sseimul")
> > > +   (set_attr "prefix" "evex")
> > > +   (set_attr "mode" "<sseinsnmode>")])
> > > +
> > >  (define_expand "mul<mode>3"
> > >    [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
> > >       (mult:VI8_AVX2_AVX512F
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > > new file mode 100644
> > > index 00000000000..800ef1f957e
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > > @@ -0,0 +1,40 @@
> > > +/* PR target/87767 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx2" } */
> > > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > > +
> > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > +
> > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > +VTYPE                                                \
> > > + __attribute__ ((noipa))                     \
> > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > +{                                            \
> > > +  return a OP 101;                           \
> > > +}                                            \
> > > +
> > > +FOO (v4si, add, +);
> > > +FOO (v8si, add, +);
> > > +FOO (v2di, add, +);
> > > +FOO (v4di, add, +);
> > > +FOO (v4sf, add, +);
> > > +FOO (v8sf, add, +);
> > > +FOO (v2df, add, +);
> > > +FOO (v4df, add, +);
> > > +
> > > +FOO (v4si, mul, *);
> > > +FOO (v8si, mul, *);
> > > +FOO (v2di, mul, *);
> > > +FOO (v4di, mul, *);
> > > +FOO (v4sf, mul, *);
> > > +FOO (v8sf, mul, *);
> > > +FOO (v2df, mul, *);
> > > +FOO (v4df, mul, *);
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > > new file mode 100644
> > > index 00000000000..21249bc0cf9
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > > @@ -0,0 +1,66 @@
> > > +/* PR target/87767 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > +
> > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > +typedef int v16si  __attribute__ ((vector_size (64)));
> > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > +typedef long long v8di  __attribute__ ((vector_size (64)));
> > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > +typedef float v16sf  __attribute__ ((vector_size (64)));
> > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > +typedef double v8df  __attribute__ ((vector_size (64)));
> > > +
> > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > +VTYPE                                                \
> > > + __attribute__ ((noipa))                     \
> > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > +{                                            \
> > > +  return a OP 101;                           \
> > > +}                                            \
> > > +
> > > +FOO (v4si, add, +);
> > > +FOO (v8si, add, +);
> > > +FOO (v16si, add, +);
> > > +FOO (v2di, add, +);
> > > +FOO (v4di, add, +);
> > > +FOO (v8di, add, +);
> > > +FOO (v4sf, add, +);
> > > +FOO (v8sf, add, +);
> > > +FOO (v16sf, add, +);
> > > +FOO (v2df, add, +);
> > > +FOO (v4df, add, +);
> > > +FOO (v8df, add, +);
> > > +
> > > +FOO (v4si, mul, *);
> > > +FOO (v8si, mul, *);
> > > +FOO (v16si, mul, *);
> > > +FOO (v2di, mul, *);
> > > +FOO (v4di, mul, *);
> > > +FOO (v8di, mul, *);
> > > +FOO (v4sf, mul, *);
> > > +FOO (v8sf, mul, *);
> > > +FOO (v16sf, mul, *);
> > > +FOO (v2df, mul, *);
> > > +FOO (v4df, mul, *);
> > > +FOO (v8df, mul, *);
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > > new file mode 100644
> > > index 00000000000..938346743c2
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > > @@ -0,0 +1,54 @@
> > > +/* PR target/87767 */
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > > +
> > > +#include<stdlib.h>
> > > +#include<stdio.h>
> > > +#include "avx512f-broadcast-pr87767-1.c"
> > > +
> > > +#define TEST(VTYPE, TYPE, N, OP_NAME, OP)            \
> > > +  do                                                 \
> > > +    {                                                        \
> > > +      TYPE exp[N], src[N];                           \
> > > +      VTYPE res;                                     \
> > > +      for (int i = 0; i < N; i++)                    \
> > > +     src[i] = i * i * 107;                           \
> > > +      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);      \
> > > +      for (int i = 0; i < N; i ++)                   \
> > > +     exp[i] = src[i] OP 101;                         \
> > > +      for (int j = 0; j < N; j++)                    \
> > > +     {                                               \
> > > +       if (res[j] != exp[j])                         \
> > > +         abort();                                    \
> > > +     }                                               \
> > > +    }                                                        \
> > > +  while (0)
> > > +
> > > +int main()
> > > +{
> > > +  TEST (v4si, int, 4, add, +);
> > > +  TEST (v8si, int, 8, add, +);
> > > +  TEST (v16si, int, 16, add, +);
> > > +  TEST (v2di, long long, 2, add, +);
> > > +  TEST (v4di, long long, 4, add, +);
> > > +  TEST (v8di, long long, 8, add, +);
> > > +  TEST (v4sf, float, 4, add, +);
> > > +  TEST (v8sf, float, 8, add, +);
> > > +  TEST (v16sf, float, 16, add, +);
> > > +  TEST (v2df, double, 2, add, +);
> > > +  TEST (v4df, double, 4, add, +);
> > > +  TEST (v8df, double, 8, add, +);
> > > +
> > > +  TEST (v4si, int, 4, mul, *);
> > > +  TEST (v8si, int, 8, mul, *);
> > > +  TEST (v16si, int, 16, mul, *);
> > > +  TEST (v2di, long long, 2, mul, *);
> > > +  TEST (v4di, long long, 4, mul, *);
> > > +  TEST (v8di, long long, 8, mul, *);
> > > +  TEST (v4sf, float, 4, mul, *);
> > > +  TEST (v8sf, float, 8, mul, *);
> > > +  TEST (v16sf, float, 16, mul, *);
> > > +  TEST (v2df, double, 2, mul, *);
> > > +  TEST (v4df, double, 4, mul, *);
> > > +  TEST (v8df, double, 8, mul, *);
> > > +}
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > > new file mode 100644
> > > index 00000000000..ec159a68158
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > > @@ -0,0 +1,40 @@
> > > +/* PR target/87767 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx512f" } */
> > > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > > +
> > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > +
> > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > +VTYPE                                                \
> > > + __attribute__ ((noipa))                     \
> > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > +{                                            \
> > > +  return a OP 101;                           \
> > > +}                                            \
> > > +
> > > +FOO (v4si, add, +);
> > > +FOO (v8si, add, +);
> > > +FOO (v2di, add, +);
> > > +FOO (v4di, add, +);
> > > +FOO (v4sf, add, +);
> > > +FOO (v8sf, add, +);
> > > +FOO (v2df, add, +);
> > > +FOO (v4df, add, +);
> > > +
> > > +FOO (v4si, mul, *);
> > > +FOO (v8si, mul, *);
> > > +FOO (v2di, mul, *);
> > > +FOO (v4di, mul, *);
> > > +FOO (v4sf, mul, *);
> > > +FOO (v8sf, mul, *);
> > > +FOO (v2df, mul, *);
> > > +FOO (v4df, mul, *);
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > index 49b5778a067..a37487d9af7 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > @@ -3,10 +3,11 @@
> > >  /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
> > >  /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
> > >  /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
> > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
> > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
> > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
> > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
> > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
> > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
> > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
> > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
> > > +/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
> > >
> > >  extern char arraysb[64];
> > >  extern short arraysw[32];
> > > --
> > > 2.18.1
> > >
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-07-24  2:37     ` Hongtao Liu
@ 2020-08-04  6:05       ` Hongtao Liu
  2020-08-26 21:23         ` Jeff Law
  0 siblings, 1 reply; 30+ messages in thread
From: Hongtao Liu @ 2020-08-04  6:05 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 21769 bytes --]

Update patch.

There are a lot of avx512 define_insns which lack corresponding memory
broadcast version, i only add *avx512f_mul<mode>3<mask_name>_bcst and
*avx512dq_mul<mode>3<mask_name>_bcst in this patch.

On Fri, Jul 24, 2020 at 10:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 9:53 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Thu, Jul 23, 2020 at 4:39 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> > >
> > > Hello,
> > > sorry for taking so long to get to this.
> > > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > > > index 535fc7e981d..8f81d101382 100644
> > > > --- a/gcc/config/i386/i386-features.c
> > > > +++ b/gcc/config/i386/i386-features.c
> > > > @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> > > >    return new pass_remove_partial_avx_dependency (ctxt);
> > > >  }
> > > >
> > > > +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> > > > +   with embedded broadcast. i.e.transform
> > > > +
> > > > +     vpaddq .LC0(%rip), %zmm0, %zmm0
> > > > +     ret
> > > > +  .LC0:
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +
> > > > +    to
> > > > +
> > > > +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
> > >
> > > It seems to me that having a special purpose pass for this is bit
> > > overzelaous.  It seems to me that you can do same pattern matching via
> > > splitter and fit it into the usual insn splitting pass?
> > >
> >
> > From an implementation perspective, there could be lots of work, since
> > memory embedding broadcast is available for nearly every instruction
> > in AVX512. And for new added AVX512 instructions, we also need to add
> > a define_split for them.
> >
>
> I'll add more tests to show my point.
>
> > > Honza
> > > > +     ret
> > > > +  .LC0:
> > > > +    .quad 3  */
> > > > +static void
> > > > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > > > +{
> > > > +  subrtx_ptr_iterator::array_type array;
> > > > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > > > +    {
> > > > +      rtx *loc = *iter;
> > > > +      rtx x = *loc;
> > > > +      rtx broadcast_mem, vec_dup, constant, first;
> > > > +      machine_mode mode;
> > > > +      if (GET_CODE (x) != MEM
> > > > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
> > > > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > > > +     continue;
> > > > +
> > > > +      mode = GET_MODE (x);
> > > > +      if (!VECTOR_MODE_P (mode))
> > > > +     return;
> > > > +
> > > > +      constant = get_pool_constant (XEXP (x, 0));
> > > > +      first = XVECEXP (constant, 0, 0);
> > > > +      /* There could be some rtx like
> > > > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > > > +      but with "*.LC1" refer to V2DI constant vector.  */
> > > > +      if (GET_MODE (constant) != mode)
> > > > +     return;
> > > > +
> > > > +      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
> > > > +     {
> > > > +       rtx tmp = XVECEXP (constant, 0, i);
> > > > +       /* Only handle one-value const vector.  */
> > > > +       if (!rtx_equal_p (tmp, first))
> > > > +         return;
> > > > +     }
> > > > +
> > > > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > > > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > > > +      *loc = vec_dup;
> > > > +      INSN_CODE (insn) = -1;
> > > > +      /* Revert change if there's no corresponding pattern.  */
> > > > +      if (recog_memoized (insn) < 0)
> > > > +             {
> > > > +               *loc = x;
> > > > +               recog_memoized (insn);
> > > > +             }
> > > > +      /* At most 1 memory_operand in an insn.  */
> > > > +      return;
> > > > +    }
> > > > +}
> > > > +
> > > > +/* For const vector having one duplicated value, there's no need to put
> > > > +   whole vector in the constant pool when target supports embedded broadcast. */
> > > > +static unsigned int
> > > > +constant_pool_broadcast (void)
> > > > +{
> > > > +  timevar_push (TV_MACH_DEP);
> > > > +  rtx_insn *insn;
> > > > +
> > > > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > > > +    {
> > > > +      if (!INSN_P (insn))
> > > > +     continue;
> > > > +
> > > > +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> > > > +      insns, not any notes that may be attached.  We don't want to mark
> > > > +      a constant just because it happens to appear in a REG_EQUIV note.  */
> > > > +      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
> > > > +     {
> > > > +       int i, n = seq->len ();
> > > > +       for (i = 0; i < n; ++i)
> > > > +         {
> > > > +           rtx subinsn = seq->element (i);
> > > > +           if (INSN_P (subinsn))
> > > > +             replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
> > > > +         }
> > > > +     }
> > > > +      else
> > > > +     replace_constant_pool_with_broadcast (insn);
> > > > +    }
> > > > +  timevar_pop (TV_MACH_DEP);
> > > > +  return 0;
> > > > +}
> > > > +
> > > > +namespace {
> > > > +
> > > > +const pass_data pass_data_constant_pool_broadcast =
> > > > +{
> > > > +  RTL_PASS, /* type */
> > > > +  "cpb", /* name */
> > > > +  OPTGROUP_NONE, /* optinfo_flags */
> > > > +  TV_MACH_DEP, /* tv_id */
> > > > +  0, /* properties_required */
> > > > +  0, /* properties_provided */
> > > > +  0, /* properties_destroyed */
> > > > +  0, /* todo_flags_start */
> > > > +  TODO_df_finish, /* todo_flags_finish */
> > > > +};
> > > > +
> > > > +class pass_constant_pool_broadcast : public rtl_opt_pass
> > > > +{
> > > > +public:
> > > > +  pass_constant_pool_broadcast (gcc::context *ctxt)
> > > > +    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
> > > > +  {}
> > > > +
> > > > +  /* opt_pass methods: */
> > > > +  virtual bool gate (function *)
> > > > +    {
> > > > +      return TARGET_AVX512F;
> > > > +    }
> > > > +
> > > > +  virtual unsigned int execute (function *)
> > > > +    {
> > > > +      return constant_pool_broadcast ();
> > > > +    }
> > > > +}; // class pass_cpb
> > > > +
> > > > +} // anon namespace
> > > > +
> > > > +rtl_opt_pass *
> > > > +make_pass_constant_pool_broadcast (gcc::context *ctxt)
> > > > +{
> > > > +  return new pass_constant_pool_broadcast (ctxt);
> > > > +}
> > > > +
> > > >  /* This compares the priority of target features in function DECL1
> > > >     and DECL2.  It returns positive value if DECL1 is higher priority,
> > > >     negative value if DECL2 is higher priority and 0 if they are the
> > > > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> > > > index d83c7b956b1..07ecf8e790f 100644
> > > > --- a/gcc/config/i386/i386-passes.def
> > > > +++ b/gcc/config/i386/i386-passes.def
> > > > @@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
> > > >    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
> > > >
> > > >    INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> > > > +  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
> > > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > > > index 7c2ce618f3f..6c6909b41dd 100644
> > > > --- a/gcc/config/i386/i386-protos.h
> > > > +++ b/gcc/config/i386/i386-protos.h
> > > > @@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
> > > >    (gcc::context *);
> > > >  extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
> > > >    (gcc::context *);
> > > > +extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
> > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > > index 431571a4bc1..fbfb459c5bf 100644
> > > > --- a/gcc/config/i386/sse.md
> > > > +++ b/gcc/config/i386/sse.md
> > > > @@ -12127,6 +12127,19 @@
> > > >     (set_attr "prefix" "evex")
> > > >     (set_attr "mode" "<sseinsnmode>")])
> > > >
> > > > +(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
> > > > +  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
> > > > +     (mult:VI8_AVX512VL
> > > > +       (vec_duplicate:VI8_AVX512VL
> > > > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > > > +       (match_operand:VI8_AVX512VL 2 "register_operand" "v")
> > > > +))]
> > > > +  "TARGET_AVX512DQ"
> > > > +  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > > > +  [(set_attr "type" "sseimul")
> > > > +   (set_attr "prefix" "evex")
> > > > +   (set_attr "mode" "<sseinsnmode>")])
> > > > +
> > > >  (define_expand "mul<mode>3<mask_name>"
> > > >    [(set (match_operand:VI4_AVX512F 0 "register_operand")
> > > >       (mult:VI4_AVX512F
> > > > @@ -12167,6 +12180,18 @@
> > > >     (set_attr "btver2_decode" "vector,vector,vector")
> > > >     (set_attr "mode" "<sseinsnmode>")])
> > > >
> > > > +(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
> > > > +  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
> > > > +     (mult:VI4_AVX512VL
> > > > +       (vec_duplicate:VI4_AVX512VL
> > > > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > > > +       (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
> > > > +  "TARGET_AVX512F"
> > > > +   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > > > +  [(set_attr "type" "sseimul")
> > > > +   (set_attr "prefix" "evex")
> > > > +   (set_attr "mode" "<sseinsnmode>")])
> > > > +
> > > >  (define_expand "mul<mode>3"
> > > >    [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
> > > >       (mult:VI8_AVX2_AVX512F
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > > > new file mode 100644
> > > > index 00000000000..800ef1f957e
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > > > @@ -0,0 +1,40 @@
> > > > +/* PR target/87767 */
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx2" } */
> > > > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > > > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > > > +
> > > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > > +
> > > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > > +VTYPE                                                \
> > > > + __attribute__ ((noipa))                     \
> > > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > > +{                                            \
> > > > +  return a OP 101;                           \
> > > > +}                                            \
> > > > +
> > > > +FOO (v4si, add, +);
> > > > +FOO (v8si, add, +);
> > > > +FOO (v2di, add, +);
> > > > +FOO (v4di, add, +);
> > > > +FOO (v4sf, add, +);
> > > > +FOO (v8sf, add, +);
> > > > +FOO (v2df, add, +);
> > > > +FOO (v4df, add, +);
> > > > +
> > > > +FOO (v4si, mul, *);
> > > > +FOO (v8si, mul, *);
> > > > +FOO (v2di, mul, *);
> > > > +FOO (v4di, mul, *);
> > > > +FOO (v4sf, mul, *);
> > > > +FOO (v8sf, mul, *);
> > > > +FOO (v2df, mul, *);
> > > > +FOO (v4df, mul, *);
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > > > new file mode 100644
> > > > index 00000000000..21249bc0cf9
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* PR target/87767 */
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > > +
> > > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > > +typedef int v16si  __attribute__ ((vector_size (64)));
> > > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > > +typedef long long v8di  __attribute__ ((vector_size (64)));
> > > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > > +typedef float v16sf  __attribute__ ((vector_size (64)));
> > > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > > +typedef double v8df  __attribute__ ((vector_size (64)));
> > > > +
> > > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > > +VTYPE                                                \
> > > > + __attribute__ ((noipa))                     \
> > > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > > +{                                            \
> > > > +  return a OP 101;                           \
> > > > +}                                            \
> > > > +
> > > > +FOO (v4si, add, +);
> > > > +FOO (v8si, add, +);
> > > > +FOO (v16si, add, +);
> > > > +FOO (v2di, add, +);
> > > > +FOO (v4di, add, +);
> > > > +FOO (v8di, add, +);
> > > > +FOO (v4sf, add, +);
> > > > +FOO (v8sf, add, +);
> > > > +FOO (v16sf, add, +);
> > > > +FOO (v2df, add, +);
> > > > +FOO (v4df, add, +);
> > > > +FOO (v8df, add, +);
> > > > +
> > > > +FOO (v4si, mul, *);
> > > > +FOO (v8si, mul, *);
> > > > +FOO (v16si, mul, *);
> > > > +FOO (v2di, mul, *);
> > > > +FOO (v4di, mul, *);
> > > > +FOO (v8di, mul, *);
> > > > +FOO (v4sf, mul, *);
> > > > +FOO (v8sf, mul, *);
> > > > +FOO (v16sf, mul, *);
> > > > +FOO (v2df, mul, *);
> > > > +FOO (v4df, mul, *);
> > > > +FOO (v8df, mul, *);
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > > > new file mode 100644
> > > > index 00000000000..938346743c2
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > > > @@ -0,0 +1,54 @@
> > > > +/* PR target/87767 */
> > > > +/* { dg-do run } */
> > > > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > > > +
> > > > +#include<stdlib.h>
> > > > +#include<stdio.h>
> > > > +#include "avx512f-broadcast-pr87767-1.c"
> > > > +
> > > > +#define TEST(VTYPE, TYPE, N, OP_NAME, OP)            \
> > > > +  do                                                 \
> > > > +    {                                                        \
> > > > +      TYPE exp[N], src[N];                           \
> > > > +      VTYPE res;                                     \
> > > > +      for (int i = 0; i < N; i++)                    \
> > > > +     src[i] = i * i * 107;                           \
> > > > +      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);      \
> > > > +      for (int i = 0; i < N; i ++)                   \
> > > > +     exp[i] = src[i] OP 101;                         \
> > > > +      for (int j = 0; j < N; j++)                    \
> > > > +     {                                               \
> > > > +       if (res[j] != exp[j])                         \
> > > > +         abort();                                    \
> > > > +     }                                               \
> > > > +    }                                                        \
> > > > +  while (0)
> > > > +
> > > > +int main()
> > > > +{
> > > > +  TEST (v4si, int, 4, add, +);
> > > > +  TEST (v8si, int, 8, add, +);
> > > > +  TEST (v16si, int, 16, add, +);
> > > > +  TEST (v2di, long long, 2, add, +);
> > > > +  TEST (v4di, long long, 4, add, +);
> > > > +  TEST (v8di, long long, 8, add, +);
> > > > +  TEST (v4sf, float, 4, add, +);
> > > > +  TEST (v8sf, float, 8, add, +);
> > > > +  TEST (v16sf, float, 16, add, +);
> > > > +  TEST (v2df, double, 2, add, +);
> > > > +  TEST (v4df, double, 4, add, +);
> > > > +  TEST (v8df, double, 8, add, +);
> > > > +
> > > > +  TEST (v4si, int, 4, mul, *);
> > > > +  TEST (v8si, int, 8, mul, *);
> > > > +  TEST (v16si, int, 16, mul, *);
> > > > +  TEST (v2di, long long, 2, mul, *);
> > > > +  TEST (v4di, long long, 4, mul, *);
> > > > +  TEST (v8di, long long, 8, mul, *);
> > > > +  TEST (v4sf, float, 4, mul, *);
> > > > +  TEST (v8sf, float, 8, mul, *);
> > > > +  TEST (v16sf, float, 16, mul, *);
> > > > +  TEST (v2df, double, 2, mul, *);
> > > > +  TEST (v4df, double, 4, mul, *);
> > > > +  TEST (v8df, double, 8, mul, *);
> > > > +}
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > > > new file mode 100644
> > > > index 00000000000..ec159a68158
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > > > @@ -0,0 +1,40 @@
> > > > +/* PR target/87767 */
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx512f" } */
> > > > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > > > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > > > +
> > > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > > +
> > > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > > +VTYPE                                                \
> > > > + __attribute__ ((noipa))                     \
> > > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > > +{                                            \
> > > > +  return a OP 101;                           \
> > > > +}                                            \
> > > > +
> > > > +FOO (v4si, add, +);
> > > > +FOO (v8si, add, +);
> > > > +FOO (v2di, add, +);
> > > > +FOO (v4di, add, +);
> > > > +FOO (v4sf, add, +);
> > > > +FOO (v8sf, add, +);
> > > > +FOO (v2df, add, +);
> > > > +FOO (v4df, add, +);
> > > > +
> > > > +FOO (v4si, mul, *);
> > > > +FOO (v8si, mul, *);
> > > > +FOO (v2di, mul, *);
> > > > +FOO (v4di, mul, *);
> > > > +FOO (v4sf, mul, *);
> > > > +FOO (v8sf, mul, *);
> > > > +FOO (v2df, mul, *);
> > > > +FOO (v4df, mul, *);
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > > index 49b5778a067..a37487d9af7 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > > @@ -3,10 +3,11 @@
> > > >  /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
> > > >  /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
> > > >  /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
> > > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
> > > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
> > > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
> > > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
> > > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
> > > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
> > > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
> > > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
> > > > +/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
> > > >
> > > >  extern char arraysb[64];
> > > >  extern short arraysw[32];
> > > > --
> > > > 2.18.1
> > > >
> > >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

[-- Attachment #2: 0001-Optimize-memory-broadcast-for-constant-vector_V2.patch --]
[-- Type: text/x-patch, Size: 36139 bytes --]

From 8cd1b1ad4f7fecec3e7a22be09d8f282115a1c7d Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 8 Jul 2020 17:14:36 +0800
Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.

For constant vector having one duplicated value, there's no need to put
whole vector in the constant pool, using embedded broadcast instead.

2020-07-09  Hongtao Liu  <hongtao.liu@intel.com>

gcc/ChangeLog:

	PR target/87767
	* config/i386/i386-features.c
	(replace_constant_pool_with_broadcast): New function.
	(constant_pool_broadcast): Ditto.
	(class pass_constant_pool_broadcast): New pass.
	(make_pass_constant_pool_broadcast): Ditto.
	* config/i386/i386-passes.def: Insert new pass after combine.
	* config/i386/i386-protos.h
	(make_pass_constant_pool_broadcast): Declare.
	* config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
	*avx512f_mul<mode>3<mask_name>_bcst): New define_insn.
	* config/i386/avx512fintrin.h (_mm512_set1_ps,
	_mm512_set1_pd,_mm512_set1_epi32, _mm512_set1_epi64): Adjusted.

gcc/testsuite/ChangeLog:

	PR target/87767
	* gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-3.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-4.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-5.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-6.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-7.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/pr92865-1.c: Adjust testcase.
---
 gcc/config/i386/avx512fintrin.h               |  27 ++--
 gcc/config/i386/i386-features.c               | 146 ++++++++++++++++++
 gcc/config/i386/i386-passes.def               |   1 +
 gcc/config/i386/i386-protos.h                 |   1 +
 gcc/config/i386/sse.md                        |  25 +++
 .../i386/avx2-broadcast-pr87767-1.c           |  40 +++++
 .../i386/avx512f-broadcast-pr87767-1.c        |  78 ++++++++++
 .../i386/avx512f-broadcast-pr87767-2.c        |  61 ++++++++
 .../i386/avx512f-broadcast-pr87767-3.c        |  55 +++++++
 .../i386/avx512f-broadcast-pr87767-4.c        |  59 +++++++
 .../i386/avx512f-broadcast-pr87767-5.c        |  63 ++++++++
 .../i386/avx512f-broadcast-pr87767-6.c        |  58 +++++++
 .../i386/avx512f-broadcast-pr87767-7.c        | 121 +++++++++++++++
 .../i386/avx512vl-broadcast-pr87767-1.c       |  41 +++++
 gcc/testsuite/gcc.target/i386/pr92865-1.c     |   9 +-
 15 files changed, 763 insertions(+), 22 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c

diff --git a/gcc/config/i386/avx512fintrin.h b/gcc/config/i386/avx512fintrin.h
index 0d53dda3a27..729d5686d68 100644
--- a/gcc/config/i386/avx512fintrin.h
+++ b/gcc/config/i386/avx512fintrin.h
@@ -239,22 +239,17 @@ extern __inline __m512d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_pd (double __A)
 {
-  return (__m512d) __builtin_ia32_broadcastsd512 (__extension__
-						  (__v2df) { __A, },
-						  (__v8df)
-						  _mm512_undefined_pd (),
-						  (__mmask8) -1);
+  return __extension__ (__m512d)(__v8df)
+    { __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_ps (float __A)
 {
-  return (__m512) __builtin_ia32_broadcastss512 (__extension__
-						 (__v4sf) { __A, },
-						 (__v16sf)
-						 _mm512_undefined_ps (),
-						 (__mmask16) -1);
+  return __extension__ (__m512)(__v16sf)
+    { __A, __A, __A, __A, __A, __A, __A, __A,
+      __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 /* Create the vector [A B C D A B C D A B C D A B C D].  */
@@ -4072,10 +4067,9 @@ extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_epi32 (int __A)
 {
-  return (__m512i) __builtin_ia32_pbroadcastd512_gpr_mask (__A,
-							   (__v16si)
-							   _mm512_undefined_epi32 (),
-							   (__mmask16)(-1));
+  return (__m512i)(__v16si)
+    { __A, __A, __A, __A, __A, __A, __A, __A,
+      __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512i
@@ -4128,10 +4122,7 @@ extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_epi64 (long long __A)
 {
-  return (__m512i) __builtin_ia32_pbroadcastq512_gpr_mask (__A,
-							   (__v8di)
-							   _mm512_undefined_epi32 (),
-							   (__mmask8)(-1));
+  return (__m512i)(__v8di) { __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512i
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 535fc7e981d..8f81d101382 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
   return new pass_remove_partial_avx_dependency (ctxt);
 }
 
+/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
+   with embedded broadcast. i.e.transform
+
+     vpaddq .LC0(%rip), %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+
+    to
+
+     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3  */
+static void
+replace_constant_pool_with_broadcast (rtx_insn* insn)
+{
+  subrtx_ptr_iterator::array_type array;
+  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
+    {
+      rtx *loc = *iter;
+      rtx x = *loc;
+      rtx broadcast_mem, vec_dup, constant, first;
+      machine_mode mode;
+      if (GET_CODE (x) != MEM
+	  || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
+	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
+	continue;
+
+      mode = GET_MODE (x);
+      if (!VECTOR_MODE_P (mode))
+	return;
+
+      constant = get_pool_constant (XEXP (x, 0));
+      first = XVECEXP (constant, 0, 0);
+      /* There could be some rtx like
+	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
+	 but with "*.LC1" refer to V2DI constant vector.  */
+      if (GET_MODE (constant) != mode)
+	return;
+
+      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
+	{
+	  rtx tmp = XVECEXP (constant, 0, i);
+	  /* Only handle one-value const vector.  */
+	  if (!rtx_equal_p (tmp, first))
+	    return;
+	}
+
+      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
+      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
+      *loc = vec_dup;
+      INSN_CODE (insn) = -1;
+      /* Revert change if there's no corresponding pattern.  */
+      if (recog_memoized (insn) < 0)
+      	{
+      	  *loc = x;
+      	  recog_memoized (insn);
+      	}
+      /* At most 1 memory_operand in an insn.  */
+      return;
+    }
+}
+
+/* For const vector having one duplicated value, there's no need to put
+   whole vector in the constant pool when target supports embedded broadcast. */
+static unsigned int
+constant_pool_broadcast (void)
+{
+  timevar_push (TV_MACH_DEP);
+  rtx_insn *insn;
+
+  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
+    {
+      if (!INSN_P (insn))
+	continue;
+
+      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
+	 insns, not any notes that may be attached.  We don't want to mark
+	 a constant just because it happens to appear in a REG_EQUIV note.  */
+      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
+	{
+	  int i, n = seq->len ();
+	  for (i = 0; i < n; ++i)
+	    {
+	      rtx subinsn = seq->element (i);
+	      if (INSN_P (subinsn))
+		replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
+	    }
+	}
+      else
+	replace_constant_pool_with_broadcast (insn);
+    }
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_constant_pool_broadcast =
+{
+  RTL_PASS, /* type */
+  "cpb", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_constant_pool_broadcast : public rtl_opt_pass
+{
+public:
+  pass_constant_pool_broadcast (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return TARGET_AVX512F;
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return constant_pool_broadcast ();
+    }
+}; // class pass_cpb
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_constant_pool_broadcast (gcc::context *ctxt)
+{
+  return new pass_constant_pool_broadcast (ctxt);
+}
+
 /* This compares the priority of target features in function DECL1
    and DECL2.  It returns positive value if DECL1 is higher priority,
    negative value if DECL2 is higher priority and 0 if they are the
diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index d83c7b956b1..07ecf8e790f 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
 
   INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
+  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 0b95c57b1a0..812d3e06cf1 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
   (gcc::context *);
 extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
   (gcc::context *);
+extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index b6348de67cb..6bd780caff6 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -12121,6 +12121,19 @@
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
+	(mult:VI8_AVX512VL
+	  (vec_duplicate:VI8_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI8_AVX512VL 2 "register_operand" "v")
+))]
+  "TARGET_AVX512DQ"
+  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3<mask_name>"
   [(set (match_operand:VI4_AVX512F 0 "register_operand")
 	(mult:VI4_AVX512F
@@ -12161,6 +12174,18 @@
    (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
+	(mult:VI4_AVX512VL
+	  (vec_duplicate:VI4_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512F"
+   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3"
   [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
 	(mult:VI8_AVX2_AVX512F
diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..800ef1f957e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
@@ -0,0 +1,40 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..610d9c6e2dd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -0,0 +1,78 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP CONSTANT;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v16si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v8di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v16sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+FOO (v8df, add, +);
+
+FOO (v4si, sub, -);
+FOO (v8si, sub, -);
+FOO (v16si, sub, -);
+FOO (v2di, sub, -);
+FOO (v4di, sub, -);
+FOO (v8di, sub, -);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v16si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v8di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v16sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
+FOO (v8df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..88b6fe77b18
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
@@ -0,0 +1,61 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+
+#include<stdlib.h>
+#include<stdio.h>
+#include "avx512f-broadcast-pr87767-1.c"
+
+#define TEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP CONSTANT;			\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+int main()
+{
+  TEST (v4si, int, 4, add, +);
+  TEST (v8si, int, 8, add, +);
+  TEST (v16si, int, 16, add, +);
+  TEST (v2di, long long, 2, add, +);
+  TEST (v4di, long long, 4, add, +);
+  TEST (v8di, long long, 8, add, +);
+  TEST (v4sf, float, 4, add, +);
+  TEST (v8sf, float, 8, add, +);
+  TEST (v16sf, float, 16, add, +);
+  TEST (v2df, double, 2, add, +);
+  TEST (v4df, double, 4, add, +);
+  TEST (v8df, double, 8, add, +);
+
+  TEST (v4si, int, 4, sub, -);
+  TEST (v8si, int, 8, sub, -);
+  TEST (v16si, int, 16, sub, -);
+  TEST (v2di, long long, 2, sub, -);
+  TEST (v4di, long long, 4, sub, -);
+  TEST (v8di, long long, 8, sub, -);
+
+  TEST (v4si, int, 4, mul, *);
+  TEST (v8si, int, 8, mul, *);
+  TEST (v16si, int, 16, mul, *);
+  TEST (v2di, long long, 2, mul, *);
+  TEST (v4di, long long, 4, mul, *);
+  TEST (v8di, long long, 8, mul, *);
+  TEST (v4sf, float, 4, mul, *);
+  TEST (v8sf, float, 8, mul, *);
+  TEST (v16sf, float, 16, mul, *);
+  TEST (v2df, double, 2, mul, *);
+  TEST (v4df, double, 4, mul, *);
+  TEST (v8df, double, 8, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
new file mode 100644
index 00000000000..94c183b7a5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
@@ -0,0 +1,55 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "(?:vfmadd|vfmsub)\[^\n\]*\\\{1to2\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "(?:vfmadd|vfmsub)\[^\n\]*\\\{1to4\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "(?:vfmadd|vfmsub)\[^\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "(?:vfmadd|vfmsub)\[^\n\]*\\\{1to16\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "(?:vfnmadd|vfnmsub)\[^\n\]*\\\{1to2\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "(?:vfnmadd|vfnmsub)\[^\n\]*\\\{1to4\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "(?:vfnmadd|vfnmsub)\[^\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "(?:vfnmadd|vfnmsub)\[^\n\]*\\\{1to16\\\}" 2 } }  */
+
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a, VTYPE b)	\
+{						\
+  return (OP1 a * b) OP2 CONSTANT;		\
+}						\
+
+FOO (v4sf, fma,, +);
+FOO (v8sf, fma,, +);
+FOO (v16sf, fma,, +);
+FOO (v2df, fma,, +);
+FOO (v4df, fma,, +);
+FOO (v8df, fma,, +);
+
+FOO (v4sf, fms,, -);
+FOO (v8sf, fms,, -);
+FOO (v16sf, fms,, -);
+FOO (v2df, fms,, -);
+FOO (v4df, fms,, -);
+FOO (v8df, fms,, -);
+
+FOO (v4sf, fnma, -, +);
+FOO (v8sf, fnma, -, +);
+FOO (v16sf, fnma, -, +);
+FOO (v2df, fnma, -, +);
+FOO (v4df, fnma, -, +);
+FOO (v8df, fnma, -, +);
+
+FOO (v4sf, fnms, -, -);
+FOO (v8sf, fnms, -, -);
+FOO (v16sf, fnms, -, -);
+FOO (v2df, fnms, -, -);
+FOO (v4df, fnms, -, -);
+FOO (v8df, fnms, -, -);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
new file mode 100644
index 00000000000..17268968f9e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
@@ -0,0 +1,59 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+
+#include<stdlib.h>
+#include<stdio.h>
+#include "avx512f-broadcast-pr87767-3.c"
+
+#define TEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src1[N], src2[N];					\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src1[i] = i * i * 107.2f;					\
+	  src2[i] = i * 2.f - 404.f;					\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src1[0], *(VTYPE*)&src2[0]); \
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src1[i] * src2[i]) OP2 CONSTANT;			\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+int main()
+{
+  TEST (v4sf, float, 4, fma,, +);
+  TEST (v8sf, float, 8, fma,, +);
+  TEST (v16sf, float, 16, fma,, +);
+  TEST (v2df, double, 2, fma,, +);
+  TEST (v4df, double, 4, fma,, +);
+  TEST (v8df, double, 8, fma,, +);
+
+  TEST (v4sf, float, 4, fms,, -);
+  TEST (v8sf, float, 8, fms,, -);
+  TEST (v16sf, float, 16, fms,, -);
+  TEST (v2df, double, 2, fms,, -);
+  TEST (v4df, double, 4, fms,, -);
+  TEST (v8df, double, 8, fms,, -);
+
+  TEST (v4sf, float, 4, fnma,-, +);
+  TEST (v8sf, float, 8, fnma,-, +);
+  TEST (v16sf, float, 16, fnma,-, +);
+  TEST (v2df, double, 2, fnma,-, +);
+  TEST (v4df, double, 4, fnma,-, +);
+  TEST (v8df, double, 8, fnma,-, +);
+
+  TEST (v4sf, float, 4, fnms,-, -);
+  TEST (v8sf, float, 8, fnms,-, -);
+  TEST (v16sf, float, 16, fnms,-, -);
+  TEST (v2df, double, 2, fnms,-, -);
+  TEST (v4df, double, 4, fnms,-, -);
+  TEST (v8df, double, 8, fnms,-, -);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
new file mode 100644
index 00000000000..139bba75686
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
@@ -0,0 +1,63 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "vpandn\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpandn\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpandn\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpandn\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpand\[^n\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpand\[^n\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpand\[^n\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpand\[^n\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpor\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpor\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpor\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpor\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpxor\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpxor\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpxor\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpxor\[^\n\]*\\\{1to16\\\}" 1 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return (OP1 a) OP2 CONSTANT;			\
+}						\
+
+FOO (v4si, andnot, ~, &);
+FOO (v8si, andnot, ~, &);
+FOO (v16si, andnot, ~, &);
+FOO (v2di, andnot, ~, &);
+FOO (v4di, andnot, ~, &);
+FOO (v8di, andnot, ~, &);
+
+FOO (v4si, and,, &);
+FOO (v8si, and,, &);
+FOO (v16si, and,, &);
+FOO (v2di, and,, &);
+FOO (v4di, and,, &);
+FOO (v8di, and,, &);
+
+FOO (v4si, or,, |);
+FOO (v8si, or,, |);
+FOO (v16si, or,, |);
+FOO (v2di, or,, |);
+FOO (v4di, or,, |);
+FOO (v8di, or,, |);
+
+FOO (v4si, xor,, ^);
+FOO (v8si, xor,, ^);
+FOO (v16si, xor,, ^);
+FOO (v2di, xor,, ^);
+FOO (v4di, xor,, ^);
+FOO (v8di, xor,, ^);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
new file mode 100644
index 00000000000..892b52858da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
@@ -0,0 +1,58 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+
+#include<stdlib.h>
+#include<stdio.h>
+#include "avx512f-broadcast-pr87767-5.c"
+
+#define TEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src[N];						\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src[i] = i * i * 107;						\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);			\
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src[i]) OP2 CONSTANT;				\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+int main()
+{
+  TEST (v4si, int, 4, andnot, ~, &);
+  TEST (v8si, int, 8, andnot, ~, &);
+  TEST (v16si, int, 16, andnot, ~, &);
+  TEST (v2di, long long, 2, andnot, ~, &);
+  TEST (v4di, long long, 4, andnot, ~, &);
+  TEST (v8di, long long, 8, andnot, ~, &);
+
+  TEST (v4si, int, 4, and,, &);
+  TEST (v8si, int, 8, and,, &);
+  TEST (v16si, int, 16, and,, &);
+  TEST (v2di, long long, 2, and,, &);
+  TEST (v4di, long long, 4, and,, &);
+  TEST (v8di, long long, 8, and,, &);
+
+  TEST (v4si, int, 4, or,, |);
+  TEST (v8si, int, 8, or,, |);
+  TEST (v16si, int, 16, or,, |);
+  TEST (v2di, long long, 2, or,, |);
+  TEST (v4di, long long, 4, or,, |);
+  TEST (v8di, long long, 8, or,, |);
+
+  TEST (v4si, int, 4, xor,, ^);
+  TEST (v8si, int, 8, xor,, ^);
+  TEST (v16si, int, 16, xor,, ^);
+  TEST (v2di, long long, 2, xor,, ^);
+  TEST (v4di, long long, 4, xor,, ^);
+  TEST (v8di, long long, 8, xor,, ^);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
new file mode 100644
index 00000000000..a8f145d8589
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
@@ -0,0 +1,121 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+
+#include<immintrin.h>
+
+#define CONSTANT 101
+
+#define FOO(VTYPE, OP_NAME, LEN, SUFFIX, MTYPE)			\
+  VTYPE									\
+  __attribute__ ((noipa))						\
+  _mm##LEN##_foo_##OP_NAME##_##SUFFIX (VTYPE dst, VTYPE src, MTYPE m)	\
+  {									\
+    return  _mm##LEN##_mask_##OP_NAME##_##SUFFIX (dst, m, src,		\
+						  _mm##LEN##_set1_##SUFFIX (CONSTANT)); \
+  }									\
+
+#define FOOZ(VTYPE, OP_NAME, LEN, SUFFIX, MTYPE)			\
+  VTYPE									\
+  __attribute__ ((noipa))						\
+  _mm##LEN##_fooz_##OP_NAME##_##SUFFIX (VTYPE dst, VTYPE src, MTYPE m)	\
+  {									\
+    return  _mm##LEN##_maskz_##OP_NAME##_##SUFFIX (m, dst, src,		\
+						  _mm##LEN##_set1_##SUFFIX (CONSTANT)); \
+  }									\
+
+FOO (__m512, add, 512, ps, __mmask16);
+FOO (__m256, add, 256, ps, __mmask8);
+FOO (__m128, add,, ps, __mmask8);
+
+FOO (__m512, sub, 512, ps, __mmask16);
+FOO (__m256, sub, 256, ps, __mmask8);
+FOO (__m128, sub,, ps, __mmask8);
+
+FOO (__m512, mul, 512, ps, __mmask16);
+FOO (__m256, mul, 256, ps, __mmask8);
+FOO (__m128, mul,, ps, __mmask8);
+
+FOO (__m512, div, 512, ps, __mmask16);
+FOO (__m256, div, 256, ps, __mmask8);
+FOO (__m128, div,, ps, __mmask8);
+
+FOOZ (__m512, fmadd, 512, ps, __mmask16);
+FOOZ (__m256, fmadd, 256, ps, __mmask8);
+FOOZ (__m128, fmadd,, ps, __mmask8);
+
+FOOZ (__m512, fmsub, 512, ps, __mmask16);
+FOOZ (__m256, fmsub, 256, ps, __mmask8);
+FOOZ (__m128, fmsub,, ps, __mmask8);
+
+FOOZ (__m512, fnmadd, 512, ps, __mmask16);
+FOOZ (__m256, fnmadd, 256, ps, __mmask8);
+FOOZ (__m128, fnmadd,, ps, __mmask8);
+
+FOOZ (__m512, fnmsub, 512, ps, __mmask16);
+FOOZ (__m256, fnmsub, 256, ps, __mmask8);
+FOOZ (__m128, fnmsub,, ps, __mmask8);
+
+FOO (__m512d, add, 512, pd, __mmask8);
+FOO (__m256d, add, 256, pd, __mmask8);
+FOO (__m128d, add,, pd, __mmask8);
+
+FOO (__m512d, sub, 512, pd, __mmask8);
+FOO (__m256d, sub, 256, pd, __mmask8);
+FOO (__m128d, sub,, pd, __mmask8);
+
+FOO (__m512d, mul, 512, pd, __mmask8);
+FOO (__m256d, mul, 256, pd, __mmask8);
+FOO (__m128d, mul,, pd, __mmask8);
+
+FOO (__m512d, div, 512, pd, __mmask8);
+FOO (__m256d, div, 256, pd, __mmask8);
+FOO (__m128d, div,, pd, __mmask8);
+
+FOOZ (__m512d, fmadd, 512, pd, __mmask8);
+FOOZ (__m256d, fmadd, 256, pd, __mmask8);
+FOOZ (__m128d, fmadd,, pd, __mmask8);
+
+FOOZ (__m512d, fmsub, 512, pd, __mmask8);
+FOOZ (__m256d, fmsub, 256, pd, __mmask8);
+FOOZ (__m128d, fmsub,, pd, __mmask8);
+
+FOOZ (__m512d, fnmadd, 512, pd, __mmask8);
+FOOZ (__m256d, fnmadd, 256, pd, __mmask8);
+FOOZ (__m128d, fnmadd,, pd, __mmask8);
+
+FOOZ (__m512d, fnmsub, 512, pd, __mmask8);
+FOOZ (__m256d, fnmsub, 256, pd, __mmask8);
+FOOZ (__m128d, fnmsub,, pd, __mmask8);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..00d2fb1b955
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -0,0 +1,41 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP CONSTANT;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
index 49b5778a067..a37487d9af7 100644
--- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
+++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
@@ -3,10 +3,11 @@
 /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
 /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
 /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
-/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
+/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
 
 extern char arraysb[64];
 extern short arraysw[32];
-- 
2.18.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-04  6:05       ` Hongtao Liu
@ 2020-08-26 21:23         ` Jeff Law
  2020-08-27 11:09           ` Jan Hubicka
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff Law @ 2020-08-26 21:23 UTC (permalink / raw)
  To: Hongtao Liu, Jan Hubicka; +Cc: GCC Patches

On Tue, 2020-08-04 at 14:05 +0800, Hongtao Liu via Gcc-patches wrote:
> Update patch.
> 
> There are a lot of avx512 define_insns which lack corresponding memory
> broadcast version, i only add *avx512f_mul<mode>3<mask_name>_bcst and
> *avx512dq_mul<mode>3<mask_name>_bcst in this patch.
> 
> On Fri, Jul 24, 2020 at 10:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > On Thu, Jul 23, 2020 at 9:53 PM Hongtao Liu <crazylht@gmail.com> wrote:
> > > On Thu, Jul 23, 2020 at 4:39 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> > > > Hello,
> > > > sorry for taking so long to get to this.
> > > > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > > > > index 535fc7e981d..8f81d101382 100644
> > > > > --- a/gcc/config/i386/i386-features.c
> > > > > +++ b/gcc/config/i386/i386-features.c
> > > > > @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> > > > >    return new pass_remove_partial_avx_dependency (ctxt);
> > > > >  }
> > > > > 
> > > > > +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> > > > > +   with embedded broadcast. i.e.transform
> > > > > +
> > > > > +     vpaddq .LC0(%rip), %zmm0, %zmm0
> > > > > +     ret
> > > > > +  .LC0:
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +
> > > > > +    to
> > > > > +
> > > > > +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
> > > > 
> > > > It seems to me that having a special purpose pass for this is bit
> > > > overzelaous.  It seems to me that you can do same pattern matching via
> > > > splitter and fit it into the usual insn splitting pass?
> > > > 
> > > 
> > > From an implementation perspective, there could be lots of work, since
> > > memory embedding broadcast is available for nearly every instruction
> > > in AVX512. And for new added AVX512 instructions, we also need to add
> > > a define_split for them.
> > > 
> > 
> > 


> > +/* For const vector having one duplicated value, there's no need to put
> > > > > +   whole vector in the constant pool when target supports embedded broadcast. */
> > > > > +static unsigned int
> > > > > +constant_pool_broadcast (void)
> > > > > +{
> > > > > +  timevar_push (TV_MACH_DEP);
> > > > > +  rtx_insn *insn;
> > > > > +
> > > > > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > > > > +    {
> > > > > +      if (!INSN_P (insn))
> > > > > +     continue;
> > > > > +
> > > > > +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> > > > > +      insns, not any notes that may be attached.  We don't want to mark
> > > > > +      a constant just because it happens to appear in a REG_EQUIV note.  */
Under what circumstances are we seeing a SEQUENCE in the x86 backend?  I'm
surprised we need to handle that case.

So your pass modifies the insn in place, which is fine.  But do we actually
remove the original constant pool entry if it's no longer used?  If not, does
this patch actually save anything (memory bandwidth perhaps?)

Is there an existing pass over the RTL chain where this would work so that it's
more compile-time efficient?

jeff


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-26 21:23         ` Jeff Law
@ 2020-08-27 11:09           ` Jan Hubicka
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Hubicka @ 2020-08-27 11:09 UTC (permalink / raw)
  To: Jeff Law; +Cc: Hongtao Liu, GCC Patches

> Under what circumstances are we seeing a SEQUENCE in the x86 backend?  I'm
> surprised we need to handle that case.
> 
> So your pass modifies the insn in place, which is fine.  But do we actually
> remove the original constant pool entry if it's no longer used?  If not, does
> this patch actually save anything (memory bandwidth perhaps?)

Constant pool entries are output only if actually used by asm output, so
this could just work.
> 
> Is there an existing pass over the RTL chain where this would work so that it's
> more compile-time efficient?

I was also concerned about adding yet another pass and wanted to look
bit more into posibility to make this a part of peephole pass.  While it
is true that the usual way to write it (adding extra pattern for every
instruction)  is a lot of work I was thinking if we can perhaps just add
quite generic define_peephole which will match everything containing
broadcast via predicate, call into the expander that will try to build
mathcing instruction and fail otherwise.  While it is still bit of a
hack I think it may be less intrusive then yet another machine specific
pass.

Honza
> 
> jeff
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-07-09  8:33 [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512 Hongtao Liu
  2020-07-10  9:24 ` Hongtao Liu
  2020-07-23  8:39 ` Jan Hubicka
@ 2020-08-27 12:24 ` Jakub Jelinek
  2020-08-27 13:07   ` Richard Biener
                     ` (2 more replies)
  2 siblings, 3 replies; 30+ messages in thread
From: Jakub Jelinek @ 2020-08-27 12:24 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: GCC Patches

On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> +static void
> +replace_constant_pool_with_broadcast (rtx_insn* insn)
> +{
> +  subrtx_ptr_iterator::array_type array;
> +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> +    {
> +      rtx *loc = *iter;
> +      rtx x = *loc;
> +      rtx broadcast_mem, vec_dup, constant, first;
> +      machine_mode mode;
> +      if (GET_CODE (x) != MEM

MEM_P

> +	  || GET_CODE (XEXP (x, 0)) != SYMBOL_REF

SYMBOL_REF_P

> +	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> +	continue;
> +
> +      mode = GET_MODE (x);
> +      if (!VECTOR_MODE_P (mode))
> +	return;
> +
> +      constant = get_pool_constant (XEXP (x, 0));
> +      first = XVECEXP (constant, 0, 0);

Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
and punt otherwise?

> +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> +      *loc = vec_dup;
> +      INSN_CODE (insn) = -1;
> +      /* Revert change if there's no corresponding pattern.  */
> +      if (recog_memoized (insn) < 0)
> +      	{
> +      	  *loc = x;
> +      	  recog_memoized (insn);
> +      	}

The usual way of doing this would be through
  validate_change (insn, loc, vec_dup, 0);

Also, isn't the pass also useful for TARGET_AVX and above (but in that case
only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
full vector loads?

As Jeff wrote, I wonder if when successfully replacing those pool constants
the old constant pool entries will be omitted.

Another thing I wonder about is whether more analysis shouldn't be used.
E.g. if the constant pool entry is already emitted into .rodata anyway
(e.g. some earlier function needed it), using the broadcast will mean
actually larger .rodata.  If {1to8} and similar is as fast as reading all
the same elements from memory (or faster), perhaps in that case it should
broadcast from the first element of the existing constant pool full vector
rather than creating a new one.
And similarly, perhaps the function should look at all constant pool entries
in the current function (not yet emitted into .rodata) and if it would
succeed for some and not for others, either use broadcast from its first
element or not perform it for the others too.

	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-27 12:24 ` Jakub Jelinek
@ 2020-08-27 13:07   ` Richard Biener
  2020-08-27 13:20     ` Jakub Jelinek
  2020-08-28 17:18   ` Hongtao Liu
  2020-09-01  9:55   ` Hongtao Liu
  2 siblings, 1 reply; 30+ messages in thread
From: Richard Biener @ 2020-08-27 13:07 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Hongtao Liu, GCC Patches

On Thu, Aug 27, 2020 at 2:25 PM Jakub Jelinek via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
>
> MEM_P
>
> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P
>
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>
> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
>
> The usual way of doing this would be through
>   validate_change (insn, loc, vec_dup, 0);
>
> Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> full vector loads?
>
> As Jeff wrote, I wonder if when successfully replacing those pool constants
> the old constant pool entries will be omitted.
>
> Another thing I wonder about is whether more analysis shouldn't be used.
> E.g. if the constant pool entry is already emitted into .rodata anyway
> (e.g. some earlier function needed it), using the broadcast will mean
> actually larger .rodata.  If {1to8} and similar is as fast as reading all
> the same elements from memory (or faster), perhaps in that case it should
> broadcast from the first element of the existing constant pool full vector
> rather than creating a new one.
> And similarly, perhaps the function should look at all constant pool entries
> in the current function (not yet emitted into .rodata) and if it would
> succeed for some and not for others, either use broadcast from its first
> element or not perform it for the others too.

IIRC I once implemented this (re-using vector constant components
for non-vector pool entries) but it was quite hackish and never merged
it seems.

Richard.

>         Jakub
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-27 13:07   ` Richard Biener
@ 2020-08-27 13:20     ` Jakub Jelinek
  2020-08-28  6:47       ` Richard Biener
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2020-08-27 13:20 UTC (permalink / raw)
  To: Richard Biener; +Cc: Hongtao Liu, GCC Patches

On Thu, Aug 27, 2020 at 03:07:59PM +0200, Richard Biener wrote:
> > Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> > only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> > full vector loads?
> >
> > As Jeff wrote, I wonder if when successfully replacing those pool constants
> > the old constant pool entries will be omitted.
> >
> > Another thing I wonder about is whether more analysis shouldn't be used.
> > E.g. if the constant pool entry is already emitted into .rodata anyway
> > (e.g. some earlier function needed it), using the broadcast will mean
> > actually larger .rodata.  If {1to8} and similar is as fast as reading all
> > the same elements from memory (or faster), perhaps in that case it should
> > broadcast from the first element of the existing constant pool full vector
> > rather than creating a new one.
> > And similarly, perhaps the function should look at all constant pool entries
> > in the current function (not yet emitted into .rodata) and if it would
> > succeed for some and not for others, either use broadcast from its first
> > element or not perform it for the others too.
> 
> IIRC I once implemented this (re-using vector constant components
> for non-vector pool entries) but it was quite hackish and never merged
> it seems.

If the generic constant pool code could do it, it would of course simplify
this pass.  Not sure if the case where earlier function emits already some
smaller constant and later function needs a CONST_VECTOR containing that can
be handled at all (probably not), but if the same function has both scalar
pool entries and CONST_VECTOR ones that contain those, or already emitted
CONST_VECTOR pool entry has them, it shouldn't be that hard, at least for
targets with symbol aliases, e.g. by using .LC33 = .LC24 or .LC34 = .LC24 + 8
where .LC33 or .LC34 would be the scalar pool entry label and .LC24
CONST_VECTOR containing those.

Seems constant pool marking is performed during
mark_constant_pool called during final from assemble_start_function or
assemble_end_function, so if the pass replaces the constants before final
and the constants are unused, they won't be emitted.

	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-27 13:20     ` Jakub Jelinek
@ 2020-08-28  6:47       ` Richard Biener
  2020-08-28  8:52         ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Richard Biener @ 2020-08-28  6:47 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Hongtao Liu, GCC Patches

On Thu, Aug 27, 2020 at 3:20 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Aug 27, 2020 at 03:07:59PM +0200, Richard Biener wrote:
> > > Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> > > only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> > > full vector loads?
> > >
> > > As Jeff wrote, I wonder if when successfully replacing those pool constants
> > > the old constant pool entries will be omitted.
> > >
> > > Another thing I wonder about is whether more analysis shouldn't be used.
> > > E.g. if the constant pool entry is already emitted into .rodata anyway
> > > (e.g. some earlier function needed it), using the broadcast will mean
> > > actually larger .rodata.  If {1to8} and similar is as fast as reading all
> > > the same elements from memory (or faster), perhaps in that case it should
> > > broadcast from the first element of the existing constant pool full vector
> > > rather than creating a new one.
> > > And similarly, perhaps the function should look at all constant pool entries
> > > in the current function (not yet emitted into .rodata) and if it would
> > > succeed for some and not for others, either use broadcast from its first
> > > element or not perform it for the others too.
> >
> > IIRC I once implemented this (re-using vector constant components
> > for non-vector pool entries) but it was quite hackish and never merged
> > it seems.
>
> If the generic constant pool code could do it, it would of course simplify
> this pass.  Not sure if the case where earlier function emits already some
> smaller constant and later function needs a CONST_VECTOR containing that can
> be handled at all (probably not), but if the same function has both scalar
> pool entries and CONST_VECTOR ones that contain those, or already emitted
> CONST_VECTOR pool entry has them, it shouldn't be that hard, at least for
> targets with symbol aliases, e.g. by using .LC33 = .LC24 or .LC34 = .LC24 + 8
> where .LC33 or .LC34 would be the scalar pool entry label and .LC24
> CONST_VECTOR containing those.
>
> Seems constant pool marking is performed during
> mark_constant_pool called during final from assemble_start_function or
> assemble_end_function, so if the pass replaces the constants before final
> and the constants are unused, they won't be emitted.

IIRC elsewhere it was discussed to use ld to perform merging by
emitting separate rodata sections for constant sizes (4, 8, 16, 32, 64
byte sizes).
ld could always direct 8 byte constant refs to the larger pools (sub-)entry.

As for GCCs constant pool code the issue is in the way lookup works
(hashing) and my earlier patch was recording an additional descriptor
for the first vector element IIRC.  So indeed if first the scalar is emitted
and then the vector this won't work easily (we'd need to be able to
associate multiple labes with a constant), but it could be made work, too.

I guess it would be interesting to experiment with function local pools
ordered by accesses to reduce memory bandwith cost (and pressure
on prefetchers) for memory bandwidth starved code.

Richard.

>         Jakub
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28  6:47       ` Richard Biener
@ 2020-08-28  8:52         ` Jakub Jelinek
  2020-08-28 10:36           ` Richard Biener
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2020-08-28  8:52 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Fri, Aug 28, 2020 at 08:47:06AM +0200, Richard Biener via Gcc-patches wrote:
> IIRC elsewhere it was discussed to use ld to perform merging by
> emitting separate rodata sections for constant sizes (4, 8, 16, 32, 64
> byte sizes).

ld does that already, and gcc too.

> ld could always direct 8 byte constant refs to the larger pools (sub-)entry.

But there is no way to express in ELF that something like that would be
acceptable.

I meant something like the following, which on e.g. a dumb:

typedef float V __attribute__((vector_size (4 * sizeof (float))));

void
foo (V *p, float *q)
{
  p[0] += (V) { 1.0f, 2.0f, 3.0f, 4.0f };
  q[0] += 4.0f;
  q[1] -= 3.0f;
  q[17] -= 2.0f;
  q[31] += 1.0f;
}

testcase merges all the 4 scalar constant pool entries into the CONST_VECTOR
one.

I'm punting for section anchors and not doing it in the per-function (i.e.
non-shared) constant pools simply because I don't know them well enough,
don't know whether backends use the offsets for something etc.
For section anchors, I guess it would need to be done before (re)computing the
offsets and arrange for the desc->mark < 0 entries not to be considered as
objects in the object block, for non-shared pools, perhaps it would be
enough to call the new function from output_constant_pool before calling
recompute_pool_offsets and adjust recompute_pool_offsets to ignore
desc->mark < 0.

2020-08-28  Jakub Jelinek  <jakub@redhat.com>

	* varasm.c (output_constant_pool_contents): Emit desc->mark < 0
	entries as aliases.
	(optimize_constant_pool): New function.
	(output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.

--- gcc/varasm.c.jj	2020-07-28 15:39:10.091755086 +0200
+++ gcc/varasm.c	2020-08-28 10:38:10.207636849 +0200
@@ -4198,7 +4198,27 @@ output_constant_pool_contents (struct rt
   class constant_descriptor_rtx *desc;
 
   for (desc = pool->first; desc ; desc = desc->next)
-    if (desc->mark)
+    if (desc->mark < 0)
+      {
+#ifdef ASM_OUTPUT_DEF
+        const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
+        char label[256];
+        char buffer[256 + 32];
+        const char *p;
+
+        ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
+	p = targetm.strip_name_encoding (label);
+	if (desc->offset)
+	  {
+	    sprintf (buffer, "%s+%ld", p, (long) (desc->offset));
+	    p = buffer;
+	  }
+	ASM_OUTPUT_DEF (asm_out_file, name, p);
+#else
+	gcc_unreachable ();
+#endif
+      }
+    else if (desc->mark)
       {
 	/* If the constant is part of an object_block, make sure that
 	   the constant has been positioned within its block, but do not
@@ -4216,6 +4236,52 @@ output_constant_pool_contents (struct rt
       }
 }
 
+/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
+   constants and scalar constants with the values of CONST_VECTOR elements,
+   try to alias the scalar constants with the CONST_VECTOR elements.  */
+
+static void
+optimize_constant_pool (struct rtx_constant_pool *pool)
+{
+  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
+    if (desc->mark > 0
+	&& GET_CODE (desc->constant) == CONST_VECTOR
+	&& VECTOR_MODE_P (desc->mode)
+	&& GET_MODE_CLASS (desc->mode) != MODE_VECTOR_BOOL
+	&& !(SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
+	     && SYMBOL_REF_BLOCK (desc->sym))
+	&& desc->labelno >= 0)
+      {
+	scalar_mode submode = GET_MODE_INNER (desc->mode);
+	unsigned int subalign = MIN (desc->align, GET_MODE_BITSIZE (submode));
+	int units = GET_MODE_NUNITS (desc->mode);
+
+	for (int i = 0; i < units; i++)
+	  {
+	    if (i != 0
+		&& rtx_equal_p (CONST_VECTOR_ELT (desc->constant, i),
+				CONST_VECTOR_ELT (desc->constant, i - 1)))
+	      continue;
+
+	    constant_descriptor_rtx tmp;
+	    tmp.constant = CONST_VECTOR_ELT (desc->constant, i);
+	    tmp.mode = submode;
+	    hashval_t hash = const_rtx_hash (tmp.constant);
+	    constant_descriptor_rtx *eldesc
+	      = pool->const_rtx_htab->find_with_hash (&tmp, hash);
+	    if (eldesc
+		&& eldesc->mark > 0
+		&& eldesc->align <= subalign
+		&& !(SYMBOL_REF_HAS_BLOCK_INFO_P (eldesc->sym)
+		     && SYMBOL_REF_BLOCK (eldesc->sym)))
+	      {
+		eldesc->mark = ~desc->labelno;
+		eldesc->offset = i * GET_MODE_SIZE (submode);
+	      }
+	  }
+      }
+}
+
 /* Mark all constants that are used in the current function, then write
    out the function's private constant pool.  */
 
@@ -4251,6 +4317,9 @@ output_constant_pool (const char *fnname
 void
 output_shared_constant_pool (void)
 {
+  if (TARGET_SUPPORTS_ALIASES)
+    optimize_constant_pool (shared_constant_pool);
+
   output_constant_pool_contents (shared_constant_pool);
 }
 \f


	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28  8:52         ` Jakub Jelinek
@ 2020-08-28 10:36           ` Richard Biener
  2020-08-28 10:47             ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Richard Biener @ 2020-08-28 10:36 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

On Fri, Aug 28, 2020 at 10:52 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Fri, Aug 28, 2020 at 08:47:06AM +0200, Richard Biener via Gcc-patches wrote:
> > IIRC elsewhere it was discussed to use ld to perform merging by
> > emitting separate rodata sections for constant sizes (4, 8, 16, 32, 64
> > byte sizes).
>
> ld does that already, and gcc too.
>
> > ld could always direct 8 byte constant refs to the larger pools (sub-)entry.
>
> But there is no way to express in ELF that something like that would be
> acceptable.

Hmm, I see.

> I meant something like the following, which on e.g. a dumb:
>
> typedef float V __attribute__((vector_size (4 * sizeof (float))));
>
> void
> foo (V *p, float *q)
> {
>   p[0] += (V) { 1.0f, 2.0f, 3.0f, 4.0f };
>   q[0] += 4.0f;
>   q[1] -= 3.0f;
>   q[17] -= 2.0f;
>   q[31] += 1.0f;
> }
>
> testcase merges all the 4 scalar constant pool entries into the CONST_VECTOR
> one.
>
> I'm punting for section anchors and not doing it in the per-function (i.e.
> non-shared) constant pools simply because I don't know them well enough,
> don't know whether backends use the offsets for something etc.
> For section anchors, I guess it would need to be done before (re)computing the
> offsets and arrange for the desc->mark < 0 entries not to be considered as
> objects in the object block, for non-shared pools, perhaps it would be
> enough to call the new function from output_constant_pool before calling
> recompute_pool_offsets and adjust recompute_pool_offsets to ignore
> desc->mark < 0.

Guess this would work indeed.  It's probably quite common to have
both vector and non-vector constants because of vectorization
and scalar epilogues.  But note that elsewhere we're using
the largest component mode to emit vector constant pool entries
to share { -1, -1, -1, -1 } and {-1l, -1l } for example so while the
below code works for FP modes it likely will break down for
integer modes?

Richard.

> 2020-08-28  Jakub Jelinek  <jakub@redhat.com>
>
>         * varasm.c (output_constant_pool_contents): Emit desc->mark < 0
>         entries as aliases.
>         (optimize_constant_pool): New function.
>         (output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.
>
> --- gcc/varasm.c.jj     2020-07-28 15:39:10.091755086 +0200
> +++ gcc/varasm.c        2020-08-28 10:38:10.207636849 +0200
> @@ -4198,7 +4198,27 @@ output_constant_pool_contents (struct rt
>    class constant_descriptor_rtx *desc;
>
>    for (desc = pool->first; desc ; desc = desc->next)
> -    if (desc->mark)
> +    if (desc->mark < 0)
> +      {
> +#ifdef ASM_OUTPUT_DEF
> +        const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
> +        char label[256];
> +        char buffer[256 + 32];
> +        const char *p;
> +
> +        ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
> +       p = targetm.strip_name_encoding (label);
> +       if (desc->offset)
> +         {
> +           sprintf (buffer, "%s+%ld", p, (long) (desc->offset));
> +           p = buffer;
> +         }
> +       ASM_OUTPUT_DEF (asm_out_file, name, p);
> +#else
> +       gcc_unreachable ();
> +#endif
> +      }
> +    else if (desc->mark)
>        {
>         /* If the constant is part of an object_block, make sure that
>            the constant has been positioned within its block, but do not
> @@ -4216,6 +4236,52 @@ output_constant_pool_contents (struct rt
>        }
>  }
>
> +/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
> +   constants and scalar constants with the values of CONST_VECTOR elements,
> +   try to alias the scalar constants with the CONST_VECTOR elements.  */
> +
> +static void
> +optimize_constant_pool (struct rtx_constant_pool *pool)
> +{
> +  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
> +    if (desc->mark > 0
> +       && GET_CODE (desc->constant) == CONST_VECTOR
> +       && VECTOR_MODE_P (desc->mode)
> +       && GET_MODE_CLASS (desc->mode) != MODE_VECTOR_BOOL
> +       && !(SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
> +            && SYMBOL_REF_BLOCK (desc->sym))
> +       && desc->labelno >= 0)
> +      {
> +       scalar_mode submode = GET_MODE_INNER (desc->mode);
> +       unsigned int subalign = MIN (desc->align, GET_MODE_BITSIZE (submode));
> +       int units = GET_MODE_NUNITS (desc->mode);
> +
> +       for (int i = 0; i < units; i++)
> +         {
> +           if (i != 0
> +               && rtx_equal_p (CONST_VECTOR_ELT (desc->constant, i),
> +                               CONST_VECTOR_ELT (desc->constant, i - 1)))
> +             continue;
> +
> +           constant_descriptor_rtx tmp;
> +           tmp.constant = CONST_VECTOR_ELT (desc->constant, i);
> +           tmp.mode = submode;
> +           hashval_t hash = const_rtx_hash (tmp.constant);
> +           constant_descriptor_rtx *eldesc
> +             = pool->const_rtx_htab->find_with_hash (&tmp, hash);
> +           if (eldesc
> +               && eldesc->mark > 0
> +               && eldesc->align <= subalign
> +               && !(SYMBOL_REF_HAS_BLOCK_INFO_P (eldesc->sym)
> +                    && SYMBOL_REF_BLOCK (eldesc->sym)))
> +             {
> +               eldesc->mark = ~desc->labelno;
> +               eldesc->offset = i * GET_MODE_SIZE (submode);
> +             }
> +         }
> +      }
> +}
> +
>  /* Mark all constants that are used in the current function, then write
>     out the function's private constant pool.  */
>
> @@ -4251,6 +4317,9 @@ output_constant_pool (const char *fnname
>  void
>  output_shared_constant_pool (void)
>  {
> +  if (TARGET_SUPPORTS_ALIASES)
> +    optimize_constant_pool (shared_constant_pool);
> +
>    output_constant_pool_contents (shared_constant_pool);
>  }
>
>
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28 10:36           ` Richard Biener
@ 2020-08-28 10:47             ` Jakub Jelinek
  2020-08-28 11:06               ` Richard Biener
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2020-08-28 10:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Fri, Aug 28, 2020 at 12:36:00PM +0200, Richard Biener wrote:
> Guess this would work indeed.  It's probably quite common to have
> both vector and non-vector constants because of vectorization
> and scalar epilogues.  But note that elsewhere we're using
> the largest component mode to emit vector constant pool entries
> to share { -1, -1, -1, -1 } and {-1l, -1l } for example so while the
> below code works for FP modes it likely will break down for
> integer modes?

I don't see why it would break, it will not optimize { -1LL, -1LL }
vs. -1 scalar, sure, but it uses the hash and equality function the
rtl constant pool uses, which means it compares both the constants
(rtx_equal_p) and mode we have recorded for it.
Of course, on x86_64 integer scalar constants will pretty much never appear
in the constant pool, so guess we'll need a different target for testing
that.

	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28 10:47             ` Jakub Jelinek
@ 2020-08-28 11:06               ` Richard Biener
  2020-08-28 11:26                 ` Jakub Jelinek
  2020-08-28 14:53                 ` Jakub Jelinek
  0 siblings, 2 replies; 30+ messages in thread
From: Richard Biener @ 2020-08-28 11:06 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

On Fri, Aug 28, 2020 at 12:47 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Fri, Aug 28, 2020 at 12:36:00PM +0200, Richard Biener wrote:
> > Guess this would work indeed.  It's probably quite common to have
> > both vector and non-vector constants because of vectorization
> > and scalar epilogues.  But note that elsewhere we're using
> > the largest component mode to emit vector constant pool entries
> > to share { -1, -1, -1, -1 } and {-1l, -1l } for example so while the
> > below code works for FP modes it likely will break down for
> > integer modes?
>
> I don't see why it would break, it will not optimize { -1LL, -1LL }
> vs. -1 scalar, sure, but it uses the hash and equality function the
> rtl constant pool uses, which means it compares both the constants
> (rtx_equal_p) and mode we have recorded for it.

Oh, I thought my patch for PR54201 was installed but it was not
(I grepped my patch folder...).  PR54201 complains about something
similar but then different ;)  Guess post-processing that would also be
possible but also a bit awkward.  Maybe we should hash the
full byte representation instead of the components.

Richard.

> Of course, on x86_64 integer scalar constants will pretty much never appear
> in the constant pool, so guess we'll need a different target for testing
> that.
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28 11:06               ` Richard Biener
@ 2020-08-28 11:26                 ` Jakub Jelinek
  2020-08-28 14:53                 ` Jakub Jelinek
  1 sibling, 0 replies; 30+ messages in thread
From: Jakub Jelinek @ 2020-08-28 11:26 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Fri, Aug 28, 2020 at 01:06:40PM +0200, Richard Biener wrote:
> > On Fri, Aug 28, 2020 at 12:36:00PM +0200, Richard Biener wrote:
> > > Guess this would work indeed.  It's probably quite common to have
> > > both vector and non-vector constants because of vectorization
> > > and scalar epilogues.  But note that elsewhere we're using
> > > the largest component mode to emit vector constant pool entries
> > > to share { -1, -1, -1, -1 } and {-1l, -1l } for example so while the
> > > below code works for FP modes it likely will break down for
> > > integer modes?
> >
> > I don't see why it would break, it will not optimize { -1LL, -1LL }
> > vs. -1 scalar, sure, but it uses the hash and equality function the
> > rtl constant pool uses, which means it compares both the constants
> > (rtx_equal_p) and mode we have recorded for it.
> 
> Oh, I thought my patch for PR54201 was installed but it was not
> (I grepped my patch folder...).  PR54201 complains about something
> similar but then different ;)  Guess post-processing that would also be
> possible but also a bit awkward.  Maybe we should hash the
> full byte representation instead of the components.

I think we in general can't, because the constants can contain even things
like LABEL_REFs, so not everything is actually representable as host byte
stream.
And I'm not convinced doing it in force_constant_mem is best, because at
that point we really don't know if the constants will not be optimized away.

We could extend what my patch does though, by sorting the pool entries
(those with mark > 0 only) by descreasing size and for those for which
native_encode_rtx is successful, enter into a hash table an entry for their
whole byte representation, then half of it until each bytes (possibly taking
into account the alignment too).

As for section anchors, handling that seems far less important, as e.g. on
both powerpc64 and aarch64 I see (unless -fno-merge-constants) the constant
pool emitted outside of the blocks and so optimized by the posted patch.

	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28 11:06               ` Richard Biener
  2020-08-28 11:26                 ` Jakub Jelinek
@ 2020-08-28 14:53                 ` Jakub Jelinek
  2020-08-28 16:07                   ` Richard Sandiford
  1 sibling, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2020-08-28 14:53 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Fri, Aug 28, 2020 at 01:06:40PM +0200, Richard Biener via Gcc-patches wrote:
> > I don't see why it would break, it will not optimize { -1LL, -1LL }
> > vs. -1 scalar, sure, but it uses the hash and equality function the
> > rtl constant pool uses, which means it compares both the constants
> > (rtx_equal_p) and mode we have recorded for it.
> 
> Oh, I thought my patch for PR54201 was installed but it was not
> (I grepped my patch folder...).  PR54201 complains about something
> similar but then different ;)  Guess post-processing that would also be
> possible but also a bit awkward.  Maybe we should hash the
> full byte representation instead of the components.

Here is an adjusted patch that ought to merge even the same sized different
mode vectors with the same byte representation, etc.
It won't really help with avoiding the multiple reads of the constant in the
same function, but as you found, your patch doesn't help with that either.
Your patch isn't really incompatible with what the patch below does, though
I wonder whether a) it wouldn't be better to always canonicalize to an
integral mode with as few elts as possible even e.g. for floats b) whether
asserting that it simplify_rtx succeeds is safe, whether it shouldn't just
canonicalize if the canonicalization works and just do what it previously
did otherwise.

The following patch puts all pool entries which can be natively encoded
into a vector, sorts it by decreasing size, determines minimum size
of a pool entry and adds hash elts for each (aligned) min_size or wider
power of two-ish portion of the pool constant in addition to the whole pool
constant byte representation.

2020-08-28  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/54201
	* varasm.c: Include alloc-pool.h.
	(output_constant_pool_contents): Emit desc->mark < 0 entries as
	aliases.
	(struct constant_descriptor_rtx_data): New type.
	(constant_descriptor_rtx_data_cmp): New function.
	(struct const_rtx_data_hasher): New type.
	(const_rtx_data_hasher::hash, const_rtx_data_hasher::equal): New
	methods.
	(optimize_constant_pool): New function.
	(output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.

--- gcc/varasm.c.jj	2020-07-28 15:39:10.091755086 +0200
+++ gcc/varasm.c	2020-08-28 15:37:30.605076961 +0200
@@ -57,6 +57,7 @@ along with GCC; see the file COPYING3.
 #include "asan.h"
 #include "rtl-iter.h"
 #include "file-prefix-map.h" /* remap_debug_filename()  */
+#include "alloc-pool.h"
 
 #ifdef XCOFF_DEBUGGING_INFO
 #include "xcoffout.h"		/* Needed for external data declarations.  */
@@ -4198,7 +4199,27 @@ output_constant_pool_contents (struct rt
   class constant_descriptor_rtx *desc;
 
   for (desc = pool->first; desc ; desc = desc->next)
-    if (desc->mark)
+    if (desc->mark < 0)
+      {
+#ifdef ASM_OUTPUT_DEF
+	const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
+	char label[256];
+	char buffer[256 + 32];
+	const char *p;
+
+	ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
+	p = targetm.strip_name_encoding (label);
+	if (desc->offset)
+	  {
+	    sprintf (buffer, "%s+%ld", p, desc->offset);
+	    p = buffer;
+	  }
+	ASM_OUTPUT_DEF (asm_out_file, name, p);
+#else
+	gcc_unreachable ();
+#endif
+      }
+    else if (desc->mark)
       {
 	/* If the constant is part of an object_block, make sure that
 	   the constant has been positioned within its block, but do not
@@ -4216,6 +4237,159 @@ output_constant_pool_contents (struct rt
       }
 }
 
+struct constant_descriptor_rtx_data {
+  constant_descriptor_rtx *desc;
+  target_unit *bytes;
+  unsigned short size;
+  unsigned short offset;
+  unsigned int hash;
+};
+
+/* qsort callback to sort constant_descriptor_rtx_data * vector by
+   decreasing size.  */
+
+static int
+constant_descriptor_rtx_data_cmp (const void *p1, const void *p2)
+{
+  constant_descriptor_rtx_data *const data1
+    = *(constant_descriptor_rtx_data * const *) p1;
+  constant_descriptor_rtx_data *const data2
+    = *(constant_descriptor_rtx_data * const *) p2;
+  if (data1->size > data2->size)
+    return -1;
+  if (data1->size < data2->size)
+    return 1;
+  if (data1->hash < data2->hash)
+    return -1;
+  gcc_assert (data1->hash > data2->hash);
+  return 1;
+}
+
+struct const_rtx_data_hasher : nofree_ptr_hash<constant_descriptor_rtx_data>
+{
+  static hashval_t hash (constant_descriptor_rtx_data *);
+  static bool equal (constant_descriptor_rtx_data *,
+		     constant_descriptor_rtx_data *);
+};
+
+/* Hash and compare functions for const_rtx_data_htab.  */
+
+hashval_t
+const_rtx_data_hasher::hash (constant_descriptor_rtx_data *data)
+{
+  return data->hash;
+}
+
+bool
+const_rtx_data_hasher::equal (constant_descriptor_rtx_data *x,
+			      constant_descriptor_rtx_data *y)
+{
+  if (x->hash != y->hash || x->size != y->size)
+    return 0;
+  unsigned int align1 = x->desc->align;
+  unsigned int align2 = y->desc->align;
+  unsigned int offset1 = (x->offset * BITS_PER_UNIT) & (align1 - 1);
+  unsigned int offset2 = (y->offset * BITS_PER_UNIT) & (align2 - 1);
+  if (offset1)
+    align1 = least_bit_hwi (offset1);
+  if (offset2)
+    align2 = least_bit_hwi (offset2);
+  if (align2 > align1)
+    return 0;
+  if (memcmp (x->bytes, y->bytes, x->size * sizeof (target_unit)) != 0)
+    return 0;
+  return 1;
+}
+
+/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
+   constants and scalar constants with the values of CONST_VECTOR elements,
+   try to alias the scalar constants with the CONST_VECTOR elements.  */
+
+static void
+optimize_constant_pool (struct rtx_constant_pool *pool)
+{
+  auto_vec<target_unit, 128> buffer;
+  auto_vec<constant_descriptor_rtx_data *, 128> vec;
+  object_allocator<constant_descriptor_rtx_data>
+    data_pool ("constant_descriptor_rtx_data_pool");
+  int idx = 0;
+  size_t size = 0;
+  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
+    if (desc->mark > 0
+	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
+	      && SYMBOL_REF_BLOCK (desc->sym)))
+      {
+	buffer.truncate (0);
+	if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
+			       GET_MODE_SIZE (desc->mode)))
+	  {
+	    constant_descriptor_rtx_data *data = data_pool.allocate ();
+	    data->desc = desc;
+	    data->bytes = NULL;
+	    data->size = GET_MODE_SIZE (desc->mode);
+	    data->offset = 0;
+	    data->hash = idx++;
+	    size += data->size;
+	    vec.safe_push (data);
+	  }
+      }
+  if (idx)
+    {
+      vec.qsort (constant_descriptor_rtx_data_cmp);
+      unsigned min_size = vec.last ()->size;
+      target_unit *bytes = XNEWVEC (target_unit, size);
+      unsigned int i;
+      constant_descriptor_rtx_data *data;
+      hash_table<const_rtx_data_hasher> * htab
+	= new hash_table<const_rtx_data_hasher> (31);
+      size = 0;
+      FOR_EACH_VEC_ELT (vec, i, data)
+	{
+	  buffer.truncate (0);
+	  native_encode_rtx (data->desc->mode, data->desc->constant,
+			     buffer, 0, data->size);
+	  memcpy (bytes + size, buffer.address (), data->size);
+	  data->bytes = bytes + size;
+	  data->hash = iterative_hash (data->bytes,
+				       data->size * sizeof (target_unit), 0);
+	  size += data->size;
+	  constant_descriptor_rtx_data **slot
+	    = htab->find_slot_with_hash (data, data->hash, INSERT);
+	  if (*slot)
+	    {
+	      data->desc->mark = ~(*slot)->desc->labelno;
+	      data->desc->offset = (*slot)->offset;
+	    }
+	  else
+	    {
+	      unsigned int sz = 1 << floor_log2 (data->size);
+
+	      *slot = data;
+	      for (sz >>= 1; sz >= min_size; sz >>= 1)
+		for (unsigned off = 0; off + sz <= data->size; off += sz)
+		  {
+		    constant_descriptor_rtx_data tmp;
+		    tmp.desc = data->desc;
+		    tmp.bytes = data->bytes + off;
+		    tmp.size = sz;
+		    tmp.offset = off;
+		    tmp.hash = iterative_hash (tmp.bytes,
+					       sz * sizeof (target_unit), 0);
+		    slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
+		    if (*slot == NULL)
+		      {
+			*slot = data_pool.allocate ();
+			**slot = tmp;
+		      }
+		  }
+	    }
+	}
+      delete htab;
+      XDELETE (bytes);
+    }
+  data_pool.release ();
+}
+
 /* Mark all constants that are used in the current function, then write
    out the function's private constant pool.  */
 
@@ -4251,6 +4425,10 @@ output_constant_pool (const char *fnname
 void
 output_shared_constant_pool (void)
 {
+  if (optimize
+      && TARGET_SUPPORTS_ALIASES)
+    optimize_constant_pool (shared_constant_pool);
+
   output_constant_pool_contents (shared_constant_pool);
 }
 \f


	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28 14:53                 ` Jakub Jelinek
@ 2020-08-28 16:07                   ` Richard Sandiford
  2020-08-28 16:25                     ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Richard Sandiford @ 2020-08-28 16:07 UTC (permalink / raw)
  To: Jakub Jelinek via Gcc-patches; +Cc: Richard Biener, Jakub Jelinek

Thanks for doing this.  I don't feel qualified to review the full
patch, but one thing:

Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> +  auto_vec<target_unit, 128> buffer;
> +  auto_vec<constant_descriptor_rtx_data *, 128> vec;
> +  object_allocator<constant_descriptor_rtx_data>
> +    data_pool ("constant_descriptor_rtx_data_pool");
> +  int idx = 0;
> +  size_t size = 0;
> +  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
> +    if (desc->mark > 0
> +	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
> +	      && SYMBOL_REF_BLOCK (desc->sym)))
> +      {
> +	buffer.truncate (0);

128 isn't big enough for all targets (e.g. aarch64 with
-msve-vector-bits=2048), so I think we still need a reserve
call here.

Thanks,
Richard

> +	if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
> +			       GET_MODE_SIZE (desc->mode)))
> +	  {
> +	    constant_descriptor_rtx_data *data = data_pool.allocate ();
> +	    data->desc = desc;
> +	    data->bytes = NULL;
> +	    data->size = GET_MODE_SIZE (desc->mode);
> +	    data->offset = 0;
> +	    data->hash = idx++;
> +	    size += data->size;
> +	    vec.safe_push (data);
> +	  }
> +      }
> +  if (idx)
> +    {
> +      vec.qsort (constant_descriptor_rtx_data_cmp);
> +      unsigned min_size = vec.last ()->size;
> +      target_unit *bytes = XNEWVEC (target_unit, size);
> +      unsigned int i;
> +      constant_descriptor_rtx_data *data;
> +      hash_table<const_rtx_data_hasher> * htab
> +	= new hash_table<const_rtx_data_hasher> (31);
> +      size = 0;
> +      FOR_EACH_VEC_ELT (vec, i, data)
> +	{
> +	  buffer.truncate (0);
> +	  native_encode_rtx (data->desc->mode, data->desc->constant,
> +			     buffer, 0, data->size);
> +	  memcpy (bytes + size, buffer.address (), data->size);
> +	  data->bytes = bytes + size;
> +	  data->hash = iterative_hash (data->bytes,
> +				       data->size * sizeof (target_unit), 0);
> +	  size += data->size;
> +	  constant_descriptor_rtx_data **slot
> +	    = htab->find_slot_with_hash (data, data->hash, INSERT);
> +	  if (*slot)
> +	    {
> +	      data->desc->mark = ~(*slot)->desc->labelno;
> +	      data->desc->offset = (*slot)->offset;
> +	    }
> +	  else
> +	    {
> +	      unsigned int sz = 1 << floor_log2 (data->size);
> +
> +	      *slot = data;
> +	      for (sz >>= 1; sz >= min_size; sz >>= 1)
> +		for (unsigned off = 0; off + sz <= data->size; off += sz)
> +		  {
> +		    constant_descriptor_rtx_data tmp;
> +		    tmp.desc = data->desc;
> +		    tmp.bytes = data->bytes + off;
> +		    tmp.size = sz;
> +		    tmp.offset = off;
> +		    tmp.hash = iterative_hash (tmp.bytes,
> +					       sz * sizeof (target_unit), 0);
> +		    slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
> +		    if (*slot == NULL)
> +		      {
> +			*slot = data_pool.allocate ();
> +			**slot = tmp;
> +		      }
> +		  }
> +	    }
> +	}
> +      delete htab;
> +      XDELETE (bytes);
> +    }
> +  data_pool.release ();
> +}
> +
>  /* Mark all constants that are used in the current function, then write
>     out the function's private constant pool.  */
>  
> @@ -4251,6 +4425,10 @@ output_constant_pool (const char *fnname
>  void
>  output_shared_constant_pool (void)
>  {
> +  if (optimize
> +      && TARGET_SUPPORTS_ALIASES)
> +    optimize_constant_pool (shared_constant_pool);
> +
>    output_constant_pool_contents (shared_constant_pool);
>  }
>  \f
>
>
> 	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28 16:07                   ` Richard Sandiford
@ 2020-08-28 16:25                     ` Jakub Jelinek
  2020-08-30  9:24                       ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2020-08-28 16:25 UTC (permalink / raw)
  To: Richard Biener, richard.sandiford; +Cc: gcc-patches

On Fri, Aug 28, 2020 at 05:07:11PM +0100, Richard Sandiford wrote:
> Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > +  auto_vec<target_unit, 128> buffer;
> > +  auto_vec<constant_descriptor_rtx_data *, 128> vec;
> > +  object_allocator<constant_descriptor_rtx_data>
> > +    data_pool ("constant_descriptor_rtx_data_pool");
> > +  int idx = 0;
> > +  size_t size = 0;
> > +  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
> > +    if (desc->mark > 0
> > +	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
> > +	      && SYMBOL_REF_BLOCK (desc->sym)))
> > +      {
> > +	buffer.truncate (0);
> 
> 128 isn't big enough for all targets (e.g. aarch64 with
> -msve-vector-bits=2048), so I think we still need a reserve
> call here.

You're right, thanks for spotting it, I've missed native_encode_rtx will do
quick_push rather than safe_push.

Updated patch below, it shouldn't be needed in the second loop, because
the first loop should already grow it to the largest size.

2020-08-28  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/54201
	* varasm.c: Include alloc-pool.h.
	(output_constant_pool_contents): Emit desc->mark < 0 entries as
	aliases.
	(struct constant_descriptor_rtx_data): New type.
	(constant_descriptor_rtx_data_cmp): New function.
	(struct const_rtx_data_hasher): New type.
	(const_rtx_data_hasher::hash, const_rtx_data_hasher::equal): New
	methods.
	(optimize_constant_pool): New function.
	(output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.

--- gcc/varasm.c.jj	2020-07-28 15:39:10.091755086 +0200
+++ gcc/varasm.c	2020-08-28 18:21:58.943759578 +0200
@@ -57,6 +57,7 @@ along with GCC; see the file COPYING3.
 #include "asan.h"
 #include "rtl-iter.h"
 #include "file-prefix-map.h" /* remap_debug_filename()  */
+#include "alloc-pool.h"
 
 #ifdef XCOFF_DEBUGGING_INFO
 #include "xcoffout.h"		/* Needed for external data declarations.  */
@@ -4198,7 +4199,27 @@ output_constant_pool_contents (struct rt
   class constant_descriptor_rtx *desc;
 
   for (desc = pool->first; desc ; desc = desc->next)
-    if (desc->mark)
+    if (desc->mark < 0)
+      {
+#ifdef ASM_OUTPUT_DEF
+	const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
+	char label[256];
+	char buffer[256 + 32];
+	const char *p;
+
+	ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
+	p = targetm.strip_name_encoding (label);
+	if (desc->offset)
+	  {
+	    sprintf (buffer, "%s+%ld", p, desc->offset);
+	    p = buffer;
+	  }
+	ASM_OUTPUT_DEF (asm_out_file, name, p);
+#else
+	gcc_unreachable ();
+#endif
+      }
+    else if (desc->mark)
       {
 	/* If the constant is part of an object_block, make sure that
 	   the constant has been positioned within its block, but do not
@@ -4216,6 +4237,160 @@ output_constant_pool_contents (struct rt
       }
 }
 
+struct constant_descriptor_rtx_data {
+  constant_descriptor_rtx *desc;
+  target_unit *bytes;
+  unsigned short size;
+  unsigned short offset;
+  unsigned int hash;
+};
+
+/* qsort callback to sort constant_descriptor_rtx_data * vector by
+   decreasing size.  */
+
+static int
+constant_descriptor_rtx_data_cmp (const void *p1, const void *p2)
+{
+  constant_descriptor_rtx_data *const data1
+    = *(constant_descriptor_rtx_data * const *) p1;
+  constant_descriptor_rtx_data *const data2
+    = *(constant_descriptor_rtx_data * const *) p2;
+  if (data1->size > data2->size)
+    return -1;
+  if (data1->size < data2->size)
+    return 1;
+  if (data1->hash < data2->hash)
+    return -1;
+  gcc_assert (data1->hash > data2->hash);
+  return 1;
+}
+
+struct const_rtx_data_hasher : nofree_ptr_hash<constant_descriptor_rtx_data>
+{
+  static hashval_t hash (constant_descriptor_rtx_data *);
+  static bool equal (constant_descriptor_rtx_data *,
+		     constant_descriptor_rtx_data *);
+};
+
+/* Hash and compare functions for const_rtx_data_htab.  */
+
+hashval_t
+const_rtx_data_hasher::hash (constant_descriptor_rtx_data *data)
+{
+  return data->hash;
+}
+
+bool
+const_rtx_data_hasher::equal (constant_descriptor_rtx_data *x,
+			      constant_descriptor_rtx_data *y)
+{
+  if (x->hash != y->hash || x->size != y->size)
+    return 0;
+  unsigned int align1 = x->desc->align;
+  unsigned int align2 = y->desc->align;
+  unsigned int offset1 = (x->offset * BITS_PER_UNIT) & (align1 - 1);
+  unsigned int offset2 = (y->offset * BITS_PER_UNIT) & (align2 - 1);
+  if (offset1)
+    align1 = least_bit_hwi (offset1);
+  if (offset2)
+    align2 = least_bit_hwi (offset2);
+  if (align2 > align1)
+    return 0;
+  if (memcmp (x->bytes, y->bytes, x->size * sizeof (target_unit)) != 0)
+    return 0;
+  return 1;
+}
+
+/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
+   constants and scalar constants with the values of CONST_VECTOR elements,
+   try to alias the scalar constants with the CONST_VECTOR elements.  */
+
+static void
+optimize_constant_pool (struct rtx_constant_pool *pool)
+{
+  auto_vec<target_unit, 128> buffer;
+  auto_vec<constant_descriptor_rtx_data *, 128> vec;
+  object_allocator<constant_descriptor_rtx_data>
+    data_pool ("constant_descriptor_rtx_data_pool");
+  int idx = 0;
+  size_t size = 0;
+  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
+    if (desc->mark > 0
+	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
+	      && SYMBOL_REF_BLOCK (desc->sym)))
+      {
+	buffer.truncate (0);
+	buffer.reserve (GET_MODE_SIZE (desc->mode));
+	if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
+			       GET_MODE_SIZE (desc->mode)))
+	  {
+	    constant_descriptor_rtx_data *data = data_pool.allocate ();
+	    data->desc = desc;
+	    data->bytes = NULL;
+	    data->size = GET_MODE_SIZE (desc->mode);
+	    data->offset = 0;
+	    data->hash = idx++;
+	    size += data->size;
+	    vec.safe_push (data);
+	  }
+      }
+  if (idx)
+    {
+      vec.qsort (constant_descriptor_rtx_data_cmp);
+      unsigned min_size = vec.last ()->size;
+      target_unit *bytes = XNEWVEC (target_unit, size);
+      unsigned int i;
+      constant_descriptor_rtx_data *data;
+      hash_table<const_rtx_data_hasher> * htab
+	= new hash_table<const_rtx_data_hasher> (31);
+      size = 0;
+      FOR_EACH_VEC_ELT (vec, i, data)
+	{
+	  buffer.truncate (0);
+	  native_encode_rtx (data->desc->mode, data->desc->constant,
+			     buffer, 0, data->size);
+	  memcpy (bytes + size, buffer.address (), data->size);
+	  data->bytes = bytes + size;
+	  data->hash = iterative_hash (data->bytes,
+				       data->size * sizeof (target_unit), 0);
+	  size += data->size;
+	  constant_descriptor_rtx_data **slot
+	    = htab->find_slot_with_hash (data, data->hash, INSERT);
+	  if (*slot)
+	    {
+	      data->desc->mark = ~(*slot)->desc->labelno;
+	      data->desc->offset = (*slot)->offset;
+	    }
+	  else
+	    {
+	      unsigned int sz = 1 << floor_log2 (data->size);
+
+	      *slot = data;
+	      for (sz >>= 1; sz >= min_size; sz >>= 1)
+		for (unsigned off = 0; off + sz <= data->size; off += sz)
+		  {
+		    constant_descriptor_rtx_data tmp;
+		    tmp.desc = data->desc;
+		    tmp.bytes = data->bytes + off;
+		    tmp.size = sz;
+		    tmp.offset = off;
+		    tmp.hash = iterative_hash (tmp.bytes,
+					       sz * sizeof (target_unit), 0);
+		    slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
+		    if (*slot == NULL)
+		      {
+			*slot = data_pool.allocate ();
+			**slot = tmp;
+		      }
+		  }
+	    }
+	}
+      delete htab;
+      XDELETE (bytes);
+    }
+  data_pool.release ();
+}
+
 /* Mark all constants that are used in the current function, then write
    out the function's private constant pool.  */
 
@@ -4251,6 +4426,10 @@ output_constant_pool (const char *fnname
 void
 output_shared_constant_pool (void)
 {
+  if (optimize
+      && TARGET_SUPPORTS_ALIASES)
+    optimize_constant_pool (shared_constant_pool);
+
   output_constant_pool_contents (shared_constant_pool);
 }
 \f


	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-27 12:24 ` Jakub Jelinek
  2020-08-27 13:07   ` Richard Biener
@ 2020-08-28 17:18   ` Hongtao Liu
  2020-09-01  9:55   ` Hongtao Liu
  2 siblings, 0 replies; 30+ messages in thread
From: Hongtao Liu @ 2020-08-28 17:18 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

On Thu, Aug 27, 2020 at 8:24 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
>
> MEM_P
>
> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P
>
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>
> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
>
> The usual way of doing this would be through
>   validate_change (insn, loc, vec_dup, 0);
>
> Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> full vector loads?
>

Yes, broadcast is insufficient. refer to [1]
[1]. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87767#c5

Since pass_combine won't combine load instruction into memory operand,
i.e. it wouldn't combine *mov mem xmm* and *vaddps xmm, xmm, xmm* into
*vaddps mem, xmm, xmm*
It just left this to ira, broadcast under avx512f isn't excluded.

Maybe patch should be rewritten by using post reload splitter, then
there would be no concern of an extra pass, but still have the latter
issue as you mentioned.

> As Jeff wrote, I wonder if when successfully replacing those pool constants
> the old constant pool entries will be omitted.
>
> Another thing I wonder about is whether more analysis shouldn't be used.
> E.g. if the constant pool entry is already emitted into .rodata anyway
> (e.g. some earlier function needed it), using the broadcast will mean

Yes, some later function may need it either, so we need a global view
to decide the replacement, hope it could be done by generic constant
pool code.

> actually larger .rodata.  If {1to8} and similar is as fast as reading all
> the same elements from memory (or faster), perhaps in that case it should
> broadcast from the first element of the existing constant pool full vector
> rather than creating a new one.
> And similarly, perhaps the function should look at all constant pool entries
> in the current function (not yet emitted into .rodata) and if it would
> succeed for some and not for others, either use broadcast from its first
> element or not perform it for the others too.
>
>         Jakub
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-28 16:25                     ` Jakub Jelinek
@ 2020-08-30  9:24                       ` Jakub Jelinek
  2020-08-31  8:18                         ` Richard Biener
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2020-08-30  9:24 UTC (permalink / raw)
  To: Richard Biener, richard.sandiford, gcc-patches

On Fri, Aug 28, 2020 at 06:25:46PM +0200, Jakub Jelinek via Gcc-patches wrote:
> You're right, thanks for spotting it, I've missed native_encode_rtx will do
> quick_push rather than safe_push.
> 
> Updated patch below, it shouldn't be needed in the second loop, because
> the first loop should already grow it to the largest size.

Testing beyond a bug in i386.md revealed also that I've lost a cast to long
to avoid breaking 32-bit bootstrap.

This is the version that passed bootstrap/regtest on both x86_64-linux and
i686-linux.  In both bootstraps/regtests together, it saved (from the
statistics I've gathered) 63104 .rodata bytes (before constant merging),
in 6814 hits of the data->desc->mark = ~(*slot)->desc->labelno;.

Ok for trunk?

2020-08-30  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/54201
	* varasm.c: Include alloc-pool.h.
	(output_constant_pool_contents): Emit desc->mark < 0 entries as
	aliases.
	(struct constant_descriptor_rtx_data): New type.
	(constant_descriptor_rtx_data_cmp): New function.
	(struct const_rtx_data_hasher): New type.
	(const_rtx_data_hasher::hash, const_rtx_data_hasher::equal): New
	methods.
	(optimize_constant_pool): New function.
	(output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.

--- gcc/varasm.c.jj	2020-07-28 15:39:10.091755086 +0200
+++ gcc/varasm.c	2020-08-28 18:21:58.943759578 +0200
@@ -57,6 +57,7 @@ along with GCC; see the file COPYING3.
 #include "asan.h"
 #include "rtl-iter.h"
 #include "file-prefix-map.h" /* remap_debug_filename()  */
+#include "alloc-pool.h"
 
 #ifdef XCOFF_DEBUGGING_INFO
 #include "xcoffout.h"		/* Needed for external data declarations.  */
@@ -4198,7 +4199,27 @@ output_constant_pool_contents (struct rt
   class constant_descriptor_rtx *desc;
 
   for (desc = pool->first; desc ; desc = desc->next)
-    if (desc->mark)
+    if (desc->mark < 0)
+      {
+#ifdef ASM_OUTPUT_DEF
+	const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
+	char label[256];
+	char buffer[256 + 32];
+	const char *p;
+
+	ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
+	p = targetm.strip_name_encoding (label);
+	if (desc->offset)
+	  {
+	    sprintf (buffer, "%s+%ld", p, (long) (desc->offset));
+	    p = buffer;
+	  }
+	ASM_OUTPUT_DEF (asm_out_file, name, p);
+#else
+	gcc_unreachable ();
+#endif
+      }
+    else if (desc->mark)
       {
 	/* If the constant is part of an object_block, make sure that
 	   the constant has been positioned within its block, but do not
@@ -4216,6 +4237,160 @@ output_constant_pool_contents (struct rt
       }
 }
 
+struct constant_descriptor_rtx_data {
+  constant_descriptor_rtx *desc;
+  target_unit *bytes;
+  unsigned short size;
+  unsigned short offset;
+  unsigned int hash;
+};
+
+/* qsort callback to sort constant_descriptor_rtx_data * vector by
+   decreasing size.  */
+
+static int
+constant_descriptor_rtx_data_cmp (const void *p1, const void *p2)
+{
+  constant_descriptor_rtx_data *const data1
+    = *(constant_descriptor_rtx_data * const *) p1;
+  constant_descriptor_rtx_data *const data2
+    = *(constant_descriptor_rtx_data * const *) p2;
+  if (data1->size > data2->size)
+    return -1;
+  if (data1->size < data2->size)
+    return 1;
+  if (data1->hash < data2->hash)
+    return -1;
+  gcc_assert (data1->hash > data2->hash);
+  return 1;
+}
+
+struct const_rtx_data_hasher : nofree_ptr_hash<constant_descriptor_rtx_data>
+{
+  static hashval_t hash (constant_descriptor_rtx_data *);
+  static bool equal (constant_descriptor_rtx_data *,
+		     constant_descriptor_rtx_data *);
+};
+
+/* Hash and compare functions for const_rtx_data_htab.  */
+
+hashval_t
+const_rtx_data_hasher::hash (constant_descriptor_rtx_data *data)
+{
+  return data->hash;
+}
+
+bool
+const_rtx_data_hasher::equal (constant_descriptor_rtx_data *x,
+			      constant_descriptor_rtx_data *y)
+{
+  if (x->hash != y->hash || x->size != y->size)
+    return 0;
+  unsigned int align1 = x->desc->align;
+  unsigned int align2 = y->desc->align;
+  unsigned int offset1 = (x->offset * BITS_PER_UNIT) & (align1 - 1);
+  unsigned int offset2 = (y->offset * BITS_PER_UNIT) & (align2 - 1);
+  if (offset1)
+    align1 = least_bit_hwi (offset1);
+  if (offset2)
+    align2 = least_bit_hwi (offset2);
+  if (align2 > align1)
+    return 0;
+  if (memcmp (x->bytes, y->bytes, x->size * sizeof (target_unit)) != 0)
+    return 0;
+  return 1;
+}
+
+/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
+   constants and scalar constants with the values of CONST_VECTOR elements,
+   try to alias the scalar constants with the CONST_VECTOR elements.  */
+
+static void
+optimize_constant_pool (struct rtx_constant_pool *pool)
+{
+  auto_vec<target_unit, 128> buffer;
+  auto_vec<constant_descriptor_rtx_data *, 128> vec;
+  object_allocator<constant_descriptor_rtx_data>
+    data_pool ("constant_descriptor_rtx_data_pool");
+  int idx = 0;
+  size_t size = 0;
+  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
+    if (desc->mark > 0
+	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
+	      && SYMBOL_REF_BLOCK (desc->sym)))
+      {
+	buffer.truncate (0);
+	buffer.reserve (GET_MODE_SIZE (desc->mode));
+	if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
+			       GET_MODE_SIZE (desc->mode)))
+	  {
+	    constant_descriptor_rtx_data *data = data_pool.allocate ();
+	    data->desc = desc;
+	    data->bytes = NULL;
+	    data->size = GET_MODE_SIZE (desc->mode);
+	    data->offset = 0;
+	    data->hash = idx++;
+	    size += data->size;
+	    vec.safe_push (data);
+	  }
+      }
+  if (idx)
+    {
+      vec.qsort (constant_descriptor_rtx_data_cmp);
+      unsigned min_size = vec.last ()->size;
+      target_unit *bytes = XNEWVEC (target_unit, size);
+      unsigned int i;
+      constant_descriptor_rtx_data *data;
+      hash_table<const_rtx_data_hasher> * htab
+	= new hash_table<const_rtx_data_hasher> (31);
+      size = 0;
+      FOR_EACH_VEC_ELT (vec, i, data)
+	{
+	  buffer.truncate (0);
+	  native_encode_rtx (data->desc->mode, data->desc->constant,
+			     buffer, 0, data->size);
+	  memcpy (bytes + size, buffer.address (), data->size);
+	  data->bytes = bytes + size;
+	  data->hash = iterative_hash (data->bytes,
+				       data->size * sizeof (target_unit), 0);
+	  size += data->size;
+	  constant_descriptor_rtx_data **slot
+	    = htab->find_slot_with_hash (data, data->hash, INSERT);
+	  if (*slot)
+	    {
+	      data->desc->mark = ~(*slot)->desc->labelno;
+	      data->desc->offset = (*slot)->offset;
+	    }
+	  else
+	    {
+	      unsigned int sz = 1 << floor_log2 (data->size);
+
+	      *slot = data;
+	      for (sz >>= 1; sz >= min_size; sz >>= 1)
+		for (unsigned off = 0; off + sz <= data->size; off += sz)
+		  {
+		    constant_descriptor_rtx_data tmp;
+		    tmp.desc = data->desc;
+		    tmp.bytes = data->bytes + off;
+		    tmp.size = sz;
+		    tmp.offset = off;
+		    tmp.hash = iterative_hash (tmp.bytes,
+					       sz * sizeof (target_unit), 0);
+		    slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
+		    if (*slot == NULL)
+		      {
+			*slot = data_pool.allocate ();
+			**slot = tmp;
+		      }
+		  }
+	    }
+	}
+      delete htab;
+      XDELETE (bytes);
+    }
+  data_pool.release ();
+}
+
 /* Mark all constants that are used in the current function, then write
    out the function's private constant pool.  */
 
@@ -4251,6 +4426,10 @@ output_constant_pool (const char *fnname
 void
 output_shared_constant_pool (void)
 {
+  if (optimize
+      && TARGET_SUPPORTS_ALIASES)
+    optimize_constant_pool (shared_constant_pool);
+
   output_constant_pool_contents (shared_constant_pool);
 }
 \f


	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-30  9:24                       ` Jakub Jelinek
@ 2020-08-31  8:18                         ` Richard Biener
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Biener @ 2020-08-31  8:18 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Sandiford, GCC Patches

On Sun, Aug 30, 2020 at 11:24 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Fri, Aug 28, 2020 at 06:25:46PM +0200, Jakub Jelinek via Gcc-patches wrote:
> > You're right, thanks for spotting it, I've missed native_encode_rtx will do
> > quick_push rather than safe_push.
> >
> > Updated patch below, it shouldn't be needed in the second loop, because
> > the first loop should already grow it to the largest size.
>
> Testing beyond a bug in i386.md revealed also that I've lost a cast to long
> to avoid breaking 32-bit bootstrap.
>
> This is the version that passed bootstrap/regtest on both x86_64-linux and
> i686-linux.  In both bootstraps/regtests together, it saved (from the
> statistics I've gathered) 63104 .rodata bytes (before constant merging),
> in 6814 hits of the data->desc->mark = ~(*slot)->desc->labelno;.
>
> Ok for trunk?

OK.

Thanks,
Richard.

> 2020-08-30  Jakub Jelinek  <jakub@redhat.com>
>
>         PR middle-end/54201
>         * varasm.c: Include alloc-pool.h.
>         (output_constant_pool_contents): Emit desc->mark < 0 entries as
>         aliases.
>         (struct constant_descriptor_rtx_data): New type.
>         (constant_descriptor_rtx_data_cmp): New function.
>         (struct const_rtx_data_hasher): New type.
>         (const_rtx_data_hasher::hash, const_rtx_data_hasher::equal): New
>         methods.
>         (optimize_constant_pool): New function.
>         (output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.
>
> --- gcc/varasm.c.jj     2020-07-28 15:39:10.091755086 +0200
> +++ gcc/varasm.c        2020-08-28 18:21:58.943759578 +0200
> @@ -57,6 +57,7 @@ along with GCC; see the file COPYING3.
>  #include "asan.h"
>  #include "rtl-iter.h"
>  #include "file-prefix-map.h" /* remap_debug_filename()  */
> +#include "alloc-pool.h"
>
>  #ifdef XCOFF_DEBUGGING_INFO
>  #include "xcoffout.h"          /* Needed for external data declarations.  */
> @@ -4198,7 +4199,27 @@ output_constant_pool_contents (struct rt
>    class constant_descriptor_rtx *desc;
>
>    for (desc = pool->first; desc ; desc = desc->next)
> -    if (desc->mark)
> +    if (desc->mark < 0)
> +      {
> +#ifdef ASM_OUTPUT_DEF
> +       const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
> +       char label[256];
> +       char buffer[256 + 32];
> +       const char *p;
> +
> +       ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
> +       p = targetm.strip_name_encoding (label);
> +       if (desc->offset)
> +         {
> +           sprintf (buffer, "%s+%ld", p, (long) (desc->offset));
> +           p = buffer;
> +         }
> +       ASM_OUTPUT_DEF (asm_out_file, name, p);
> +#else
> +       gcc_unreachable ();
> +#endif
> +      }
> +    else if (desc->mark)
>        {
>         /* If the constant is part of an object_block, make sure that
>            the constant has been positioned within its block, but do not
> @@ -4216,6 +4237,160 @@ output_constant_pool_contents (struct rt
>        }
>  }
>
> +struct constant_descriptor_rtx_data {
> +  constant_descriptor_rtx *desc;
> +  target_unit *bytes;
> +  unsigned short size;
> +  unsigned short offset;
> +  unsigned int hash;
> +};
> +
> +/* qsort callback to sort constant_descriptor_rtx_data * vector by
> +   decreasing size.  */
> +
> +static int
> +constant_descriptor_rtx_data_cmp (const void *p1, const void *p2)
> +{
> +  constant_descriptor_rtx_data *const data1
> +    = *(constant_descriptor_rtx_data * const *) p1;
> +  constant_descriptor_rtx_data *const data2
> +    = *(constant_descriptor_rtx_data * const *) p2;
> +  if (data1->size > data2->size)
> +    return -1;
> +  if (data1->size < data2->size)
> +    return 1;
> +  if (data1->hash < data2->hash)
> +    return -1;
> +  gcc_assert (data1->hash > data2->hash);
> +  return 1;
> +}
> +
> +struct const_rtx_data_hasher : nofree_ptr_hash<constant_descriptor_rtx_data>
> +{
> +  static hashval_t hash (constant_descriptor_rtx_data *);
> +  static bool equal (constant_descriptor_rtx_data *,
> +                    constant_descriptor_rtx_data *);
> +};
> +
> +/* Hash and compare functions for const_rtx_data_htab.  */
> +
> +hashval_t
> +const_rtx_data_hasher::hash (constant_descriptor_rtx_data *data)
> +{
> +  return data->hash;
> +}
> +
> +bool
> +const_rtx_data_hasher::equal (constant_descriptor_rtx_data *x,
> +                             constant_descriptor_rtx_data *y)
> +{
> +  if (x->hash != y->hash || x->size != y->size)
> +    return 0;
> +  unsigned int align1 = x->desc->align;
> +  unsigned int align2 = y->desc->align;
> +  unsigned int offset1 = (x->offset * BITS_PER_UNIT) & (align1 - 1);
> +  unsigned int offset2 = (y->offset * BITS_PER_UNIT) & (align2 - 1);
> +  if (offset1)
> +    align1 = least_bit_hwi (offset1);
> +  if (offset2)
> +    align2 = least_bit_hwi (offset2);
> +  if (align2 > align1)
> +    return 0;
> +  if (memcmp (x->bytes, y->bytes, x->size * sizeof (target_unit)) != 0)
> +    return 0;
> +  return 1;
> +}
> +
> +/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
> +   constants and scalar constants with the values of CONST_VECTOR elements,
> +   try to alias the scalar constants with the CONST_VECTOR elements.  */
> +
> +static void
> +optimize_constant_pool (struct rtx_constant_pool *pool)
> +{
> +  auto_vec<target_unit, 128> buffer;
> +  auto_vec<constant_descriptor_rtx_data *, 128> vec;
> +  object_allocator<constant_descriptor_rtx_data>
> +    data_pool ("constant_descriptor_rtx_data_pool");
> +  int idx = 0;
> +  size_t size = 0;
> +  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
> +    if (desc->mark > 0
> +       && ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
> +             && SYMBOL_REF_BLOCK (desc->sym)))
> +      {
> +       buffer.truncate (0);
> +       buffer.reserve (GET_MODE_SIZE (desc->mode));
> +       if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
> +                              GET_MODE_SIZE (desc->mode)))
> +         {
> +           constant_descriptor_rtx_data *data = data_pool.allocate ();
> +           data->desc = desc;
> +           data->bytes = NULL;
> +           data->size = GET_MODE_SIZE (desc->mode);
> +           data->offset = 0;
> +           data->hash = idx++;
> +           size += data->size;
> +           vec.safe_push (data);
> +         }
> +      }
> +  if (idx)
> +    {
> +      vec.qsort (constant_descriptor_rtx_data_cmp);
> +      unsigned min_size = vec.last ()->size;
> +      target_unit *bytes = XNEWVEC (target_unit, size);
> +      unsigned int i;
> +      constant_descriptor_rtx_data *data;
> +      hash_table<const_rtx_data_hasher> * htab
> +       = new hash_table<const_rtx_data_hasher> (31);
> +      size = 0;
> +      FOR_EACH_VEC_ELT (vec, i, data)
> +       {
> +         buffer.truncate (0);
> +         native_encode_rtx (data->desc->mode, data->desc->constant,
> +                            buffer, 0, data->size);
> +         memcpy (bytes + size, buffer.address (), data->size);
> +         data->bytes = bytes + size;
> +         data->hash = iterative_hash (data->bytes,
> +                                      data->size * sizeof (target_unit), 0);
> +         size += data->size;
> +         constant_descriptor_rtx_data **slot
> +           = htab->find_slot_with_hash (data, data->hash, INSERT);
> +         if (*slot)
> +           {
> +             data->desc->mark = ~(*slot)->desc->labelno;
> +             data->desc->offset = (*slot)->offset;
> +           }
> +         else
> +           {
> +             unsigned int sz = 1 << floor_log2 (data->size);
> +
> +             *slot = data;
> +             for (sz >>= 1; sz >= min_size; sz >>= 1)
> +               for (unsigned off = 0; off + sz <= data->size; off += sz)
> +                 {
> +                   constant_descriptor_rtx_data tmp;
> +                   tmp.desc = data->desc;
> +                   tmp.bytes = data->bytes + off;
> +                   tmp.size = sz;
> +                   tmp.offset = off;
> +                   tmp.hash = iterative_hash (tmp.bytes,
> +                                              sz * sizeof (target_unit), 0);
> +                   slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
> +                   if (*slot == NULL)
> +                     {
> +                       *slot = data_pool.allocate ();
> +                       **slot = tmp;
> +                     }
> +                 }
> +           }
> +       }
> +      delete htab;
> +      XDELETE (bytes);
> +    }
> +  data_pool.release ();
> +}
> +
>  /* Mark all constants that are used in the current function, then write
>     out the function's private constant pool.  */
>
> @@ -4251,6 +4426,10 @@ output_constant_pool (const char *fnname
>  void
>  output_shared_constant_pool (void)
>  {
> +  if (optimize
> +      && TARGET_SUPPORTS_ALIASES)
> +    optimize_constant_pool (shared_constant_pool);
> +
>    output_constant_pool_contents (shared_constant_pool);
>  }
>
>
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-08-27 12:24 ` Jakub Jelinek
  2020-08-27 13:07   ` Richard Biener
  2020-08-28 17:18   ` Hongtao Liu
@ 2020-09-01  9:55   ` Hongtao Liu
  2020-09-01 10:11     ` Jakub Jelinek
  2 siblings, 1 reply; 30+ messages in thread
From: Hongtao Liu @ 2020-09-01  9:55 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 2056 bytes --]

On Thu, Aug 27, 2020 at 8:24 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
>
> MEM_P
>

Changed.

> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P

Changed.

>
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>

Changed.

> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
>
> The usual way of doing this would be through
>   validate_change (insn, loc, vec_dup, 0);
>

Changed.

> Under what circumstances are we seeing a SEQUENCE in the x86 backend?  I'm
> surprised we need to handle that case.
>

Remove handling of SEQUENCE.

>Is there an existing pass over the RTL chain where this would work so that it's
>more compile-time efficient?
>

I tried define_split, but there's too many of them(considering usage
of define_subst for mask).
Also for new added instructions which support embedded broadcast,
corresponding define_split needs to be added.

Update patch.

--
BR,
Hongtao

[-- Attachment #2: 0001-Optimize-memory-broadcast-for-constant-vector-under-.patch --]
[-- Type: text/x-patch, Size: 37277 bytes --]

From 5bf6ca178da59eb239ae1bf1df429336018d95e0 Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 8 Jul 2020 17:14:36 +0800
Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.

For constant vector having one duplicated value, there's no need to put
whole vector in the constant pool, using embedded broadcast instead.

2020-07-09  Hongtao Liu  <hongtao.liu@intel.com>

gcc/ChangeLog:

	PR target/87767
	* config/i386/i386-features.c
	(replace_constant_pool_with_broadcast): New function.
	(constant_pool_broadcast): Ditto.
	(class pass_constant_pool_broadcast): New pass.
	(make_pass_constant_pool_broadcast): Ditto.
	* config/i386/i386-passes.def: Insert new pass after combine.
	* config/i386/i386-protos.h
	(make_pass_constant_pool_broadcast): Declare.
	* config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
	*avx512f_mul<mode>3<mask_name>_bcst): New define_insn.
	* config/i386/avx512fintrin.h (_mm512_set1_ps,
	_mm512_set1_pd,_mm512_set1_epi32, _mm512_set1_epi64): Adjusted.

gcc/testsuite/ChangeLog:

	PR target/87767
	* gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-3.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-4.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-5.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-6.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-7.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-3.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-4.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-6.c: New test.
---
 gcc/config/i386/avx512fintrin.h               |  27 ++--
 gcc/config/i386/i386-features.c               | 131 ++++++++++++++++++
 gcc/config/i386/i386-passes.def               |   1 +
 gcc/config/i386/i386-protos.h                 |   1 +
 gcc/config/i386/sse.md                        |  24 ++++
 .../i386/avx2-broadcast-pr87767-1.c           |  40 ++++++
 .../i386/avx512f-broadcast-pr87767-1.c        |  30 ++++
 .../i386/avx512f-broadcast-pr87767-2.c        |  42 ++++++
 .../i386/avx512f-broadcast-pr87767-3.c        |  30 ++++
 .../i386/avx512f-broadcast-pr87767-4.c        |  42 ++++++
 .../i386/avx512f-broadcast-pr87767-5.c        |  26 ++++
 .../i386/avx512f-broadcast-pr87767-6.c        |  41 ++++++
 .../i386/avx512f-broadcast-pr87767-7.c        | 121 ++++++++++++++++
 .../i386/avx512vl-broadcast-pr87767-1.c       |  45 ++++++
 .../i386/avx512vl-broadcast-pr87767-2.c       |  59 ++++++++
 .../i386/avx512vl-broadcast-pr87767-3.c       |  37 +++++
 .../i386/avx512vl-broadcast-pr87767-4.c       |  56 ++++++++
 .../i386/avx512vl-broadcast-pr87767-5.c       |  37 +++++
 .../i386/avx512vl-broadcast-pr87767-6.c       |  55 ++++++++
 19 files changed, 827 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c

diff --git a/gcc/config/i386/avx512fintrin.h b/gcc/config/i386/avx512fintrin.h
index 0d53dda3a27..729d5686d68 100644
--- a/gcc/config/i386/avx512fintrin.h
+++ b/gcc/config/i386/avx512fintrin.h
@@ -239,22 +239,17 @@ extern __inline __m512d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_pd (double __A)
 {
-  return (__m512d) __builtin_ia32_broadcastsd512 (__extension__
-						  (__v2df) { __A, },
-						  (__v8df)
-						  _mm512_undefined_pd (),
-						  (__mmask8) -1);
+  return __extension__ (__m512d)(__v8df)
+    { __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_ps (float __A)
 {
-  return (__m512) __builtin_ia32_broadcastss512 (__extension__
-						 (__v4sf) { __A, },
-						 (__v16sf)
-						 _mm512_undefined_ps (),
-						 (__mmask16) -1);
+  return __extension__ (__m512)(__v16sf)
+    { __A, __A, __A, __A, __A, __A, __A, __A,
+      __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 /* Create the vector [A B C D A B C D A B C D A B C D].  */
@@ -4072,10 +4067,9 @@ extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_epi32 (int __A)
 {
-  return (__m512i) __builtin_ia32_pbroadcastd512_gpr_mask (__A,
-							   (__v16si)
-							   _mm512_undefined_epi32 (),
-							   (__mmask16)(-1));
+  return (__m512i)(__v16si)
+    { __A, __A, __A, __A, __A, __A, __A, __A,
+      __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512i
@@ -4128,10 +4122,7 @@ extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_epi64 (long long __A)
 {
-  return (__m512i) __builtin_ia32_pbroadcastq512_gpr_mask (__A,
-							   (__v8di)
-							   _mm512_undefined_epi32 (),
-							   (__mmask8)(-1));
+  return (__m512i)(__v8di) { __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512i
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 535fc7e981d..27cccd8d388 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2379,6 +2379,137 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
   return new pass_remove_partial_avx_dependency (ctxt);
 }
 
+/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
+   with embedded broadcast. i.e.transform
+
+     vpaddq .LC0(%rip), %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+
+    to
+
+     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3  */
+static void
+replace_constant_pool_with_broadcast (rtx_insn* insn)
+{
+  subrtx_ptr_iterator::array_type array;
+  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
+    {
+      rtx *loc = *iter;
+      rtx x = *loc;
+      rtx broadcast_mem, vec_dup, constant, first;
+      machine_mode mode;
+
+      /* Constant pool.  */
+      if (!MEM_P (x)
+	  || !SYMBOL_REF_P (XEXP (x, 0))
+	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
+	continue;
+
+      /* Const vector.  */
+      mode = GET_MODE (x);
+      if (!VECTOR_MODE_P (mode))
+	return;
+      constant = get_pool_constant (XEXP (x, 0));
+      if (GET_CODE (constant) != CONST_VECTOR)
+	return;
+
+      first = XVECEXP (constant, 0, 0);
+      /* There could be some rtx like
+	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
+	 but with "*.LC1" refer to V2DI constant vector.  */
+      if (GET_MODE (constant) != mode)
+	return;
+
+      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
+	{
+	  rtx tmp = XVECEXP (constant, 0, i);
+	  /* Vector duplicate value.  */
+	  if (!rtx_equal_p (tmp, first))
+	    return;
+	}
+
+      /* Replace with embedded broadcast.  */
+      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
+      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
+      validate_change (insn, loc, vec_dup, 0);
+
+      /* At most 1 memory_operand in an insn.  */
+      return;
+    }
+}
+
+/* For const vector having one duplicated value, there's no need to put
+   whole vector in the constant pool when target supports embedded broadcast. */
+static unsigned int
+constant_pool_broadcast (void)
+{
+  timevar_push (TV_MACH_DEP);
+  rtx_insn *insn;
+
+  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
+    {
+      if (!INSN_P (insn))
+	continue;
+      replace_constant_pool_with_broadcast (insn);
+    }
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_constant_pool_broadcast =
+{
+  RTL_PASS, /* type */
+  "cpb", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_constant_pool_broadcast : public rtl_opt_pass
+{
+public:
+  pass_constant_pool_broadcast (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return TARGET_AVX512F;
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return constant_pool_broadcast ();
+    }
+}; // class pass_cpb
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_constant_pool_broadcast (gcc::context *ctxt)
+{
+  return new pass_constant_pool_broadcast (ctxt);
+}
+
 /* This compares the priority of target features in function DECL1
    and DECL2.  It returns positive value if DECL1 is higher priority,
    negative value if DECL2 is higher priority and 0 if they are the
diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index d83c7b956b1..07ecf8e790f 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
 
   INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
+  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index b6088f22d55..c5b700efd0e 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
   (gcc::context *);
 extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
   (gcc::context *);
+extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8250325e1a3..a728b979f01 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -12120,6 +12120,18 @@ (define_insn "avx512dq_mul<mode>3<mask_name>"
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
+	(mult:VI8_AVX512VL
+	  (vec_duplicate:VI8_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI8_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512DQ"
+  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3<mask_name>"
   [(set (match_operand:VI4_AVX512F 0 "register_operand")
 	(mult:VI4_AVX512F
@@ -12160,6 +12172,18 @@ (define_insn "*<sse4_1_avx2>_mul<mode>3<mask_name>"
    (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
+	(mult:VI4_AVX512VL
+	  (vec_duplicate:VI4_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512F"
+   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3"
   [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
 	(mult:VI8_AVX2_AVX512F
diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..aee1680ed98
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
@@ -0,0 +1,40 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2 -mno-avx512f" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..a8ee5f5faf1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -0,0 +1,30 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512dq" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 5 } }  */
+
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP CONSTANT;				\
+}						\
+
+FOO (v16si, add, +);
+FOO (v8di, add, +);
+FOO (v16sf, add, +);
+FOO (v8df, add, +);
+FOO (v16si, sub, -);
+FOO (v8di, sub, -);
+FOO (v16si, mul, *);
+FOO (v8di, mul, *);
+FOO (v16sf, mul, *);
+FOO (v8df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..30cf5809c3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
@@ -0,0 +1,42 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512dq" } */
+/* { dg-require-effective-target avx512dq } */
+
+#define AVX512DQ
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-1.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP CONSTANT;			\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16si, int, 16, add, +);
+  RTEST (v8di, long long, 8, add, +);
+  RTEST (v16sf, float, 16, add, +);
+  RTEST (v8df, double, 8, add, +);
+  RTEST (v16si, int, 16, sub, -);
+  RTEST (v8di, long long, 8, sub, -);
+  RTEST (v16si, int, 16, mul, *);
+  RTEST (v8di, long long, 8, mul, *);
+  RTEST (v16sf, float, 16, mul, *);
+  RTEST (v8df, double, 8, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
new file mode 100644
index 00000000000..c2f22c4ee5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
@@ -0,0 +1,30 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 4 } }  */
+
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a, VTYPE b)	\
+{						\
+  return (OP1 a * b) OP2 CONSTANT;		\
+}						\
+
+FOO (v16sf, fma,, +);
+FOO (v8df, fma,, +);
+FOO (v16sf, fms,, -);
+FOO (v8df, fms,, -);
+FOO (v16sf, fnma, -, +);
+FOO (v8df, fnma, -, +);
+FOO (v16sf, fnms, -, -);
+FOO (v8df, fnms, -, -);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
new file mode 100644
index 00000000000..dabe91b11c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
@@ -0,0 +1,42 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-require-effective-target avx512f } */
+
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-3.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src1[N], src2[N];					\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src1[i] = i * i * 107.2f;					\
+	  src2[i] = i * 2.f - 404.f;					\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src1[0], *(VTYPE*)&src2[0]); \
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src1[i] * src2[i]) OP2 CONSTANT;			\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16sf, float, 16, fma,, +);
+  RTEST (v8df, double, 8, fma,, +);
+  RTEST (v16sf, float, 16, fms,, -);
+  RTEST (v8df, double, 8, fms,, -);
+  RTEST (v16sf, float, 16, fnma,-, +);
+  RTEST (v8df, double, 8, fnma,-, +);
+  RTEST (v16sf, float, 16, fnms,-, -);
+  RTEST (v8df, double, 8, fnms,-, -);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
new file mode 100644
index 00000000000..72e1098ccbe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
@@ -0,0 +1,26 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to16\\\}" 4 } }  */
+
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return (OP1 a) OP2 CONSTANT;			\
+}						\
+
+FOO (v16si, andnot, ~, &);
+FOO (v8di, andnot, ~, &);
+FOO (v16si, and,, &);
+FOO (v8di, and,, &);
+FOO (v16si, or,, |);
+FOO (v8di, or,, |);
+FOO (v16si, xor,, ^);
+FOO (v8di, xor,, ^);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
new file mode 100644
index 00000000000..f288f83158c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
@@ -0,0 +1,41 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-require-effective-target avx512f} */
+
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-5.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src[N];						\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src[i] = i * i * 107;						\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);			\
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src[i]) OP2 CONSTANT;				\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16si, int, 16, andnot, ~, &);
+  RTEST (v8di, long long, 8, andnot, ~, &);
+  RTEST (v16si, int, 16, and,, &);
+  RTEST (v8di, long long, 8, and,, &);
+  RTEST (v16si, int, 16, or,, |);
+  RTEST (v8di, long long, 8, or,, |);
+  RTEST (v16si, int, 16, xor,, ^);
+  RTEST (v8di, long long, 8, xor,, ^);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
new file mode 100644
index 00000000000..a8f145d8589
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
@@ -0,0 +1,121 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+
+#include<immintrin.h>
+
+#define CONSTANT 101
+
+#define FOO(VTYPE, OP_NAME, LEN, SUFFIX, MTYPE)			\
+  VTYPE									\
+  __attribute__ ((noipa))						\
+  _mm##LEN##_foo_##OP_NAME##_##SUFFIX (VTYPE dst, VTYPE src, MTYPE m)	\
+  {									\
+    return  _mm##LEN##_mask_##OP_NAME##_##SUFFIX (dst, m, src,		\
+						  _mm##LEN##_set1_##SUFFIX (CONSTANT)); \
+  }									\
+
+#define FOOZ(VTYPE, OP_NAME, LEN, SUFFIX, MTYPE)			\
+  VTYPE									\
+  __attribute__ ((noipa))						\
+  _mm##LEN##_fooz_##OP_NAME##_##SUFFIX (VTYPE dst, VTYPE src, MTYPE m)	\
+  {									\
+    return  _mm##LEN##_maskz_##OP_NAME##_##SUFFIX (m, dst, src,		\
+						  _mm##LEN##_set1_##SUFFIX (CONSTANT)); \
+  }									\
+
+FOO (__m512, add, 512, ps, __mmask16);
+FOO (__m256, add, 256, ps, __mmask8);
+FOO (__m128, add,, ps, __mmask8);
+
+FOO (__m512, sub, 512, ps, __mmask16);
+FOO (__m256, sub, 256, ps, __mmask8);
+FOO (__m128, sub,, ps, __mmask8);
+
+FOO (__m512, mul, 512, ps, __mmask16);
+FOO (__m256, mul, 256, ps, __mmask8);
+FOO (__m128, mul,, ps, __mmask8);
+
+FOO (__m512, div, 512, ps, __mmask16);
+FOO (__m256, div, 256, ps, __mmask8);
+FOO (__m128, div,, ps, __mmask8);
+
+FOOZ (__m512, fmadd, 512, ps, __mmask16);
+FOOZ (__m256, fmadd, 256, ps, __mmask8);
+FOOZ (__m128, fmadd,, ps, __mmask8);
+
+FOOZ (__m512, fmsub, 512, ps, __mmask16);
+FOOZ (__m256, fmsub, 256, ps, __mmask8);
+FOOZ (__m128, fmsub,, ps, __mmask8);
+
+FOOZ (__m512, fnmadd, 512, ps, __mmask16);
+FOOZ (__m256, fnmadd, 256, ps, __mmask8);
+FOOZ (__m128, fnmadd,, ps, __mmask8);
+
+FOOZ (__m512, fnmsub, 512, ps, __mmask16);
+FOOZ (__m256, fnmsub, 256, ps, __mmask8);
+FOOZ (__m128, fnmsub,, ps, __mmask8);
+
+FOO (__m512d, add, 512, pd, __mmask8);
+FOO (__m256d, add, 256, pd, __mmask8);
+FOO (__m128d, add,, pd, __mmask8);
+
+FOO (__m512d, sub, 512, pd, __mmask8);
+FOO (__m256d, sub, 256, pd, __mmask8);
+FOO (__m128d, sub,, pd, __mmask8);
+
+FOO (__m512d, mul, 512, pd, __mmask8);
+FOO (__m256d, mul, 256, pd, __mmask8);
+FOO (__m128d, mul,, pd, __mmask8);
+
+FOO (__m512d, div, 512, pd, __mmask8);
+FOO (__m256d, div, 256, pd, __mmask8);
+FOO (__m128d, div,, pd, __mmask8);
+
+FOOZ (__m512d, fmadd, 512, pd, __mmask8);
+FOOZ (__m256d, fmadd, 256, pd, __mmask8);
+FOOZ (__m128d, fmadd,, pd, __mmask8);
+
+FOOZ (__m512d, fmsub, 512, pd, __mmask8);
+FOOZ (__m256d, fmsub, 256, pd, __mmask8);
+FOOZ (__m128d, fmsub,, pd, __mmask8);
+
+FOOZ (__m512d, fnmadd, 512, pd, __mmask8);
+FOOZ (__m256d, fnmadd, 256, pd, __mmask8);
+FOOZ (__m128d, fnmadd,, pd, __mmask8);
+
+FOOZ (__m512d, fnmsub, 512, pd, __mmask8);
+FOOZ (__m256d, fnmsub, 256, pd, __mmask8);
+FOOZ (__m128d, fnmsub,, pd, __mmask8);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..397e287134c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -0,0 +1,45 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 10 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP CONSTANT;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4si, sub, -);
+FOO (v8si, sub, -);
+FOO (v2di, sub, -);
+FOO (v4di, sub, -);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..9b796ac7124
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
@@ -0,0 +1,59 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512dq -mavx512vl" } */
+/* { dg-require-effective-target avx512dq } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512DQ
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-1.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP CONSTANT;			\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8si, int, 8, add, +);
+  RTEST (v4di, long long, 4, add, +);
+  RTEST (v8sf, float, 8, add, +);
+  RTEST (v4df, double, 4, add, +);
+  RTEST (v8si, int, 8, sub, -);
+  RTEST (v4di, long long, 4, sub, -);
+  RTEST (v8si, int, 8, mul, *);
+  RTEST (v4di, long long, 4, mul, *);
+  RTEST (v8sf, float, 8, mul, *);
+  RTEST (v4df, double, 4, mul, *);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4si, int, 4, add, +);
+  RTEST (v2di, long long, 2, add, +);
+  RTEST (v4sf, float, 4, add, +);
+  RTEST (v2df, double, 2, add, +);
+  RTEST (v4si, int, 4, sub, -);
+  RTEST (v2di, long long, 2, sub, -);
+  RTEST (v4si, int, 4, mul, *);
+  RTEST (v2di, long long, 2, mul, *);
+  RTEST (v4sf, float, 4, mul, *);
+  RTEST (v2df, double, 2, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
new file mode 100644
index 00000000000..aedfb16016e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
@@ -0,0 +1,37 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a, VTYPE b)	\
+{						\
+  return (OP1 a * b) OP2 CONSTANT;		\
+}						\
+
+FOO (v4sf, fma,, +);
+FOO (v8sf, fma,, +);
+FOO (v2df, fma,, +);
+FOO (v4df, fma,, +);
+FOO (v4sf, fms,, -);
+FOO (v8sf, fms,, -);
+FOO (v2df, fms,, -);
+FOO (v4df, fms,, -);
+FOO (v4sf, fnma, -, +);
+FOO (v8sf, fnma, -, +);
+FOO (v2df, fnma, -, +);
+FOO (v4df, fnma, -, +);
+FOO (v4sf, fnms, -, -);
+FOO (v8sf, fnms, -, -);
+FOO (v2df, fnms, -, -);
+FOO (v4df, fnms, -, -);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
new file mode 100644
index 00000000000..40b8eb9929d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
@@ -0,0 +1,56 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-3.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src1[N], src2[N];					\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src1[i] = i * i * 107.2f;					\
+	  src2[i] = i * 2.f - 404.f;					\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src1[0], *(VTYPE*)&src2[0]); \
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src1[i] * src2[i]) OP2 CONSTANT;			\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8sf, float, 8, fma,, +);
+  RTEST (v4df, double, 4, fma,, +);
+  RTEST (v8sf, float, 8, fms,, -);
+  RTEST (v4df, double, 4, fms,, -);
+  RTEST (v8sf, float, 8, fnma,-, +);
+  RTEST (v4df, double, 4, fnma,-, +);
+  RTEST (v8sf, float, 8, fnms,-, -);
+  RTEST (v4df, double, 4, fnms,-, -);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4sf, float, 4, fma,, +);
+  RTEST (v2df, double, 2, fma,, +);
+  RTEST (v4sf, float, 4, fms,, -);
+  RTEST (v2df, double, 2, fms,, -);
+  RTEST (v4sf, float, 4, fnma,-, +);
+  RTEST (v2df, double, 2, fnma,-, +);
+  RTEST (v4sf, float, 4, fnms,-, -);
+  RTEST (v2df, double, 2, fnms,-, -);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
new file mode 100644
index 00000000000..1e9460faa9d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
@@ -0,0 +1,37 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return (OP1 a) OP2 CONSTANT;			\
+}						\
+
+FOO (v4si, andnot, ~, &);
+FOO (v8si, andnot, ~, &);
+FOO (v2di, andnot, ~, &);
+FOO (v4di, andnot, ~, &);
+FOO (v4si, and,, &);
+FOO (v8si, and,, &);
+FOO (v2di, and,, &);
+FOO (v4di, and,, &);
+FOO (v4si, or,, |);
+FOO (v8si, or,, |);
+FOO (v2di, or,, |);
+FOO (v4di, or,, |);
+FOO (v4si, xor,, ^);
+FOO (v8si, xor,, ^);
+FOO (v2di, xor,, ^);
+FOO (v4di, xor,, ^);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c
new file mode 100644
index 00000000000..493a76f0917
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c
@@ -0,0 +1,55 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-5.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src[N];						\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src[i] = i * i * 107;						\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);			\
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src[i]) OP2 CONSTANT;				\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8si, int, 8, andnot, ~, &);
+  RTEST (v4di, long long, 4, andnot, ~, &);
+  RTEST (v8si, int, 8, and,, &);
+  RTEST (v4di, long long, 4, and,, &);
+  RTEST (v8si, int, 8, or,, |);
+  RTEST (v4di, long long, 4, or,, |);
+  RTEST (v8si, int, 8, xor,, ^);
+  RTEST (v4di, long long, 4, xor,, ^);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4si, int, 4, andnot, ~, &);
+  RTEST (v2di, long long, 2, andnot, ~, &);
+  RTEST (v4si, int, 4, and,, &);
+  RTEST (v2di, long long, 2, and,, &);
+  RTEST (v4si, int, 4, or,, |);
+  RTEST (v2di, long long, 2, or,, |);
+  RTEST (v4si, int, 4, xor,, ^);
+  RTEST (v2di, long long, 2, xor,, ^);
+}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-09-01  9:55   ` Hongtao Liu
@ 2020-09-01 10:11     ` Jakub Jelinek
  2020-09-02  1:57       ` Hongtao Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2020-09-01 10:11 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: GCC Patches

On Tue, Sep 01, 2020 at 05:55:18PM +0800, Hongtao Liu wrote:
> I tried define_split, but there's too many of them(considering usage
> of define_subst for mask).
> Also for new added instructions which support embedded broadcast,
> corresponding define_split needs to be added.

One pass that could (sometimes) handle it is the rpad pass which is also
added right after the combiner like your pass.
So, couldn't you call replace_constant_pool_with_broadcast from both
your pass body loop and remove_partial_avx_dependency, the latter guarded
with if (TARGET_AVX512F), and change the new pass gate to be false when
the rpad pass gate is true?

> +static void
> +replace_constant_pool_with_broadcast (rtx_insn* insn)

Formatting, * should be after space, not before it.

> +      first = XVECEXP (constant, 0, 0);
> +      /* There could be some rtx like
> +	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> +	 but with "*.LC1" refer to V2DI constant vector.  */
> +      if (GET_MODE (constant) != mode)
> +	return;

Is there a reason why don't you want to handle that?
You could do instead:
      if (GET_MODE (constant) != mode)
	{
	  constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
	  if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
	    return;
	}
      first = XVECEXP (constant, 0, 0);

Other than that LGTM.

	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-09-01 10:11     ` Jakub Jelinek
@ 2020-09-02  1:57       ` Hongtao Liu
  2020-09-02  9:58         ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Hongtao Liu @ 2020-09-02  1:57 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1742 bytes --]

On Tue, Sep 1, 2020 at 6:11 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Sep 01, 2020 at 05:55:18PM +0800, Hongtao Liu wrote:
> > I tried define_split, but there's too many of them(considering usage
> > of define_subst for mask).
> > Also for new added instructions which support embedded broadcast,
> > corresponding define_split needs to be added.
>
> One pass that could (sometimes) handle it is the rpad pass which is also
> added right after the combiner like your pass.
> So, couldn't you call replace_constant_pool_with_broadcast from both
> your pass body loop and remove_partial_avx_dependency, the latter guarded
> with if (TARGET_AVX512F), and change the new pass gate to be false when
> the rpad pass gate is true?
>

Yes, it could save compile time when both this pass and rpad pass are available.
Changed.

> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
>
> Formatting, * should be after space, not before it.
>

Changed.

> > +      first = XVECEXP (constant, 0, 0);
> > +      /* There could be some rtx like
> > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > +      but with "*.LC1" refer to V2DI constant vector.  */
> > +      if (GET_MODE (constant) != mode)
> > +     return;
>
> Is there a reason why don't you want to handle that?

just avoid error caused by different element number.

> You could do instead:
>       if (GET_MODE (constant) != mode)
>         {
>           constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
>           if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
>             return;
>         }
>       first = XVECEXP (constant, 0, 0);
>

Changed.

> Other than that LGTM.
>
>         Jakub
>


-- 
BR,
Hongtao

[-- Attachment #2: 0001-Optimize-memory-broadcast-for-constant-vector-under-_V4.patch --]
[-- Type: text/x-patch, Size: 38491 bytes --]

From e438639545c7bad5bb17b20a16efcc1871166093 Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 8 Jul 2020 17:14:36 +0800
Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.

For constant vector having one duplicated value, there's no need to put
whole vector in the constant pool, using embedded broadcast instead.

2020-07-09  Hongtao Liu  <hongtao.liu@intel.com>

gcc/ChangeLog:

	PR target/87767
	* config/i386/i386-features.c
	(replace_constant_pool_with_broadcast): New function.
	(constant_pool_broadcast): Ditto.
	(class pass_constant_pool_broadcast): New pass.
	(make_pass_constant_pool_broadcast): Ditto.
	(remove_partial_avx_dependency): Call
	replace_constant_pool_with_broadcast under TARGET_AVX512F, it
	would save compile time when both pass rpad and cpb are
	available.
	* config/i386/i386-passes.def: Insert new pass after combine.
	* config/i386/i386-protos.h
	(make_pass_constant_pool_broadcast): Declare.
	* config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
	*avx512f_mul<mode>3<mask_name>_bcst): New define_insn.
	* config/i386/avx512fintrin.h (_mm512_set1_ps,
	_mm512_set1_pd,_mm512_set1_epi32, _mm512_set1_epi64): Adjusted.

gcc/testsuite/ChangeLog:

	PR target/87767
	* gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-3.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-4.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-5.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-6.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-7.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-3.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-4.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-6.c: New test.
---
 gcc/config/i386/avx512fintrin.h               |  27 ++--
 gcc/config/i386/i386-features.c               | 147 ++++++++++++++++++
 gcc/config/i386/i386-passes.def               |   1 +
 gcc/config/i386/i386-protos.h                 |   1 +
 gcc/config/i386/sse.md                        |  24 +++
 .../i386/avx2-broadcast-pr87767-1.c           |  40 +++++
 .../i386/avx512f-broadcast-pr87767-1.c        |  30 ++++
 .../i386/avx512f-broadcast-pr87767-2.c        |  42 +++++
 .../i386/avx512f-broadcast-pr87767-3.c        |  30 ++++
 .../i386/avx512f-broadcast-pr87767-4.c        |  42 +++++
 .../i386/avx512f-broadcast-pr87767-5.c        |  26 ++++
 .../i386/avx512f-broadcast-pr87767-6.c        |  41 +++++
 .../i386/avx512f-broadcast-pr87767-7.c        | 121 ++++++++++++++
 .../i386/avx512vl-broadcast-pr87767-1.c       |  45 ++++++
 .../i386/avx512vl-broadcast-pr87767-2.c       |  59 +++++++
 .../i386/avx512vl-broadcast-pr87767-3.c       |  37 +++++
 .../i386/avx512vl-broadcast-pr87767-4.c       |  56 +++++++
 .../i386/avx512vl-broadcast-pr87767-5.c       |  37 +++++
 .../i386/avx512vl-broadcast-pr87767-6.c       |  55 +++++++
 19 files changed, 843 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c

diff --git a/gcc/config/i386/avx512fintrin.h b/gcc/config/i386/avx512fintrin.h
index 0d53dda3a27..729d5686d68 100644
--- a/gcc/config/i386/avx512fintrin.h
+++ b/gcc/config/i386/avx512fintrin.h
@@ -239,22 +239,17 @@ extern __inline __m512d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_pd (double __A)
 {
-  return (__m512d) __builtin_ia32_broadcastsd512 (__extension__
-						  (__v2df) { __A, },
-						  (__v8df)
-						  _mm512_undefined_pd (),
-						  (__mmask8) -1);
+  return __extension__ (__m512d)(__v8df)
+    { __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_ps (float __A)
 {
-  return (__m512) __builtin_ia32_broadcastss512 (__extension__
-						 (__v4sf) { __A, },
-						 (__v16sf)
-						 _mm512_undefined_ps (),
-						 (__mmask16) -1);
+  return __extension__ (__m512)(__v16sf)
+    { __A, __A, __A, __A, __A, __A, __A, __A,
+      __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 /* Create the vector [A B C D A B C D A B C D A B C D].  */
@@ -4072,10 +4067,9 @@ extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_epi32 (int __A)
 {
-  return (__m512i) __builtin_ia32_pbroadcastd512_gpr_mask (__A,
-							   (__v16si)
-							   _mm512_undefined_epi32 (),
-							   (__mmask16)(-1));
+  return (__m512i)(__v16si)
+    { __A, __A, __A, __A, __A, __A, __A, __A,
+      __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512i
@@ -4128,10 +4122,7 @@ extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_epi64 (long long __A)
 {
-  return (__m512i) __builtin_ia32_pbroadcastq512_gpr_mask (__A,
-							   (__v8di)
-							   _mm512_undefined_epi32 (),
-							   (__mmask8)(-1));
+  return (__m512i)(__v8di) { __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512i
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 535fc7e981d..d36b8fea650 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2162,6 +2162,81 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt)
   return new pass_insert_endbr_and_patchable_area (ctxt);
 }
 
+/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
+   with embedded broadcast. i.e.transform
+
+     vpaddq .LC0(%rip), %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+
+    to
+
+     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3  */
+static void
+replace_constant_pool_with_broadcast (rtx_insn *insn)
+{
+  subrtx_ptr_iterator::array_type array;
+  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
+    {
+      rtx *loc = *iter;
+      rtx x = *loc;
+      rtx broadcast_mem, vec_dup, constant, first;
+      machine_mode mode;
+
+      /* Constant pool.  */
+      if (!MEM_P (x)
+	  || !SYMBOL_REF_P (XEXP (x, 0))
+	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
+	continue;
+
+      /* Const vector.  */
+      mode = GET_MODE (x);
+      if (!VECTOR_MODE_P (mode))
+	return;
+      constant = get_pool_constant (XEXP (x, 0));
+      if (GET_CODE (constant) != CONST_VECTOR)
+	return;
+
+      first = XVECEXP (constant, 0, 0);
+      /* There could be some rtx like
+	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
+	 but with "*.LC1" refer to V2DI constant vector.  */
+      if (GET_MODE (constant) != mode)
+	{
+	  constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
+	  if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
+	    return;
+	}
+
+      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
+	{
+	  rtx tmp = XVECEXP (constant, 0, i);
+	  /* Vector duplicate value.  */
+	  if (!rtx_equal_p (tmp, first))
+	    return;
+	}
+
+      /* Replace with embedded broadcast.  */
+      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
+      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
+      validate_change (insn, loc, vec_dup, 0);
+
+      /* At most 1 memory_operand in an insn.  */
+      return;
+    }
+}
+
 /* At entry of the nearest common dominator for basic blocks with
    conversions, generate a single
 	vxorps %xmmN, %xmmN, %xmmN
@@ -2197,6 +2272,10 @@ remove_partial_avx_dependency (void)
 	  if (!NONDEBUG_INSN_P (insn))
 	    continue;
 
+	  /* Hanlde AVX512 embedded broadcast here to save compile time.  */
+	  if (TARGET_AVX512F)
+	    replace_constant_pool_with_broadcast (insn);
+
 	  set = single_set (insn);
 	  if (!set)
 	    continue;
@@ -2379,6 +2458,74 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
   return new pass_remove_partial_avx_dependency (ctxt);
 }
 
+/* For const vector having one duplicated value, there's no need to put
+   whole vector in the constant pool when target supports embedded broadcast. */
+static unsigned int
+constant_pool_broadcast (void)
+{
+  timevar_push (TV_MACH_DEP);
+  rtx_insn *insn;
+
+  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
+    {
+      if (!INSN_P (insn))
+	continue;
+      replace_constant_pool_with_broadcast (insn);
+    }
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_constant_pool_broadcast =
+{
+  RTL_PASS, /* type */
+  "cpb", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_constant_pool_broadcast : public rtl_opt_pass
+{
+public:
+  pass_constant_pool_broadcast (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      /* Return false if rpad pass gate is true.
+	 replace_constant_pool_with_broadcast is called
+	 from both this pass and rpad pass.  */
+      return (TARGET_AVX512F
+	      && !(TARGET_AVX
+		   && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+		   && TARGET_SSE_MATH
+		   && optimize
+		   && optimize_function_for_speed_p (cfun)));
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return constant_pool_broadcast ();
+    }
+}; // class pass_cpb
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_constant_pool_broadcast (gcc::context *ctxt)
+{
+  return new pass_constant_pool_broadcast (ctxt);
+}
+
 /* This compares the priority of target features in function DECL1
    and DECL2.  It returns positive value if DECL1 is higher priority,
    negative value if DECL2 is higher priority and 0 if they are the
diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index d83c7b956b1..07ecf8e790f 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
 
   INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
+  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index b6088f22d55..c5b700efd0e 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
   (gcc::context *);
 extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
   (gcc::context *);
+extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8250325e1a3..a728b979f01 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -12120,6 +12120,18 @@ (define_insn "avx512dq_mul<mode>3<mask_name>"
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
+	(mult:VI8_AVX512VL
+	  (vec_duplicate:VI8_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI8_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512DQ"
+  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3<mask_name>"
   [(set (match_operand:VI4_AVX512F 0 "register_operand")
 	(mult:VI4_AVX512F
@@ -12160,6 +12172,18 @@ (define_insn "*<sse4_1_avx2>_mul<mode>3<mask_name>"
    (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
+	(mult:VI4_AVX512VL
+	  (vec_duplicate:VI4_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512F"
+   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3"
   [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
 	(mult:VI8_AVX2_AVX512F
diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..aee1680ed98
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
@@ -0,0 +1,40 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2 -mno-avx512f" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..a8ee5f5faf1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -0,0 +1,30 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512dq" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 5 } }  */
+
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP CONSTANT;				\
+}						\
+
+FOO (v16si, add, +);
+FOO (v8di, add, +);
+FOO (v16sf, add, +);
+FOO (v8df, add, +);
+FOO (v16si, sub, -);
+FOO (v8di, sub, -);
+FOO (v16si, mul, *);
+FOO (v8di, mul, *);
+FOO (v16sf, mul, *);
+FOO (v8df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..30cf5809c3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
@@ -0,0 +1,42 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512dq" } */
+/* { dg-require-effective-target avx512dq } */
+
+#define AVX512DQ
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-1.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP CONSTANT;			\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16si, int, 16, add, +);
+  RTEST (v8di, long long, 8, add, +);
+  RTEST (v16sf, float, 16, add, +);
+  RTEST (v8df, double, 8, add, +);
+  RTEST (v16si, int, 16, sub, -);
+  RTEST (v8di, long long, 8, sub, -);
+  RTEST (v16si, int, 16, mul, *);
+  RTEST (v8di, long long, 8, mul, *);
+  RTEST (v16sf, float, 16, mul, *);
+  RTEST (v8df, double, 8, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
new file mode 100644
index 00000000000..c2f22c4ee5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
@@ -0,0 +1,30 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 4 } }  */
+
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a, VTYPE b)	\
+{						\
+  return (OP1 a * b) OP2 CONSTANT;		\
+}						\
+
+FOO (v16sf, fma,, +);
+FOO (v8df, fma,, +);
+FOO (v16sf, fms,, -);
+FOO (v8df, fms,, -);
+FOO (v16sf, fnma, -, +);
+FOO (v8df, fnma, -, +);
+FOO (v16sf, fnms, -, -);
+FOO (v8df, fnms, -, -);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
new file mode 100644
index 00000000000..dabe91b11c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
@@ -0,0 +1,42 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-require-effective-target avx512f } */
+
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-3.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src1[N], src2[N];					\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src1[i] = i * i * 107.2f;					\
+	  src2[i] = i * 2.f - 404.f;					\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src1[0], *(VTYPE*)&src2[0]); \
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src1[i] * src2[i]) OP2 CONSTANT;			\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16sf, float, 16, fma,, +);
+  RTEST (v8df, double, 8, fma,, +);
+  RTEST (v16sf, float, 16, fms,, -);
+  RTEST (v8df, double, 8, fms,, -);
+  RTEST (v16sf, float, 16, fnma,-, +);
+  RTEST (v8df, double, 8, fnma,-, +);
+  RTEST (v16sf, float, 16, fnms,-, -);
+  RTEST (v8df, double, 8, fnms,-, -);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
new file mode 100644
index 00000000000..72e1098ccbe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
@@ -0,0 +1,26 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to16\\\}" 4 } }  */
+
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return (OP1 a) OP2 CONSTANT;			\
+}						\
+
+FOO (v16si, andnot, ~, &);
+FOO (v8di, andnot, ~, &);
+FOO (v16si, and,, &);
+FOO (v8di, and,, &);
+FOO (v16si, or,, |);
+FOO (v8di, or,, |);
+FOO (v16si, xor,, ^);
+FOO (v8di, xor,, ^);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
new file mode 100644
index 00000000000..f288f83158c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
@@ -0,0 +1,41 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-require-effective-target avx512f} */
+
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-5.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src[N];						\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src[i] = i * i * 107;						\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);			\
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src[i]) OP2 CONSTANT;				\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16si, int, 16, andnot, ~, &);
+  RTEST (v8di, long long, 8, andnot, ~, &);
+  RTEST (v16si, int, 16, and,, &);
+  RTEST (v8di, long long, 8, and,, &);
+  RTEST (v16si, int, 16, or,, |);
+  RTEST (v8di, long long, 8, or,, |);
+  RTEST (v16si, int, 16, xor,, ^);
+  RTEST (v8di, long long, 8, xor,, ^);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
new file mode 100644
index 00000000000..a8f145d8589
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
@@ -0,0 +1,121 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+
+#include<immintrin.h>
+
+#define CONSTANT 101
+
+#define FOO(VTYPE, OP_NAME, LEN, SUFFIX, MTYPE)			\
+  VTYPE									\
+  __attribute__ ((noipa))						\
+  _mm##LEN##_foo_##OP_NAME##_##SUFFIX (VTYPE dst, VTYPE src, MTYPE m)	\
+  {									\
+    return  _mm##LEN##_mask_##OP_NAME##_##SUFFIX (dst, m, src,		\
+						  _mm##LEN##_set1_##SUFFIX (CONSTANT)); \
+  }									\
+
+#define FOOZ(VTYPE, OP_NAME, LEN, SUFFIX, MTYPE)			\
+  VTYPE									\
+  __attribute__ ((noipa))						\
+  _mm##LEN##_fooz_##OP_NAME##_##SUFFIX (VTYPE dst, VTYPE src, MTYPE m)	\
+  {									\
+    return  _mm##LEN##_maskz_##OP_NAME##_##SUFFIX (m, dst, src,		\
+						  _mm##LEN##_set1_##SUFFIX (CONSTANT)); \
+  }									\
+
+FOO (__m512, add, 512, ps, __mmask16);
+FOO (__m256, add, 256, ps, __mmask8);
+FOO (__m128, add,, ps, __mmask8);
+
+FOO (__m512, sub, 512, ps, __mmask16);
+FOO (__m256, sub, 256, ps, __mmask8);
+FOO (__m128, sub,, ps, __mmask8);
+
+FOO (__m512, mul, 512, ps, __mmask16);
+FOO (__m256, mul, 256, ps, __mmask8);
+FOO (__m128, mul,, ps, __mmask8);
+
+FOO (__m512, div, 512, ps, __mmask16);
+FOO (__m256, div, 256, ps, __mmask8);
+FOO (__m128, div,, ps, __mmask8);
+
+FOOZ (__m512, fmadd, 512, ps, __mmask16);
+FOOZ (__m256, fmadd, 256, ps, __mmask8);
+FOOZ (__m128, fmadd,, ps, __mmask8);
+
+FOOZ (__m512, fmsub, 512, ps, __mmask16);
+FOOZ (__m256, fmsub, 256, ps, __mmask8);
+FOOZ (__m128, fmsub,, ps, __mmask8);
+
+FOOZ (__m512, fnmadd, 512, ps, __mmask16);
+FOOZ (__m256, fnmadd, 256, ps, __mmask8);
+FOOZ (__m128, fnmadd,, ps, __mmask8);
+
+FOOZ (__m512, fnmsub, 512, ps, __mmask16);
+FOOZ (__m256, fnmsub, 256, ps, __mmask8);
+FOOZ (__m128, fnmsub,, ps, __mmask8);
+
+FOO (__m512d, add, 512, pd, __mmask8);
+FOO (__m256d, add, 256, pd, __mmask8);
+FOO (__m128d, add,, pd, __mmask8);
+
+FOO (__m512d, sub, 512, pd, __mmask8);
+FOO (__m256d, sub, 256, pd, __mmask8);
+FOO (__m128d, sub,, pd, __mmask8);
+
+FOO (__m512d, mul, 512, pd, __mmask8);
+FOO (__m256d, mul, 256, pd, __mmask8);
+FOO (__m128d, mul,, pd, __mmask8);
+
+FOO (__m512d, div, 512, pd, __mmask8);
+FOO (__m256d, div, 256, pd, __mmask8);
+FOO (__m128d, div,, pd, __mmask8);
+
+FOOZ (__m512d, fmadd, 512, pd, __mmask8);
+FOOZ (__m256d, fmadd, 256, pd, __mmask8);
+FOOZ (__m128d, fmadd,, pd, __mmask8);
+
+FOOZ (__m512d, fmsub, 512, pd, __mmask8);
+FOOZ (__m256d, fmsub, 256, pd, __mmask8);
+FOOZ (__m128d, fmsub,, pd, __mmask8);
+
+FOOZ (__m512d, fnmadd, 512, pd, __mmask8);
+FOOZ (__m256d, fnmadd, 256, pd, __mmask8);
+FOOZ (__m128d, fnmadd,, pd, __mmask8);
+
+FOOZ (__m512d, fnmsub, 512, pd, __mmask8);
+FOOZ (__m256d, fnmsub, 256, pd, __mmask8);
+FOOZ (__m128d, fnmsub,, pd, __mmask8);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..397e287134c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -0,0 +1,45 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 10 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP CONSTANT;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4si, sub, -);
+FOO (v8si, sub, -);
+FOO (v2di, sub, -);
+FOO (v4di, sub, -);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..9b796ac7124
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
@@ -0,0 +1,59 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512dq -mavx512vl" } */
+/* { dg-require-effective-target avx512dq } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512DQ
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-1.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP CONSTANT;			\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8si, int, 8, add, +);
+  RTEST (v4di, long long, 4, add, +);
+  RTEST (v8sf, float, 8, add, +);
+  RTEST (v4df, double, 4, add, +);
+  RTEST (v8si, int, 8, sub, -);
+  RTEST (v4di, long long, 4, sub, -);
+  RTEST (v8si, int, 8, mul, *);
+  RTEST (v4di, long long, 4, mul, *);
+  RTEST (v8sf, float, 8, mul, *);
+  RTEST (v4df, double, 4, mul, *);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4si, int, 4, add, +);
+  RTEST (v2di, long long, 2, add, +);
+  RTEST (v4sf, float, 4, add, +);
+  RTEST (v2df, double, 2, add, +);
+  RTEST (v4si, int, 4, sub, -);
+  RTEST (v2di, long long, 2, sub, -);
+  RTEST (v4si, int, 4, mul, *);
+  RTEST (v2di, long long, 2, mul, *);
+  RTEST (v4sf, float, 4, mul, *);
+  RTEST (v2df, double, 2, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
new file mode 100644
index 00000000000..aedfb16016e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
@@ -0,0 +1,37 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a, VTYPE b)	\
+{						\
+  return (OP1 a * b) OP2 CONSTANT;		\
+}						\
+
+FOO (v4sf, fma,, +);
+FOO (v8sf, fma,, +);
+FOO (v2df, fma,, +);
+FOO (v4df, fma,, +);
+FOO (v4sf, fms,, -);
+FOO (v8sf, fms,, -);
+FOO (v2df, fms,, -);
+FOO (v4df, fms,, -);
+FOO (v4sf, fnma, -, +);
+FOO (v8sf, fnma, -, +);
+FOO (v2df, fnma, -, +);
+FOO (v4df, fnma, -, +);
+FOO (v4sf, fnms, -, -);
+FOO (v8sf, fnms, -, -);
+FOO (v2df, fnms, -, -);
+FOO (v4df, fnms, -, -);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
new file mode 100644
index 00000000000..40b8eb9929d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
@@ -0,0 +1,56 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-3.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src1[N], src2[N];					\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src1[i] = i * i * 107.2f;					\
+	  src2[i] = i * 2.f - 404.f;					\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src1[0], *(VTYPE*)&src2[0]); \
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src1[i] * src2[i]) OP2 CONSTANT;			\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8sf, float, 8, fma,, +);
+  RTEST (v4df, double, 4, fma,, +);
+  RTEST (v8sf, float, 8, fms,, -);
+  RTEST (v4df, double, 4, fms,, -);
+  RTEST (v8sf, float, 8, fnma,-, +);
+  RTEST (v4df, double, 4, fnma,-, +);
+  RTEST (v8sf, float, 8, fnms,-, -);
+  RTEST (v4df, double, 4, fnms,-, -);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4sf, float, 4, fma,, +);
+  RTEST (v2df, double, 2, fma,, +);
+  RTEST (v4sf, float, 4, fms,, -);
+  RTEST (v2df, double, 2, fms,, -);
+  RTEST (v4sf, float, 4, fnma,-, +);
+  RTEST (v2df, double, 2, fnma,-, +);
+  RTEST (v4sf, float, 4, fnms,-, -);
+  RTEST (v2df, double, 2, fnms,-, -);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
new file mode 100644
index 00000000000..1e9460faa9d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
@@ -0,0 +1,37 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return (OP1 a) OP2 CONSTANT;			\
+}						\
+
+FOO (v4si, andnot, ~, &);
+FOO (v8si, andnot, ~, &);
+FOO (v2di, andnot, ~, &);
+FOO (v4di, andnot, ~, &);
+FOO (v4si, and,, &);
+FOO (v8si, and,, &);
+FOO (v2di, and,, &);
+FOO (v4di, and,, &);
+FOO (v4si, or,, |);
+FOO (v8si, or,, |);
+FOO (v2di, or,, |);
+FOO (v4di, or,, |);
+FOO (v4si, xor,, ^);
+FOO (v8si, xor,, ^);
+FOO (v2di, xor,, ^);
+FOO (v4di, xor,, ^);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c
new file mode 100644
index 00000000000..493a76f0917
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c
@@ -0,0 +1,55 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-5.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src[N];						\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src[i] = i * i * 107;						\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);			\
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src[i]) OP2 CONSTANT;				\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8si, int, 8, andnot, ~, &);
+  RTEST (v4di, long long, 4, andnot, ~, &);
+  RTEST (v8si, int, 8, and,, &);
+  RTEST (v4di, long long, 4, and,, &);
+  RTEST (v8si, int, 8, or,, |);
+  RTEST (v4di, long long, 4, or,, |);
+  RTEST (v8si, int, 8, xor,, ^);
+  RTEST (v4di, long long, 4, xor,, ^);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4si, int, 4, andnot, ~, &);
+  RTEST (v2di, long long, 2, andnot, ~, &);
+  RTEST (v4si, int, 4, and,, &);
+  RTEST (v2di, long long, 2, and,, &);
+  RTEST (v4si, int, 4, or,, |);
+  RTEST (v2di, long long, 2, or,, |);
+  RTEST (v4si, int, 4, xor,, ^);
+  RTEST (v2di, long long, 2, xor,, ^);
+}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-09-02  1:57       ` Hongtao Liu
@ 2020-09-02  9:58         ` Jakub Jelinek
  2020-09-03  2:11           ` Hongtao Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2020-09-02  9:58 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: GCC Patches

On Wed, Sep 02, 2020 at 09:57:08AM +0800, Hongtao Liu via Gcc-patches wrote:
> +
> +      first = XVECEXP (constant, 0, 0);
> +      /* There could be some rtx like
> +	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> +	 but with "*.LC1" refer to V2DI constant vector.  */
> +      if (GET_MODE (constant) != mode)
> +	{
> +	  constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
> +	  if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
> +	    return;
> +	}

The
      first = XVECEXP (constant, 0, 0);
line needs to be after this if, not before it, otherwise it will miscompile
things or just ICE.

> @@ -2197,6 +2272,10 @@ remove_partial_avx_dependency (void)
>  	  if (!NONDEBUG_INSN_P (insn))
>  	    continue;
>  
> +	  /* Hanlde AVX512 embedded broadcast here to save compile time.  */

s/Hanlde/Handle/

> +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> +    {
> +      if (!INSN_P (insn))
> +	continue;
> +      replace_constant_pool_with_broadcast (insn);
> +    }

Perhaps instead do:
  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
    if (INSN_P (insn))
      replace_constant_pool_with_broadcast (insn);
?

> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      /* Return false if rpad pass gate is true.
> +	 replace_constant_pool_with_broadcast is called
> +	 from both this pass and rpad pass.  */
> +      return (TARGET_AVX512F
> +	      && !(TARGET_AVX
> +		   && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> +		   && TARGET_SSE_MATH
> +		   && optimize
> +		   && optimize_function_for_speed_p (cfun)));

I think this could be a maintainance nightmare.
Perhaps instead add

static bool
remove_partial_avx_dependency_gate ()
{
  return (TARGET_AVX
	  && TARGET_SSE_PARTIAL_REG_DEPENDENCY
	  && TARGET_SSE_MATH
	  && optimize
	  && optimize_function_for_speed_p (cfun));
}
after the remove_partial_avx_dependency function definition,
change pass_remove_partial_avx_dependency gate body to
      return remove_partial_avx_dependency_gate ();
and in pass_constant_pool_broadcast::gate do
      return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ();
(with the comment you have there)?

LGTM with those changes.

	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-09-02  9:58         ` Jakub Jelinek
@ 2020-09-03  2:11           ` Hongtao Liu
  2020-09-03  7:27             ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Hongtao Liu @ 2020-09-03  2:11 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 2734 bytes --]

On Wed, Sep 2, 2020 at 5:58 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Wed, Sep 02, 2020 at 09:57:08AM +0800, Hongtao Liu via Gcc-patches wrote:
> > +
> > +      first = XVECEXP (constant, 0, 0);
> > +      /* There could be some rtx like
> > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > +      but with "*.LC1" refer to V2DI constant vector.  */
> > +      if (GET_MODE (constant) != mode)
> > +     {
> > +       constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
> > +       if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
> > +         return;
> > +     }
>
> The
>       first = XVECEXP (constant, 0, 0);
> line needs to be after this if, not before it, otherwise it will miscompile
> things or just ICE.
>

Changed.

> > @@ -2197,6 +2272,10 @@ remove_partial_avx_dependency (void)
> >         if (!NONDEBUG_INSN_P (insn))
> >           continue;
> >
> > +       /* Hanlde AVX512 embedded broadcast here to save compile time.  */
>
> s/Hanlde/Handle/
>

Changed, sorry for the typo.

> > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > +    {
> > +      if (!INSN_P (insn))
> > +     continue;
> > +      replace_constant_pool_with_broadcast (insn);
> > +    }
>
> Perhaps instead do:
>   for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
>     if (INSN_P (insn))
>       replace_constant_pool_with_broadcast (insn);
> ?
>

Changed.

> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      /* Return false if rpad pass gate is true.
> > +      replace_constant_pool_with_broadcast is called
> > +      from both this pass and rpad pass.  */
> > +      return (TARGET_AVX512F
> > +           && !(TARGET_AVX
> > +                && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> > +                && TARGET_SSE_MATH
> > +                && optimize
> > +                && optimize_function_for_speed_p (cfun)));
>
> I think this could be a maintainance nightmare.
> Perhaps instead add
>

Yes, a common interface should be added as bellow, changed.

> static bool
> remove_partial_avx_dependency_gate ()
> {
>   return (TARGET_AVX
>           && TARGET_SSE_PARTIAL_REG_DEPENDENCY
>           && TARGET_SSE_MATH
>           && optimize
>           && optimize_function_for_speed_p (cfun));
> }
> after the remove_partial_avx_dependency function definition,
> change pass_remove_partial_avx_dependency gate body to
>       return remove_partial_avx_dependency_gate ();
> and in pass_constant_pool_broadcast::gate do
>       return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ();
> (with the comment you have there)?
>
> LGTM with those changes.
>
>         Jakub
>

Thanks for the review, update patch.

-- 
BR,
Hongtao

[-- Attachment #2: 0001-Optimize-memory-broadcast-for-constant-vector-under-_V5.patch --]
[-- Type: text/x-patch, Size: 39244 bytes --]

From acf3825279190ca0540bb4704f66568fdbe06ce8 Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 8 Jul 2020 17:14:36 +0800
Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.

For constant vector having one duplicated value, there's no need to put
whole vector in the constant pool, using embedded broadcast instead.

2020-07-09  Hongtao Liu  <hongtao.liu@intel.com>

gcc/ChangeLog:

	PR target/87767
	* config/i386/i386-features.c
	(replace_constant_pool_with_broadcast): New function.
	(constant_pool_broadcast): Ditto.
	(class pass_constant_pool_broadcast): New pass.
	(make_pass_constant_pool_broadcast): Ditto.
	(remove_partial_avx_dependency): Call
	replace_constant_pool_with_broadcast under TARGET_AVX512F, it
	would save compile time when both pass rpad and cpb are
	available.
	(remove_partial_avx_dependency_gate): New function.
	(class pass_remove_partial_avx_dependency::gate): Call
	remove_partial_avx_dependency_gate.
	* config/i386/i386-passes.def: Insert new pass after combine.
	* config/i386/i386-protos.h
	(make_pass_constant_pool_broadcast): Declare.
	* config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst):
	New define_insn.
	(*avx512f_mul<mode>3<mask_name>_bcst): Ditto.
	* config/i386/avx512fintrin.h (_mm512_set1_ps,
	_mm512_set1_pd,_mm512_set1_epi32, _mm512_set1_epi64): Adjusted.

gcc/testsuite/ChangeLog:

	PR target/87767
	* gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-3.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-4.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-5.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-6.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-7.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-3.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-4.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-6.c: New test.
---
 gcc/config/i386/avx512fintrin.h               |  27 +--
 gcc/config/i386/i386-features.c               | 157 +++++++++++++++++-
 gcc/config/i386/i386-passes.def               |   1 +
 gcc/config/i386/i386-protos.h                 |   1 +
 gcc/config/i386/sse.md                        |  24 +++
 .../i386/avx2-broadcast-pr87767-1.c           |  40 +++++
 .../i386/avx512f-broadcast-pr87767-1.c        |  30 ++++
 .../i386/avx512f-broadcast-pr87767-2.c        |  42 +++++
 .../i386/avx512f-broadcast-pr87767-3.c        |  30 ++++
 .../i386/avx512f-broadcast-pr87767-4.c        |  42 +++++
 .../i386/avx512f-broadcast-pr87767-5.c        |  26 +++
 .../i386/avx512f-broadcast-pr87767-6.c        |  41 +++++
 .../i386/avx512f-broadcast-pr87767-7.c        | 121 ++++++++++++++
 .../i386/avx512vl-broadcast-pr87767-1.c       |  45 +++++
 .../i386/avx512vl-broadcast-pr87767-2.c       |  59 +++++++
 .../i386/avx512vl-broadcast-pr87767-3.c       |  37 +++++
 .../i386/avx512vl-broadcast-pr87767-4.c       |  56 +++++++
 .../i386/avx512vl-broadcast-pr87767-5.c       |  37 +++++
 .../i386/avx512vl-broadcast-pr87767-6.c       |  55 ++++++
 19 files changed, 848 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c

diff --git a/gcc/config/i386/avx512fintrin.h b/gcc/config/i386/avx512fintrin.h
index 0d53dda3a27..729d5686d68 100644
--- a/gcc/config/i386/avx512fintrin.h
+++ b/gcc/config/i386/avx512fintrin.h
@@ -239,22 +239,17 @@ extern __inline __m512d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_pd (double __A)
 {
-  return (__m512d) __builtin_ia32_broadcastsd512 (__extension__
-						  (__v2df) { __A, },
-						  (__v8df)
-						  _mm512_undefined_pd (),
-						  (__mmask8) -1);
+  return __extension__ (__m512d)(__v8df)
+    { __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_ps (float __A)
 {
-  return (__m512) __builtin_ia32_broadcastss512 (__extension__
-						 (__v4sf) { __A, },
-						 (__v16sf)
-						 _mm512_undefined_ps (),
-						 (__mmask16) -1);
+  return __extension__ (__m512)(__v16sf)
+    { __A, __A, __A, __A, __A, __A, __A, __A,
+      __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 /* Create the vector [A B C D A B C D A B C D A B C D].  */
@@ -4072,10 +4067,9 @@ extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_epi32 (int __A)
 {
-  return (__m512i) __builtin_ia32_pbroadcastd512_gpr_mask (__A,
-							   (__v16si)
-							   _mm512_undefined_epi32 (),
-							   (__mmask16)(-1));
+  return (__m512i)(__v16si)
+    { __A, __A, __A, __A, __A, __A, __A, __A,
+      __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512i
@@ -4128,10 +4122,7 @@ extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_set1_epi64 (long long __A)
 {
-  return (__m512i) __builtin_ia32_pbroadcastq512_gpr_mask (__A,
-							   (__v8di)
-							   _mm512_undefined_epi32 (),
-							   (__mmask8)(-1));
+  return (__m512i)(__v8di) { __A, __A, __A, __A, __A, __A, __A, __A };
 }
 
 extern __inline __m512i
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 535fc7e981d..620f7f157f4 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2162,6 +2162,81 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt)
   return new pass_insert_endbr_and_patchable_area (ctxt);
 }
 
+/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
+   with embedded broadcast. i.e.transform
+
+     vpaddq .LC0(%rip), %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+
+    to
+
+     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3  */
+static void
+replace_constant_pool_with_broadcast (rtx_insn *insn)
+{
+  subrtx_ptr_iterator::array_type array;
+  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
+    {
+      rtx *loc = *iter;
+      rtx x = *loc;
+      rtx broadcast_mem, vec_dup, constant, first;
+      machine_mode mode;
+
+      /* Constant pool.  */
+      if (!MEM_P (x)
+	  || !SYMBOL_REF_P (XEXP (x, 0))
+	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
+	continue;
+
+      /* Const vector.  */
+      mode = GET_MODE (x);
+      if (!VECTOR_MODE_P (mode))
+	return;
+      constant = get_pool_constant (XEXP (x, 0));
+      if (GET_CODE (constant) != CONST_VECTOR)
+	return;
+
+      /* There could be some rtx like
+	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
+	 but with "*.LC1" refer to V2DI constant vector.  */
+      if (GET_MODE (constant) != mode)
+	{
+	  constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
+	  if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
+	    return;
+	}
+      first = XVECEXP (constant, 0, 0);
+
+      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
+	{
+	  rtx tmp = XVECEXP (constant, 0, i);
+	  /* Vector duplicate value.  */
+	  if (!rtx_equal_p (tmp, first))
+	    return;
+	}
+
+      /* Replace with embedded broadcast.  */
+      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
+      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
+      validate_change (insn, loc, vec_dup, 0);
+
+      /* At most 1 memory_operand in an insn.  */
+      return;
+    }
+}
+
 /* At entry of the nearest common dominator for basic blocks with
    conversions, generate a single
 	vxorps %xmmN, %xmmN, %xmmN
@@ -2197,6 +2272,10 @@ remove_partial_avx_dependency (void)
 	  if (!NONDEBUG_INSN_P (insn))
 	    continue;
 
+	  /* Handle AVX512 embedded broadcast here to save compile time.  */
+	  if (TARGET_AVX512F)
+	    replace_constant_pool_with_broadcast (insn);
+
 	  set = single_set (insn);
 	  if (!set)
 	    continue;
@@ -2333,6 +2412,16 @@ remove_partial_avx_dependency (void)
   return 0;
 }
 
+static bool
+remove_partial_avx_dependency_gate ()
+{
+  return (TARGET_AVX
+	  && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+	  && TARGET_SSE_MATH
+	  && optimize
+	  && optimize_function_for_speed_p (cfun));
+}
+
 namespace {
 
 const pass_data pass_data_remove_partial_avx_dependency =
@@ -2358,11 +2447,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      return (TARGET_AVX
-	      && TARGET_SSE_PARTIAL_REG_DEPENDENCY
-	      && TARGET_SSE_MATH
-	      && optimize
-	      && optimize_function_for_speed_p (cfun));
+      return remove_partial_avx_dependency_gate ();
     }
 
   virtual unsigned int execute (function *)
@@ -2379,6 +2464,68 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
   return new pass_remove_partial_avx_dependency (ctxt);
 }
 
+/* For const vector having one duplicated value, there's no need to put
+   whole vector in the constant pool when target supports embedded broadcast. */
+static unsigned int
+constant_pool_broadcast (void)
+{
+  timevar_push (TV_MACH_DEP);
+  rtx_insn *insn;
+
+  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
+    {
+      if (INSN_P (insn))
+	replace_constant_pool_with_broadcast (insn);
+    }
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_constant_pool_broadcast =
+{
+  RTL_PASS, /* type */
+  "cpb", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_constant_pool_broadcast : public rtl_opt_pass
+{
+public:
+  pass_constant_pool_broadcast (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      /* Return false if rpad pass gate is true.
+	 replace_constant_pool_with_broadcast is called
+	 from both this pass and rpad pass.  */
+      return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ());
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return constant_pool_broadcast ();
+    }
+}; // class pass_cpb
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_constant_pool_broadcast (gcc::context *ctxt)
+{
+  return new pass_constant_pool_broadcast (ctxt);
+}
+
 /* This compares the priority of target features in function DECL1
    and DECL2.  It returns positive value if DECL1 is higher priority,
    negative value if DECL2 is higher priority and 0 if they are the
diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index d83c7b956b1..07ecf8e790f 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
 
   INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
+  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index b6088f22d55..c5b700efd0e 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
   (gcc::context *);
 extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
   (gcc::context *);
+extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8250325e1a3..a728b979f01 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -12120,6 +12120,18 @@ (define_insn "avx512dq_mul<mode>3<mask_name>"
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
+	(mult:VI8_AVX512VL
+	  (vec_duplicate:VI8_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI8_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512DQ"
+  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3<mask_name>"
   [(set (match_operand:VI4_AVX512F 0 "register_operand")
 	(mult:VI4_AVX512F
@@ -12160,6 +12172,18 @@ (define_insn "*<sse4_1_avx2>_mul<mode>3<mask_name>"
    (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
+	(mult:VI4_AVX512VL
+	  (vec_duplicate:VI4_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512F"
+   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3"
   [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
 	(mult:VI8_AVX2_AVX512F
diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..aee1680ed98
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
@@ -0,0 +1,40 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2 -mno-avx512f" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..a8ee5f5faf1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -0,0 +1,30 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512dq" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 5 } }  */
+
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP CONSTANT;				\
+}						\
+
+FOO (v16si, add, +);
+FOO (v8di, add, +);
+FOO (v16sf, add, +);
+FOO (v8df, add, +);
+FOO (v16si, sub, -);
+FOO (v8di, sub, -);
+FOO (v16si, mul, *);
+FOO (v8di, mul, *);
+FOO (v16sf, mul, *);
+FOO (v8df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..30cf5809c3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
@@ -0,0 +1,42 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512dq" } */
+/* { dg-require-effective-target avx512dq } */
+
+#define AVX512DQ
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-1.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP CONSTANT;			\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16si, int, 16, add, +);
+  RTEST (v8di, long long, 8, add, +);
+  RTEST (v16sf, float, 16, add, +);
+  RTEST (v8df, double, 8, add, +);
+  RTEST (v16si, int, 16, sub, -);
+  RTEST (v8di, long long, 8, sub, -);
+  RTEST (v16si, int, 16, mul, *);
+  RTEST (v8di, long long, 8, mul, *);
+  RTEST (v16sf, float, 16, mul, *);
+  RTEST (v8df, double, 8, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
new file mode 100644
index 00000000000..c2f22c4ee5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-3.c
@@ -0,0 +1,30 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 4 } }  */
+
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a, VTYPE b)	\
+{						\
+  return (OP1 a * b) OP2 CONSTANT;		\
+}						\
+
+FOO (v16sf, fma,, +);
+FOO (v8df, fma,, +);
+FOO (v16sf, fms,, -);
+FOO (v8df, fms,, -);
+FOO (v16sf, fnma, -, +);
+FOO (v8df, fnma, -, +);
+FOO (v16sf, fnms, -, -);
+FOO (v8df, fnms, -, -);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
new file mode 100644
index 00000000000..dabe91b11c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-4.c
@@ -0,0 +1,42 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-require-effective-target avx512f } */
+
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-3.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src1[N], src2[N];					\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src1[i] = i * i * 107.2f;					\
+	  src2[i] = i * 2.f - 404.f;					\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src1[0], *(VTYPE*)&src2[0]); \
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src1[i] * src2[i]) OP2 CONSTANT;			\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16sf, float, 16, fma,, +);
+  RTEST (v8df, double, 8, fma,, +);
+  RTEST (v16sf, float, 16, fms,, -);
+  RTEST (v8df, double, 8, fms,, -);
+  RTEST (v16sf, float, 16, fnma,-, +);
+  RTEST (v8df, double, 8, fnma,-, +);
+  RTEST (v16sf, float, 16, fnms,-, -);
+  RTEST (v8df, double, 8, fnms,-, -);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
new file mode 100644
index 00000000000..72e1098ccbe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
@@ -0,0 +1,26 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to8\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^n\n\]*\\\{1to16\\\}" 4 } }  */
+
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return (OP1 a) OP2 CONSTANT;			\
+}						\
+
+FOO (v16si, andnot, ~, &);
+FOO (v8di, andnot, ~, &);
+FOO (v16si, and,, &);
+FOO (v8di, and,, &);
+FOO (v16si, or,, |);
+FOO (v8di, or,, |);
+FOO (v16si, xor,, ^);
+FOO (v8di, xor,, ^);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
new file mode 100644
index 00000000000..f288f83158c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-6.c
@@ -0,0 +1,41 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-require-effective-target avx512f} */
+
+#include "avx512f-helper.h"
+
+#include "avx512f-broadcast-pr87767-5.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src[N];						\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src[i] = i * i * 107;						\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);			\
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src[i]) OP2 CONSTANT;				\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_512 (void)
+{
+  RTEST (v16si, int, 16, andnot, ~, &);
+  RTEST (v8di, long long, 8, andnot, ~, &);
+  RTEST (v16si, int, 16, and,, &);
+  RTEST (v8di, long long, 8, and,, &);
+  RTEST (v16si, int, 16, or,, |);
+  RTEST (v8di, long long, 8, or,, |);
+  RTEST (v16si, int, 16, xor,, ^);
+  RTEST (v8di, long long, 8, xor,, ^);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
new file mode 100644
index 00000000000..a8f145d8589
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-7.c
@@ -0,0 +1,121 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vdiv\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfmsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vfnmsub\[^\n\]*\\\{1to16\\\}" 1 } }  */
+
+#include<immintrin.h>
+
+#define CONSTANT 101
+
+#define FOO(VTYPE, OP_NAME, LEN, SUFFIX, MTYPE)			\
+  VTYPE									\
+  __attribute__ ((noipa))						\
+  _mm##LEN##_foo_##OP_NAME##_##SUFFIX (VTYPE dst, VTYPE src, MTYPE m)	\
+  {									\
+    return  _mm##LEN##_mask_##OP_NAME##_##SUFFIX (dst, m, src,		\
+						  _mm##LEN##_set1_##SUFFIX (CONSTANT)); \
+  }									\
+
+#define FOOZ(VTYPE, OP_NAME, LEN, SUFFIX, MTYPE)			\
+  VTYPE									\
+  __attribute__ ((noipa))						\
+  _mm##LEN##_fooz_##OP_NAME##_##SUFFIX (VTYPE dst, VTYPE src, MTYPE m)	\
+  {									\
+    return  _mm##LEN##_maskz_##OP_NAME##_##SUFFIX (m, dst, src,		\
+						  _mm##LEN##_set1_##SUFFIX (CONSTANT)); \
+  }									\
+
+FOO (__m512, add, 512, ps, __mmask16);
+FOO (__m256, add, 256, ps, __mmask8);
+FOO (__m128, add,, ps, __mmask8);
+
+FOO (__m512, sub, 512, ps, __mmask16);
+FOO (__m256, sub, 256, ps, __mmask8);
+FOO (__m128, sub,, ps, __mmask8);
+
+FOO (__m512, mul, 512, ps, __mmask16);
+FOO (__m256, mul, 256, ps, __mmask8);
+FOO (__m128, mul,, ps, __mmask8);
+
+FOO (__m512, div, 512, ps, __mmask16);
+FOO (__m256, div, 256, ps, __mmask8);
+FOO (__m128, div,, ps, __mmask8);
+
+FOOZ (__m512, fmadd, 512, ps, __mmask16);
+FOOZ (__m256, fmadd, 256, ps, __mmask8);
+FOOZ (__m128, fmadd,, ps, __mmask8);
+
+FOOZ (__m512, fmsub, 512, ps, __mmask16);
+FOOZ (__m256, fmsub, 256, ps, __mmask8);
+FOOZ (__m128, fmsub,, ps, __mmask8);
+
+FOOZ (__m512, fnmadd, 512, ps, __mmask16);
+FOOZ (__m256, fnmadd, 256, ps, __mmask8);
+FOOZ (__m128, fnmadd,, ps, __mmask8);
+
+FOOZ (__m512, fnmsub, 512, ps, __mmask16);
+FOOZ (__m256, fnmsub, 256, ps, __mmask8);
+FOOZ (__m128, fnmsub,, ps, __mmask8);
+
+FOO (__m512d, add, 512, pd, __mmask8);
+FOO (__m256d, add, 256, pd, __mmask8);
+FOO (__m128d, add,, pd, __mmask8);
+
+FOO (__m512d, sub, 512, pd, __mmask8);
+FOO (__m256d, sub, 256, pd, __mmask8);
+FOO (__m128d, sub,, pd, __mmask8);
+
+FOO (__m512d, mul, 512, pd, __mmask8);
+FOO (__m256d, mul, 256, pd, __mmask8);
+FOO (__m128d, mul,, pd, __mmask8);
+
+FOO (__m512d, div, 512, pd, __mmask8);
+FOO (__m256d, div, 256, pd, __mmask8);
+FOO (__m128d, div,, pd, __mmask8);
+
+FOOZ (__m512d, fmadd, 512, pd, __mmask8);
+FOOZ (__m256d, fmadd, 256, pd, __mmask8);
+FOOZ (__m128d, fmadd,, pd, __mmask8);
+
+FOOZ (__m512d, fmsub, 512, pd, __mmask8);
+FOOZ (__m256d, fmsub, 256, pd, __mmask8);
+FOOZ (__m128d, fmsub,, pd, __mmask8);
+
+FOOZ (__m512d, fnmadd, 512, pd, __mmask8);
+FOOZ (__m256d, fnmadd, 256, pd, __mmask8);
+FOOZ (__m128d, fnmadd,, pd, __mmask8);
+
+FOOZ (__m512d, fnmsub, 512, pd, __mmask8);
+FOOZ (__m256d, fnmsub, 256, pd, __mmask8);
+FOOZ (__m128d, fnmsub,, pd, __mmask8);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..397e287134c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -0,0 +1,45 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 10 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP CONSTANT;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4si, sub, -);
+FOO (v8si, sub, -);
+FOO (v2di, sub, -);
+FOO (v4di, sub, -);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..9b796ac7124
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-2.c
@@ -0,0 +1,59 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512dq -mavx512vl" } */
+/* { dg-require-effective-target avx512dq } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512DQ
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-1.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP CONSTANT;			\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8si, int, 8, add, +);
+  RTEST (v4di, long long, 4, add, +);
+  RTEST (v8sf, float, 8, add, +);
+  RTEST (v4df, double, 4, add, +);
+  RTEST (v8si, int, 8, sub, -);
+  RTEST (v4di, long long, 4, sub, -);
+  RTEST (v8si, int, 8, mul, *);
+  RTEST (v4di, long long, 4, mul, *);
+  RTEST (v8sf, float, 8, mul, *);
+  RTEST (v4df, double, 4, mul, *);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4si, int, 4, add, +);
+  RTEST (v2di, long long, 2, add, +);
+  RTEST (v4sf, float, 4, add, +);
+  RTEST (v2df, double, 2, add, +);
+  RTEST (v4si, int, 4, sub, -);
+  RTEST (v2di, long long, 2, sub, -);
+  RTEST (v4si, int, 4, mul, *);
+  RTEST (v2di, long long, 2, mul, *);
+  RTEST (v4sf, float, 4, mul, *);
+  RTEST (v2df, double, 2, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
new file mode 100644
index 00000000000..aedfb16016e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-3.c
@@ -0,0 +1,37 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a, VTYPE b)	\
+{						\
+  return (OP1 a * b) OP2 CONSTANT;		\
+}						\
+
+FOO (v4sf, fma,, +);
+FOO (v8sf, fma,, +);
+FOO (v2df, fma,, +);
+FOO (v4df, fma,, +);
+FOO (v4sf, fms,, -);
+FOO (v8sf, fms,, -);
+FOO (v2df, fms,, -);
+FOO (v4df, fms,, -);
+FOO (v4sf, fnma, -, +);
+FOO (v8sf, fnma, -, +);
+FOO (v2df, fnma, -, +);
+FOO (v4df, fnma, -, +);
+FOO (v4sf, fnms, -, -);
+FOO (v8sf, fnms, -, -);
+FOO (v2df, fnms, -, -);
+FOO (v4df, fnms, -, -);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
new file mode 100644
index 00000000000..40b8eb9929d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-4.c
@@ -0,0 +1,56 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-3.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src1[N], src2[N];					\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src1[i] = i * i * 107.2f;					\
+	  src2[i] = i * 2.f - 404.f;					\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src1[0], *(VTYPE*)&src2[0]); \
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src1[i] * src2[i]) OP2 CONSTANT;			\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8sf, float, 8, fma,, +);
+  RTEST (v4df, double, 4, fma,, +);
+  RTEST (v8sf, float, 8, fms,, -);
+  RTEST (v4df, double, 4, fms,, -);
+  RTEST (v8sf, float, 8, fnma,-, +);
+  RTEST (v4df, double, 4, fnma,-, +);
+  RTEST (v8sf, float, 8, fnms,-, -);
+  RTEST (v4df, double, 4, fnms,-, -);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4sf, float, 4, fma,, +);
+  RTEST (v2df, double, 2, fma,, +);
+  RTEST (v4sf, float, 4, fms,, -);
+  RTEST (v2df, double, 2, fms,, -);
+  RTEST (v4sf, float, 4, fnma,-, +);
+  RTEST (v2df, double, 2, fnma,-, +);
+  RTEST (v4sf, float, 4, fnms,-, -);
+  RTEST (v2df, double, 2, fnms,-, -);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
new file mode 100644
index 00000000000..1e9460faa9d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
@@ -0,0 +1,37 @@
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 8 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+
+#define CONSTANT 101;
+#define FOO(VTYPE, OP_NAME, OP1, OP2)		\
+VTYPE						\
+ __attribute__ ((noipa))			\
+ foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return (OP1 a) OP2 CONSTANT;			\
+}						\
+
+FOO (v4si, andnot, ~, &);
+FOO (v8si, andnot, ~, &);
+FOO (v2di, andnot, ~, &);
+FOO (v4di, andnot, ~, &);
+FOO (v4si, and,, &);
+FOO (v8si, and,, &);
+FOO (v2di, and,, &);
+FOO (v4di, and,, &);
+FOO (v4si, or,, |);
+FOO (v8si, or,, |);
+FOO (v2di, or,, |);
+FOO (v4di, or,, |);
+FOO (v4si, xor,, ^);
+FOO (v8si, xor,, ^);
+FOO (v2di, xor,, ^);
+FOO (v4di, xor,, ^);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c
new file mode 100644
index 00000000000..493a76f0917
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-6.c
@@ -0,0 +1,55 @@
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl" } */
+/* { dg-require-effective-target avx512vl } */
+
+#define AVX512VL
+#include "avx512f-helper.h"
+
+#include "avx512vl-broadcast-pr87767-5.c"
+
+#define RTEST(VTYPE, TYPE, N, OP_NAME, OP1, OP2)				\
+  do									\
+    {									\
+      TYPE exp[N], src[N];						\
+      VTYPE res;							\
+      for (int i = 0; i < N; i++)					\
+	{								\
+	  src[i] = i * i * 107;						\
+	}								\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);			\
+      for (int i = 0; i < N; i ++)					\
+	exp[i] = (OP1 src[i]) OP2 CONSTANT;				\
+      for (int j = 0; j < N; j++)					\
+	{								\
+	  if (res[j] != exp[j])						\
+	    abort();							\
+	}								\
+    }									\
+  while (0)
+
+void
+test_256 (void)
+{
+  RTEST (v8si, int, 8, andnot, ~, &);
+  RTEST (v4di, long long, 4, andnot, ~, &);
+  RTEST (v8si, int, 8, and,, &);
+  RTEST (v4di, long long, 4, and,, &);
+  RTEST (v8si, int, 8, or,, |);
+  RTEST (v4di, long long, 4, or,, |);
+  RTEST (v8si, int, 8, xor,, ^);
+  RTEST (v4di, long long, 4, xor,, ^);
+}
+
+void
+test_128 (void)
+{
+  RTEST (v4si, int, 4, andnot, ~, &);
+  RTEST (v2di, long long, 2, andnot, ~, &);
+  RTEST (v4si, int, 4, and,, &);
+  RTEST (v2di, long long, 2, and,, &);
+  RTEST (v4si, int, 4, or,, |);
+  RTEST (v2di, long long, 2, or,, |);
+  RTEST (v4si, int, 4, xor,, ^);
+  RTEST (v2di, long long, 2, xor,, ^);
+}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
  2020-09-03  2:11           ` Hongtao Liu
@ 2020-09-03  7:27             ` Jakub Jelinek
  0 siblings, 0 replies; 30+ messages in thread
From: Jakub Jelinek @ 2020-09-03  7:27 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: GCC Patches

On Thu, Sep 03, 2020 at 10:11:14AM +0800, Hongtao Liu wrote:
> Thanks for the review, update patch.

Ok for trunk, thanks.

> From acf3825279190ca0540bb4704f66568fdbe06ce8 Mon Sep 17 00:00:00 2001
> From: liuhongt <hongtao.liu@intel.com>
> Date: Wed, 8 Jul 2020 17:14:36 +0800
> Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.
> 
> For constant vector having one duplicated value, there's no need to put
> whole vector in the constant pool, using embedded broadcast instead.
> 
> 2020-07-09  Hongtao Liu  <hongtao.liu@intel.com>
> 
> gcc/ChangeLog:
> 
> 	PR target/87767
> 	* config/i386/i386-features.c
> 	(replace_constant_pool_with_broadcast): New function.
> 	(constant_pool_broadcast): Ditto.
> 	(class pass_constant_pool_broadcast): New pass.
> 	(make_pass_constant_pool_broadcast): Ditto.
> 	(remove_partial_avx_dependency): Call
> 	replace_constant_pool_with_broadcast under TARGET_AVX512F, it
> 	would save compile time when both pass rpad and cpb are
> 	available.
> 	(remove_partial_avx_dependency_gate): New function.
> 	(class pass_remove_partial_avx_dependency::gate): Call
> 	remove_partial_avx_dependency_gate.
> 	* config/i386/i386-passes.def: Insert new pass after combine.
> 	* config/i386/i386-protos.h
> 	(make_pass_constant_pool_broadcast): Declare.
> 	* config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst):
> 	New define_insn.
> 	(*avx512f_mul<mode>3<mask_name>_bcst): Ditto.
> 	* config/i386/avx512fintrin.h (_mm512_set1_ps,
> 	_mm512_set1_pd,_mm512_set1_epi32, _mm512_set1_epi64): Adjusted.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/87767
> 	* gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-3.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-4.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-5.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-6.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-7.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-2.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-3.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-4.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-6.c: New test.

	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-09-03  7:27 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-09  8:33 [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512 Hongtao Liu
2020-07-10  9:24 ` Hongtao Liu
2020-07-17  7:24   ` Hongtao Liu
2020-07-23  8:39 ` Jan Hubicka
2020-07-23 13:53   ` Hongtao Liu
2020-07-24  2:37     ` Hongtao Liu
2020-08-04  6:05       ` Hongtao Liu
2020-08-26 21:23         ` Jeff Law
2020-08-27 11:09           ` Jan Hubicka
2020-08-27 12:24 ` Jakub Jelinek
2020-08-27 13:07   ` Richard Biener
2020-08-27 13:20     ` Jakub Jelinek
2020-08-28  6:47       ` Richard Biener
2020-08-28  8:52         ` Jakub Jelinek
2020-08-28 10:36           ` Richard Biener
2020-08-28 10:47             ` Jakub Jelinek
2020-08-28 11:06               ` Richard Biener
2020-08-28 11:26                 ` Jakub Jelinek
2020-08-28 14:53                 ` Jakub Jelinek
2020-08-28 16:07                   ` Richard Sandiford
2020-08-28 16:25                     ` Jakub Jelinek
2020-08-30  9:24                       ` Jakub Jelinek
2020-08-31  8:18                         ` Richard Biener
2020-08-28 17:18   ` Hongtao Liu
2020-09-01  9:55   ` Hongtao Liu
2020-09-01 10:11     ` Jakub Jelinek
2020-09-02  1:57       ` Hongtao Liu
2020-09-02  9:58         ` Jakub Jelinek
2020-09-03  2:11           ` Hongtao Liu
2020-09-03  7:27             ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).