public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
@ 2021-05-13  9:23 Hongtao Liu
  2021-05-13  9:40 ` Uros Bizjak
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-05-13  9:23 UTC (permalink / raw)
  To: GCC Patches; +Cc: Uros Bizjak, Jakub Jelinek, H. J. Lu

[-- Attachment #1: Type: text/plain, Size: 2685 bytes --]

Hi:
  When __builtin_ia32_vzeroupper is called explicitly, the corresponding
vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
which leads to incorrect optimization in pass_reload.
In order to solve this problem, this patch introduces a pre_reload
splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
problem in pr.

At the same time, in order to optimize the low 128 bits in
post_reload CSE, this patch also transforms those CLOBBERS to SETs in
pass_vzeroupper.

It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
function, because of vzeroupper's pattern, pro_epilog will save and
restore those registers, which is obviously redundant. In order to
eliminate this redundancy, a post_reload splitter is introduced, which
drops those SETs, until epilogue_completed splitter adds those SETs
back, it looks to be safe since there's no CSE between post_reload
split2 and epilogue_completed split3??? Also frame info needs to be
updated in pro_epilog, which saves and restores xmm6-xmm15 only if
there's usage other than explicit vzeroupper pattern.

  Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
  Ok for trunk?

gcc/ChangeLog:

        PR target/82735
        * config/i386/i386-expand.c (ix86_expand_builtin): Count
        number of __builtin_ia32_vzeroupper.
        * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
        Transform CLOBBERs to SETs for explicit vzeroupper pattern so
        that CSE can optimize lower 128 bits.
        * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog):
        New.
        (ix86_save_reg): If there's no use of xmm6~xmm15 other than
        explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
        REGNO.
        (ix86_finalize_stack_frame_flags): Recompute frame layout if
        there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
        * config/i386/i386.h (struct machine_function): Change type of
        has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
        * config/i386/sse.md (*avx_vzeroupper_2): New post-reload
        splitter which will drop all SETs for explicit vzeroupper
        patterns.
        (*avx_vzeroupper_1): Generate SET reg to reg instead of
        CLOBBER, and add pre-reload splitter after it.

gcc/testsuite/ChangeLog:

        PR target/82735
        * gcc.target/i386/pr82735-1.c: New test.
        * gcc.target/i386/pr82735-2.c: New test.
        * gcc.target/i386/pr82735-3.c: New test.
        * gcc.target/i386/pr82735-4.c: New test.
        * gcc.target/i386/pr82735-5.c: New test.


-- 
BR,
Hongtao

[-- Attachment #2: 0001-i386-Fix-_mm256_zeroupper-to-notify-LRA-that-vzeroup.patch --]
[-- Type: text/x-patch, Size: 17592 bytes --]

From d53b0c6934ea499c9f87df963661b627e7e977bf Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 12 May 2021 14:20:54 +0800
Subject: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper
 will kill sse registers.

When __builtin_ia32_vzeroupper is called explicitly, the corresponding
vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
which leads to incorrect optimization in pass_reload.
In order to solve this problem, this patch introduces a pre_reload
splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
problem in pr.

At the same time, in order to optimize the low 128 bits in
post_reload CSE, this patch also transforms those CLOBBERS to SETs in
pass_vzeroupper.

It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
function, because of vzeroupper's pattern, pro_epilog will save and
restore those registers, which is obviously redundant. In order to
eliminate this redundancy, a post_reload splitter is introduced, which
drops those SETs, until epilogue_completed splitter adds those SETs
back, it looks to be safe since there's no CSE between post_reload
split2 and epilogue_completed split3??? Also frame info needs to be
updated in pro_epilog, which saves and restores xmm6-xmm15 only if
there's usage other than explicit vzeroupper pattern.

gcc/ChangeLog:

	PR target/82735
	* config/i386/i386-expand.c (ix86_expand_builtin): Count
	number of __builtin_ia32_vzeroupper.
	* config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
	Transform CLOBBERs to SETs for explict vzeroupper pattern so
	that CSE can optimize lower 128 bits.
	* config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog):
	New.
	(ix86_save_reg): If there's no use of xmm6~xmm15 other than
	explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
	REGNO.
	(ix86_finalize_stack_frame_flags): Recompute frame layout if
	there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
	* config/i386/i386.h (struct machine_function): Change type of
	has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
	* config/i386/sse.md (*avx_vzeroupper_2): New post-reload
	splitter which will drop all SETs for explicit vzeroupper
	patterns.
	(*avx_vzeroupper_1): Generate SET reg to reg instead of
	CLOBBER, and add pre-reload splitter after it.

gcc/testsuite/ChangeLog:

	PR target/82735
	* gcc.target/i386/pr82735-1.c: New test.
	* gcc.target/i386/pr82735-2.c: New test.
	* gcc.target/i386/pr82735-3.c: New test.
	* gcc.target/i386/pr82735-4.c: New test.
	* gcc.target/i386/pr82735-5.c: New test.
---
 gcc/config/i386/i386-expand.c             |  2 +-
 gcc/config/i386/i386-features.c           | 25 ++++++++++-
 gcc/config/i386/i386.c                    | 23 ++++++++++
 gcc/config/i386/i386.h                    |  8 ++--
 gcc/config/i386/sse.md                    | 48 +++++++++++++++++++-
 gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 ++++++++++++
 gcc/testsuite/gcc.target/i386/pr82735-2.c | 21 +++++++++
 gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 +++
 gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 ++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++++++++++++
 10 files changed, 256 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index fee4d07b7fd..7f3326a12b2 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -13233,7 +13233,7 @@ rdseed_step:
       return 0;
 
     case IX86_BUILTIN_VZEROUPPER:
-      cfun->machine->has_explicit_vzeroupper = true;
+      cfun->machine->has_explicit_vzeroupper++;
       break;
 
     default:
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 77783a154b6..6b2179f16cb 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -1827,8 +1827,31 @@ ix86_add_reg_usage_to_vzerouppers (void)
 	{
 	  if (!NONDEBUG_INSN_P (insn))
 	    continue;
+	  /* Transform CLOBBERs to SETs so that lower 128 bits of sse reisters
+	     will be able to cross vzeroupper in post-reload CSE.  */
 	  if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
-	    ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
+	    {
+	      if (XVECEXP (XVECEXP (PATTERN (insn), 0, 0), 0, 0) == const1_rtx)
+		{
+		  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+		  rtvec vec = rtvec_alloc (nregs + 1);
+		  RTVEC_ELT (vec, 0) = XVECEXP (PATTERN (insn), 0, 0);
+		  for (unsigned int i = 0; i < nregs; ++i)
+		    {
+		      unsigned int regno = GET_SSE_REGNO (i);
+		      rtx reg = gen_rtx_REG (V2DImode, regno);
+		      RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
+		    }
+		  XVEC (PATTERN (insn), 0) = vec;
+		  INSN_CODE (insn) = -1;
+		  df_insn_rescan (insn);
+		}
+	      else
+		{
+		  gcc_assert (XVECLEN (PATTERN (insn), 0) == 1);
+		  ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
+		}
+	    }
 	  df_simulate_one_insn_backwards (bb, insn, live_regs);
 	}
     }
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 780da108a7c..4d4d7dbbc82 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -6170,6 +6170,17 @@ ix86_hard_regno_scratch_ok (unsigned int regno)
 	      && df_regs_ever_live_p (regno)));
 }
 
+/* Return true if explicit usage of __builtin_ia32_vzeroupper
+   should be specially handled in pro_epilog.  */
+static bool
+ix86_handle_explicit_vzeroupper_in_pro_epilog ()
+{
+  return (cfun->machine->has_explicit_vzeroupper
+	  && TARGET_64BIT_MS_ABI
+	  && !epilogue_completed
+	  && reload_completed);
+}
+
 /* Return TRUE if we need to save REGNO.  */
 
 bool
@@ -6244,6 +6255,16 @@ ix86_save_reg (unsigned int regno, bool maybe_eh_return, bool ignore_outlined)
       && !cfun->machine->no_drap_save_restore)
     return true;
 
+  /* If there's no use other than explicit vzeroupper
+     for xmm6~xmm15 under TARGET_64BIT_MS_ABI,
+     no need to save REGNO.  */
+  if (ix86_handle_explicit_vzeroupper_in_pro_epilog ()
+      && (IN_RANGE (regno, FIRST_SSE_REG + 6, LAST_SSE_REG)
+	  || IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
+    return df_regs_ever_live_p (regno)
+      ? df_hard_reg_used_count (regno) > cfun->machine->has_explicit_vzeroupper
+      : false;
+
   return (df_regs_ever_live_p (regno)
 	  && !call_used_or_fixed_reg_p (regno)
 	  && (regno != HARD_FRAME_POINTER_REGNUM || !frame_pointer_needed));
@@ -8046,6 +8067,8 @@ ix86_finalize_stack_frame_flags (void)
     recompute_frame_layout_p = true;
   crtl->stack_realign_needed = stack_realign;
   crtl->stack_realign_finalized = true;
+  if (ix86_handle_explicit_vzeroupper_in_pro_epilog ())
+    recompute_frame_layout_p = true;
   if (recompute_frame_layout_p)
     ix86_compute_frame_layout ();
 }
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 97d6f3863cb..c0855a936ac 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2654,10 +2654,6 @@ struct GTY(()) machine_function {
   /* True if the function needs a stack frame.  */
   BOOL_BITFIELD stack_frame_required : 1;
 
-  /* True if __builtin_ia32_vzeroupper () has been expanded in current
-     function.  */
-  BOOL_BITFIELD has_explicit_vzeroupper : 1;
-
   /* True if we should act silently, rather than raise an error for
      invalid calls.  */
   BOOL_BITFIELD silent_p : 1;
@@ -2665,6 +2661,10 @@ struct GTY(()) machine_function {
   /* The largest alignment, in bytes, of stack slot actually used.  */
   unsigned int max_used_stack_alignment;
 
+  /* Number of __builtin_ia32_vzeroupper () which has been expanded in
+     current function.  */
+  unsigned int has_explicit_vzeroupper;
+
   /* During prologue/epilogue generation, the current frame state.
      Otherwise, the frame state at the end of the prologue.  */
   struct machine_frame_state fs;
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 897cf3eaea9..489fa02fa20 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -20626,7 +20626,7 @@ (define_insn_and_split "*avx_vzeroupper_1"
       else
 	{
 	  rtx reg = gen_rtx_REG (V2DImode, regno);
-	  RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+	  RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
 	}
     }
   operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
@@ -20638,6 +20638,52 @@ (define_insn_and_split "*avx_vzeroupper_1"
    (set_attr "btver2_decode" "vector")
    (set_attr "mode" "OI")])
 
+(define_split
+  [(match_parallel 0 "vzeroupper_pattern"
+     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
+  "TARGET_AVX && ix86_pre_reload_split ()"
+  [(match_dup 0)]
+{
+  /* When vzeroupper is explictly used, for LRA purpose, make it clear
+     the instruction kills sse registers.  */
+  gcc_assert (cfun->machine->has_explicit_vzeroupper);
+  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+  rtvec vec = rtvec_alloc (nregs + 1);
+  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
+						gen_rtvec (1, const1_rtx),
+						UNSPECV_VZEROUPPER);
+  for (unsigned int i = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      rtx reg = gen_rtx_REG (V2DImode, regno);
+      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+    }
+  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
+})
+
+(define_insn_and_split "*avx_vzeroupper_2"
+  [(match_parallel 0 "vzeroupper_pattern"
+     [(unspec_volatile [(const_int 1)] UNSPECV_VZEROUPPER)])]
+  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
+  "vzeroupper"
+  "&& reload_completed && TARGET_64BIT_MS_ABI"
+  [(const_int 0)]
+{
+  /* To avoid redundant save and restore in pro_and_epilog, drop
+     those SETs/CLOBBERs which are added by pre-reload splitter
+     or pass_vzeroupper, it's safe since there's no CSE optimization
+     between post-reload split2 and epilogue-completed split3???  */
+  gcc_assert (cfun->machine->has_explicit_vzeroupper);
+  emit_insn (gen_avx_vzeroupper ());
+  DONE;
+}
+  [(set_attr "type" "sse")
+   (set_attr "modrm" "0")
+   (set_attr "memory" "none")
+   (set_attr "prefix" "vex")
+   (set_attr "btver2_decode" "vector")
+   (set_attr "mode" "OI")])
+
 (define_mode_attr pbroadcast_evex_isa
   [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
    (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
new file mode 100644
index 00000000000..1a63b9ae9c9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
@@ -0,0 +1,29 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx" } */
+/* { dg-require-effective-target avx } */
+
+#include "avx-check.h"
+
+void
+__attribute__ ((noipa))
+mtest(char *dest)
+{
+  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
+  _mm256_zeroupper();
+  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)dest, ymm2);
+}
+
+void
+avx_test ()
+{
+  char buf[64];
+  for (int i = 0; i != 64; i++)
+    buf[i] = 2;
+  mtest (buf);
+
+  for (int i = 0; i < 32; ++i)
+    if (buf[i] != 1)
+      __builtin_abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
new file mode 100644
index 00000000000..48d0d6e983d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2" } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  /* xmm1 can be propagated to xmm2 by CSE.  */
+  __m128i xmm1 = _mm_set1_epi8((char)0x1);
+  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  __m128i xmm2 = _mm_set1_epi8((char)0x1);
+  _mm_storeu_si128((__m128i *)dest, xmm2);
+}
+
+/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
+/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
new file mode 100644
index 00000000000..e3f801e6924
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
@@ -0,0 +1,5 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2 -mabi=ms" } */
+/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
+
+#include "pr82735-2.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
new file mode 100644
index 00000000000..78c0a6cb2c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
@@ -0,0 +1,48 @@
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
new file mode 100644
index 00000000000..2a58cbe52d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
@@ -0,0 +1,54 @@
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-13  9:23 [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] Hongtao Liu
@ 2021-05-13  9:40 ` Uros Bizjak
  2021-05-13  9:43   ` Uros Bizjak
  0 siblings, 1 reply; 45+ messages in thread
From: Uros Bizjak @ 2021-05-13  9:40 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: GCC Patches, Jakub Jelinek, H. J. Lu

On Thu, May 13, 2021 at 11:18 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Hi:
>   When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> which leads to incorrect optimization in pass_reload.
> In order to solve this problem, this patch introduces a pre_reload
> splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
> problem in pr.
>
> At the same time, in order to optimize the low 128 bits in
> post_reload CSE, this patch also transforms those CLOBBERS to SETs in
> pass_vzeroupper.
>
> It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
> are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
> function, because of vzeroupper's pattern, pro_epilog will save and
> restore those registers, which is obviously redundant. In order to
> eliminate this redundancy, a post_reload splitter is introduced, which
> drops those SETs, until epilogue_completed splitter adds those SETs
> back, it looks to be safe since there's no CSE between post_reload
> split2 and epilogue_completed split3??? Also frame info needs to be
> updated in pro_epilog, which saves and restores xmm6-xmm15 only if
> there's usage other than explicit vzeroupper pattern.
>
>   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
>   Ok for trunk?

Some time ago a support for CLOBBER_HIGH RTX was added (and later
removed for some reason). Perhaps we could resurrect the patch for the
purpose of ferrying 128bit modes via vzeroupper RTX?

+(define_split
+  [(match_parallel 0 "vzeroupper_pattern"
+     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
+  "TARGET_AVX && ix86_pre_reload_split ()"
+  [(match_dup 0)]
+{
+  /* When vzeroupper is explictly used, for LRA purpose, make it clear
+     the instruction kills sse registers.  */
+  gcc_assert (cfun->machine->has_explicit_vzeroupper);
+  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+  rtvec vec = rtvec_alloc (nregs + 1);
+  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
+                        gen_rtvec (1, const1_rtx),
+                        UNSPECV_VZEROUPPER);
+  for (unsigned int i = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      rtx reg = gen_rtx_REG (V2DImode, regno);
+      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+    }
+  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
+})

Wouldn't this also kill lower 128bit values that are not touched by
vzeroupper? A CLOBBER_HIGH would be more appropriate here.

Uros.


> gcc/ChangeLog:
>
>         PR target/82735
>         * config/i386/i386-expand.c (ix86_expand_builtin): Count
>         number of __builtin_ia32_vzeroupper.
>         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
>         Transform CLOBBERs to SETs for explicit vzeroupper pattern so
>         that CSE can optimize lower 128 bits.
>         * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog):
>         New.
>         (ix86_save_reg): If there's no use of xmm6~xmm15 other than
>         explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
>         REGNO.
>         (ix86_finalize_stack_frame_flags): Recompute frame layout if
>         there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
>         * config/i386/i386.h (struct machine_function): Change type of
>         has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
>         * config/i386/sse.md (*avx_vzeroupper_2): New post-reload
>         splitter which will drop all SETs for explicit vzeroupper
>         patterns.
>         (*avx_vzeroupper_1): Generate SET reg to reg instead of
>         CLOBBER, and add pre-reload splitter after it.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/82735
>         * gcc.target/i386/pr82735-1.c: New test.
>         * gcc.target/i386/pr82735-2.c: New test.
>         * gcc.target/i386/pr82735-3.c: New test.
>         * gcc.target/i386/pr82735-4.c: New test.
>         * gcc.target/i386/pr82735-5.c: New test.
>
>
> --
> BR,
> Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-13  9:40 ` Uros Bizjak
@ 2021-05-13  9:43   ` Uros Bizjak
  2021-05-13  9:54     ` Jakub Jelinek
  0 siblings, 1 reply; 45+ messages in thread
From: Uros Bizjak @ 2021-05-13  9:43 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: GCC Patches, Jakub Jelinek, H. J. Lu

On Thu, May 13, 2021 at 11:40 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 11:18 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > Hi:
> >   When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> > which leads to incorrect optimization in pass_reload.
> > In order to solve this problem, this patch introduces a pre_reload
> > splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
> > problem in pr.
> >
> > At the same time, in order to optimize the low 128 bits in
> > post_reload CSE, this patch also transforms those CLOBBERS to SETs in
> > pass_vzeroupper.
> >
> > It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
> > are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
> > function, because of vzeroupper's pattern, pro_epilog will save and
> > restore those registers, which is obviously redundant. In order to
> > eliminate this redundancy, a post_reload splitter is introduced, which
> > drops those SETs, until epilogue_completed splitter adds those SETs
> > back, it looks to be safe since there's no CSE between post_reload
> > split2 and epilogue_completed split3??? Also frame info needs to be
> > updated in pro_epilog, which saves and restores xmm6-xmm15 only if
> > there's usage other than explicit vzeroupper pattern.
> >
> >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >   Ok for trunk?
>
> Some time ago a support for CLOBBER_HIGH RTX was added (and later
> removed for some reason). Perhaps we could resurrect the patch for the
> purpose of ferrying 128bit modes via vzeroupper RTX?

https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html

Uros.

>
> +(define_split
> +  [(match_parallel 0 "vzeroupper_pattern"
> +     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> +  "TARGET_AVX && ix86_pre_reload_split ()"
> +  [(match_dup 0)]
> +{
> +  /* When vzeroupper is explictly used, for LRA purpose, make it clear
> +     the instruction kills sse registers.  */
> +  gcc_assert (cfun->machine->has_explicit_vzeroupper);
> +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> +  rtvec vec = rtvec_alloc (nregs + 1);
> +  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
> +                        gen_rtvec (1, const1_rtx),
> +                        UNSPECV_VZEROUPPER);
> +  for (unsigned int i = 0; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      rtx reg = gen_rtx_REG (V2DImode, regno);
> +      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +    }
> +  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> +})
>
> Wouldn't this also kill lower 128bit values that are not touched by
> vzeroupper? A CLOBBER_HIGH would be more appropriate here.
>
> Uros.
>
>
> > gcc/ChangeLog:
> >
> >         PR target/82735
> >         * config/i386/i386-expand.c (ix86_expand_builtin): Count
> >         number of __builtin_ia32_vzeroupper.
> >         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
> >         Transform CLOBBERs to SETs for explicit vzeroupper pattern so
> >         that CSE can optimize lower 128 bits.
> >         * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog):
> >         New.
> >         (ix86_save_reg): If there's no use of xmm6~xmm15 other than
> >         explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
> >         REGNO.
> >         (ix86_finalize_stack_frame_flags): Recompute frame layout if
> >         there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
> >         * config/i386/i386.h (struct machine_function): Change type of
> >         has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
> >         * config/i386/sse.md (*avx_vzeroupper_2): New post-reload
> >         splitter which will drop all SETs for explicit vzeroupper
> >         patterns.
> >         (*avx_vzeroupper_1): Generate SET reg to reg instead of
> >         CLOBBER, and add pre-reload splitter after it.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/82735
> >         * gcc.target/i386/pr82735-1.c: New test.
> >         * gcc.target/i386/pr82735-2.c: New test.
> >         * gcc.target/i386/pr82735-3.c: New test.
> >         * gcc.target/i386/pr82735-4.c: New test.
> >         * gcc.target/i386/pr82735-5.c: New test.
> >
> >
> > --
> > BR,
> > Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-13  9:43   ` Uros Bizjak
@ 2021-05-13  9:54     ` Jakub Jelinek
  2021-05-13 11:32       ` Richard Sandiford
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Jelinek @ 2021-05-13  9:54 UTC (permalink / raw)
  To: Uros Bizjak, Richard Sandiford; +Cc: Hongtao Liu, GCC Patches, H. J. Lu

On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> > >   Ok for trunk?
> >
> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> > removed for some reason). Perhaps we could resurrect the patch for the
> > purpose of ferrying 128bit modes via vzeroupper RTX?
> 
> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html

https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
is where it got removed, CCing Richard.

> > +(define_split
> > +  [(match_parallel 0 "vzeroupper_pattern"
> > +     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> > +  "TARGET_AVX && ix86_pre_reload_split ()"
> > +  [(match_dup 0)]
> > +{
> > +  /* When vzeroupper is explictly used, for LRA purpose, make it clear
> > +     the instruction kills sse registers.  */
> > +  gcc_assert (cfun->machine->has_explicit_vzeroupper);
> > +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> > +  rtvec vec = rtvec_alloc (nregs + 1);
> > +  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
> > +                        gen_rtvec (1, const1_rtx),
> > +                        UNSPECV_VZEROUPPER);
> > +  for (unsigned int i = 0; i < nregs; ++i)
> > +    {
> > +      unsigned int regno = GET_SSE_REGNO (i);
> > +      rtx reg = gen_rtx_REG (V2DImode, regno);
> > +      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> > +    }
> > +  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> > +})
> >
> > Wouldn't this also kill lower 128bit values that are not touched by
> > vzeroupper? A CLOBBER_HIGH would be more appropriate here.

Yes, it would.  But normally the only xmm* hard regs live across the
explicit user vzeroupper would be local and global register variables,
I think the 1st scheduler etc. shouldn't extend lifetime of the
xmm hard regs across UNSPEC_VOLATILE.

	Jakub


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-13  9:54     ` Jakub Jelinek
@ 2021-05-13 11:32       ` Richard Sandiford
  2021-05-13 11:37         ` Jakub Jelinek
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Sandiford @ 2021-05-13 11:32 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Uros Bizjak, Hongtao Liu, GCC Patches, H. J. Lu

Jakub Jelinek <jakub@redhat.com> writes:
> On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
>> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
>> > >   Ok for trunk?
>> >
>> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
>> > removed for some reason). Perhaps we could resurrect the patch for the
>> > purpose of ferrying 128bit modes via vzeroupper RTX?
>> 
>> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
>
> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> is where it got removed, CCing Richard.

Yeah.  Initially clobber_high seemed like the best appraoch for
handling the tlsdesc thing, but in practice it was too difficult
to shoe-horn the concept in after the fact, when so much rtl
infrastructure wasn't prepared to deal with it.  The old support
didn't handle all cases and passes correctly, and handled others
suboptimally.

I think it would be worth using the same approach as
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
vzeroupper: represent the instructions as call_insns in which the
call has a special vzeroupper ABI.  I think that's likely to lead
to better code than clobber_high would (or at least, it did for tlsdesc).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-13 11:32       ` Richard Sandiford
@ 2021-05-13 11:37         ` Jakub Jelinek
  2021-05-13 11:52           ` Richard Sandiford
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Jelinek @ 2021-05-13 11:37 UTC (permalink / raw)
  To: Uros Bizjak, Hongtao Liu, GCC Patches, H. J. Lu, richard.sandiford

On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
> Jakub Jelinek <jakub@redhat.com> writes:
> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >> > >   Ok for trunk?
> >> >
> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> >> > removed for some reason). Perhaps we could resurrect the patch for the
> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
> >> 
> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
> >
> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> > is where it got removed, CCing Richard.
> 
> Yeah.  Initially clobber_high seemed like the best appraoch for
> handling the tlsdesc thing, but in practice it was too difficult
> to shoe-horn the concept in after the fact, when so much rtl
> infrastructure wasn't prepared to deal with it.  The old support
> didn't handle all cases and passes correctly, and handled others
> suboptimally.
> 
> I think it would be worth using the same approach as
> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> vzeroupper: represent the instructions as call_insns in which the
> call has a special vzeroupper ABI.  I think that's likely to lead
> to better code than clobber_high would (or at least, it did for tlsdesc).

Perhaps a magic call_insn that is split post-reload into a normal insn
with the sets then?

	Jakub


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-13 11:37         ` Jakub Jelinek
@ 2021-05-13 11:52           ` Richard Sandiford
  2021-05-14  2:27             ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Sandiford @ 2021-05-13 11:52 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Uros Bizjak, Hongtao Liu, GCC Patches, H. J. Lu

Jakub Jelinek <jakub@redhat.com> writes:
> On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
>> Jakub Jelinek <jakub@redhat.com> writes:
>> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
>> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
>> >> > >   Ok for trunk?
>> >> >
>> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
>> >> > removed for some reason). Perhaps we could resurrect the patch for the
>> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
>> >> 
>> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
>> >
>> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
>> > is where it got removed, CCing Richard.
>> 
>> Yeah.  Initially clobber_high seemed like the best appraoch for
>> handling the tlsdesc thing, but in practice it was too difficult
>> to shoe-horn the concept in after the fact, when so much rtl
>> infrastructure wasn't prepared to deal with it.  The old support
>> didn't handle all cases and passes correctly, and handled others
>> suboptimally.
>> 
>> I think it would be worth using the same approach as
>> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
>> vzeroupper: represent the instructions as call_insns in which the
>> call has a special vzeroupper ABI.  I think that's likely to lead
>> to better code than clobber_high would (or at least, it did for tlsdesc).
>
> Perhaps a magic call_insn that is split post-reload into a normal insn
> with the sets then?

I'd be tempted to treat it is a call_insn throughout.  The unspec_volatile
means that we can't move the instruction, so converting a call_insn to an
insn isn't likely to help from that point of view.  The sets are also
likely to be handled suboptimally compared to the more accurate register
information attached to the call: all code that handles calls has to be
prepared to deal with partial clobbers, whereas most code dealing with
sets will assume that the set does useful work, and that the rhs of the
set is live.

Thanks,
Richard


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-13 11:52           ` Richard Sandiford
@ 2021-05-14  2:27             ` Hongtao Liu
  2021-05-17  8:44               ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-05-14  2:27 UTC (permalink / raw)
  To: Jakub Jelinek, Uros Bizjak, Hongtao Liu, GCC Patches, H. J. Lu,
	Richard Sandiford

On Thu, May 13, 2021 at 7:52 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Jakub Jelinek <jakub@redhat.com> writes:
> > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
> >> Jakub Jelinek <jakub@redhat.com> writes:
> >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> >> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >> >> > >   Ok for trunk?
> >> >> >
> >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> >> >> > removed for some reason). Perhaps we could resurrect the patch for the
> >> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
> >> >>
> >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
> >> >
> >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> >> > is where it got removed, CCing Richard.
> >>
> >> Yeah.  Initially clobber_high seemed like the best appraoch for
> >> handling the tlsdesc thing, but in practice it was too difficult
> >> to shoe-horn the concept in after the fact, when so much rtl
> >> infrastructure wasn't prepared to deal with it.  The old support
> >> didn't handle all cases and passes correctly, and handled others
> >> suboptimally.
> >>
> >> I think it would be worth using the same approach as
> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> >> vzeroupper: represent the instructions as call_insns in which the
> >> call has a special vzeroupper ABI.  I think that's likely to lead
> >> to better code than clobber_high would (or at least, it did for tlsdesc).

From an implementation perspective, I guess you're meaning we should
implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386
backend.

> >
> > Perhaps a magic call_insn that is split post-reload into a normal insn
> > with the sets then?
>
> I'd be tempted to treat it is a call_insn throughout.  The unspec_volatile
> means that we can't move the instruction, so converting a call_insn to an
> insn isn't likely to help from that point of view.  The sets are also
> likely to be handled suboptimally compared to the more accurate register
> information attached to the call: all code that handles calls has to be
> prepared to deal with partial clobbers, whereas most code dealing with
> sets will assume that the set does useful work, and that the rhs of the
> set is live.
>
> Thanks,
> Richard
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-14  2:27             ` Hongtao Liu
@ 2021-05-17  8:44               ` Hongtao Liu
  2021-05-17  9:56                 ` Richard Sandiford
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-05-17  8:44 UTC (permalink / raw)
  To: Jakub Jelinek, Uros Bizjak, Hongtao Liu, GCC Patches, H. J. Lu,
	Richard Sandiford

On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 7:52 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Jakub Jelinek <jakub@redhat.com> writes:
> > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
> > >> Jakub Jelinek <jakub@redhat.com> writes:
> > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> > >> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> > >> >> > >   Ok for trunk?
> > >> >> >
> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> > >> >> > removed for some reason). Perhaps we could resurrect the patch for the
> > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
> > >> >>
> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
> > >> >
> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> > >> > is where it got removed, CCing Richard.
> > >>
> > >> Yeah.  Initially clobber_high seemed like the best appraoch for
> > >> handling the tlsdesc thing, but in practice it was too difficult
> > >> to shoe-horn the concept in after the fact, when so much rtl
> > >> infrastructure wasn't prepared to deal with it.  The old support
> > >> didn't handle all cases and passes correctly, and handled others
> > >> suboptimally.
> > >>
> > >> I think it would be worth using the same approach as
> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> > >> vzeroupper: represent the instructions as call_insns in which the
> > >> call has a special vzeroupper ABI.  I think that's likely to lead
> > >> to better code than clobber_high would (or at least, it did for tlsdesc).
>
> From an implementation perspective, I guess you're meaning we should
> implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386
> backend.
>
When I implemented the vzeroupper pattern as call_insn and defined
TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related
to 2 parts

1. requires_stack_frame_p return true for vzeroupper which should be false.
2. in subst_stack_regs, vzeroupper shouldn't kill arguments

I've tried a rough patch like below, it works for those failures,
unfortunately, I don't have an arm machine to test, so I want to ask
would the below change break something in the arm backend?

modified   gcc/reg-stack.c
@@ -174,6 +174,7 @@
 #include "reload.h"
 #include "tree-pass.h"
 #include "rtl-iter.h"
+#include "function-abi.h"

 #ifdef STACK_REGS

@@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
   bool control_flow_insn_deleted = false;
   int i;

-  if (CALL_P (insn))
+  if (CALL_P (insn) && insn_callee_abi (insn).id () == 0)
     {
       int top = regstack->top;

modified   gcc/shrink-wrap.c
@@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn,
HARD_REG_SET prologue_used,
   unsigned regno;

   if (CALL_P (insn))
-    return !SIBLING_CALL_P (insn);
+    {
+      if (insn_callee_abi (insn).id() != 0)
+ return false;
+      else
+ return !SIBLING_CALL_P (insn);
+    }

   /* We need a frame to get the unique CFA expected by the unwinder.  */
   if (cfun->can_throw_non_call_exceptions && can_throw_internal (insn))
> > >
> > > Perhaps a magic call_insn that is split post-reload into a normal insn
> > > with the sets then?
> >
> > I'd be tempted to treat it is a call_insn throughout.  The unspec_volatile
> > means that we can't move the instruction, so converting a call_insn to an
> > insn isn't likely to help from that point of view.  The sets are also
> > likely to be handled suboptimally compared to the more accurate register
> > information attached to the call: all code that handles calls has to be
> > prepared to deal with partial clobbers, whereas most code dealing with
> > sets will assume that the set does useful work, and that the rhs of the
> > set is live.
> >
> > Thanks,
> > Richard
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-17  8:44               ` Hongtao Liu
@ 2021-05-17  9:56                 ` Richard Sandiford
  2021-05-18 13:12                   ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Sandiford @ 2021-05-17  9:56 UTC (permalink / raw)
  To: Hongtao Liu via Gcc-patches
  Cc: Jakub Jelinek, Uros Bizjak, Hongtao Liu, H. J. Lu

Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote:
>>
>> On Thu, May 13, 2021 at 7:52 PM Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>> >
>> > Jakub Jelinek <jakub@redhat.com> writes:
>> > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
>> > >> Jakub Jelinek <jakub@redhat.com> writes:
>> > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
>> > >> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
>> > >> >> > >   Ok for trunk?
>> > >> >> >
>> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
>> > >> >> > removed for some reason). Perhaps we could resurrect the patch for the
>> > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
>> > >> >>
>> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
>> > >> >
>> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
>> > >> > is where it got removed, CCing Richard.
>> > >>
>> > >> Yeah.  Initially clobber_high seemed like the best appraoch for
>> > >> handling the tlsdesc thing, but in practice it was too difficult
>> > >> to shoe-horn the concept in after the fact, when so much rtl
>> > >> infrastructure wasn't prepared to deal with it.  The old support
>> > >> didn't handle all cases and passes correctly, and handled others
>> > >> suboptimally.
>> > >>
>> > >> I think it would be worth using the same approach as
>> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
>> > >> vzeroupper: represent the instructions as call_insns in which the
>> > >> call has a special vzeroupper ABI.  I think that's likely to lead
>> > >> to better code than clobber_high would (or at least, it did for tlsdesc).
>>
>> From an implementation perspective, I guess you're meaning we should
>> implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386
>> backend.
>>
> When I implemented the vzeroupper pattern as call_insn and defined
> TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related
> to 2 parts
>
> 1. requires_stack_frame_p return true for vzeroupper which should be false.
> 2. in subst_stack_regs, vzeroupper shouldn't kill arguments
>
> I've tried a rough patch like below, it works for those failures,
> unfortunately, I don't have an arm machine to test, so I want to ask
> would the below change break something in the arm backend?

ABI id 0 just means the default ABI.  Real calls can use other ABIs
besides the default.  That said…

> modified   gcc/reg-stack.c
> @@ -174,6 +174,7 @@
>  #include "reload.h"
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
> +#include "function-abi.h"
>
>  #ifdef STACK_REGS
>
> @@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
>    bool control_flow_insn_deleted = false;
>    int i;
>
> -  if (CALL_P (insn))
> +  if (CALL_P (insn) && insn_callee_abi (insn).id () == 0)
>      {
>        int top = regstack->top;

…reg-stack.c is effectively x86-specific code, so checking id 0 here
wouldn't affect anything else.  It doesn't feel very future-proof
though, since x86 could use ABIs other than 0 for real calls in future.

AIUI the property that matters here isn't the ABI, but that the target
of the call doesn't reference stack registers.  That can be true for
real calls too, with -fipa-ra.

> modified   gcc/shrink-wrap.c
> @@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn,
> HARD_REG_SET prologue_used,
>    unsigned regno;
>
>    if (CALL_P (insn))
> -    return !SIBLING_CALL_P (insn);
> +    {
> +      if (insn_callee_abi (insn).id() != 0)
> + return false;
> +      else
> + return !SIBLING_CALL_P (insn);
> +    }

TBH I'm not sure why off-hand this function needs to treat non-sibling
calls specially, rather than rely on normal DF information.  Calls have
a use of the stack pointer, so we should return true for that reason:

	/* The stack ptr is used (honorarily) by a CALL insn.  */
	df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
		       NULL, bb, insn_info, DF_REF_REG_USE,
		       DF_REF_CALL_STACK_USAGE | flags);

I guess this is something we should suppress for fake calls though.

It looks like the rtx “used” flag is unused for INSNs, so we could
use that as a CALL_INSN flag that indicates a fake call.  We could just
need to make:

      /* For all other RTXes clear the used flag on the copy.  */
      RTX_FLAG (copy, used) = 0;

conditional on !INSN_P.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-17  9:56                 ` Richard Sandiford
@ 2021-05-18 13:12                   ` Hongtao Liu
  2021-05-18 15:18                     ` Richard Sandiford
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-05-18 13:12 UTC (permalink / raw)
  To: Hongtao Liu via Gcc-patches, Jakub Jelinek, Uros Bizjak,
	Hongtao Liu, H. J. Lu, Richard Sandiford

On Mon, May 17, 2021 at 5:56 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >>
> >> On Thu, May 13, 2021 at 7:52 PM Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >> >
> >> > Jakub Jelinek <jakub@redhat.com> writes:
> >> > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
> >> > >> Jakub Jelinek <jakub@redhat.com> writes:
> >> > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> >> > >> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >> > >> >> > >   Ok for trunk?
> >> > >> >> >
> >> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> >> > >> >> > removed for some reason). Perhaps we could resurrect the patch for the
> >> > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
> >> > >> >>
> >> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
> >> > >> >
> >> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> >> > >> > is where it got removed, CCing Richard.
> >> > >>
> >> > >> Yeah.  Initially clobber_high seemed like the best appraoch for
> >> > >> handling the tlsdesc thing, but in practice it was too difficult
> >> > >> to shoe-horn the concept in after the fact, when so much rtl
> >> > >> infrastructure wasn't prepared to deal with it.  The old support
> >> > >> didn't handle all cases and passes correctly, and handled others
> >> > >> suboptimally.
> >> > >>
> >> > >> I think it would be worth using the same approach as
> >> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> >> > >> vzeroupper: represent the instructions as call_insns in which the
> >> > >> call has a special vzeroupper ABI.  I think that's likely to lead
> >> > >> to better code than clobber_high would (or at least, it did for tlsdesc).
> >>
> >> From an implementation perspective, I guess you're meaning we should
> >> implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386
> >> backend.
> >>
> > When I implemented the vzeroupper pattern as call_insn and defined
> > TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related
> > to 2 parts
> >
> > 1. requires_stack_frame_p return true for vzeroupper which should be false.
> > 2. in subst_stack_regs, vzeroupper shouldn't kill arguments
> >
> > I've tried a rough patch like below, it works for those failures,
> > unfortunately, I don't have an arm machine to test, so I want to ask
> > would the below change break something in the arm backend?
>
> ABI id 0 just means the default ABI.  Real calls can use other ABIs
> besides the default.  That said…
>
> > modified   gcc/reg-stack.c
> > @@ -174,6 +174,7 @@
> >  #include "reload.h"
> >  #include "tree-pass.h"
> >  #include "rtl-iter.h"
> > +#include "function-abi.h"
> >
> >  #ifdef STACK_REGS
> >
> > @@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
> >    bool control_flow_insn_deleted = false;
> >    int i;
> >
> > -  if (CALL_P (insn))
> > +  if (CALL_P (insn) && insn_callee_abi (insn).id () == 0)
> >      {
> >        int top = regstack->top;
>
> …reg-stack.c is effectively x86-specific code, so checking id 0 here
> wouldn't affect anything else.  It doesn't feel very future-proof
> though, since x86 could use ABIs other than 0 for real calls in future.
>
> AIUI the property that matters here isn't the ABI, but that the target
> of the call doesn't reference stack registers.  That can be true for
> real calls too, with -fipa-ra.
>
> > modified   gcc/shrink-wrap.c
> > @@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn,
> > HARD_REG_SET prologue_used,
> >    unsigned regno;
> >
> >    if (CALL_P (insn))
> > -    return !SIBLING_CALL_P (insn);
> > +    {
> > +      if (insn_callee_abi (insn).id() != 0)
> > + return false;
> > +      else
> > + return !SIBLING_CALL_P (insn);
> > +    }
>
> TBH I'm not sure why off-hand this function needs to treat non-sibling
> calls specially, rather than rely on normal DF information.  Calls have
> a use of the stack pointer, so we should return true for that reason:
>
>         /* The stack ptr is used (honorarily) by a CALL insn.  */
>         df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
>                        NULL, bb, insn_info, DF_REF_REG_USE,
>                        DF_REF_CALL_STACK_USAGE | flags);
>
> I guess this is something we should suppress for fake calls though.
>
> It looks like the rtx “used” flag is unused for INSNs, so we could
> use that as a CALL_INSN flag that indicates a fake call.  We could just
> need to make:
>
>       /* For all other RTXes clear the used flag on the copy.  */
>       RTX_FLAG (copy, used) = 0;
>
> conditional on !INSN_P.
>
I got another error in

@@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn)
       return true;

     case CALL_INSN:
+      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
+      if (RTX_FLAG (insn, used))
+ break;

and performance issue in

modified   gcc/final.c
@@ -4498,7 +4498,8 @@ leaf_function_p (void)
   for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
     {
       if (CALL_P (insn)
-   && ! SIBLING_CALL_P (insn))
+   && ! SIBLING_CALL_P (insn)
+   && !RTX_FLAG (insn, used))
  return 0;
       if (NONJUMP_INSN_P (insn)

Also i grep CALL_P or CALL_INSN in GCC source codes, there are many
places which hold the assumption CALL_P/CALL_INSN is a real call.
Considering that vzeroupper is used a lot on the i386 backend, I'm a
bit worried that this implementation solution will be a bottomless
pit.

> Thanks,
> Richard



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-18 13:12                   ` Hongtao Liu
@ 2021-05-18 15:18                     ` Richard Sandiford
  2021-05-25  6:04                       ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Sandiford @ 2021-05-18 15:18 UTC (permalink / raw)
  To: Hongtao Liu via Gcc-patches
  Cc: Jakub Jelinek, Uros Bizjak, Hongtao Liu, H. J. Lu

Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, May 17, 2021 at 5:56 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> It looks like the rtx “used” flag is unused for INSNs, so we could
>> use that as a CALL_INSN flag that indicates a fake call.  We could just
>> need to make:
>>
>>       /* For all other RTXes clear the used flag on the copy.  */
>>       RTX_FLAG (copy, used) = 0;
>>
>> conditional on !INSN_P.
>>
> I got another error in
>
> @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn)
>        return true;
>
>      case CALL_INSN:
> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> +      if (RTX_FLAG (insn, used))
> + break;

I guess this is because of the nonlocal_goto condition?  If so, that
could be fixed by adding a REG_EH_REGION note of INT_MIN.  Even if we
don't do that, I think the fix belongs in nonlocal_goto instead.

> and performance issue in
>
> modified   gcc/final.c
> @@ -4498,7 +4498,8 @@ leaf_function_p (void)
>    for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
>      {
>        if (CALL_P (insn)
> -   && ! SIBLING_CALL_P (insn))
> +   && ! SIBLING_CALL_P (insn)
> +   && !RTX_FLAG (insn, used))
>   return 0;
>        if (NONJUMP_INSN_P (insn)
>
> Also i grep CALL_P or CALL_INSN in GCC source codes, there are many
> places which hold the assumption CALL_P/CALL_INSN is a real call.
> Considering that vzeroupper is used a lot on the i386 backend, I'm a
> bit worried that this implementation solution will be a bottomless
> pit.

Maybe, but I think the same is true for CLOBBER_HIGH.  If we have
a third alternative then we should consider it, but I think the
call approach is still going to be less problematic then CLOBBER_HIGH.

The main advantage of the call approach is that the CALL_P handling
is (mostly) conservatively correct and performance problems are just
a one-line change.  The CLOBBER_HIGH approach instead requires
changes to the way that passes track liveness information for
non-call instructions (so is much more than a one-line change).
Also, treating a CLOBBER_HIGH like a CLOBBER isn't conservatively
correct, because other code might be relying on part of the register
being preserved.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-18 15:18                     ` Richard Sandiford
@ 2021-05-25  6:04                       ` Hongtao Liu
  2021-05-25  6:30                         ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-05-25  6:04 UTC (permalink / raw)
  To: Hongtao Liu via Gcc-patches, Jakub Jelinek, Uros Bizjak,
	Hongtao Liu, H. J. Lu, Richard Sandiford

On Tue, May 18, 2021 at 11:18 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Mon, May 17, 2021 at 5:56 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >> It looks like the rtx “used” flag is unused for INSNs, so we could
> >> use that as a CALL_INSN flag that indicates a fake call.  We could just
> >> need to make:
> >>
> >>       /* For all other RTXes clear the used flag on the copy.  */
> >>       RTX_FLAG (copy, used) = 0;
> >>
> >> conditional on !INSN_P.
> >>
> > I got another error in
> >
> > @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn)
> >        return true;
> >
> >      case CALL_INSN:
> > +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> > +      if (RTX_FLAG (insn, used))
> > + break;
>
> I guess this is because of the nonlocal_goto condition?  If so, that
> could be fixed by adding a REG_EH_REGION note of INT_MIN.  Even if we
> don't do that, I think the fix belongs in nonlocal_goto instead.
>
This is error info, IMHO, the fix should be in control_flow_insn_p?

../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:
In function ‘foo’:
../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
error: in basic block 5:
   21 | }
      | ^
../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
error: flow control insn inside a basic block
(call_insn 77 50 86 5 (parallel [
            (call (mem:QI (unspec_volatile [
                            (const_int 0 [0])
                        ] UNSPECV_VZEROUPPER) [0  S1 A8])
                (const_int 0 [0]))
            (unspec [
                    (const_int 1 [0x1])
                ] UNSPEC_CALLEE_ABI)
        ]) -1
     (nil)
    (nil))
during RTL pass: pro_and_epilogue
../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
internal compiler error: in rtl_verify_bb_insns, at cfgrtl.c:2797
0x129a2a3 _fatal_insn(char const*, rtx_def const*, char const*, int,
char const*)
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/rtl-error.c:108
0xcb8834 rtl_verify_bb_insns
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2797
0xcb8b09 rtl_verify_flow_info_1
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2883
0xcb9284 rtl_verify_flow_info
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:3125
0xc9f44d verify_flow_info()
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.c:267
0xcb21b7 checking_verify_flow_info
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.h:212
0xcb6a3c commit_edge_insertions()
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2115
0xebfcb8 thread_prologue_and_epilogue_insns()
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6136
0xec07db rest_of_handle_thread_prologue_and_epilogue
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6510
0xec09b8 execute
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6586
>
> Thanks,
> Richard



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-25  6:04                       ` Hongtao Liu
@ 2021-05-25  6:30                         ` Hongtao Liu
  2021-05-27  5:07                           ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-05-25  6:30 UTC (permalink / raw)
  To: Hongtao Liu via Gcc-patches, Jakub Jelinek, Uros Bizjak,
	Hongtao Liu, H. J. Lu, Richard Sandiford

On Tue, May 25, 2021 at 2:04 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Tue, May 18, 2021 at 11:18 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > On Mon, May 17, 2021 at 5:56 PM Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >> It looks like the rtx “used” flag is unused for INSNs, so we could
> > >> use that as a CALL_INSN flag that indicates a fake call.  We could just
> > >> need to make:
> > >>
> > >>       /* For all other RTXes clear the used flag on the copy.  */
> > >>       RTX_FLAG (copy, used) = 0;
> > >>
> > >> conditional on !INSN_P.
> > >>
> > > I got another error in
> > >
> > > @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn)
> > >        return true;
> > >
> > >      case CALL_INSN:
> > > +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> > > +      if (RTX_FLAG (insn, used))
> > > + break;
> >
> > I guess this is because of the nonlocal_goto condition?  If so, that
Oh, I guess you're meaning can_nonlocal_goto which is inside
constrol_flow_insn_p. Sorry for disturbing you.
> > could be fixed by adding a REG_EH_REGION note of INT_MIN.  Even if we
> > don't do that, I think the fix belongs in nonlocal_goto instead.
> >
> This is error info, IMHO, the fix should be in control_flow_insn_p?
>
> ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:
> In function ‘foo’:
> ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
> error: in basic block 5:
>    21 | }
>       | ^
> ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
> error: flow control insn inside a basic block
> (call_insn 77 50 86 5 (parallel [
>             (call (mem:QI (unspec_volatile [
>                             (const_int 0 [0])
>                         ] UNSPECV_VZEROUPPER) [0  S1 A8])
>                 (const_int 0 [0]))
>             (unspec [
>                     (const_int 1 [0x1])
>                 ] UNSPEC_CALLEE_ABI)
>         ]) -1
>      (nil)
>     (nil))
> during RTL pass: pro_and_epilogue
> ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
> internal compiler error: in rtl_verify_bb_insns, at cfgrtl.c:2797
> 0x129a2a3 _fatal_insn(char const*, rtx_def const*, char const*, int,
> char const*)
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/rtl-error.c:108
> 0xcb8834 rtl_verify_bb_insns
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2797
> 0xcb8b09 rtl_verify_flow_info_1
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2883
> 0xcb9284 rtl_verify_flow_info
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:3125
> 0xc9f44d verify_flow_info()
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.c:267
> 0xcb21b7 checking_verify_flow_info
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.h:212
> 0xcb6a3c commit_edge_insertions()
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2115
> 0xebfcb8 thread_prologue_and_epilogue_insns()
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6136
> 0xec07db rest_of_handle_thread_prologue_and_epilogue
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6510
> 0xec09b8 execute
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6586
> >
> > Thanks,
> > Richard
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-25  6:30                         ` Hongtao Liu
@ 2021-05-27  5:07                           ` Hongtao Liu
  2021-05-27  7:05                             ` Uros Bizjak
  2021-05-27  7:20                             ` [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] Jakub Jelinek
  0 siblings, 2 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-05-27  5:07 UTC (permalink / raw)
  To: Hongtao Liu via Gcc-patches, Jakub Jelinek, Uros Bizjak,
	Hongtao Liu, H. J. Lu, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 2824 bytes --]

Hi:
  This is an updated patch which implements vzeroupper as call_insn
which has a special vzeroupper ABI, also in this patch i reverted
r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in
a different way.
  Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and
x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}.
  Also test the patch on SPEC2017 and eembc, no performance impact as expected.
  Ok for trunk?

gcc/ChangeLog:

        PR target/82735
        * config/i386/i386-expand.c (ix86_expand_builtin): Remove
        assignment of cfun->machine->has_explicit_vzeroupper.
        * config/i386/i386-features.c
        (ix86_add_reg_usage_to_vzerouppers): Delete.
        (ix86_add_reg_usage_to_vzeroupper): Ditto.
        (rest_of_handle_insert_vzeroupper): Remove
        ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
        of the function.
        (gate): Remove cfun->machine->has_explicit_vzeroupper.
        * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
        Declared.
        * config/i386/i386.c (ix86_insn_callee_abi): New function.
        (ix86_initialize_callee_abi): Ditto.
        (ix86_expand_avx_vzeroupper): Ditto.
        (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
        ABI.
        (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
        * config/i386/i386.h (enum i386_insn_callee_abi_index): New.
        (struct GTY(()) machine_function): Delete
        has_explicit_vzeroupper.
        * config/i386/i386.md (enum unspec): New member
        UNSPEC_CALLEE_ABI.
        * config/i386/predicates.md (vzeroupper_pattern): Adjust.
        * config/i386/sse.md (avx_vzeroupper): Call
        ix86_expand_avx_vzeroupper.
        (*avx_vzeroupper): Rename to ..
        (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
        call_insn which has a special vzeroupper ABI.
        (*avx_vzeroupper_1): Deleted.
        * df-scan.c (df_get_call_refs): When call_insn is a fake call,
        it won't use stack pointer reg.
        * final.c (leaf_function_p): When call_insn is a fake call, it
        won't affect caller as a leaf function.
        * reg-stack.c (callee_clobbers_any_stack_reg): New.
        (subst_stack_regs): When call_insn doesn't clobber any stack
        reg, don't clear the arguments.
        * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
        a insn.
        * shrink-wrap.c (requires_stack_frame_p): No need for stack
        frame for a fake call.

gcc/testsuite/ChangeLog:

        PR target/82735
        * gcc.target/i386/pr82735-1.c: New test.
        * gcc.target/i386/pr82735-2.c: New test.
        * gcc.target/i386/pr82735-3.c: New test.
        * gcc.target/i386/pr82735-4.c: New test.
        * gcc.target/i386/pr82735-5.c: New test.

[-- Attachment #2: 0001-i386-Fix-_mm256_zeroupper-by-representing-the-instru.patch --]
[-- Type: text/x-patch, Size: 27017 bytes --]

From c898d16891ed4e9cf104ee36914399da0bff4a23 Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 12 May 2021 14:20:54 +0800
Subject: [PATCH] [i386] Fix _mm256_zeroupper by representing the instructions
 as call_insns in which the call has a special vzeroupper ABI.

When __builtin_ia32_vzeroupper is called explicitly, the corresponding
vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
which leads to incorrect optimization in pass_reload. In order to
solve this problem, this patch refine instructions as call_insns in
which the call has a special vzeroupper ABI.

gcc/ChangeLog:

	PR target/82735
	* config/i386/i386-expand.c (ix86_expand_builtin): Remove
	assignment of cfun->machine->has_explicit_vzeroupper.
	* config/i386/i386-features.c
	(ix86_add_reg_usage_to_vzerouppers): Delete.
	(ix86_add_reg_usage_to_vzeroupper): Ditto.
	(rest_of_handle_insert_vzeroupper): Remove
	ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
	of the function.
	(gate): Remove cfun->machine->has_explicit_vzeroupper.
	* config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
	Declared.
	* config/i386/i386.c (ix86_insn_callee_abi): New function.
	(ix86_initialize_callee_abi): Ditto.
	(ix86_expand_avx_vzeroupper): Ditto.
	(ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
	ABI.
	(TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
	* config/i386/i386.h (enum i386_insn_callee_abi_index): New.
	(struct GTY(()) machine_function): Delete
	has_explicit_vzeroupper.
	* config/i386/i386.md (enum unspec): New member
	UNSPEC_CALLEE_ABI.
	* config/i386/predicates.md (vzeroupper_pattern): Adjust.
	* config/i386/sse.md (avx_vzeroupper): Call
	ix86_expand_avx_vzeroupper.
	(*avx_vzeroupper): Rename to ..
	(avx_vzeroupper_callee_abi): .. this, and adjust pattern as
	call_insn which has a special vzeroupper ABI.
	(*avx_vzeroupper_1): Deleted.
	* df-scan.c (df_get_call_refs): When call_insn is a fake call,
	it won't use stack pointer reg.
	* final.c (leaf_function_p): When call_insn is a fake call, it
	won't affect caller as a leaf function.
	* reg-stack.c (callee_clobbers_any_stack_reg): New.
	(subst_stack_regs): When call_insn doesn't clobber any stack
	reg, don't clear the arguments.
	* rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
	a insn.
	* shrink-wrap.c (requires_stack_frame_p): No need for stack
	frame for a fake call.

gcc/testsuite/ChangeLog:

	PR target/82735
	* gcc.target/i386/pr82735-1.c: New test.
	* gcc.target/i386/pr82735-2.c: New test.
	* gcc.target/i386/pr82735-3.c: New test.
	* gcc.target/i386/pr82735-4.c: New test.
	* gcc.target/i386/pr82735-5.c: New test.
---
 gcc/config/i386/i386-expand.c             |  4 -
 gcc/config/i386/i386-features.c           | 99 +++--------------------
 gcc/config/i386/i386-protos.h             |  2 +
 gcc/config/i386/i386.c                    | 53 +++++++++++-
 gcc/config/i386/i386.h                    | 12 ++-
 gcc/config/i386/i386.md                   |  4 +
 gcc/config/i386/predicates.md             |  5 +-
 gcc/config/i386/sse.md                    | 58 ++++---------
 gcc/df-scan.c                             |  4 +-
 gcc/final.c                               |  3 +-
 gcc/reg-stack.c                           | 18 ++++-
 gcc/rtl.c                                 |  6 +-
 gcc/shrink-wrap.c                         |  3 +-
 gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 +++++++
 gcc/testsuite/gcc.target/i386/pr82735-2.c | 22 +++++
 gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 ++
 gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 +++++++++++
 gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++
 18 files changed, 282 insertions(+), 147 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index 9f3d41955a2..d25d59aa4e7 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -13282,10 +13282,6 @@ rdseed_step:
 
       return 0;
 
-    case IX86_BUILTIN_VZEROUPPER:
-      cfun->machine->has_explicit_vzeroupper = true;
-      break;
-
     default:
       break;
     }
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 77783a154b6..a25769ae478 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -1768,92 +1768,22 @@ convert_scalars_to_vector (bool timode_p)
   return 0;
 }
 
-/* Modify the vzeroupper pattern in INSN so that it describes the effect
-   that the instruction has on the SSE registers.  LIVE_REGS are the set
-   of registers that are live across the instruction.
-
-   For a live register R we use:
-
-     (set (reg:V2DF R) (reg:V2DF R))
-
-   which preserves the low 128 bits but clobbers the upper bits.  */
-
-static void
-ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
-{
-  rtx pattern = PATTERN (insn);
-  unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  unsigned int npats = nregs;
-  for (unsigned int i = 0; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (!bitmap_bit_p (live_regs, regno))
-	npats--;
-    }
-  if (npats == 0)
-    return;
-  rtvec vec = rtvec_alloc (npats + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
-  for (unsigned int i = 0, j = 0; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (!bitmap_bit_p (live_regs, regno))
-	continue;
-      rtx reg = gen_rtx_REG (V2DImode, regno);
-      ++j;
-      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
-    }
-  XVEC (pattern, 0) = vec;
-  INSN_CODE (insn) = -1;
-  df_insn_rescan (insn);
-}
-
-/* Walk the vzeroupper instructions in the function and annotate them
-   with the effect that they have on the SSE registers.  */
-
-static void
-ix86_add_reg_usage_to_vzerouppers (void)
-{
-  basic_block bb;
-  rtx_insn *insn;
-  auto_bitmap live_regs;
-
-  df_analyze ();
-  FOR_EACH_BB_FN (bb, cfun)
-    {
-      bitmap_copy (live_regs, df_get_live_out (bb));
-      df_simulate_initialize_backwards (bb, live_regs);
-      FOR_BB_INSNS_REVERSE (bb, insn)
-	{
-	  if (!NONDEBUG_INSN_P (insn))
-	    continue;
-	  if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
-	    ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
-	  df_simulate_one_insn_backwards (bb, insn, live_regs);
-	}
-    }
-}
-
 static unsigned int
 rest_of_handle_insert_vzeroupper (void)
 {
-  if (TARGET_VZEROUPPER
-      && flag_expensive_optimizations
-      && !optimize_size)
-    {
-      /* vzeroupper instructions are inserted immediately after reload to
-	 account for possible spills from 256bit or 512bit registers.  The pass
-	 reuses mode switching infrastructure by re-running mode insertion
-	 pass, so disable entities that have already been processed.  */
-      for (int i = 0; i < MAX_386_ENTITIES; i++)
-	ix86_optimize_mode_switching[i] = 0;
+  /* vzeroupper instructions are inserted immediately after reload to
+     account for possible spills from 256bit or 512bit registers.  The pass
+     reuses mode switching infrastructure by re-running mode insertion
+     pass, so disable entities that have already been processed.  */
+  for (int i = 0; i < MAX_386_ENTITIES; i++)
+    ix86_optimize_mode_switching[i] = 0;
 
-      ix86_optimize_mode_switching[AVX_U128] = 1;
+  ix86_optimize_mode_switching[AVX_U128] = 1;
 
-      /* Call optimize_mode_switching.  */
-      g->get_passes ()->execute_pass_mode_switching ();
-    }
-  ix86_add_reg_usage_to_vzerouppers ();
+  /* Call optimize_mode_switching.  */
+  g->get_passes ()->execute_pass_mode_switching ();
+
+  df_analyze ();
   return 0;
 }
 
@@ -1882,11 +1812,8 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      return TARGET_AVX
-	     && ((TARGET_VZEROUPPER
-		  && flag_expensive_optimizations
-		  && !optimize_size)
-		 || cfun->machine->has_explicit_vzeroupper);
+      return TARGET_AVX && TARGET_VZEROUPPER
+	&& flag_expensive_optimizations && !optimize_size;
     }
 
   virtual unsigned int execute (function *)
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 7782cf1163f..e81b9872c6c 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -51,6 +51,7 @@ extern void ix86_reset_previous_fndecl (void);
 extern bool ix86_using_red_zone (void);
 
 extern unsigned int ix86_regmode_natural_size (machine_mode);
+
 #ifdef RTX_CODE
 extern int standard_80387_constant_p (rtx);
 extern const char *standard_80387_constant_opcode (rtx);
@@ -216,6 +217,7 @@ extern rtx ix86_split_stack_guard (void);
 extern void ix86_move_vector_high_sse_to_mmx (rtx);
 extern void ix86_split_mmx_pack (rtx[], enum rtx_code);
 extern void ix86_split_mmx_punpck (rtx[], bool);
+extern void ix86_expand_avx_vzeroupper (void);
 
 #ifdef TREE_CODE
 extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 743d8a25fe3..cecc07e7371 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -19494,15 +19494,63 @@ ix86_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
   return false;
 }
 
+/* Implement TARGET_INSN_CALLEE_ABI.  */
+
+const predefined_function_abi &
+ix86_insn_callee_abi (const rtx_insn *insn)
+{
+  unsigned int abi_id = 0;
+  rtx pat = PATTERN (insn);
+  if (vzeroupper_pattern (pat, VOIDmode))
+    abi_id = I386_VZEROUPPER;
+
+  return function_abis[abi_id];
+}
+
+/* Initialize function_abis with corresponding abi_id,
+   currently only handle vzeroupper.  */
+void
+ix86_initialize_callee_abi (unsigned int abi_id)
+{
+  gcc_assert (abi_id == I386_VZEROUPPER);
+  predefined_function_abi &vzeroupper_abi = function_abis[abi_id];
+  if (!vzeroupper_abi.initialized_p ())
+    {
+      HARD_REG_SET full_reg_clobbers;
+      CLEAR_HARD_REG_SET (full_reg_clobbers);
+      vzeroupper_abi.initialize (I386_VZEROUPPER, full_reg_clobbers);
+    }
+}
+
+void
+ix86_expand_avx_vzeroupper (void)
+{
+  /* Initialize vzeroupper_abi here.  */
+  ix86_initialize_callee_abi (I386_VZEROUPPER);
+  rtx_insn *insn = emit_call_insn (gen_avx_vzeroupper_callee_abi ());
+  /* Return false for non-local goto in can_nonlocal_goto.  */
+  make_reg_eh_region_note (insn, 0, INT_MIN);
+  /* Flag used for call_insn indicates it's a fake call.  */
+  RTX_FLAG (insn, used) = 1;
+}
+
+
 /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
    saves SSE registers across calls is Win64 (thus no need to check the
    current ABI here), and with AVX enabled Win64 only guarantees that
    the low 16 bytes are saved.  */
 
 static bool
-ix86_hard_regno_call_part_clobbered (unsigned int, unsigned int regno,
+ix86_hard_regno_call_part_clobbered (unsigned int abi_id, unsigned int regno,
 				     machine_mode mode)
 {
+  /* Special ABI for vzeroupper which only clobber higher part of sse regs.  */
+  if (abi_id == I386_VZEROUPPER)
+      return (GET_MODE_SIZE (mode) > 16
+	      && ((TARGET_64BIT
+		   && (IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
+		  || (IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG))));
+
   return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
 }
 
@@ -23916,6 +23964,9 @@ ix86_run_selftests (void)
 #define TARGET_HARD_REGNO_CALL_PART_CLOBBERED \
   ix86_hard_regno_call_part_clobbered
 
+#undef TARGET_INSN_CALLEE_ABI
+#define TARGET_INSN_CALLEE_ABI ix86_insn_callee_abi
+
 #undef TARGET_CAN_CHANGE_MODE_CLASS
 #define TARGET_CAN_CHANGE_MODE_CLASS ix86_can_change_mode_class
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 53d503fc6e0..9d07769169d 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1177,6 +1177,14 @@ extern const char *host_detect_local_cpu (int argc, const char **argv);
 
 #define KEEP_AGGREGATE_RETURN_POINTER 0
 \f
+
+enum i386_insn_callee_abi_index
+{
+  I386_DEFAULT,		/* Default function abi.  */
+  I386_VZEROUPPER,	/* For vzeroupper.  */
+  I386_UNKNOWN
+};
+
 /* Define the classes of registers for register constraints in the
    machine description.  Also define ranges of constants.
 
@@ -2659,10 +2667,6 @@ struct GTY(()) machine_function {
   /* True if the function needs a stack frame.  */
   BOOL_BITFIELD stack_frame_required : 1;
 
-  /* True if __builtin_ia32_vzeroupper () has been expanded in current
-     function.  */
-  BOOL_BITFIELD has_explicit_vzeroupper : 1;
-
   /* True if we should act silently, rather than raise an error for
      invalid calls.  */
   BOOL_BITFIELD silent_p : 1;
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2fc8fae30f3..bd012ab1d21 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -191,6 +191,10 @@ (define_c_enum "unspec" [
   ;; For MOVDIRI and MOVDIR64B support
   UNSPEC_MOVDIRI
   UNSPEC_MOVDIR64B
+
+  ;; For insn_callee_abi:
+  UNSPEC_CALLEE_ABI
+
 ])
 
 (define_c_enum "unspecv" [
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index abd307ebdb8..7c5b7482833 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1599,8 +1599,9 @@ (define_predicate "vzeroall_pattern"
 ;; return true if OP is a vzeroupper pattern.
 (define_predicate "vzeroupper_pattern"
   (and (match_code "parallel")
-       (match_code "unspec_volatile" "a")
-       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
+       (match_code "unspec" "b")
+       (match_test "XINT (XVECEXP (op, 0, 1), 1) == UNSPEC_CALLEE_ABI")
+       (match_test "XVECEXP (XVECEXP (op, 0, 1), 0, 0) == const1_rtx")))
 
 ;; Return true if OP is an addsub vec_merge operation
 (define_predicate "addsub_vm_operator"
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index a4503ddcb73..86cf44bae14 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -20857,14 +20857,22 @@ (define_insn "*avx_vzeroall"
 ;; if the upper 128bits are unused.  Initially we expand the instructions
 ;; as though they had no effect on the SSE registers, but later add SETs and
 ;; CLOBBERs to the PARALLEL to model the real effect.
+
 (define_expand "avx_vzeroupper"
-  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX")
+  [(parallel [(call (mem:QI (unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER))
+		    (const_int 0))
+	     (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]
+  "TARGET_AVX"
+{
+  ix86_expand_avx_vzeroupper ();
+  DONE;
+})
 
-(define_insn "*avx_vzeroupper"
-  [(match_parallel 0 "vzeroupper_pattern"
-     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
+(define_insn "avx_vzeroupper_callee_abi"
+  [(call (mem:QI (unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER))
+	 (const_int 0))
+    (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)]
+  "TARGET_AVX"
   "vzeroupper"
   [(set_attr "type" "sse")
    (set_attr "modrm" "0")
@@ -20873,44 +20881,6 @@ (define_insn "*avx_vzeroupper"
    (set_attr "btver2_decode" "vector")
    (set_attr "mode" "OI")])
 
-(define_insn_and_split "*avx_vzeroupper_1"
-  [(match_parallel 0 "vzeroupper_pattern"
-     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
-  "#"
-  "&& epilogue_completed"
-  [(match_dup 0)]
-{
-  /* For IPA-RA purposes, make it clear the instruction clobbers
-     even XMM registers not mentioned explicitly in the pattern.  */
-  unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  unsigned int npats = XVECLEN (operands[0], 0);
-  rtvec vec = rtvec_alloc (nregs + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
-  for (unsigned int i = 0, j = 1; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (j < npats
-	  && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
-	{
-	  RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
-	  j++;
-	}
-      else
-	{
-	  rtx reg = gen_rtx_REG (V2DImode, regno);
-	  RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
-	}
-    }
-  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
-}
-  [(set_attr "type" "sse")
-   (set_attr "modrm" "0")
-   (set_attr "memory" "none")
-   (set_attr "prefix" "vex")
-   (set_attr "btver2_decode" "vector")
-   (set_attr "mode" "OI")])
-
 (define_mode_attr pbroadcast_evex_isa
   [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
    (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
diff --git a/gcc/df-scan.c b/gcc/df-scan.c
index 6691c3e8357..845702d9630 100644
--- a/gcc/df-scan.c
+++ b/gcc/df-scan.c
@@ -3090,7 +3090,9 @@ df_get_call_refs (class df_collection_rec *collection_rec,
 
   for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
     {
-      if (i == STACK_POINTER_REGNUM)
+      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
+      if (i == STACK_POINTER_REGNUM
+	  && !RTX_FLAG (insn_info->insn, used))
 	/* The stack ptr is used (honorarily) by a CALL insn.  */
 	df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
 		       NULL, bb, insn_info, DF_REF_REG_USE,
diff --git a/gcc/final.c b/gcc/final.c
index e0a70fcd830..5f8b7b006c2 100644
--- a/gcc/final.c
+++ b/gcc/final.c
@@ -4109,7 +4109,8 @@ leaf_function_p (void)
   for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
     {
       if (CALL_P (insn)
-	  && ! SIBLING_CALL_P (insn))
+	  && ! SIBLING_CALL_P (insn)
+	  && !RTX_FLAG (insn, used))
 	return 0;
       if (NONJUMP_INSN_P (insn)
 	  && GET_CODE (PATTERN (insn)) == SEQUENCE
diff --git a/gcc/reg-stack.c b/gcc/reg-stack.c
index 25210f0c17f..1d9ea035cf4 100644
--- a/gcc/reg-stack.c
+++ b/gcc/reg-stack.c
@@ -174,6 +174,7 @@
 #include "reload.h"
 #include "tree-pass.h"
 #include "rtl-iter.h"
+#include "function-abi.h"
 
 #ifdef STACK_REGS
 
@@ -2368,6 +2369,18 @@ subst_asm_stack_regs (rtx_insn *insn, stack_ptr regstack)
 	    }
       }
 }
+
+/* Return true if a function call is allowed to alter some or all bits
+   of any stack reg.  */
+static bool
+callee_clobbers_any_stack_reg (const function_abi & callee_abi)
+{
+  for (unsigned regno = FIRST_STACK_REG; regno <= LAST_STACK_REG; regno++)
+    if (callee_abi.clobbers_at_least_part_of_reg_p (regno))
+      return true;
+  return false;
+}
+
 \f
 /* Substitute stack hard reg numbers for stack virtual registers in
    INSN.  Non-stack register numbers are not changed.  REGSTACK is the
@@ -2382,7 +2395,10 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
   bool control_flow_insn_deleted = false;
   int i;
 
-  if (CALL_P (insn))
+  /* If the target of the call doesn't clobber any stack registers,
+     Don't clear the arguments.  */
+  if (CALL_P (insn)
+      && callee_clobbers_any_stack_reg (insn_callee_abi (insn)))
     {
       int top = regstack->top;
 
diff --git a/gcc/rtl.c b/gcc/rtl.c
index b0ba1ff684c..aaee882f5ca 100644
--- a/gcc/rtl.c
+++ b/gcc/rtl.c
@@ -395,8 +395,10 @@ shallow_copy_rtx (const_rtx orig MEM_STAT_DECL)
     case SCRATCH:
       break;
     default:
-      /* For all other RTXes clear the used flag on the copy.  */
-      RTX_FLAG (copy, used) = 0;
+      /* For all other RTXes clear the used flag on the copy.
+	 CALL_INSN use "used" flag to indicate it's a fake call.  */
+      if (!INSN_P (orig))
+	RTX_FLAG (copy, used) = 0;
       break;
     }
   return copy;
diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index ba7b5cd56fd..07f9a081dd3 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
   HARD_REG_SET hardregs;
   unsigned regno;
 
-  if (CALL_P (insn))
+  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
+  if (CALL_P (insn) && !RTX_FLAG (insn, used))
     return !SIBLING_CALL_P (insn);
 
   /* We need a frame to get the unique CFA expected by the unwinder.  */
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
new file mode 100644
index 00000000000..1a63b9ae9c9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
@@ -0,0 +1,29 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx" } */
+/* { dg-require-effective-target avx } */
+
+#include "avx-check.h"
+
+void
+__attribute__ ((noipa))
+mtest(char *dest)
+{
+  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
+  _mm256_zeroupper();
+  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)dest, ymm2);
+}
+
+void
+avx_test ()
+{
+  char buf[64];
+  for (int i = 0; i != 64; i++)
+    buf[i] = 2;
+  mtest (buf);
+
+  for (int i = 0; i < 32; ++i)
+    if (buf[i] != 1)
+      __builtin_abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
new file mode 100644
index 00000000000..ac9d006f794
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2" } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  /* xmm1 can be propagated to xmm2 by CSE.  */
+  __m128i xmm1 = _mm_set_epi8(0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
+			      0x9, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16);
+  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  __m128i xmm2 = xmm1;
+  _mm_storeu_si128((__m128i *)dest, xmm2);
+}
+
+/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
+/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
new file mode 100644
index 00000000000..e3f801e6924
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
@@ -0,0 +1,5 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2 -mabi=ms" } */
+/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
+
+#include "pr82735-2.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
new file mode 100644
index 00000000000..78c0a6cb2c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
@@ -0,0 +1,48 @@
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
new file mode 100644
index 00000000000..2a58cbe52d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
@@ -0,0 +1,54 @@
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-27  5:07                           ` Hongtao Liu
@ 2021-05-27  7:05                             ` Uros Bizjak
  2021-06-01  2:24                               ` Hongtao Liu
  2021-06-03  6:54                               ` [PATCH 1/2] CALL_INSN may not be a real function call liuhongt
  2021-05-27  7:20                             ` [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] Jakub Jelinek
  1 sibling, 2 replies; 45+ messages in thread
From: Uros Bizjak @ 2021-05-27  7:05 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Hongtao Liu via Gcc-patches, Jakub Jelinek, H. J. Lu, Richard Sandiford

On Thu, May 27, 2021 at 7:03 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Hi:
>   This is an updated patch which implements vzeroupper as call_insn
> which has a special vzeroupper ABI, also in this patch i reverted
> r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in
> a different way.
>   Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and
> x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}.
>   Also test the patch on SPEC2017 and eembc, no performance impact as expected.
>   Ok for trunk?
>
> gcc/ChangeLog:
>
>         PR target/82735
>         * config/i386/i386-expand.c (ix86_expand_builtin): Remove
>         assignment of cfun->machine->has_explicit_vzeroupper.
>         * config/i386/i386-features.c
>         (ix86_add_reg_usage_to_vzerouppers): Delete.
>         (ix86_add_reg_usage_to_vzeroupper): Ditto.
>         (rest_of_handle_insert_vzeroupper): Remove
>         ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
>         of the function.
>         (gate): Remove cfun->machine->has_explicit_vzeroupper.
>         * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
>         Declared.
>         * config/i386/i386.c (ix86_insn_callee_abi): New function.
>         (ix86_initialize_callee_abi): Ditto.
>         (ix86_expand_avx_vzeroupper): Ditto.
>         (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
>         ABI.
>         (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
>         * config/i386/i386.h (enum i386_insn_callee_abi_index): New.
>         (struct GTY(()) machine_function): Delete
>         has_explicit_vzeroupper.
>         * config/i386/i386.md (enum unspec): New member
>         UNSPEC_CALLEE_ABI.
>         * config/i386/predicates.md (vzeroupper_pattern): Adjust.
>         * config/i386/sse.md (avx_vzeroupper): Call
>         ix86_expand_avx_vzeroupper.
>         (*avx_vzeroupper): Rename to ..
>         (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
>         call_insn which has a special vzeroupper ABI.
>         (*avx_vzeroupper_1): Deleted.
>         * df-scan.c (df_get_call_refs): When call_insn is a fake call,
>         it won't use stack pointer reg.
>         * final.c (leaf_function_p): When call_insn is a fake call, it
>         won't affect caller as a leaf function.
>         * reg-stack.c (callee_clobbers_any_stack_reg): New.
>         (subst_stack_regs): When call_insn doesn't clobber any stack
>         reg, don't clear the arguments.
>         * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
>         a insn.
>         * shrink-wrap.c (requires_stack_frame_p): No need for stack
>         frame for a fake call.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/82735
>         * gcc.target/i386/pr82735-1.c: New test.
>         * gcc.target/i386/pr82735-2.c: New test.
>         * gcc.target/i386/pr82735-3.c: New test.
>         * gcc.target/i386/pr82735-4.c: New test.
>         * gcc.target/i386/pr82735-5.c: New test.

Please split the patch to middle-end and target part. The middle-end
should be approved first.

 (define_expand "avx_vzeroupper"
-  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX")
+  [(parallel [(call (mem:QI (unspec_volatile [(const_int 0)]
UNSPECV_VZEROUPPER))
+            (const_int 0))
+         (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]

The call insn doesn't look like a valid RTX. Why not just:

+  [(parallel [(call (mem:QI (const_int 0)
+            (const_int 0))

for a fake call? Also, UNSPEC_VZEROUPPER can be removed this way since
the const_int 1 of UNSPEC_CALLEE_ABI is now used to detect vzeroupper.

Also, you don't need the avx_vzeroupper pattern to just call
ix86_expand_avx_vzeroupper. Just call the function directly from the
call site:

    case AVX_U128:
      if (mode == AVX_U128_CLEAN)
    emit_insn (gen_avx_vzeroupper ());
      break;

+         (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]

Can this const_int 1 be somehow more descriptive? Perhaps use
define_constant to define I386_VZEROUPPER ABI and use it in .md as
well as .c files.

Uros.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-27  5:07                           ` Hongtao Liu
  2021-05-27  7:05                             ` Uros Bizjak
@ 2021-05-27  7:20                             ` Jakub Jelinek
  2021-05-27 10:50                               ` Richard Sandiford
  1 sibling, 1 reply; 45+ messages in thread
From: Jakub Jelinek @ 2021-05-27  7:20 UTC (permalink / raw)
  To: Hongtao Liu, Richard Sandiford, Eric Botcazou
  Cc: gcc-patches, Uros Bizjak, H. J. Lu

On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote:
> +  /* Flag used for call_insn indicates it's a fake call.  */
> +  RTX_FLAG (insn, used) = 1;

> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> +      if (i == STACK_POINTER_REGNUM
> +	  && !RTX_FLAG (insn_info->insn, used))

> -	  && ! SIBLING_CALL_P (insn))
> +	  && ! SIBLING_CALL_P (insn)
> +	  && !RTX_FLAG (insn, used))

> -      /* For all other RTXes clear the used flag on the copy.  */
> -      RTX_FLAG (copy, used) = 0;
> +      /* For all other RTXes clear the used flag on the copy.
> +	 CALL_INSN use "used" flag to indicate it's a fake call.  */
> +      if (!INSN_P (orig))
> +	RTX_FLAG (copy, used) = 0;
>        break;
>      }
>    return copy;
> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
>    HARD_REG_SET hardregs;
>    unsigned regno;
>  
> -  if (CALL_P (insn))
> +  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> +  if (CALL_P (insn) && !RTX_FLAG (insn, used))
>      return !SIBLING_CALL_P (insn);

Please define a macro for this in rtl.h (and mention it above used;
member too in a comment, see all the other comments in there), like:
/* 1 if RTX is a call_insn for a fake call.  */
#define FAKE_CALL_P(RTX)					\
  (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
Though, I'm also not sure if used can be actually used for this,
because it is used e.g. in emit-rtl.c for verification of RTL sharing.
Though, it seems no other rtl flag is free for CALL_INSN.
Could this fake call flag sit on the CALL rtx instead?

	Jakub


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-27  7:20                             ` [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] Jakub Jelinek
@ 2021-05-27 10:50                               ` Richard Sandiford
  2021-06-01  2:22                                 ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Sandiford @ 2021-05-27 10:50 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Hongtao Liu, Eric Botcazou, gcc-patches, Uros Bizjak, H. J. Lu

Jakub Jelinek <jakub@redhat.com> writes:
> On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote:
>> +  /* Flag used for call_insn indicates it's a fake call.  */
>> +  RTX_FLAG (insn, used) = 1;
>
>> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
>> +      if (i == STACK_POINTER_REGNUM
>> +	  && !RTX_FLAG (insn_info->insn, used))
>
>> -	  && ! SIBLING_CALL_P (insn))
>> +	  && ! SIBLING_CALL_P (insn)
>> +	  && !RTX_FLAG (insn, used))
>
>> -      /* For all other RTXes clear the used flag on the copy.  */
>> -      RTX_FLAG (copy, used) = 0;
>> +      /* For all other RTXes clear the used flag on the copy.
>> +	 CALL_INSN use "used" flag to indicate it's a fake call.  */
>> +      if (!INSN_P (orig))
>> +	RTX_FLAG (copy, used) = 0;
>>        break;
>>      }
>>    return copy;
>> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
>>    HARD_REG_SET hardregs;
>>    unsigned regno;
>>  
>> -  if (CALL_P (insn))
>> +  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
>> +  if (CALL_P (insn) && !RTX_FLAG (insn, used))
>>      return !SIBLING_CALL_P (insn);
>
> Please define a macro for this in rtl.h (and mention it above used;
> member too in a comment, see all the other comments in there), like:
> /* 1 if RTX is a call_insn for a fake call.  */
> #define FAKE_CALL_P(RTX)					\
>   (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
> Though, I'm also not sure if used can be actually used for this,
> because it is used e.g. in emit-rtl.c for verification of RTL sharing.

I thought it should be OK, since:

- copy_rtx_if_shared_1 and mark_used_flags do nothing for insns
- verify_rtx_sharing is only called for parts of an insn, rather than
  an insn itself

I guess an alternative would be to add a new rtx_code for fake call
insns and use CALL_P to test for both.  However, that would lose the
property that the default behaviour is conservatively correct
(even for direct checks of CALL_INSN), so the flag IMO seems better.

Thanks,
Richard

> Though, it seems no other rtl flag is free for CALL_INSN.
> Could this fake call flag sit on the CALL rtx instead?
>
> 	Jakub

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-27 10:50                               ` Richard Sandiford
@ 2021-06-01  2:22                                 ` Hongtao Liu
  2021-06-01  2:25                                   ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-06-01  2:22 UTC (permalink / raw)
  To: Jakub Jelinek, Hongtao Liu, Eric Botcazou, GCC Patches,
	Uros Bizjak, H. J. Lu, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 2516 bytes --]

On Thu, May 27, 2021 at 6:50 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Jakub Jelinek <jakub@redhat.com> writes:
> > On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote:
> >> +  /* Flag used for call_insn indicates it's a fake call.  */
> >> +  RTX_FLAG (insn, used) = 1;
> >
> >> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> >> +      if (i == STACK_POINTER_REGNUM
> >> +      && !RTX_FLAG (insn_info->insn, used))
> >
> >> -      && ! SIBLING_CALL_P (insn))
> >> +      && ! SIBLING_CALL_P (insn)
> >> +      && !RTX_FLAG (insn, used))
> >
> >> -      /* For all other RTXes clear the used flag on the copy.  */
> >> -      RTX_FLAG (copy, used) = 0;
> >> +      /* For all other RTXes clear the used flag on the copy.
> >> +     CALL_INSN use "used" flag to indicate it's a fake call.  */
> >> +      if (!INSN_P (orig))
> >> +    RTX_FLAG (copy, used) = 0;
> >>        break;
> >>      }
> >>    return copy;
> >> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
> >>    HARD_REG_SET hardregs;
> >>    unsigned regno;
> >>
> >> -  if (CALL_P (insn))
> >> +  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> >> +  if (CALL_P (insn) && !RTX_FLAG (insn, used))
> >>      return !SIBLING_CALL_P (insn);
> >
> > Please define a macro for this in rtl.h (and mention it above used;
> > member too in a comment, see all the other comments in there), like:
> > /* 1 if RTX is a call_insn for a fake call.  */
> > #define FAKE_CALL_P(RTX)                                      \
> >   (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
Changed.
> > Though, I'm also not sure if used can be actually used for this,
> > because it is used e.g. in emit-rtl.c for verification of RTL sharing.
>
> I thought it should be OK, since:
>
> - copy_rtx_if_shared_1 and mark_used_flags do nothing for insns
> - verify_rtx_sharing is only called for parts of an insn, rather than
>   an insn itself
>
> I guess an alternative would be to add a new rtx_code for fake call
> insns and use CALL_P to test for both.  However, that would lose the
> property that the default behaviour is conservatively correct
> (even for direct checks of CALL_INSN), so the flag IMO seems better.
>
> Thanks,
> Richard
>
> > Though, it seems no other rtl flag is free for CALL_INSN.
> > Could this fake call flag sit on the CALL rtx instead?
> >
> >       Jakub

Updated separate patch for the middle-end part.

-- 
BR,
Hongtao

[-- Attachment #2: 0001-CALL_INSN-may-not-be-a-real-function-call.patch --]
[-- Type: text/x-patch, Size: 5007 bytes --]

From 537822e0d54aa324c520a4b504dcfe882b363c7b Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Tue, 1 Jun 2021 09:00:57 +0800
Subject: [PATCH 1/2] CALL_INSN may not be a real function call.

Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
fake call, it won't have its own function stack.

gcc/ChangeLog

	PR target/82735
	* df-scan.c (df_get_call_refs): When call_insn is a fake call,
	it won't use stack pointer reg.
	* final.c (leaf_function_p): When call_insn is a fake call, it
	won't affect caller as a leaf function.
	* reg-stack.c (callee_clobbers_any_stack_reg): New.
	(subst_stack_regs): When call_insn doesn't clobber any stack
	reg, don't clear the arguments.
	* rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
	a insn.
	* shrink-wrap.c (requires_stack_frame_p): No need for stack
	frame for a fake call.
	* rtl.h (FAKE_CALL_P): New macro.
---
 gcc/df-scan.c     |  3 ++-
 gcc/final.c       |  3 ++-
 gcc/reg-stack.c   | 18 +++++++++++++++++-
 gcc/rtl.c         |  6 ++++--
 gcc/rtl.h         |  5 +++++
 gcc/shrink-wrap.c |  2 +-
 6 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/gcc/df-scan.c b/gcc/df-scan.c
index 6691c3e8357..1268536b3f0 100644
--- a/gcc/df-scan.c
+++ b/gcc/df-scan.c
@@ -3090,7 +3090,8 @@ df_get_call_refs (class df_collection_rec *collection_rec,
 
   for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
     {
-      if (i == STACK_POINTER_REGNUM)
+      if (i == STACK_POINTER_REGNUM
+	  && !FAKE_CALL_P (insn_info->insn))
 	/* The stack ptr is used (honorarily) by a CALL insn.  */
 	df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
 		       NULL, bb, insn_info, DF_REF_REG_USE,
diff --git a/gcc/final.c b/gcc/final.c
index e0a70fcd830..817f7722cb2 100644
--- a/gcc/final.c
+++ b/gcc/final.c
@@ -4109,7 +4109,8 @@ leaf_function_p (void)
   for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
     {
       if (CALL_P (insn)
-	  && ! SIBLING_CALL_P (insn))
+	  && ! SIBLING_CALL_P (insn)
+	  && ! FAKE_CALL_P (insn))
 	return 0;
       if (NONJUMP_INSN_P (insn)
 	  && GET_CODE (PATTERN (insn)) == SEQUENCE
diff --git a/gcc/reg-stack.c b/gcc/reg-stack.c
index 25210f0c17f..1d9ea035cf4 100644
--- a/gcc/reg-stack.c
+++ b/gcc/reg-stack.c
@@ -174,6 +174,7 @@
 #include "reload.h"
 #include "tree-pass.h"
 #include "rtl-iter.h"
+#include "function-abi.h"
 
 #ifdef STACK_REGS
 
@@ -2368,6 +2369,18 @@ subst_asm_stack_regs (rtx_insn *insn, stack_ptr regstack)
 	    }
       }
 }
+
+/* Return true if a function call is allowed to alter some or all bits
+   of any stack reg.  */
+static bool
+callee_clobbers_any_stack_reg (const function_abi & callee_abi)
+{
+  for (unsigned regno = FIRST_STACK_REG; regno <= LAST_STACK_REG; regno++)
+    if (callee_abi.clobbers_at_least_part_of_reg_p (regno))
+      return true;
+  return false;
+}
+
 \f
 /* Substitute stack hard reg numbers for stack virtual registers in
    INSN.  Non-stack register numbers are not changed.  REGSTACK is the
@@ -2382,7 +2395,10 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
   bool control_flow_insn_deleted = false;
   int i;
 
-  if (CALL_P (insn))
+  /* If the target of the call doesn't clobber any stack registers,
+     Don't clear the arguments.  */
+  if (CALL_P (insn)
+      && callee_clobbers_any_stack_reg (insn_callee_abi (insn)))
     {
       int top = regstack->top;
 
diff --git a/gcc/rtl.c b/gcc/rtl.c
index b0ba1ff684c..aaee882f5ca 100644
--- a/gcc/rtl.c
+++ b/gcc/rtl.c
@@ -395,8 +395,10 @@ shallow_copy_rtx (const_rtx orig MEM_STAT_DECL)
     case SCRATCH:
       break;
     default:
-      /* For all other RTXes clear the used flag on the copy.  */
-      RTX_FLAG (copy, used) = 0;
+      /* For all other RTXes clear the used flag on the copy.
+	 CALL_INSN use "used" flag to indicate it's a fake call.  */
+      if (!INSN_P (orig))
+	RTX_FLAG (copy, used) = 0;
       break;
     }
   return copy;
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 35178b5bfac..5ed0d6dd6fa 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -839,6 +839,11 @@ struct GTY(()) rtvec_def {
 /* Predicate yielding nonzero iff X is a call insn.  */
 #define CALL_P(X) (GET_CODE (X) == CALL_INSN)
 
+/* 1 if RTX is a call_insn for a fake call.
+   CALL_INSN use "used" flag to indicate it's a fake call.  */
+#define FAKE_CALL_P(RTX)                                        \
+  (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
+
 /* Predicate yielding nonzero iff X is an insn that cannot jump.  */
 #define NONJUMP_INSN_P(X) (GET_CODE (X) == INSN)
 
diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index ba7b5cd56fd..5e60f34f749 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -57,7 +57,7 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
   HARD_REG_SET hardregs;
   unsigned regno;
 
-  if (CALL_P (insn))
+  if (CALL_P (insn) && !FAKE_CALL_P (insn))
     return !SIBLING_CALL_P (insn);
 
   /* We need a frame to get the unique CFA expected by the unwinder.  */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-05-27  7:05                             ` Uros Bizjak
@ 2021-06-01  2:24                               ` Hongtao Liu
  2021-06-03  6:54                               ` [PATCH 1/2] CALL_INSN may not be a real function call liuhongt
  1 sibling, 0 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-06-01  2:24 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Hongtao Liu via Gcc-patches, Jakub Jelinek, H. J. Lu, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 6454 bytes --]

On Thu, May 27, 2021 at 3:05 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Thu, May 27, 2021 at 7:03 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > Hi:
> >   This is an updated patch which implements vzeroupper as call_insn
> > which has a special vzeroupper ABI, also in this patch i reverted
> > r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in
> > a different way.
> >   Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and
> > x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}.
> >   Also test the patch on SPEC2017 and eembc, no performance impact as expected.
> >   Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >         PR target/82735
> >         * config/i386/i386-expand.c (ix86_expand_builtin): Remove
> >         assignment of cfun->machine->has_explicit_vzeroupper.
> >         * config/i386/i386-features.c
> >         (ix86_add_reg_usage_to_vzerouppers): Delete.
> >         (ix86_add_reg_usage_to_vzeroupper): Ditto.
> >         (rest_of_handle_insert_vzeroupper): Remove
> >         ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
> >         of the function.
> >         (gate): Remove cfun->machine->has_explicit_vzeroupper.
> >         * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
> >         Declared.
> >         * config/i386/i386.c (ix86_insn_callee_abi): New function.
> >         (ix86_initialize_callee_abi): Ditto.
> >         (ix86_expand_avx_vzeroupper): Ditto.
> >         (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
> >         ABI.
> >         (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
> >         * config/i386/i386.h (enum i386_insn_callee_abi_index): New.
> >         (struct GTY(()) machine_function): Delete
> >         has_explicit_vzeroupper.
> >         * config/i386/i386.md (enum unspec): New member
> >         UNSPEC_CALLEE_ABI.
> >         * config/i386/predicates.md (vzeroupper_pattern): Adjust.
> >         * config/i386/sse.md (avx_vzeroupper): Call
> >         ix86_expand_avx_vzeroupper.
> >         (*avx_vzeroupper): Rename to ..
> >         (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
> >         call_insn which has a special vzeroupper ABI.
> >         (*avx_vzeroupper_1): Deleted.
> >         * df-scan.c (df_get_call_refs): When call_insn is a fake call,
> >         it won't use stack pointer reg.
> >         * final.c (leaf_function_p): When call_insn is a fake call, it
> >         won't affect caller as a leaf function.
> >         * reg-stack.c (callee_clobbers_any_stack_reg): New.
> >         (subst_stack_regs): When call_insn doesn't clobber any stack
> >         reg, don't clear the arguments.
> >         * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
> >         a insn.
> >         * shrink-wrap.c (requires_stack_frame_p): No need for stack
> >         frame for a fake call.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/82735
> >         * gcc.target/i386/pr82735-1.c: New test.
> >         * gcc.target/i386/pr82735-2.c: New test.
> >         * gcc.target/i386/pr82735-3.c: New test.
> >         * gcc.target/i386/pr82735-4.c: New test.
> >         * gcc.target/i386/pr82735-5.c: New test.
>
> Please split the patch to middle-end and target part. The middle-end
> should be approved first.
>
>  (define_expand "avx_vzeroupper"
> -  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX")
> +  [(parallel [(call (mem:QI (unspec_volatile [(const_int 0)]
> UNSPECV_VZEROUPPER))
> +            (const_int 0))
> +         (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]
>
> The call insn doesn't look like a valid RTX. Why not just:
>
> +  [(parallel [(call (mem:QI (const_int 0)
> +            (const_int 0))
>
> for a fake call? Also, UNSPEC_VZEROUPPER can be removed this way since
> the const_int 1 of UNSPEC_CALLEE_ABI is now used to detect vzeroupper.
>
Changed.
> Also, you don't need the avx_vzeroupper pattern to just call
> ix86_expand_avx_vzeroupper. Just call the function directly from the
> call site:
>
>     case AVX_U128:
>       if (mode == AVX_U128_CLEAN)
>     emit_insn (gen_avx_vzeroupper ());
>       break;
>
Changed.
> +         (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]
>
> Can this const_int 1 be somehow more descriptive? Perhaps use
> define_constant to define I386_VZEROUPPER ABI and use it in .md as
> well as .c files.
Changed.
>
> Uros.

Update separate patch for the backend part.

gcc/ChangeLog:

        PR target/82735
        * config/i386/i386-expand.c (ix86_expand_builtin): Remove
        assignment of cfun->machine->has_explicit_vzeroupper.
        * config/i386/i386-features.c
        (ix86_add_reg_usage_to_vzerouppers): Delete.
        (ix86_add_reg_usage_to_vzeroupper): Ditto.
        (rest_of_handle_insert_vzeroupper): Remove
        ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
        of the function.
        (gate): Remove cfun->machine->has_explicit_vzeroupper.
        * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
        Declared.
        * config/i386/i386.c (ix86_insn_callee_abi): New function.
        (ix86_initialize_callee_abi): Ditto.
        (ix86_expand_avx_vzeroupper): Ditto.
        (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
        ABI.
        (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
        (ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper
        directly.
        * config/i386/i386.h (struct GTY(()) machine_function): Delete
        has_explicit_vzeroupper.
        * config/i386/i386.md (enum unspec): New member
        UNSPEC_CALLEE_ABI.
        (I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New
        define_constants for insn callee abi index.
        * config/i386/predicates.md (vzeroupper_pattern): Adjust.
        * config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted.
        (avx_vzeroupper): Call ix86_expand_avx_vzeroupper.
        (*avx_vzeroupper): Rename to ..
        (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
        call_insn which has a special vzeroupper ABI.
        (*avx_vzeroupper_1): Deleted.

gcc/testsuite/ChangeLog:

        PR target/82735
        * gcc.target/i386/pr82735-1.c: New test.
        * gcc.target/i386/pr82735-2.c: New test.
        * gcc.target/i386/pr82735-3.c: New test.
        * gcc.target/i386/pr82735-4.c: New test.
        * gcc.target/i386/pr82735-5.c: New test.
-- 
BR,
Hongtao

[-- Attachment #2: 0002-Fix-_mm256_zeroupper-by-representing-the-instruction.patch --]
[-- Type: text/x-patch, Size: 23027 bytes --]

From ec9f1fab9bebc0341c6c7b079c43fe68242de064 Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Tue, 1 Jun 2021 09:09:44 +0800
Subject: [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as
 call_insns in which the call has a special vzeroupper ABI.

When __builtin_ia32_vzeroupper is called explicitly, the corresponding
vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
which leads to incorrect optimization in pass_reload. In order to
solve this problem, this patch refine instructions as call_insns in
which the call has a special vzeroupper ABI.

gcc/ChangeLog:

	PR target/82735
	* config/i386/i386-expand.c (ix86_expand_builtin): Remove
	assignment of cfun->machine->has_explicit_vzeroupper.
	* config/i386/i386-features.c
	(ix86_add_reg_usage_to_vzerouppers): Delete.
	(ix86_add_reg_usage_to_vzeroupper): Ditto.
	(rest_of_handle_insert_vzeroupper): Remove
	ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
	of the function.
	(gate): Remove cfun->machine->has_explicit_vzeroupper.
	* config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
	Declared.
	* config/i386/i386.c (ix86_insn_callee_abi): New function.
	(ix86_initialize_callee_abi): Ditto.
	(ix86_expand_avx_vzeroupper): Ditto.
	(ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
	ABI.
	(TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
	(ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper
	directly.
	* config/i386/i386.h (struct GTY(()) machine_function): Delete
	has_explicit_vzeroupper.
	* config/i386/i386.md (enum unspec): New member
	UNSPEC_CALLEE_ABI.
	(I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New
	define_constants for insn callee abi index.
	* config/i386/predicates.md (vzeroupper_pattern): Adjust.
	* config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted.
	(avx_vzeroupper): Call ix86_expand_avx_vzeroupper.
	(*avx_vzeroupper): Rename to ..
	(avx_vzeroupper_callee_abi): .. this, and adjust pattern as
	call_insn which has a special vzeroupper ABI.
	(*avx_vzeroupper_1): Deleted.

gcc/testsuite/ChangeLog:

	PR target/82735
	* gcc.target/i386/pr82735-1.c: New test.
	* gcc.target/i386/pr82735-2.c: New test.
	* gcc.target/i386/pr82735-3.c: New test.
	* gcc.target/i386/pr82735-4.c: New test.
	* gcc.target/i386/pr82735-5.c: New test.
---
 gcc/config/i386/i386-expand.c             |  4 -
 gcc/config/i386/i386-features.c           | 99 +++--------------------
 gcc/config/i386/i386-protos.h             |  1 +
 gcc/config/i386/i386.c                    | 55 ++++++++++++-
 gcc/config/i386/i386.h                    |  4 -
 gcc/config/i386/i386.md                   | 10 +++
 gcc/config/i386/predicates.md             |  5 +-
 gcc/config/i386/sse.md                    | 59 ++++----------
 gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 +++++++
 gcc/testsuite/gcc.target/i386/pr82735-2.c | 22 +++++
 gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 ++
 gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 +++++++++++
 gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++
 13 files changed, 252 insertions(+), 143 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index 9f3d41955a2..d25d59aa4e7 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -13282,10 +13282,6 @@ rdseed_step:
 
       return 0;
 
-    case IX86_BUILTIN_VZEROUPPER:
-      cfun->machine->has_explicit_vzeroupper = true;
-      break;
-
     default:
       break;
     }
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 77783a154b6..a25769ae478 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -1768,92 +1768,22 @@ convert_scalars_to_vector (bool timode_p)
   return 0;
 }
 
-/* Modify the vzeroupper pattern in INSN so that it describes the effect
-   that the instruction has on the SSE registers.  LIVE_REGS are the set
-   of registers that are live across the instruction.
-
-   For a live register R we use:
-
-     (set (reg:V2DF R) (reg:V2DF R))
-
-   which preserves the low 128 bits but clobbers the upper bits.  */
-
-static void
-ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
-{
-  rtx pattern = PATTERN (insn);
-  unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  unsigned int npats = nregs;
-  for (unsigned int i = 0; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (!bitmap_bit_p (live_regs, regno))
-	npats--;
-    }
-  if (npats == 0)
-    return;
-  rtvec vec = rtvec_alloc (npats + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
-  for (unsigned int i = 0, j = 0; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (!bitmap_bit_p (live_regs, regno))
-	continue;
-      rtx reg = gen_rtx_REG (V2DImode, regno);
-      ++j;
-      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
-    }
-  XVEC (pattern, 0) = vec;
-  INSN_CODE (insn) = -1;
-  df_insn_rescan (insn);
-}
-
-/* Walk the vzeroupper instructions in the function and annotate them
-   with the effect that they have on the SSE registers.  */
-
-static void
-ix86_add_reg_usage_to_vzerouppers (void)
-{
-  basic_block bb;
-  rtx_insn *insn;
-  auto_bitmap live_regs;
-
-  df_analyze ();
-  FOR_EACH_BB_FN (bb, cfun)
-    {
-      bitmap_copy (live_regs, df_get_live_out (bb));
-      df_simulate_initialize_backwards (bb, live_regs);
-      FOR_BB_INSNS_REVERSE (bb, insn)
-	{
-	  if (!NONDEBUG_INSN_P (insn))
-	    continue;
-	  if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
-	    ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
-	  df_simulate_one_insn_backwards (bb, insn, live_regs);
-	}
-    }
-}
-
 static unsigned int
 rest_of_handle_insert_vzeroupper (void)
 {
-  if (TARGET_VZEROUPPER
-      && flag_expensive_optimizations
-      && !optimize_size)
-    {
-      /* vzeroupper instructions are inserted immediately after reload to
-	 account for possible spills from 256bit or 512bit registers.  The pass
-	 reuses mode switching infrastructure by re-running mode insertion
-	 pass, so disable entities that have already been processed.  */
-      for (int i = 0; i < MAX_386_ENTITIES; i++)
-	ix86_optimize_mode_switching[i] = 0;
+  /* vzeroupper instructions are inserted immediately after reload to
+     account for possible spills from 256bit or 512bit registers.  The pass
+     reuses mode switching infrastructure by re-running mode insertion
+     pass, so disable entities that have already been processed.  */
+  for (int i = 0; i < MAX_386_ENTITIES; i++)
+    ix86_optimize_mode_switching[i] = 0;
 
-      ix86_optimize_mode_switching[AVX_U128] = 1;
+  ix86_optimize_mode_switching[AVX_U128] = 1;
 
-      /* Call optimize_mode_switching.  */
-      g->get_passes ()->execute_pass_mode_switching ();
-    }
-  ix86_add_reg_usage_to_vzerouppers ();
+  /* Call optimize_mode_switching.  */
+  g->get_passes ()->execute_pass_mode_switching ();
+
+  df_analyze ();
   return 0;
 }
 
@@ -1882,11 +1812,8 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      return TARGET_AVX
-	     && ((TARGET_VZEROUPPER
-		  && flag_expensive_optimizations
-		  && !optimize_size)
-		 || cfun->machine->has_explicit_vzeroupper);
+      return TARGET_AVX && TARGET_VZEROUPPER
+	&& flag_expensive_optimizations && !optimize_size;
     }
 
   virtual unsigned int execute (function *)
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 7782cf1163f..e6ac9390777 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -216,6 +216,7 @@ extern rtx ix86_split_stack_guard (void);
 extern void ix86_move_vector_high_sse_to_mmx (rtx);
 extern void ix86_split_mmx_pack (rtx[], enum rtx_code);
 extern void ix86_split_mmx_punpck (rtx[], bool);
+extern void ix86_expand_avx_vzeroupper (void);
 
 #ifdef TREE_CODE
 extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 743d8a25fe3..f0b66dd0d56 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -14426,7 +14426,7 @@ ix86_emit_mode_set (int entity, int mode, int prev_mode ATTRIBUTE_UNUSED,
       break;
     case AVX_U128:
       if (mode == AVX_U128_CLEAN)
-	emit_insn (gen_avx_vzeroupper ());
+	ix86_expand_avx_vzeroupper ();
       break;
     case I387_ROUNDEVEN:
     case I387_TRUNC:
@@ -19494,15 +19494,63 @@ ix86_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
   return false;
 }
 
+/* Implement TARGET_INSN_CALLEE_ABI.  */
+
+const predefined_function_abi &
+ix86_insn_callee_abi (const rtx_insn *insn)
+{
+  unsigned int abi_id = 0;
+  rtx pat = PATTERN (insn);
+  if (vzeroupper_pattern (pat, VOIDmode))
+    abi_id = I386_VZEROUPPER;
+
+  return function_abis[abi_id];
+}
+
+/* Initialize function_abis with corresponding abi_id,
+   currently only handle vzeroupper.  */
+void
+ix86_initialize_callee_abi (unsigned int abi_id)
+{
+  gcc_assert (abi_id == I386_VZEROUPPER);
+  predefined_function_abi &vzeroupper_abi = function_abis[abi_id];
+  if (!vzeroupper_abi.initialized_p ())
+    {
+      HARD_REG_SET full_reg_clobbers;
+      CLEAR_HARD_REG_SET (full_reg_clobbers);
+      vzeroupper_abi.initialize (I386_VZEROUPPER, full_reg_clobbers);
+    }
+}
+
+void
+ix86_expand_avx_vzeroupper (void)
+{
+  /* Initialize vzeroupper_abi here.  */
+  ix86_initialize_callee_abi (I386_VZEROUPPER);
+  rtx_insn *insn = emit_call_insn (gen_avx_vzeroupper_callee_abi ());
+  /* Return false for non-local goto in can_nonlocal_goto.  */
+  make_reg_eh_region_note (insn, 0, INT_MIN);
+  /* Flag used for call_insn indicates it's a fake call.  */
+  RTX_FLAG (insn, used) = 1;
+}
+
+
 /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
    saves SSE registers across calls is Win64 (thus no need to check the
    current ABI here), and with AVX enabled Win64 only guarantees that
    the low 16 bytes are saved.  */
 
 static bool
-ix86_hard_regno_call_part_clobbered (unsigned int, unsigned int regno,
+ix86_hard_regno_call_part_clobbered (unsigned int abi_id, unsigned int regno,
 				     machine_mode mode)
 {
+  /* Special ABI for vzeroupper which only clobber higher part of sse regs.  */
+  if (abi_id == I386_VZEROUPPER)
+      return (GET_MODE_SIZE (mode) > 16
+	      && ((TARGET_64BIT
+		   && (IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
+		  || (IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG))));
+
   return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
 }
 
@@ -23916,6 +23964,9 @@ ix86_run_selftests (void)
 #define TARGET_HARD_REGNO_CALL_PART_CLOBBERED \
   ix86_hard_regno_call_part_clobbered
 
+#undef TARGET_INSN_CALLEE_ABI
+#define TARGET_INSN_CALLEE_ABI ix86_insn_callee_abi
+
 #undef TARGET_CAN_CHANGE_MODE_CLASS
 #define TARGET_CAN_CHANGE_MODE_CLASS ix86_can_change_mode_class
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 53d503fc6e0..919d0b2418a 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2659,10 +2659,6 @@ struct GTY(()) machine_function {
   /* True if the function needs a stack frame.  */
   BOOL_BITFIELD stack_frame_required : 1;
 
-  /* True if __builtin_ia32_vzeroupper () has been expanded in current
-     function.  */
-  BOOL_BITFIELD has_explicit_vzeroupper : 1;
-
   /* True if we should act silently, rather than raise an error for
      invalid calls.  */
   BOOL_BITFIELD silent_p : 1;
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2fc8fae30f3..5d9f5aa39ac 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -191,6 +191,10 @@ (define_c_enum "unspec" [
   ;; For MOVDIRI and MOVDIR64B support
   UNSPEC_MOVDIRI
   UNSPEC_MOVDIR64B
+
+  ;; For insn_callee_abi:
+  UNSPEC_CALLEE_ABI
+
 ])
 
 (define_c_enum "unspecv" [
@@ -447,6 +451,12 @@ (define_constants
    (FIRST_PSEUDO_REG		76)
   ])
 
+;; Insn callee abi index.
+(define_constants
+  [(I386_DEFAULT	0)
+   (I386_VZEROUPPER	1)
+   (I386_UNKNOWN	2)])
+
 ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
 ;; from i386.c.
 
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index abd307ebdb8..8b787553f32 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1599,8 +1599,9 @@ (define_predicate "vzeroall_pattern"
 ;; return true if OP is a vzeroupper pattern.
 (define_predicate "vzeroupper_pattern"
   (and (match_code "parallel")
-       (match_code "unspec_volatile" "a")
-       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
+       (match_code "unspec" "b")
+       (match_test "XINT (XVECEXP (op, 0, 1), 1) == UNSPEC_CALLEE_ABI")
+       (match_test "INTVAL (XVECEXP (XVECEXP (op, 0, 1), 0, 0)) == I386_VZEROUPPER")))
 
 ;; Return true if OP is an addsub vec_merge operation
 (define_predicate "addsub_vm_operator"
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index a4503ddcb73..949347a3247 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -205,7 +205,6 @@ (define_c_enum "unspecv" [
   UNSPECV_MONITOR
   UNSPECV_MWAIT
   UNSPECV_VZEROALL
-  UNSPECV_VZEROUPPER
 
   ;; For KEYLOCKER
   UNSPECV_LOADIWKEY
@@ -20857,14 +20856,22 @@ (define_insn "*avx_vzeroall"
 ;; if the upper 128bits are unused.  Initially we expand the instructions
 ;; as though they had no effect on the SSE registers, but later add SETs and
 ;; CLOBBERs to the PARALLEL to model the real effect.
+
 (define_expand "avx_vzeroupper"
-  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX")
+  [(parallel [(call (mem:QI (const_int 0))
+		    (const_int 0))
+	     (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)])]
+  "TARGET_AVX"
+{
+  ix86_expand_avx_vzeroupper ();
+  DONE;
+})
 
-(define_insn "*avx_vzeroupper"
-  [(match_parallel 0 "vzeroupper_pattern"
-     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
+(define_insn "avx_vzeroupper_callee_abi"
+  [(call (mem:QI (const_int 0))
+	 (const_int 0))
+    (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)]
+  "TARGET_AVX"
   "vzeroupper"
   [(set_attr "type" "sse")
    (set_attr "modrm" "0")
@@ -20873,44 +20880,6 @@ (define_insn "*avx_vzeroupper"
    (set_attr "btver2_decode" "vector")
    (set_attr "mode" "OI")])
 
-(define_insn_and_split "*avx_vzeroupper_1"
-  [(match_parallel 0 "vzeroupper_pattern"
-     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
-  "#"
-  "&& epilogue_completed"
-  [(match_dup 0)]
-{
-  /* For IPA-RA purposes, make it clear the instruction clobbers
-     even XMM registers not mentioned explicitly in the pattern.  */
-  unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  unsigned int npats = XVECLEN (operands[0], 0);
-  rtvec vec = rtvec_alloc (nregs + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
-  for (unsigned int i = 0, j = 1; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (j < npats
-	  && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
-	{
-	  RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
-	  j++;
-	}
-      else
-	{
-	  rtx reg = gen_rtx_REG (V2DImode, regno);
-	  RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
-	}
-    }
-  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
-}
-  [(set_attr "type" "sse")
-   (set_attr "modrm" "0")
-   (set_attr "memory" "none")
-   (set_attr "prefix" "vex")
-   (set_attr "btver2_decode" "vector")
-   (set_attr "mode" "OI")])
-
 (define_mode_attr pbroadcast_evex_isa
   [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
    (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
new file mode 100644
index 00000000000..1a63b9ae9c9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
@@ -0,0 +1,29 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx" } */
+/* { dg-require-effective-target avx } */
+
+#include "avx-check.h"
+
+void
+__attribute__ ((noipa))
+mtest(char *dest)
+{
+  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
+  _mm256_zeroupper();
+  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)dest, ymm2);
+}
+
+void
+avx_test ()
+{
+  char buf[64];
+  for (int i = 0; i != 64; i++)
+    buf[i] = 2;
+  mtest (buf);
+
+  for (int i = 0; i < 32; ++i)
+    if (buf[i] != 1)
+      __builtin_abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
new file mode 100644
index 00000000000..ac9d006f794
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2" } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  /* xmm1 can be propagated to xmm2 by CSE.  */
+  __m128i xmm1 = _mm_set_epi8(0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
+			      0x9, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16);
+  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  __m128i xmm2 = xmm1;
+  _mm_storeu_si128((__m128i *)dest, xmm2);
+}
+
+/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
+/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
new file mode 100644
index 00000000000..e3f801e6924
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
@@ -0,0 +1,5 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2 -mabi=ms" } */
+/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
+
+#include "pr82735-2.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
new file mode 100644
index 00000000000..78c0a6cb2c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
@@ -0,0 +1,48 @@
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
new file mode 100644
index 00000000000..2a58cbe52d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
@@ -0,0 +1,54 @@
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]
  2021-06-01  2:22                                 ` Hongtao Liu
@ 2021-06-01  2:25                                   ` Hongtao Liu
  0 siblings, 0 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-06-01  2:25 UTC (permalink / raw)
  To: Jakub Jelinek, Hongtao Liu, Eric Botcazou, GCC Patches,
	Uros Bizjak, H. J. Lu, Richard Sandiford

On Tue, Jun 1, 2021 at 10:22 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, May 27, 2021 at 6:50 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Jakub Jelinek <jakub@redhat.com> writes:
> > > On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote:
> > >> +  /* Flag used for call_insn indicates it's a fake call.  */
> > >> +  RTX_FLAG (insn, used) = 1;
> > >
> > >> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> > >> +      if (i == STACK_POINTER_REGNUM
> > >> +      && !RTX_FLAG (insn_info->insn, used))
> > >
> > >> -      && ! SIBLING_CALL_P (insn))
> > >> +      && ! SIBLING_CALL_P (insn)
> > >> +      && !RTX_FLAG (insn, used))
> > >
> > >> -      /* For all other RTXes clear the used flag on the copy.  */
> > >> -      RTX_FLAG (copy, used) = 0;
> > >> +      /* For all other RTXes clear the used flag on the copy.
> > >> +     CALL_INSN use "used" flag to indicate it's a fake call.  */
> > >> +      if (!INSN_P (orig))
> > >> +    RTX_FLAG (copy, used) = 0;
> > >>        break;
> > >>      }
> > >>    return copy;
> > >> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
> > >>    HARD_REG_SET hardregs;
> > >>    unsigned regno;
> > >>
> > >> -  if (CALL_P (insn))
> > >> +  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> > >> +  if (CALL_P (insn) && !RTX_FLAG (insn, used))
> > >>      return !SIBLING_CALL_P (insn);
> > >
> > > Please define a macro for this in rtl.h (and mention it above used;
> > > member too in a comment, see all the other comments in there), like:
> > > /* 1 if RTX is a call_insn for a fake call.  */
> > > #define FAKE_CALL_P(RTX)                                      \
> > >   (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
> Changed.
> > > Though, I'm also not sure if used can be actually used for this,
> > > because it is used e.g. in emit-rtl.c for verification of RTL sharing.
> >
> > I thought it should be OK, since:
> >
> > - copy_rtx_if_shared_1 and mark_used_flags do nothing for insns
> > - verify_rtx_sharing is only called for parts of an insn, rather than
> >   an insn itself
> >
> > I guess an alternative would be to add a new rtx_code for fake call
> > insns and use CALL_P to test for both.  However, that would lose the
> > property that the default behaviour is conservatively correct
> > (even for direct checks of CALL_INSN), so the flag IMO seems better.
> >
> > Thanks,
> > Richard
> >
> > > Though, it seems no other rtl flag is free for CALL_INSN.
> > > Could this fake call flag sit on the CALL rtx instead?
> > >
> > >       Jakub
>
> Updated separate patch for the middle-end part.

gcc/ChangeLog

        PR target/82735
        * df-scan.c (df_get_call_refs): When call_insn is a fake call,
        it won't use stack pointer reg.
        * final.c (leaf_function_p): When call_insn is a fake call, it
        won't affect caller as a leaf function.
        * reg-stack.c (callee_clobbers_any_stack_reg): New.
        (subst_stack_regs): When call_insn doesn't clobber any stack
        reg, don't clear the arguments.
        * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
        a insn.
        * shrink-wrap.c (requires_stack_frame_p): No need for stack
        frame for a fake call.
        * rtl.h (FAKE_CALL_P): New macro.

>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-05-27  7:05                             ` Uros Bizjak
  2021-06-01  2:24                               ` Hongtao Liu
@ 2021-06-03  6:54                               ` liuhongt
  2021-06-03  6:54                                 ` [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI liuhongt
                                                   ` (3 more replies)
  1 sibling, 4 replies; 45+ messages in thread
From: liuhongt @ 2021-06-03  6:54 UTC (permalink / raw)
  To: gcc-patches, jakub, richard.sandiford, ubizjak

Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
fake call, it won't have its own function stack.

gcc/ChangeLog

	PR target/82735
	* df-scan.c (df_get_call_refs): When call_insn is a fake call,
	it won't use stack pointer reg.
	* final.c (leaf_function_p): When call_insn is a fake call, it
	won't affect caller as a leaf function.
	* reg-stack.c (callee_clobbers_any_stack_reg): New.
	(subst_stack_regs): When call_insn doesn't clobber any stack
	reg, don't clear the arguments.
	* rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
	a insn.
	* shrink-wrap.c (requires_stack_frame_p): No need for stack
	frame for a fake call.
	* rtl.h (FAKE_CALL_P): New macro.
---
 gcc/df-scan.c     |  3 ++-
 gcc/final.c       |  3 ++-
 gcc/reg-stack.c   | 18 +++++++++++++++++-
 gcc/rtl.c         |  6 ++++--
 gcc/rtl.h         |  5 +++++
 gcc/shrink-wrap.c |  2 +-
 6 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/gcc/df-scan.c b/gcc/df-scan.c
index 6691c3e8357..1268536b3f0 100644
--- a/gcc/df-scan.c
+++ b/gcc/df-scan.c
@@ -3090,7 +3090,8 @@ df_get_call_refs (class df_collection_rec *collection_rec,
 
   for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
     {
-      if (i == STACK_POINTER_REGNUM)
+      if (i == STACK_POINTER_REGNUM
+	  && !FAKE_CALL_P (insn_info->insn))
 	/* The stack ptr is used (honorarily) by a CALL insn.  */
 	df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
 		       NULL, bb, insn_info, DF_REF_REG_USE,
diff --git a/gcc/final.c b/gcc/final.c
index e0a70fcd830..817f7722cb2 100644
--- a/gcc/final.c
+++ b/gcc/final.c
@@ -4109,7 +4109,8 @@ leaf_function_p (void)
   for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
     {
       if (CALL_P (insn)
-	  && ! SIBLING_CALL_P (insn))
+	  && ! SIBLING_CALL_P (insn)
+	  && ! FAKE_CALL_P (insn))
 	return 0;
       if (NONJUMP_INSN_P (insn)
 	  && GET_CODE (PATTERN (insn)) == SEQUENCE
diff --git a/gcc/reg-stack.c b/gcc/reg-stack.c
index 25210f0c17f..1d9ea035cf4 100644
--- a/gcc/reg-stack.c
+++ b/gcc/reg-stack.c
@@ -174,6 +174,7 @@
 #include "reload.h"
 #include "tree-pass.h"
 #include "rtl-iter.h"
+#include "function-abi.h"
 
 #ifdef STACK_REGS
 
@@ -2368,6 +2369,18 @@ subst_asm_stack_regs (rtx_insn *insn, stack_ptr regstack)
 	    }
       }
 }
+
+/* Return true if a function call is allowed to alter some or all bits
+   of any stack reg.  */
+static bool
+callee_clobbers_any_stack_reg (const function_abi & callee_abi)
+{
+  for (unsigned regno = FIRST_STACK_REG; regno <= LAST_STACK_REG; regno++)
+    if (callee_abi.clobbers_at_least_part_of_reg_p (regno))
+      return true;
+  return false;
+}
+
 \f
 /* Substitute stack hard reg numbers for stack virtual registers in
    INSN.  Non-stack register numbers are not changed.  REGSTACK is the
@@ -2382,7 +2395,10 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
   bool control_flow_insn_deleted = false;
   int i;
 
-  if (CALL_P (insn))
+  /* If the target of the call doesn't clobber any stack registers,
+     Don't clear the arguments.  */
+  if (CALL_P (insn)
+      && callee_clobbers_any_stack_reg (insn_callee_abi (insn)))
     {
       int top = regstack->top;
 
diff --git a/gcc/rtl.c b/gcc/rtl.c
index b0ba1ff684c..aaee882f5ca 100644
--- a/gcc/rtl.c
+++ b/gcc/rtl.c
@@ -395,8 +395,10 @@ shallow_copy_rtx (const_rtx orig MEM_STAT_DECL)
     case SCRATCH:
       break;
     default:
-      /* For all other RTXes clear the used flag on the copy.  */
-      RTX_FLAG (copy, used) = 0;
+      /* For all other RTXes clear the used flag on the copy.
+	 CALL_INSN use "used" flag to indicate it's a fake call.  */
+      if (!INSN_P (orig))
+	RTX_FLAG (copy, used) = 0;
       break;
     }
   return copy;
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 35178b5bfac..5ed0d6dd6fa 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -839,6 +839,11 @@ struct GTY(()) rtvec_def {
 /* Predicate yielding nonzero iff X is a call insn.  */
 #define CALL_P(X) (GET_CODE (X) == CALL_INSN)
 
+/* 1 if RTX is a call_insn for a fake call.
+   CALL_INSN use "used" flag to indicate it's a fake call.  */
+#define FAKE_CALL_P(RTX)                                        \
+  (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
+
 /* Predicate yielding nonzero iff X is an insn that cannot jump.  */
 #define NONJUMP_INSN_P(X) (GET_CODE (X) == INSN)
 
diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index ba7b5cd56fd..5e60f34f749 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -57,7 +57,7 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
   HARD_REG_SET hardregs;
   unsigned regno;
 
-  if (CALL_P (insn))
+  if (CALL_P (insn) && !FAKE_CALL_P (insn))
     return !SIBLING_CALL_P (insn);
 
   /* We need a frame to get the unique CFA expected by the unwinder.  */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI.
  2021-06-03  6:54                               ` [PATCH 1/2] CALL_INSN may not be a real function call liuhongt
@ 2021-06-03  6:54                                 ` liuhongt
  2021-06-04  2:56                                   ` Hongtao Liu
  2021-06-04  6:26                                   ` Uros Bizjak
  2021-06-04  2:55                                 ` [PATCH 1/2] CALL_INSN may not be a real function call Hongtao Liu
                                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 45+ messages in thread
From: liuhongt @ 2021-06-03  6:54 UTC (permalink / raw)
  To: gcc-patches, jakub, richard.sandiford, ubizjak

When __builtin_ia32_vzeroupper is called explicitly, the corresponding
vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
which leads to incorrect optimization in pass_reload. In order to
solve this problem, this patch refine instructions as call_insns in
which the call has a special vzeroupper ABI.

gcc/ChangeLog:

	PR target/82735
	* config/i386/i386-expand.c (ix86_expand_builtin): Remove
	assignment of cfun->machine->has_explicit_vzeroupper.
	* config/i386/i386-features.c
	(ix86_add_reg_usage_to_vzerouppers): Delete.
	(ix86_add_reg_usage_to_vzeroupper): Ditto.
	(rest_of_handle_insert_vzeroupper): Remove
	ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
	of the function.
	(gate): Remove cfun->machine->has_explicit_vzeroupper.
	* config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
	Declared.
	* config/i386/i386.c (ix86_insn_callee_abi): New function.
	(ix86_initialize_callee_abi): Ditto.
	(ix86_expand_avx_vzeroupper): Ditto.
	(ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
	ABI.
	(TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
	(ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper
	directly.
	* config/i386/i386.h (struct GTY(()) machine_function): Delete
	has_explicit_vzeroupper.
	* config/i386/i386.md (enum unspec): New member
	UNSPEC_CALLEE_ABI.
	(I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New
	define_constants for insn callee abi index.
	* config/i386/predicates.md (vzeroupper_pattern): Adjust.
	* config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted.
	(avx_vzeroupper): Call ix86_expand_avx_vzeroupper.
	(*avx_vzeroupper): Rename to ..
	(avx_vzeroupper_callee_abi): .. this, and adjust pattern as
	call_insn which has a special vzeroupper ABI.
	(*avx_vzeroupper_1): Deleted.

gcc/testsuite/ChangeLog:

	PR target/82735
	* gcc.target/i386/pr82735-1.c: New test.
	* gcc.target/i386/pr82735-2.c: New test.
	* gcc.target/i386/pr82735-3.c: New test.
	* gcc.target/i386/pr82735-4.c: New test.
	* gcc.target/i386/pr82735-5.c: New test.
---
 gcc/config/i386/i386-expand.c             |  4 -
 gcc/config/i386/i386-features.c           | 99 +++--------------------
 gcc/config/i386/i386-protos.h             |  1 +
 gcc/config/i386/i386.c                    | 55 ++++++++++++-
 gcc/config/i386/i386.h                    |  4 -
 gcc/config/i386/i386.md                   | 10 +++
 gcc/config/i386/predicates.md             |  5 +-
 gcc/config/i386/sse.md                    | 59 ++++----------
 gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 +++++++
 gcc/testsuite/gcc.target/i386/pr82735-2.c | 22 +++++
 gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 ++
 gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 +++++++++++
 gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++
 13 files changed, 252 insertions(+), 143 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index 9f3d41955a2..d25d59aa4e7 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -13282,10 +13282,6 @@ rdseed_step:
 
       return 0;
 
-    case IX86_BUILTIN_VZEROUPPER:
-      cfun->machine->has_explicit_vzeroupper = true;
-      break;
-
     default:
       break;
     }
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 77783a154b6..a25769ae478 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -1768,92 +1768,22 @@ convert_scalars_to_vector (bool timode_p)
   return 0;
 }
 
-/* Modify the vzeroupper pattern in INSN so that it describes the effect
-   that the instruction has on the SSE registers.  LIVE_REGS are the set
-   of registers that are live across the instruction.
-
-   For a live register R we use:
-
-     (set (reg:V2DF R) (reg:V2DF R))
-
-   which preserves the low 128 bits but clobbers the upper bits.  */
-
-static void
-ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
-{
-  rtx pattern = PATTERN (insn);
-  unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  unsigned int npats = nregs;
-  for (unsigned int i = 0; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (!bitmap_bit_p (live_regs, regno))
-	npats--;
-    }
-  if (npats == 0)
-    return;
-  rtvec vec = rtvec_alloc (npats + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
-  for (unsigned int i = 0, j = 0; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (!bitmap_bit_p (live_regs, regno))
-	continue;
-      rtx reg = gen_rtx_REG (V2DImode, regno);
-      ++j;
-      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
-    }
-  XVEC (pattern, 0) = vec;
-  INSN_CODE (insn) = -1;
-  df_insn_rescan (insn);
-}
-
-/* Walk the vzeroupper instructions in the function and annotate them
-   with the effect that they have on the SSE registers.  */
-
-static void
-ix86_add_reg_usage_to_vzerouppers (void)
-{
-  basic_block bb;
-  rtx_insn *insn;
-  auto_bitmap live_regs;
-
-  df_analyze ();
-  FOR_EACH_BB_FN (bb, cfun)
-    {
-      bitmap_copy (live_regs, df_get_live_out (bb));
-      df_simulate_initialize_backwards (bb, live_regs);
-      FOR_BB_INSNS_REVERSE (bb, insn)
-	{
-	  if (!NONDEBUG_INSN_P (insn))
-	    continue;
-	  if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
-	    ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
-	  df_simulate_one_insn_backwards (bb, insn, live_regs);
-	}
-    }
-}
-
 static unsigned int
 rest_of_handle_insert_vzeroupper (void)
 {
-  if (TARGET_VZEROUPPER
-      && flag_expensive_optimizations
-      && !optimize_size)
-    {
-      /* vzeroupper instructions are inserted immediately after reload to
-	 account for possible spills from 256bit or 512bit registers.  The pass
-	 reuses mode switching infrastructure by re-running mode insertion
-	 pass, so disable entities that have already been processed.  */
-      for (int i = 0; i < MAX_386_ENTITIES; i++)
-	ix86_optimize_mode_switching[i] = 0;
+  /* vzeroupper instructions are inserted immediately after reload to
+     account for possible spills from 256bit or 512bit registers.  The pass
+     reuses mode switching infrastructure by re-running mode insertion
+     pass, so disable entities that have already been processed.  */
+  for (int i = 0; i < MAX_386_ENTITIES; i++)
+    ix86_optimize_mode_switching[i] = 0;
 
-      ix86_optimize_mode_switching[AVX_U128] = 1;
+  ix86_optimize_mode_switching[AVX_U128] = 1;
 
-      /* Call optimize_mode_switching.  */
-      g->get_passes ()->execute_pass_mode_switching ();
-    }
-  ix86_add_reg_usage_to_vzerouppers ();
+  /* Call optimize_mode_switching.  */
+  g->get_passes ()->execute_pass_mode_switching ();
+
+  df_analyze ();
   return 0;
 }
 
@@ -1882,11 +1812,8 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      return TARGET_AVX
-	     && ((TARGET_VZEROUPPER
-		  && flag_expensive_optimizations
-		  && !optimize_size)
-		 || cfun->machine->has_explicit_vzeroupper);
+      return TARGET_AVX && TARGET_VZEROUPPER
+	&& flag_expensive_optimizations && !optimize_size;
     }
 
   virtual unsigned int execute (function *)
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 7782cf1163f..e6ac9390777 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -216,6 +216,7 @@ extern rtx ix86_split_stack_guard (void);
 extern void ix86_move_vector_high_sse_to_mmx (rtx);
 extern void ix86_split_mmx_pack (rtx[], enum rtx_code);
 extern void ix86_split_mmx_punpck (rtx[], bool);
+extern void ix86_expand_avx_vzeroupper (void);
 
 #ifdef TREE_CODE
 extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 743d8a25fe3..f0b66dd0d56 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -14426,7 +14426,7 @@ ix86_emit_mode_set (int entity, int mode, int prev_mode ATTRIBUTE_UNUSED,
       break;
     case AVX_U128:
       if (mode == AVX_U128_CLEAN)
-	emit_insn (gen_avx_vzeroupper ());
+	ix86_expand_avx_vzeroupper ();
       break;
     case I387_ROUNDEVEN:
     case I387_TRUNC:
@@ -19494,15 +19494,63 @@ ix86_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
   return false;
 }
 
+/* Implement TARGET_INSN_CALLEE_ABI.  */
+
+const predefined_function_abi &
+ix86_insn_callee_abi (const rtx_insn *insn)
+{
+  unsigned int abi_id = 0;
+  rtx pat = PATTERN (insn);
+  if (vzeroupper_pattern (pat, VOIDmode))
+    abi_id = I386_VZEROUPPER;
+
+  return function_abis[abi_id];
+}
+
+/* Initialize function_abis with corresponding abi_id,
+   currently only handle vzeroupper.  */
+void
+ix86_initialize_callee_abi (unsigned int abi_id)
+{
+  gcc_assert (abi_id == I386_VZEROUPPER);
+  predefined_function_abi &vzeroupper_abi = function_abis[abi_id];
+  if (!vzeroupper_abi.initialized_p ())
+    {
+      HARD_REG_SET full_reg_clobbers;
+      CLEAR_HARD_REG_SET (full_reg_clobbers);
+      vzeroupper_abi.initialize (I386_VZEROUPPER, full_reg_clobbers);
+    }
+}
+
+void
+ix86_expand_avx_vzeroupper (void)
+{
+  /* Initialize vzeroupper_abi here.  */
+  ix86_initialize_callee_abi (I386_VZEROUPPER);
+  rtx_insn *insn = emit_call_insn (gen_avx_vzeroupper_callee_abi ());
+  /* Return false for non-local goto in can_nonlocal_goto.  */
+  make_reg_eh_region_note (insn, 0, INT_MIN);
+  /* Flag used for call_insn indicates it's a fake call.  */
+  RTX_FLAG (insn, used) = 1;
+}
+
+
 /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
    saves SSE registers across calls is Win64 (thus no need to check the
    current ABI here), and with AVX enabled Win64 only guarantees that
    the low 16 bytes are saved.  */
 
 static bool
-ix86_hard_regno_call_part_clobbered (unsigned int, unsigned int regno,
+ix86_hard_regno_call_part_clobbered (unsigned int abi_id, unsigned int regno,
 				     machine_mode mode)
 {
+  /* Special ABI for vzeroupper which only clobber higher part of sse regs.  */
+  if (abi_id == I386_VZEROUPPER)
+      return (GET_MODE_SIZE (mode) > 16
+	      && ((TARGET_64BIT
+		   && (IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
+		  || (IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG))));
+
   return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
 }
 
@@ -23916,6 +23964,9 @@ ix86_run_selftests (void)
 #define TARGET_HARD_REGNO_CALL_PART_CLOBBERED \
   ix86_hard_regno_call_part_clobbered
 
+#undef TARGET_INSN_CALLEE_ABI
+#define TARGET_INSN_CALLEE_ABI ix86_insn_callee_abi
+
 #undef TARGET_CAN_CHANGE_MODE_CLASS
 #define TARGET_CAN_CHANGE_MODE_CLASS ix86_can_change_mode_class
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 53d503fc6e0..919d0b2418a 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2659,10 +2659,6 @@ struct GTY(()) machine_function {
   /* True if the function needs a stack frame.  */
   BOOL_BITFIELD stack_frame_required : 1;
 
-  /* True if __builtin_ia32_vzeroupper () has been expanded in current
-     function.  */
-  BOOL_BITFIELD has_explicit_vzeroupper : 1;
-
   /* True if we should act silently, rather than raise an error for
      invalid calls.  */
   BOOL_BITFIELD silent_p : 1;
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2fc8fae30f3..5d9f5aa39ac 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -191,6 +191,10 @@ (define_c_enum "unspec" [
   ;; For MOVDIRI and MOVDIR64B support
   UNSPEC_MOVDIRI
   UNSPEC_MOVDIR64B
+
+  ;; For insn_callee_abi:
+  UNSPEC_CALLEE_ABI
+
 ])
 
 (define_c_enum "unspecv" [
@@ -447,6 +451,12 @@ (define_constants
    (FIRST_PSEUDO_REG		76)
   ])
 
+;; Insn callee abi index.
+(define_constants
+  [(I386_DEFAULT	0)
+   (I386_VZEROUPPER	1)
+   (I386_UNKNOWN	2)])
+
 ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
 ;; from i386.c.
 
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index abd307ebdb8..8b787553f32 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1599,8 +1599,9 @@ (define_predicate "vzeroall_pattern"
 ;; return true if OP is a vzeroupper pattern.
 (define_predicate "vzeroupper_pattern"
   (and (match_code "parallel")
-       (match_code "unspec_volatile" "a")
-       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
+       (match_code "unspec" "b")
+       (match_test "XINT (XVECEXP (op, 0, 1), 1) == UNSPEC_CALLEE_ABI")
+       (match_test "INTVAL (XVECEXP (XVECEXP (op, 0, 1), 0, 0)) == I386_VZEROUPPER")))
 
 ;; Return true if OP is an addsub vec_merge operation
 (define_predicate "addsub_vm_operator"
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index a4503ddcb73..949347a3247 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -205,7 +205,6 @@ (define_c_enum "unspecv" [
   UNSPECV_MONITOR
   UNSPECV_MWAIT
   UNSPECV_VZEROALL
-  UNSPECV_VZEROUPPER
 
   ;; For KEYLOCKER
   UNSPECV_LOADIWKEY
@@ -20857,14 +20856,22 @@ (define_insn "*avx_vzeroall"
 ;; if the upper 128bits are unused.  Initially we expand the instructions
 ;; as though they had no effect on the SSE registers, but later add SETs and
 ;; CLOBBERs to the PARALLEL to model the real effect.
+
 (define_expand "avx_vzeroupper"
-  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX")
+  [(parallel [(call (mem:QI (const_int 0))
+		    (const_int 0))
+	     (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)])]
+  "TARGET_AVX"
+{
+  ix86_expand_avx_vzeroupper ();
+  DONE;
+})
 
-(define_insn "*avx_vzeroupper"
-  [(match_parallel 0 "vzeroupper_pattern"
-     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
+(define_insn "avx_vzeroupper_callee_abi"
+  [(call (mem:QI (const_int 0))
+	 (const_int 0))
+    (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)]
+  "TARGET_AVX"
   "vzeroupper"
   [(set_attr "type" "sse")
    (set_attr "modrm" "0")
@@ -20873,44 +20880,6 @@ (define_insn "*avx_vzeroupper"
    (set_attr "btver2_decode" "vector")
    (set_attr "mode" "OI")])
 
-(define_insn_and_split "*avx_vzeroupper_1"
-  [(match_parallel 0 "vzeroupper_pattern"
-     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
-  "#"
-  "&& epilogue_completed"
-  [(match_dup 0)]
-{
-  /* For IPA-RA purposes, make it clear the instruction clobbers
-     even XMM registers not mentioned explicitly in the pattern.  */
-  unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  unsigned int npats = XVECLEN (operands[0], 0);
-  rtvec vec = rtvec_alloc (nregs + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
-  for (unsigned int i = 0, j = 1; i < nregs; ++i)
-    {
-      unsigned int regno = GET_SSE_REGNO (i);
-      if (j < npats
-	  && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
-	{
-	  RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
-	  j++;
-	}
-      else
-	{
-	  rtx reg = gen_rtx_REG (V2DImode, regno);
-	  RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
-	}
-    }
-  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
-}
-  [(set_attr "type" "sse")
-   (set_attr "modrm" "0")
-   (set_attr "memory" "none")
-   (set_attr "prefix" "vex")
-   (set_attr "btver2_decode" "vector")
-   (set_attr "mode" "OI")])
-
 (define_mode_attr pbroadcast_evex_isa
   [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
    (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
new file mode 100644
index 00000000000..1a63b9ae9c9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
@@ -0,0 +1,29 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx" } */
+/* { dg-require-effective-target avx } */
+
+#include "avx-check.h"
+
+void
+__attribute__ ((noipa))
+mtest(char *dest)
+{
+  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
+  _mm256_zeroupper();
+  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)dest, ymm2);
+}
+
+void
+avx_test ()
+{
+  char buf[64];
+  for (int i = 0; i != 64; i++)
+    buf[i] = 2;
+  mtest (buf);
+
+  for (int i = 0; i < 32; ++i)
+    if (buf[i] != 1)
+      __builtin_abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
new file mode 100644
index 00000000000..ac9d006f794
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2" } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  /* xmm1 can be propagated to xmm2 by CSE.  */
+  __m128i xmm1 = _mm_set_epi8(0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
+			      0x9, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16);
+  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  __m128i xmm2 = xmm1;
+  _mm_storeu_si128((__m128i *)dest, xmm2);
+}
+
+/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
+/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
new file mode 100644
index 00000000000..e3f801e6924
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
@@ -0,0 +1,5 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2 -mabi=ms" } */
+/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
+
+#include "pr82735-2.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
new file mode 100644
index 00000000000..78c0a6cb2c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
@@ -0,0 +1,48 @@
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
new file mode 100644
index 00000000000..2a58cbe52d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
@@ -0,0 +1,54 @@
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-06-03  6:54                               ` [PATCH 1/2] CALL_INSN may not be a real function call liuhongt
  2021-06-03  6:54                                 ` [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI liuhongt
@ 2021-06-04  2:55                                 ` Hongtao Liu
  2021-06-04  7:50                                 ` Jakub Jelinek
  2021-07-05 23:30                                 ` Segher Boessenkool
  3 siblings, 0 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-06-04  2:55 UTC (permalink / raw)
  To: liuhongt; +Cc: GCC Patches, Jakub Jelinek, Richard Sandiford, Uros Bizjak

Ping,
This is a splitted middle-end patch as a follow up of
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571544.html

On Thu, Jun 3, 2021 at 2:54 PM liuhongt via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> fake call, it won't have its own function stack.
>
> gcc/ChangeLog
>
>         PR target/82735
>         * df-scan.c (df_get_call_refs): When call_insn is a fake call,
>         it won't use stack pointer reg.
>         * final.c (leaf_function_p): When call_insn is a fake call, it
>         won't affect caller as a leaf function.
>         * reg-stack.c (callee_clobbers_any_stack_reg): New.
>         (subst_stack_regs): When call_insn doesn't clobber any stack
>         reg, don't clear the arguments.
>         * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
>         a insn.
>         * shrink-wrap.c (requires_stack_frame_p): No need for stack
>         frame for a fake call.
>         * rtl.h (FAKE_CALL_P): New macro.
> ---
>  gcc/df-scan.c     |  3 ++-
>  gcc/final.c       |  3 ++-
>  gcc/reg-stack.c   | 18 +++++++++++++++++-
>  gcc/rtl.c         |  6 ++++--
>  gcc/rtl.h         |  5 +++++
>  gcc/shrink-wrap.c |  2 +-
>  6 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/df-scan.c b/gcc/df-scan.c
> index 6691c3e8357..1268536b3f0 100644
> --- a/gcc/df-scan.c
> +++ b/gcc/df-scan.c
> @@ -3090,7 +3090,8 @@ df_get_call_refs (class df_collection_rec *collection_rec,
>
>    for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
>      {
> -      if (i == STACK_POINTER_REGNUM)
> +      if (i == STACK_POINTER_REGNUM
> +         && !FAKE_CALL_P (insn_info->insn))
>         /* The stack ptr is used (honorarily) by a CALL insn.  */
>         df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
>                        NULL, bb, insn_info, DF_REF_REG_USE,
> diff --git a/gcc/final.c b/gcc/final.c
> index e0a70fcd830..817f7722cb2 100644
> --- a/gcc/final.c
> +++ b/gcc/final.c
> @@ -4109,7 +4109,8 @@ leaf_function_p (void)
>    for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
>      {
>        if (CALL_P (insn)
> -         && ! SIBLING_CALL_P (insn))
> +         && ! SIBLING_CALL_P (insn)
> +         && ! FAKE_CALL_P (insn))
>         return 0;
>        if (NONJUMP_INSN_P (insn)
>           && GET_CODE (PATTERN (insn)) == SEQUENCE
> diff --git a/gcc/reg-stack.c b/gcc/reg-stack.c
> index 25210f0c17f..1d9ea035cf4 100644
> --- a/gcc/reg-stack.c
> +++ b/gcc/reg-stack.c
> @@ -174,6 +174,7 @@
>  #include "reload.h"
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
> +#include "function-abi.h"
>
>  #ifdef STACK_REGS
>
> @@ -2368,6 +2369,18 @@ subst_asm_stack_regs (rtx_insn *insn, stack_ptr regstack)
>             }
>        }
>  }
> +
> +/* Return true if a function call is allowed to alter some or all bits
> +   of any stack reg.  */
> +static bool
> +callee_clobbers_any_stack_reg (const function_abi & callee_abi)
> +{
> +  for (unsigned regno = FIRST_STACK_REG; regno <= LAST_STACK_REG; regno++)
> +    if (callee_abi.clobbers_at_least_part_of_reg_p (regno))
> +      return true;
> +  return false;
> +}
> +
>
>  /* Substitute stack hard reg numbers for stack virtual registers in
>     INSN.  Non-stack register numbers are not changed.  REGSTACK is the
> @@ -2382,7 +2395,10 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
>    bool control_flow_insn_deleted = false;
>    int i;
>
> -  if (CALL_P (insn))
> +  /* If the target of the call doesn't clobber any stack registers,
> +     Don't clear the arguments.  */
> +  if (CALL_P (insn)
> +      && callee_clobbers_any_stack_reg (insn_callee_abi (insn)))
>      {
>        int top = regstack->top;
>
> diff --git a/gcc/rtl.c b/gcc/rtl.c
> index b0ba1ff684c..aaee882f5ca 100644
> --- a/gcc/rtl.c
> +++ b/gcc/rtl.c
> @@ -395,8 +395,10 @@ shallow_copy_rtx (const_rtx orig MEM_STAT_DECL)
>      case SCRATCH:
>        break;
>      default:
> -      /* For all other RTXes clear the used flag on the copy.  */
> -      RTX_FLAG (copy, used) = 0;
> +      /* For all other RTXes clear the used flag on the copy.
> +        CALL_INSN use "used" flag to indicate it's a fake call.  */
> +      if (!INSN_P (orig))
> +       RTX_FLAG (copy, used) = 0;
>        break;
>      }
>    return copy;
> diff --git a/gcc/rtl.h b/gcc/rtl.h
> index 35178b5bfac..5ed0d6dd6fa 100644
> --- a/gcc/rtl.h
> +++ b/gcc/rtl.h
> @@ -839,6 +839,11 @@ struct GTY(()) rtvec_def {
>  /* Predicate yielding nonzero iff X is a call insn.  */
>  #define CALL_P(X) (GET_CODE (X) == CALL_INSN)
>
> +/* 1 if RTX is a call_insn for a fake call.
> +   CALL_INSN use "used" flag to indicate it's a fake call.  */
> +#define FAKE_CALL_P(RTX)                                        \
> +  (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
> +
>  /* Predicate yielding nonzero iff X is an insn that cannot jump.  */
>  #define NONJUMP_INSN_P(X) (GET_CODE (X) == INSN)
>
> diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
> index ba7b5cd56fd..5e60f34f749 100644
> --- a/gcc/shrink-wrap.c
> +++ b/gcc/shrink-wrap.c
> @@ -57,7 +57,7 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
>    HARD_REG_SET hardregs;
>    unsigned regno;
>
> -  if (CALL_P (insn))
> +  if (CALL_P (insn) && !FAKE_CALL_P (insn))
>      return !SIBLING_CALL_P (insn);
>
>    /* We need a frame to get the unique CFA expected by the unwinder.  */
> --
> 2.18.1
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI.
  2021-06-03  6:54                                 ` [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI liuhongt
@ 2021-06-04  2:56                                   ` Hongtao Liu
  2021-06-04  6:26                                   ` Uros Bizjak
  1 sibling, 0 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-06-04  2:56 UTC (permalink / raw)
  To: liuhongt; +Cc: GCC Patches, Jakub Jelinek, Richard Sandiford, Uros Bizjak

Ping

This is a splitted backend patch as a follow up of
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571545.html

On Thu, Jun 3, 2021 at 2:55 PM liuhongt via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> which leads to incorrect optimization in pass_reload. In order to
> solve this problem, this patch refine instructions as call_insns in
> which the call has a special vzeroupper ABI.
>
> gcc/ChangeLog:
>
>         PR target/82735
>         * config/i386/i386-expand.c (ix86_expand_builtin): Remove
>         assignment of cfun->machine->has_explicit_vzeroupper.
>         * config/i386/i386-features.c
>         (ix86_add_reg_usage_to_vzerouppers): Delete.
>         (ix86_add_reg_usage_to_vzeroupper): Ditto.
>         (rest_of_handle_insert_vzeroupper): Remove
>         ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
>         of the function.
>         (gate): Remove cfun->machine->has_explicit_vzeroupper.
>         * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
>         Declared.
>         * config/i386/i386.c (ix86_insn_callee_abi): New function.
>         (ix86_initialize_callee_abi): Ditto.
>         (ix86_expand_avx_vzeroupper): Ditto.
>         (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
>         ABI.
>         (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
>         (ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper
>         directly.
>         * config/i386/i386.h (struct GTY(()) machine_function): Delete
>         has_explicit_vzeroupper.
>         * config/i386/i386.md (enum unspec): New member
>         UNSPEC_CALLEE_ABI.
>         (I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New
>         define_constants for insn callee abi index.
>         * config/i386/predicates.md (vzeroupper_pattern): Adjust.
>         * config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted.
>         (avx_vzeroupper): Call ix86_expand_avx_vzeroupper.
>         (*avx_vzeroupper): Rename to ..
>         (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
>         call_insn which has a special vzeroupper ABI.
>         (*avx_vzeroupper_1): Deleted.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/82735
>         * gcc.target/i386/pr82735-1.c: New test.
>         * gcc.target/i386/pr82735-2.c: New test.
>         * gcc.target/i386/pr82735-3.c: New test.
>         * gcc.target/i386/pr82735-4.c: New test.
>         * gcc.target/i386/pr82735-5.c: New test.
> ---
>  gcc/config/i386/i386-expand.c             |  4 -
>  gcc/config/i386/i386-features.c           | 99 +++--------------------
>  gcc/config/i386/i386-protos.h             |  1 +
>  gcc/config/i386/i386.c                    | 55 ++++++++++++-
>  gcc/config/i386/i386.h                    |  4 -
>  gcc/config/i386/i386.md                   | 10 +++
>  gcc/config/i386/predicates.md             |  5 +-
>  gcc/config/i386/sse.md                    | 59 ++++----------
>  gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 +++++++
>  gcc/testsuite/gcc.target/i386/pr82735-2.c | 22 +++++
>  gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 ++
>  gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 +++++++++++
>  gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++
>  13 files changed, 252 insertions(+), 143 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c
>
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index 9f3d41955a2..d25d59aa4e7 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -13282,10 +13282,6 @@ rdseed_step:
>
>        return 0;
>
> -    case IX86_BUILTIN_VZEROUPPER:
> -      cfun->machine->has_explicit_vzeroupper = true;
> -      break;
> -
>      default:
>        break;
>      }
> diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> index 77783a154b6..a25769ae478 100644
> --- a/gcc/config/i386/i386-features.c
> +++ b/gcc/config/i386/i386-features.c
> @@ -1768,92 +1768,22 @@ convert_scalars_to_vector (bool timode_p)
>    return 0;
>  }
>
> -/* Modify the vzeroupper pattern in INSN so that it describes the effect
> -   that the instruction has on the SSE registers.  LIVE_REGS are the set
> -   of registers that are live across the instruction.
> -
> -   For a live register R we use:
> -
> -     (set (reg:V2DF R) (reg:V2DF R))
> -
> -   which preserves the low 128 bits but clobbers the upper bits.  */
> -
> -static void
> -ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
> -{
> -  rtx pattern = PATTERN (insn);
> -  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> -  unsigned int npats = nregs;
> -  for (unsigned int i = 0; i < nregs; ++i)
> -    {
> -      unsigned int regno = GET_SSE_REGNO (i);
> -      if (!bitmap_bit_p (live_regs, regno))
> -       npats--;
> -    }
> -  if (npats == 0)
> -    return;
> -  rtvec vec = rtvec_alloc (npats + 1);
> -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> -  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> -    {
> -      unsigned int regno = GET_SSE_REGNO (i);
> -      if (!bitmap_bit_p (live_regs, regno))
> -       continue;
> -      rtx reg = gen_rtx_REG (V2DImode, regno);
> -      ++j;
> -      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
> -    }
> -  XVEC (pattern, 0) = vec;
> -  INSN_CODE (insn) = -1;
> -  df_insn_rescan (insn);
> -}
> -
> -/* Walk the vzeroupper instructions in the function and annotate them
> -   with the effect that they have on the SSE registers.  */
> -
> -static void
> -ix86_add_reg_usage_to_vzerouppers (void)
> -{
> -  basic_block bb;
> -  rtx_insn *insn;
> -  auto_bitmap live_regs;
> -
> -  df_analyze ();
> -  FOR_EACH_BB_FN (bb, cfun)
> -    {
> -      bitmap_copy (live_regs, df_get_live_out (bb));
> -      df_simulate_initialize_backwards (bb, live_regs);
> -      FOR_BB_INSNS_REVERSE (bb, insn)
> -       {
> -         if (!NONDEBUG_INSN_P (insn))
> -           continue;
> -         if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
> -           ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
> -         df_simulate_one_insn_backwards (bb, insn, live_regs);
> -       }
> -    }
> -}
> -
>  static unsigned int
>  rest_of_handle_insert_vzeroupper (void)
>  {
> -  if (TARGET_VZEROUPPER
> -      && flag_expensive_optimizations
> -      && !optimize_size)
> -    {
> -      /* vzeroupper instructions are inserted immediately after reload to
> -        account for possible spills from 256bit or 512bit registers.  The pass
> -        reuses mode switching infrastructure by re-running mode insertion
> -        pass, so disable entities that have already been processed.  */
> -      for (int i = 0; i < MAX_386_ENTITIES; i++)
> -       ix86_optimize_mode_switching[i] = 0;
> +  /* vzeroupper instructions are inserted immediately after reload to
> +     account for possible spills from 256bit or 512bit registers.  The pass
> +     reuses mode switching infrastructure by re-running mode insertion
> +     pass, so disable entities that have already been processed.  */
> +  for (int i = 0; i < MAX_386_ENTITIES; i++)
> +    ix86_optimize_mode_switching[i] = 0;
>
> -      ix86_optimize_mode_switching[AVX_U128] = 1;
> +  ix86_optimize_mode_switching[AVX_U128] = 1;
>
> -      /* Call optimize_mode_switching.  */
> -      g->get_passes ()->execute_pass_mode_switching ();
> -    }
> -  ix86_add_reg_usage_to_vzerouppers ();
> +  /* Call optimize_mode_switching.  */
> +  g->get_passes ()->execute_pass_mode_switching ();
> +
> +  df_analyze ();
>    return 0;
>  }
>
> @@ -1882,11 +1812,8 @@ public:
>    /* opt_pass methods: */
>    virtual bool gate (function *)
>      {
> -      return TARGET_AVX
> -            && ((TARGET_VZEROUPPER
> -                 && flag_expensive_optimizations
> -                 && !optimize_size)
> -                || cfun->machine->has_explicit_vzeroupper);
> +      return TARGET_AVX && TARGET_VZEROUPPER
> +       && flag_expensive_optimizations && !optimize_size;
>      }
>
>    virtual unsigned int execute (function *)
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 7782cf1163f..e6ac9390777 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -216,6 +216,7 @@ extern rtx ix86_split_stack_guard (void);
>  extern void ix86_move_vector_high_sse_to_mmx (rtx);
>  extern void ix86_split_mmx_pack (rtx[], enum rtx_code);
>  extern void ix86_split_mmx_punpck (rtx[], bool);
> +extern void ix86_expand_avx_vzeroupper (void);
>
>  #ifdef TREE_CODE
>  extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 743d8a25fe3..f0b66dd0d56 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -14426,7 +14426,7 @@ ix86_emit_mode_set (int entity, int mode, int prev_mode ATTRIBUTE_UNUSED,
>        break;
>      case AVX_U128:
>        if (mode == AVX_U128_CLEAN)
> -       emit_insn (gen_avx_vzeroupper ());
> +       ix86_expand_avx_vzeroupper ();
>        break;
>      case I387_ROUNDEVEN:
>      case I387_TRUNC:
> @@ -19494,15 +19494,63 @@ ix86_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
>    return false;
>  }
>
> +/* Implement TARGET_INSN_CALLEE_ABI.  */
> +
> +const predefined_function_abi &
> +ix86_insn_callee_abi (const rtx_insn *insn)
> +{
> +  unsigned int abi_id = 0;
> +  rtx pat = PATTERN (insn);
> +  if (vzeroupper_pattern (pat, VOIDmode))
> +    abi_id = I386_VZEROUPPER;
> +
> +  return function_abis[abi_id];
> +}
> +
> +/* Initialize function_abis with corresponding abi_id,
> +   currently only handle vzeroupper.  */
> +void
> +ix86_initialize_callee_abi (unsigned int abi_id)
> +{
> +  gcc_assert (abi_id == I386_VZEROUPPER);
> +  predefined_function_abi &vzeroupper_abi = function_abis[abi_id];
> +  if (!vzeroupper_abi.initialized_p ())
> +    {
> +      HARD_REG_SET full_reg_clobbers;
> +      CLEAR_HARD_REG_SET (full_reg_clobbers);
> +      vzeroupper_abi.initialize (I386_VZEROUPPER, full_reg_clobbers);
> +    }
> +}
> +
> +void
> +ix86_expand_avx_vzeroupper (void)
> +{
> +  /* Initialize vzeroupper_abi here.  */
> +  ix86_initialize_callee_abi (I386_VZEROUPPER);
> +  rtx_insn *insn = emit_call_insn (gen_avx_vzeroupper_callee_abi ());
> +  /* Return false for non-local goto in can_nonlocal_goto.  */
> +  make_reg_eh_region_note (insn, 0, INT_MIN);
> +  /* Flag used for call_insn indicates it's a fake call.  */
> +  RTX_FLAG (insn, used) = 1;
> +}
> +
> +
>  /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
>     saves SSE registers across calls is Win64 (thus no need to check the
>     current ABI here), and with AVX enabled Win64 only guarantees that
>     the low 16 bytes are saved.  */
>
>  static bool
> -ix86_hard_regno_call_part_clobbered (unsigned int, unsigned int regno,
> +ix86_hard_regno_call_part_clobbered (unsigned int abi_id, unsigned int regno,
>                                      machine_mode mode)
>  {
> +  /* Special ABI for vzeroupper which only clobber higher part of sse regs.  */
> +  if (abi_id == I386_VZEROUPPER)
> +      return (GET_MODE_SIZE (mode) > 16
> +             && ((TARGET_64BIT
> +                  && (IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
> +                 || (IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG))));
> +
>    return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
>  }
>
> @@ -23916,6 +23964,9 @@ ix86_run_selftests (void)
>  #define TARGET_HARD_REGNO_CALL_PART_CLOBBERED \
>    ix86_hard_regno_call_part_clobbered
>
> +#undef TARGET_INSN_CALLEE_ABI
> +#define TARGET_INSN_CALLEE_ABI ix86_insn_callee_abi
> +
>  #undef TARGET_CAN_CHANGE_MODE_CLASS
>  #define TARGET_CAN_CHANGE_MODE_CLASS ix86_can_change_mode_class
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 53d503fc6e0..919d0b2418a 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -2659,10 +2659,6 @@ struct GTY(()) machine_function {
>    /* True if the function needs a stack frame.  */
>    BOOL_BITFIELD stack_frame_required : 1;
>
> -  /* True if __builtin_ia32_vzeroupper () has been expanded in current
> -     function.  */
> -  BOOL_BITFIELD has_explicit_vzeroupper : 1;
> -
>    /* True if we should act silently, rather than raise an error for
>       invalid calls.  */
>    BOOL_BITFIELD silent_p : 1;
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 2fc8fae30f3..5d9f5aa39ac 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -191,6 +191,10 @@ (define_c_enum "unspec" [
>    ;; For MOVDIRI and MOVDIR64B support
>    UNSPEC_MOVDIRI
>    UNSPEC_MOVDIR64B
> +
> +  ;; For insn_callee_abi:
> +  UNSPEC_CALLEE_ABI
> +
>  ])
>
>  (define_c_enum "unspecv" [
> @@ -447,6 +451,12 @@ (define_constants
>     (FIRST_PSEUDO_REG           76)
>    ])
>
> +;; Insn callee abi index.
> +(define_constants
> +  [(I386_DEFAULT       0)
> +   (I386_VZEROUPPER    1)
> +   (I386_UNKNOWN       2)])
> +
>  ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
>  ;; from i386.c.
>
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index abd307ebdb8..8b787553f32 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -1599,8 +1599,9 @@ (define_predicate "vzeroall_pattern"
>  ;; return true if OP is a vzeroupper pattern.
>  (define_predicate "vzeroupper_pattern"
>    (and (match_code "parallel")
> -       (match_code "unspec_volatile" "a")
> -       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
> +       (match_code "unspec" "b")
> +       (match_test "XINT (XVECEXP (op, 0, 1), 1) == UNSPEC_CALLEE_ABI")
> +       (match_test "INTVAL (XVECEXP (XVECEXP (op, 0, 1), 0, 0)) == I386_VZEROUPPER")))
>
>  ;; Return true if OP is an addsub vec_merge operation
>  (define_predicate "addsub_vm_operator"
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index a4503ddcb73..949347a3247 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -205,7 +205,6 @@ (define_c_enum "unspecv" [
>    UNSPECV_MONITOR
>    UNSPECV_MWAIT
>    UNSPECV_VZEROALL
> -  UNSPECV_VZEROUPPER
>
>    ;; For KEYLOCKER
>    UNSPECV_LOADIWKEY
> @@ -20857,14 +20856,22 @@ (define_insn "*avx_vzeroall"
>  ;; if the upper 128bits are unused.  Initially we expand the instructions
>  ;; as though they had no effect on the SSE registers, but later add SETs and
>  ;; CLOBBERs to the PARALLEL to model the real effect.
> +
>  (define_expand "avx_vzeroupper"
> -  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX")
> +  [(parallel [(call (mem:QI (const_int 0))
> +                   (const_int 0))
> +            (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)])]
> +  "TARGET_AVX"
> +{
> +  ix86_expand_avx_vzeroupper ();
> +  DONE;
> +})
>
> -(define_insn "*avx_vzeroupper"
> -  [(match_parallel 0 "vzeroupper_pattern"
> -     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
> +(define_insn "avx_vzeroupper_callee_abi"
> +  [(call (mem:QI (const_int 0))
> +        (const_int 0))
> +    (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)]
> +  "TARGET_AVX"
>    "vzeroupper"
>    [(set_attr "type" "sse")
>     (set_attr "modrm" "0")
> @@ -20873,44 +20880,6 @@ (define_insn "*avx_vzeroupper"
>     (set_attr "btver2_decode" "vector")
>     (set_attr "mode" "OI")])
>
> -(define_insn_and_split "*avx_vzeroupper_1"
> -  [(match_parallel 0 "vzeroupper_pattern"
> -     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> -  "#"
> -  "&& epilogue_completed"
> -  [(match_dup 0)]
> -{
> -  /* For IPA-RA purposes, make it clear the instruction clobbers
> -     even XMM registers not mentioned explicitly in the pattern.  */
> -  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> -  unsigned int npats = XVECLEN (operands[0], 0);
> -  rtvec vec = rtvec_alloc (nregs + 1);
> -  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> -  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> -    {
> -      unsigned int regno = GET_SSE_REGNO (i);
> -      if (j < npats
> -         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> -       {
> -         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> -         j++;
> -       }
> -      else
> -       {
> -         rtx reg = gen_rtx_REG (V2DImode, regno);
> -         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> -       }
> -    }
> -  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> -}
> -  [(set_attr "type" "sse")
> -   (set_attr "modrm" "0")
> -   (set_attr "memory" "none")
> -   (set_attr "prefix" "vex")
> -   (set_attr "btver2_decode" "vector")
> -   (set_attr "mode" "OI")])
> -
>  (define_mode_attr pbroadcast_evex_isa
>    [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
>     (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
> new file mode 100644
> index 00000000000..1a63b9ae9c9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
> @@ -0,0 +1,29 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mavx" } */
> +/* { dg-require-effective-target avx } */
> +
> +#include "avx-check.h"
> +
> +void
> +__attribute__ ((noipa))
> +mtest(char *dest)
> +{
> +  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
> +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
> +  _mm256_zeroupper();
> +  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
> +  _mm256_storeu_si256((__m256i *)dest, ymm2);
> +}
> +
> +void
> +avx_test ()
> +{
> +  char buf[64];
> +  for (int i = 0; i != 64; i++)
> +    buf[i] = 2;
> +  mtest (buf);
> +
> +  for (int i = 0; i < 32; ++i)
> +    if (buf[i] != 1)
> +      __builtin_abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
> new file mode 100644
> index 00000000000..ac9d006f794
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx -O2" } */
> +
> +#include <immintrin.h>
> +
> +void test(char *dest)
> +{
> +  /* xmm1 can be propagated to xmm2 by CSE.  */
> +  __m128i xmm1 = _mm_set_epi8(0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
> +                             0x9, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16);
> +  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  __m128i xmm2 = xmm1;
> +  _mm_storeu_si128((__m128i *)dest, xmm2);
> +}
> +
> +/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
> +/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
> new file mode 100644
> index 00000000000..e3f801e6924
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
> @@ -0,0 +1,5 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx -O2 -mabi=ms" } */
> +/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
> +
> +#include "pr82735-2.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
> new file mode 100644
> index 00000000000..78c0a6cb2c8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
> @@ -0,0 +1,48 @@
> +/* { dg-do compile { target { ! ia32 } } }  */
> +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
> +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
> +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
> +
> +#include <immintrin.h>
> +
> +void test(char *dest)
> +{
> +  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
> +  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
> +               "vmovdqa\t%%ymm0, %1\n\t"
> +               "vmovdqa\t%%ymm0, %2\n\t"
> +               "vmovdqa\t%%ymm0, %3\n\t"
> +               "vmovdqa\t%%ymm0, %4\n\t"
> +               "vmovdqa\t%%ymm0, %5\n\t"
> +               "vmovdqa\t%%ymm0, %6\n\t"
> +               "vmovdqa\t%%ymm0, %7\n\t"
> +               "vmovdqa\t%%ymm0, %8\n\t"
> +               "vmovdqa\t%%ymm0, %9\n\t"
> +               "vmovdqa\t%%ymm0, %10\n\t"
> +               "vmovdqa\t%%ymm0, %11\n\t"
> +               "vmovdqa\t%%ymm0, %12\n\t"
> +               "vmovdqa\t%%ymm0, %13\n\t"
> +               "vmovdqa\t%%ymm0, %14\n\t"
> +               "vmovdqa\t%%ymm0, %15\n\t"
> +               : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
> +                 "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
> +                 "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
> +                 "=v"(ymm0)
> +               ::);
> +  _mm256_zeroupper();
> +  _mm256_storeu_si256((__m256i *)dest, ymm1);
> +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
> new file mode 100644
> index 00000000000..2a58cbe52d0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
> @@ -0,0 +1,54 @@
> +/* { dg-do compile { target { ! ia32 } } }  */
> +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
> +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
> +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
> +
> +#include <immintrin.h>
> +
> +void test(char *dest)
> +{
> +  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
> +  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
> +               "vmovdqa\t%%ymm0, %1\n\t"
> +               "vmovdqa\t%%ymm0, %2\n\t"
> +               "vmovdqa\t%%ymm0, %3\n\t"
> +               "vmovdqa\t%%ymm0, %4\n\t"
> +               "vmovdqa\t%%ymm0, %5\n\t"
> +               "vmovdqa\t%%ymm0, %6\n\t"
> +               "vmovdqa\t%%ymm0, %7\n\t"
> +               "vmovdqa\t%%ymm0, %8\n\t"
> +               "vmovdqa\t%%ymm0, %9\n\t"
> +               "vmovdqa\t%%ymm0, %10\n\t"
> +               "vmovdqa\t%%ymm0, %11\n\t"
> +               "vmovdqa\t%%ymm0, %12\n\t"
> +               "vmovdqa\t%%ymm0, %13\n\t"
> +               "vmovdqa\t%%ymm0, %14\n\t"
> +               "vmovdqa\t%%ymm0, %15\n\t"
> +               : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
> +                 "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
> +                 "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
> +                 "=v"(ymm0)
> +               ::);
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_storeu_si256((__m256i *)dest, ymm1);
> +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
> +}
> --
> 2.18.1
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI.
  2021-06-03  6:54                                 ` [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI liuhongt
  2021-06-04  2:56                                   ` Hongtao Liu
@ 2021-06-04  6:26                                   ` Uros Bizjak
  2021-06-04  6:34                                     ` Hongtao Liu
  1 sibling, 1 reply; 45+ messages in thread
From: Uros Bizjak @ 2021-06-04  6:26 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches, Jakub Jelinek, Richard Sandiford

On Thu, Jun 3, 2021 at 8:54 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> which leads to incorrect optimization in pass_reload. In order to
> solve this problem, this patch refine instructions as call_insns in
> which the call has a special vzeroupper ABI.
>
> gcc/ChangeLog:
>
>         PR target/82735
>         * config/i386/i386-expand.c (ix86_expand_builtin): Remove
>         assignment of cfun->machine->has_explicit_vzeroupper.
>         * config/i386/i386-features.c
>         (ix86_add_reg_usage_to_vzerouppers): Delete.
>         (ix86_add_reg_usage_to_vzeroupper): Ditto.
>         (rest_of_handle_insert_vzeroupper): Remove
>         ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
>         of the function.
>         (gate): Remove cfun->machine->has_explicit_vzeroupper.
>         * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
>         Declared.
>         * config/i386/i386.c (ix86_insn_callee_abi): New function.
>         (ix86_initialize_callee_abi): Ditto.
>         (ix86_expand_avx_vzeroupper): Ditto.
>         (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
>         ABI.
>         (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
>         (ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper
>         directly.
>         * config/i386/i386.h (struct GTY(()) machine_function): Delete
>         has_explicit_vzeroupper.
>         * config/i386/i386.md (enum unspec): New member
>         UNSPEC_CALLEE_ABI.
>         (I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New
>         define_constants for insn callee abi index.
>         * config/i386/predicates.md (vzeroupper_pattern): Adjust.
>         * config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted.
>         (avx_vzeroupper): Call ix86_expand_avx_vzeroupper.
>         (*avx_vzeroupper): Rename to ..
>         (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
>         call_insn which has a special vzeroupper ABI.
>         (*avx_vzeroupper_1): Deleted.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/82735
>         * gcc.target/i386/pr82735-1.c: New test.
>         * gcc.target/i386/pr82735-2.c: New test.
>         * gcc.target/i386/pr82735-3.c: New test.
>         * gcc.target/i386/pr82735-4.c: New test.
>         * gcc.target/i386/pr82735-5.c: New test.

LGTM, with a small nit below.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-expand.c             |  4 -
>  gcc/config/i386/i386-features.c           | 99 +++--------------------
>  gcc/config/i386/i386-protos.h             |  1 +
>  gcc/config/i386/i386.c                    | 55 ++++++++++++-
>  gcc/config/i386/i386.h                    |  4 -
>  gcc/config/i386/i386.md                   | 10 +++
>  gcc/config/i386/predicates.md             |  5 +-
>  gcc/config/i386/sse.md                    | 59 ++++----------
>  gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 +++++++
>  gcc/testsuite/gcc.target/i386/pr82735-2.c | 22 +++++
>  gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 ++
>  gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 +++++++++++
>  gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++
>  13 files changed, 252 insertions(+), 143 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c
>
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index 9f3d41955a2..d25d59aa4e7 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -13282,10 +13282,6 @@ rdseed_step:
>
>        return 0;
>
> -    case IX86_BUILTIN_VZEROUPPER:
> -      cfun->machine->has_explicit_vzeroupper = true;
> -      break;
> -
>      default:
>        break;
>      }
> diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> index 77783a154b6..a25769ae478 100644
> --- a/gcc/config/i386/i386-features.c
> +++ b/gcc/config/i386/i386-features.c
> @@ -1768,92 +1768,22 @@ convert_scalars_to_vector (bool timode_p)
>    return 0;
>  }
>
> -/* Modify the vzeroupper pattern in INSN so that it describes the effect
> -   that the instruction has on the SSE registers.  LIVE_REGS are the set
> -   of registers that are live across the instruction.
> -
> -   For a live register R we use:
> -
> -     (set (reg:V2DF R) (reg:V2DF R))
> -
> -   which preserves the low 128 bits but clobbers the upper bits.  */
> -
> -static void
> -ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
> -{
> -  rtx pattern = PATTERN (insn);
> -  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> -  unsigned int npats = nregs;
> -  for (unsigned int i = 0; i < nregs; ++i)
> -    {
> -      unsigned int regno = GET_SSE_REGNO (i);
> -      if (!bitmap_bit_p (live_regs, regno))
> -       npats--;
> -    }
> -  if (npats == 0)
> -    return;
> -  rtvec vec = rtvec_alloc (npats + 1);
> -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> -  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> -    {
> -      unsigned int regno = GET_SSE_REGNO (i);
> -      if (!bitmap_bit_p (live_regs, regno))
> -       continue;
> -      rtx reg = gen_rtx_REG (V2DImode, regno);
> -      ++j;
> -      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
> -    }
> -  XVEC (pattern, 0) = vec;
> -  INSN_CODE (insn) = -1;
> -  df_insn_rescan (insn);
> -}
> -
> -/* Walk the vzeroupper instructions in the function and annotate them
> -   with the effect that they have on the SSE registers.  */
> -
> -static void
> -ix86_add_reg_usage_to_vzerouppers (void)
> -{
> -  basic_block bb;
> -  rtx_insn *insn;
> -  auto_bitmap live_regs;
> -
> -  df_analyze ();
> -  FOR_EACH_BB_FN (bb, cfun)
> -    {
> -      bitmap_copy (live_regs, df_get_live_out (bb));
> -      df_simulate_initialize_backwards (bb, live_regs);
> -      FOR_BB_INSNS_REVERSE (bb, insn)
> -       {
> -         if (!NONDEBUG_INSN_P (insn))
> -           continue;
> -         if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
> -           ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
> -         df_simulate_one_insn_backwards (bb, insn, live_regs);
> -       }
> -    }
> -}
> -
>  static unsigned int
>  rest_of_handle_insert_vzeroupper (void)
>  {
> -  if (TARGET_VZEROUPPER
> -      && flag_expensive_optimizations
> -      && !optimize_size)
> -    {
> -      /* vzeroupper instructions are inserted immediately after reload to
> -        account for possible spills from 256bit or 512bit registers.  The pass
> -        reuses mode switching infrastructure by re-running mode insertion
> -        pass, so disable entities that have already been processed.  */
> -      for (int i = 0; i < MAX_386_ENTITIES; i++)
> -       ix86_optimize_mode_switching[i] = 0;
> +  /* vzeroupper instructions are inserted immediately after reload to
> +     account for possible spills from 256bit or 512bit registers.  The pass
> +     reuses mode switching infrastructure by re-running mode insertion
> +     pass, so disable entities that have already been processed.  */
> +  for (int i = 0; i < MAX_386_ENTITIES; i++)
> +    ix86_optimize_mode_switching[i] = 0;
>
> -      ix86_optimize_mode_switching[AVX_U128] = 1;
> +  ix86_optimize_mode_switching[AVX_U128] = 1;
>
> -      /* Call optimize_mode_switching.  */
> -      g->get_passes ()->execute_pass_mode_switching ();
> -    }
> -  ix86_add_reg_usage_to_vzerouppers ();
> +  /* Call optimize_mode_switching.  */
> +  g->get_passes ()->execute_pass_mode_switching ();
> +
> +  df_analyze ();
>    return 0;
>  }
>
> @@ -1882,11 +1812,8 @@ public:
>    /* opt_pass methods: */
>    virtual bool gate (function *)
>      {
> -      return TARGET_AVX
> -            && ((TARGET_VZEROUPPER
> -                 && flag_expensive_optimizations
> -                 && !optimize_size)
> -                || cfun->machine->has_explicit_vzeroupper);
> +      return TARGET_AVX && TARGET_VZEROUPPER
> +       && flag_expensive_optimizations && !optimize_size;
>      }
>
>    virtual unsigned int execute (function *)
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 7782cf1163f..e6ac9390777 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -216,6 +216,7 @@ extern rtx ix86_split_stack_guard (void);
>  extern void ix86_move_vector_high_sse_to_mmx (rtx);
>  extern void ix86_split_mmx_pack (rtx[], enum rtx_code);
>  extern void ix86_split_mmx_punpck (rtx[], bool);
> +extern void ix86_expand_avx_vzeroupper (void);
>
>  #ifdef TREE_CODE
>  extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 743d8a25fe3..f0b66dd0d56 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -14426,7 +14426,7 @@ ix86_emit_mode_set (int entity, int mode, int prev_mode ATTRIBUTE_UNUSED,
>        break;
>      case AVX_U128:
>        if (mode == AVX_U128_CLEAN)
> -       emit_insn (gen_avx_vzeroupper ());
> +       ix86_expand_avx_vzeroupper ();
>        break;
>      case I387_ROUNDEVEN:
>      case I387_TRUNC:
> @@ -19494,15 +19494,63 @@ ix86_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
>    return false;
>  }
>
> +/* Implement TARGET_INSN_CALLEE_ABI.  */
> +
> +const predefined_function_abi &
> +ix86_insn_callee_abi (const rtx_insn *insn)
> +{
> +  unsigned int abi_id = 0;
> +  rtx pat = PATTERN (insn);
> +  if (vzeroupper_pattern (pat, VOIDmode))
> +    abi_id = I386_VZEROUPPER;
> +
> +  return function_abis[abi_id];
> +}
> +
> +/* Initialize function_abis with corresponding abi_id,
> +   currently only handle vzeroupper.  */
> +void
> +ix86_initialize_callee_abi (unsigned int abi_id)
> +{
> +  gcc_assert (abi_id == I386_VZEROUPPER);
> +  predefined_function_abi &vzeroupper_abi = function_abis[abi_id];
> +  if (!vzeroupper_abi.initialized_p ())
> +    {
> +      HARD_REG_SET full_reg_clobbers;
> +      CLEAR_HARD_REG_SET (full_reg_clobbers);
> +      vzeroupper_abi.initialize (I386_VZEROUPPER, full_reg_clobbers);
> +    }
> +}
> +
> +void
> +ix86_expand_avx_vzeroupper (void)
> +{
> +  /* Initialize vzeroupper_abi here.  */
> +  ix86_initialize_callee_abi (I386_VZEROUPPER);
> +  rtx_insn *insn = emit_call_insn (gen_avx_vzeroupper_callee_abi ());
> +  /* Return false for non-local goto in can_nonlocal_goto.  */
> +  make_reg_eh_region_note (insn, 0, INT_MIN);
> +  /* Flag used for call_insn indicates it's a fake call.  */
> +  RTX_FLAG (insn, used) = 1;
> +}
> +
> +
>  /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
>     saves SSE registers across calls is Win64 (thus no need to check the
>     current ABI here), and with AVX enabled Win64 only guarantees that
>     the low 16 bytes are saved.  */
>
>  static bool
> -ix86_hard_regno_call_part_clobbered (unsigned int, unsigned int regno,
> +ix86_hard_regno_call_part_clobbered (unsigned int abi_id, unsigned int regno,
>                                      machine_mode mode)
>  {
> +  /* Special ABI for vzeroupper which only clobber higher part of sse regs.  */
> +  if (abi_id == I386_VZEROUPPER)
> +      return (GET_MODE_SIZE (mode) > 16
> +             && ((TARGET_64BIT
> +                  && (IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
> +                 || (IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG))));
> +
>    return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
>  }
>
> @@ -23916,6 +23964,9 @@ ix86_run_selftests (void)
>  #define TARGET_HARD_REGNO_CALL_PART_CLOBBERED \
>    ix86_hard_regno_call_part_clobbered
>
> +#undef TARGET_INSN_CALLEE_ABI
> +#define TARGET_INSN_CALLEE_ABI ix86_insn_callee_abi
> +
>  #undef TARGET_CAN_CHANGE_MODE_CLASS
>  #define TARGET_CAN_CHANGE_MODE_CLASS ix86_can_change_mode_class
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 53d503fc6e0..919d0b2418a 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -2659,10 +2659,6 @@ struct GTY(()) machine_function {
>    /* True if the function needs a stack frame.  */
>    BOOL_BITFIELD stack_frame_required : 1;
>
> -  /* True if __builtin_ia32_vzeroupper () has been expanded in current
> -     function.  */
> -  BOOL_BITFIELD has_explicit_vzeroupper : 1;
> -
>    /* True if we should act silently, rather than raise an error for
>       invalid calls.  */
>    BOOL_BITFIELD silent_p : 1;
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 2fc8fae30f3..5d9f5aa39ac 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -191,6 +191,10 @@ (define_c_enum "unspec" [
>    ;; For MOVDIRI and MOVDIR64B support
>    UNSPEC_MOVDIRI
>    UNSPEC_MOVDIR64B
> +
> +  ;; For insn_callee_abi:
> +  UNSPEC_CALLEE_ABI
> +
>  ])
>
>  (define_c_enum "unspecv" [
> @@ -447,6 +451,12 @@ (define_constants
>     (FIRST_PSEUDO_REG           76)
>    ])
>
> +;; Insn callee abi index.
> +(define_constants
> +  [(I386_DEFAULT       0)
> +   (I386_VZEROUPPER    1)
> +   (I386_UNKNOWN       2)])

Please name these ABI_DEFAULT, ABI_VZEROUPPER and ABI_UNKNOWN.

> +
>  ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
>  ;; from i386.c.
>
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index abd307ebdb8..8b787553f32 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -1599,8 +1599,9 @@ (define_predicate "vzeroall_pattern"
>  ;; return true if OP is a vzeroupper pattern.
>  (define_predicate "vzeroupper_pattern"
>    (and (match_code "parallel")
> -       (match_code "unspec_volatile" "a")
> -       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
> +       (match_code "unspec" "b")
> +       (match_test "XINT (XVECEXP (op, 0, 1), 1) == UNSPEC_CALLEE_ABI")
> +       (match_test "INTVAL (XVECEXP (XVECEXP (op, 0, 1), 0, 0)) == I386_VZEROUPPER")))
>
>  ;; Return true if OP is an addsub vec_merge operation
>  (define_predicate "addsub_vm_operator"
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index a4503ddcb73..949347a3247 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -205,7 +205,6 @@ (define_c_enum "unspecv" [
>    UNSPECV_MONITOR
>    UNSPECV_MWAIT
>    UNSPECV_VZEROALL
> -  UNSPECV_VZEROUPPER
>
>    ;; For KEYLOCKER
>    UNSPECV_LOADIWKEY
> @@ -20857,14 +20856,22 @@ (define_insn "*avx_vzeroall"
>  ;; if the upper 128bits are unused.  Initially we expand the instructions
>  ;; as though they had no effect on the SSE registers, but later add SETs and
>  ;; CLOBBERs to the PARALLEL to model the real effect.
> +
>  (define_expand "avx_vzeroupper"
> -  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX")
> +  [(parallel [(call (mem:QI (const_int 0))
> +                   (const_int 0))
> +            (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)])]
> +  "TARGET_AVX"
> +{
> +  ix86_expand_avx_vzeroupper ();
> +  DONE;
> +})
>
> -(define_insn "*avx_vzeroupper"
> -  [(match_parallel 0 "vzeroupper_pattern"
> -     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
> +(define_insn "avx_vzeroupper_callee_abi"
> +  [(call (mem:QI (const_int 0))
> +        (const_int 0))
> +    (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)]
> +  "TARGET_AVX"
>    "vzeroupper"
>    [(set_attr "type" "sse")
>     (set_attr "modrm" "0")
> @@ -20873,44 +20880,6 @@ (define_insn "*avx_vzeroupper"
>     (set_attr "btver2_decode" "vector")
>     (set_attr "mode" "OI")])
>
> -(define_insn_and_split "*avx_vzeroupper_1"
> -  [(match_parallel 0 "vzeroupper_pattern"
> -     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> -  "#"
> -  "&& epilogue_completed"
> -  [(match_dup 0)]
> -{
> -  /* For IPA-RA purposes, make it clear the instruction clobbers
> -     even XMM registers not mentioned explicitly in the pattern.  */
> -  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> -  unsigned int npats = XVECLEN (operands[0], 0);
> -  rtvec vec = rtvec_alloc (nregs + 1);
> -  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> -  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> -    {
> -      unsigned int regno = GET_SSE_REGNO (i);
> -      if (j < npats
> -         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> -       {
> -         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> -         j++;
> -       }
> -      else
> -       {
> -         rtx reg = gen_rtx_REG (V2DImode, regno);
> -         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> -       }
> -    }
> -  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> -}
> -  [(set_attr "type" "sse")
> -   (set_attr "modrm" "0")
> -   (set_attr "memory" "none")
> -   (set_attr "prefix" "vex")
> -   (set_attr "btver2_decode" "vector")
> -   (set_attr "mode" "OI")])
> -
>  (define_mode_attr pbroadcast_evex_isa
>    [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
>     (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
> new file mode 100644
> index 00000000000..1a63b9ae9c9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
> @@ -0,0 +1,29 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mavx" } */
> +/* { dg-require-effective-target avx } */
> +
> +#include "avx-check.h"
> +
> +void
> +__attribute__ ((noipa))
> +mtest(char *dest)
> +{
> +  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
> +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
> +  _mm256_zeroupper();
> +  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
> +  _mm256_storeu_si256((__m256i *)dest, ymm2);
> +}
> +
> +void
> +avx_test ()
> +{
> +  char buf[64];
> +  for (int i = 0; i != 64; i++)
> +    buf[i] = 2;
> +  mtest (buf);
> +
> +  for (int i = 0; i < 32; ++i)
> +    if (buf[i] != 1)
> +      __builtin_abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
> new file mode 100644
> index 00000000000..ac9d006f794
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx -O2" } */
> +
> +#include <immintrin.h>
> +
> +void test(char *dest)
> +{
> +  /* xmm1 can be propagated to xmm2 by CSE.  */
> +  __m128i xmm1 = _mm_set_epi8(0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
> +                             0x9, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16);
> +  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  __m128i xmm2 = xmm1;
> +  _mm_storeu_si128((__m128i *)dest, xmm2);
> +}
> +
> +/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
> +/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
> new file mode 100644
> index 00000000000..e3f801e6924
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
> @@ -0,0 +1,5 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx -O2 -mabi=ms" } */
> +/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
> +
> +#include "pr82735-2.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
> new file mode 100644
> index 00000000000..78c0a6cb2c8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
> @@ -0,0 +1,48 @@
> +/* { dg-do compile { target { ! ia32 } } }  */
> +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
> +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
> +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
> +
> +#include <immintrin.h>
> +
> +void test(char *dest)
> +{
> +  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
> +  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
> +               "vmovdqa\t%%ymm0, %1\n\t"
> +               "vmovdqa\t%%ymm0, %2\n\t"
> +               "vmovdqa\t%%ymm0, %3\n\t"
> +               "vmovdqa\t%%ymm0, %4\n\t"
> +               "vmovdqa\t%%ymm0, %5\n\t"
> +               "vmovdqa\t%%ymm0, %6\n\t"
> +               "vmovdqa\t%%ymm0, %7\n\t"
> +               "vmovdqa\t%%ymm0, %8\n\t"
> +               "vmovdqa\t%%ymm0, %9\n\t"
> +               "vmovdqa\t%%ymm0, %10\n\t"
> +               "vmovdqa\t%%ymm0, %11\n\t"
> +               "vmovdqa\t%%ymm0, %12\n\t"
> +               "vmovdqa\t%%ymm0, %13\n\t"
> +               "vmovdqa\t%%ymm0, %14\n\t"
> +               "vmovdqa\t%%ymm0, %15\n\t"
> +               : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
> +                 "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
> +                 "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
> +                 "=v"(ymm0)
> +               ::);
> +  _mm256_zeroupper();
> +  _mm256_storeu_si256((__m256i *)dest, ymm1);
> +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
> new file mode 100644
> index 00000000000..2a58cbe52d0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
> @@ -0,0 +1,54 @@
> +/* { dg-do compile { target { ! ia32 } } }  */
> +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
> +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
> +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
> +
> +#include <immintrin.h>
> +
> +void test(char *dest)
> +{
> +  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
> +  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
> +               "vmovdqa\t%%ymm0, %1\n\t"
> +               "vmovdqa\t%%ymm0, %2\n\t"
> +               "vmovdqa\t%%ymm0, %3\n\t"
> +               "vmovdqa\t%%ymm0, %4\n\t"
> +               "vmovdqa\t%%ymm0, %5\n\t"
> +               "vmovdqa\t%%ymm0, %6\n\t"
> +               "vmovdqa\t%%ymm0, %7\n\t"
> +               "vmovdqa\t%%ymm0, %8\n\t"
> +               "vmovdqa\t%%ymm0, %9\n\t"
> +               "vmovdqa\t%%ymm0, %10\n\t"
> +               "vmovdqa\t%%ymm0, %11\n\t"
> +               "vmovdqa\t%%ymm0, %12\n\t"
> +               "vmovdqa\t%%ymm0, %13\n\t"
> +               "vmovdqa\t%%ymm0, %14\n\t"
> +               "vmovdqa\t%%ymm0, %15\n\t"
> +               : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
> +                 "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
> +                 "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
> +                 "=v"(ymm0)
> +               ::);
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_zeroupper();
> +  _mm256_storeu_si256((__m256i *)dest, ymm1);
> +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
> +  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
> +}
> --
> 2.18.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI.
  2021-06-04  6:26                                   ` Uros Bizjak
@ 2021-06-04  6:34                                     ` Hongtao Liu
  2021-06-07 19:04                                       ` [PATCH] x86: Don't compile pr82735-[345].c for x32 H.J. Lu
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-06-04  6:34 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: liuhongt, Jakub Jelinek, Richard Sandiford, gcc-patches

On Fri, Jun 4, 2021 at 2:27 PM Uros Bizjak via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jun 3, 2021 at 8:54 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> > which leads to incorrect optimization in pass_reload. In order to
> > solve this problem, this patch refine instructions as call_insns in
> > which the call has a special vzeroupper ABI.
> >
> > gcc/ChangeLog:
> >
> >         PR target/82735
> >         * config/i386/i386-expand.c (ix86_expand_builtin): Remove
> >         assignment of cfun->machine->has_explicit_vzeroupper.
> >         * config/i386/i386-features.c
> >         (ix86_add_reg_usage_to_vzerouppers): Delete.
> >         (ix86_add_reg_usage_to_vzeroupper): Ditto.
> >         (rest_of_handle_insert_vzeroupper): Remove
> >         ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
> >         of the function.
> >         (gate): Remove cfun->machine->has_explicit_vzeroupper.
> >         * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
> >         Declared.
> >         * config/i386/i386.c (ix86_insn_callee_abi): New function.
> >         (ix86_initialize_callee_abi): Ditto.
> >         (ix86_expand_avx_vzeroupper): Ditto.
> >         (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
> >         ABI.
> >         (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
> >         (ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper
> >         directly.
> >         * config/i386/i386.h (struct GTY(()) machine_function): Delete
> >         has_explicit_vzeroupper.
> >         * config/i386/i386.md (enum unspec): New member
> >         UNSPEC_CALLEE_ABI.
> >         (I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New
> >         define_constants for insn callee abi index.
> >         * config/i386/predicates.md (vzeroupper_pattern): Adjust.
> >         * config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted.
> >         (avx_vzeroupper): Call ix86_expand_avx_vzeroupper.
> >         (*avx_vzeroupper): Rename to ..
> >         (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
> >         call_insn which has a special vzeroupper ABI.
> >         (*avx_vzeroupper_1): Deleted.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/82735
> >         * gcc.target/i386/pr82735-1.c: New test.
> >         * gcc.target/i386/pr82735-2.c: New test.
> >         * gcc.target/i386/pr82735-3.c: New test.
> >         * gcc.target/i386/pr82735-4.c: New test.
> >         * gcc.target/i386/pr82735-5.c: New test.
>
> LGTM, with a small nit below.
>
> Thanks,
> Uros.
>
> > ---
> >  gcc/config/i386/i386-expand.c             |  4 -
> >  gcc/config/i386/i386-features.c           | 99 +++--------------------
> >  gcc/config/i386/i386-protos.h             |  1 +
> >  gcc/config/i386/i386.c                    | 55 ++++++++++++-
> >  gcc/config/i386/i386.h                    |  4 -
> >  gcc/config/i386/i386.md                   | 10 +++
> >  gcc/config/i386/predicates.md             |  5 +-
> >  gcc/config/i386/sse.md                    | 59 ++++----------
> >  gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 +++++++
> >  gcc/testsuite/gcc.target/i386/pr82735-2.c | 22 +++++
> >  gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 ++
> >  gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 +++++++++++
> >  gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++
> >  13 files changed, 252 insertions(+), 143 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c
> >
> > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > index 9f3d41955a2..d25d59aa4e7 100644
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > @@ -13282,10 +13282,6 @@ rdseed_step:
> >
> >        return 0;
> >
> > -    case IX86_BUILTIN_VZEROUPPER:
> > -      cfun->machine->has_explicit_vzeroupper = true;
> > -      break;
> > -
> >      default:
> >        break;
> >      }
> > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > index 77783a154b6..a25769ae478 100644
> > --- a/gcc/config/i386/i386-features.c
> > +++ b/gcc/config/i386/i386-features.c
> > @@ -1768,92 +1768,22 @@ convert_scalars_to_vector (bool timode_p)
> >    return 0;
> >  }
> >
> > -/* Modify the vzeroupper pattern in INSN so that it describes the effect
> > -   that the instruction has on the SSE registers.  LIVE_REGS are the set
> > -   of registers that are live across the instruction.
> > -
> > -   For a live register R we use:
> > -
> > -     (set (reg:V2DF R) (reg:V2DF R))
> > -
> > -   which preserves the low 128 bits but clobbers the upper bits.  */
> > -
> > -static void
> > -ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
> > -{
> > -  rtx pattern = PATTERN (insn);
> > -  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> > -  unsigned int npats = nregs;
> > -  for (unsigned int i = 0; i < nregs; ++i)
> > -    {
> > -      unsigned int regno = GET_SSE_REGNO (i);
> > -      if (!bitmap_bit_p (live_regs, regno))
> > -       npats--;
> > -    }
> > -  if (npats == 0)
> > -    return;
> > -  rtvec vec = rtvec_alloc (npats + 1);
> > -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> > -  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> > -    {
> > -      unsigned int regno = GET_SSE_REGNO (i);
> > -      if (!bitmap_bit_p (live_regs, regno))
> > -       continue;
> > -      rtx reg = gen_rtx_REG (V2DImode, regno);
> > -      ++j;
> > -      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
> > -    }
> > -  XVEC (pattern, 0) = vec;
> > -  INSN_CODE (insn) = -1;
> > -  df_insn_rescan (insn);
> > -}
> > -
> > -/* Walk the vzeroupper instructions in the function and annotate them
> > -   with the effect that they have on the SSE registers.  */
> > -
> > -static void
> > -ix86_add_reg_usage_to_vzerouppers (void)
> > -{
> > -  basic_block bb;
> > -  rtx_insn *insn;
> > -  auto_bitmap live_regs;
> > -
> > -  df_analyze ();
> > -  FOR_EACH_BB_FN (bb, cfun)
> > -    {
> > -      bitmap_copy (live_regs, df_get_live_out (bb));
> > -      df_simulate_initialize_backwards (bb, live_regs);
> > -      FOR_BB_INSNS_REVERSE (bb, insn)
> > -       {
> > -         if (!NONDEBUG_INSN_P (insn))
> > -           continue;
> > -         if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
> > -           ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
> > -         df_simulate_one_insn_backwards (bb, insn, live_regs);
> > -       }
> > -    }
> > -}
> > -
> >  static unsigned int
> >  rest_of_handle_insert_vzeroupper (void)
> >  {
> > -  if (TARGET_VZEROUPPER
> > -      && flag_expensive_optimizations
> > -      && !optimize_size)
> > -    {
> > -      /* vzeroupper instructions are inserted immediately after reload to
> > -        account for possible spills from 256bit or 512bit registers.  The pass
> > -        reuses mode switching infrastructure by re-running mode insertion
> > -        pass, so disable entities that have already been processed.  */
> > -      for (int i = 0; i < MAX_386_ENTITIES; i++)
> > -       ix86_optimize_mode_switching[i] = 0;
> > +  /* vzeroupper instructions are inserted immediately after reload to
> > +     account for possible spills from 256bit or 512bit registers.  The pass
> > +     reuses mode switching infrastructure by re-running mode insertion
> > +     pass, so disable entities that have already been processed.  */
> > +  for (int i = 0; i < MAX_386_ENTITIES; i++)
> > +    ix86_optimize_mode_switching[i] = 0;
> >
> > -      ix86_optimize_mode_switching[AVX_U128] = 1;
> > +  ix86_optimize_mode_switching[AVX_U128] = 1;
> >
> > -      /* Call optimize_mode_switching.  */
> > -      g->get_passes ()->execute_pass_mode_switching ();
> > -    }
> > -  ix86_add_reg_usage_to_vzerouppers ();
> > +  /* Call optimize_mode_switching.  */
> > +  g->get_passes ()->execute_pass_mode_switching ();
> > +
> > +  df_analyze ();
> >    return 0;
> >  }
> >
> > @@ -1882,11 +1812,8 @@ public:
> >    /* opt_pass methods: */
> >    virtual bool gate (function *)
> >      {
> > -      return TARGET_AVX
> > -            && ((TARGET_VZEROUPPER
> > -                 && flag_expensive_optimizations
> > -                 && !optimize_size)
> > -                || cfun->machine->has_explicit_vzeroupper);
> > +      return TARGET_AVX && TARGET_VZEROUPPER
> > +       && flag_expensive_optimizations && !optimize_size;
> >      }
> >
> >    virtual unsigned int execute (function *)
> > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > index 7782cf1163f..e6ac9390777 100644
> > --- a/gcc/config/i386/i386-protos.h
> > +++ b/gcc/config/i386/i386-protos.h
> > @@ -216,6 +216,7 @@ extern rtx ix86_split_stack_guard (void);
> >  extern void ix86_move_vector_high_sse_to_mmx (rtx);
> >  extern void ix86_split_mmx_pack (rtx[], enum rtx_code);
> >  extern void ix86_split_mmx_punpck (rtx[], bool);
> > +extern void ix86_expand_avx_vzeroupper (void);
> >
> >  #ifdef TREE_CODE
> >  extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > index 743d8a25fe3..f0b66dd0d56 100644
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -14426,7 +14426,7 @@ ix86_emit_mode_set (int entity, int mode, int prev_mode ATTRIBUTE_UNUSED,
> >        break;
> >      case AVX_U128:
> >        if (mode == AVX_U128_CLEAN)
> > -       emit_insn (gen_avx_vzeroupper ());
> > +       ix86_expand_avx_vzeroupper ();
> >        break;
> >      case I387_ROUNDEVEN:
> >      case I387_TRUNC:
> > @@ -19494,15 +19494,63 @@ ix86_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
> >    return false;
> >  }
> >
> > +/* Implement TARGET_INSN_CALLEE_ABI.  */
> > +
> > +const predefined_function_abi &
> > +ix86_insn_callee_abi (const rtx_insn *insn)
> > +{
> > +  unsigned int abi_id = 0;
> > +  rtx pat = PATTERN (insn);
> > +  if (vzeroupper_pattern (pat, VOIDmode))
> > +    abi_id = I386_VZEROUPPER;
> > +
> > +  return function_abis[abi_id];
> > +}
> > +
> > +/* Initialize function_abis with corresponding abi_id,
> > +   currently only handle vzeroupper.  */
> > +void
> > +ix86_initialize_callee_abi (unsigned int abi_id)
> > +{
> > +  gcc_assert (abi_id == I386_VZEROUPPER);
> > +  predefined_function_abi &vzeroupper_abi = function_abis[abi_id];
> > +  if (!vzeroupper_abi.initialized_p ())
> > +    {
> > +      HARD_REG_SET full_reg_clobbers;
> > +      CLEAR_HARD_REG_SET (full_reg_clobbers);
> > +      vzeroupper_abi.initialize (I386_VZEROUPPER, full_reg_clobbers);
> > +    }
> > +}
> > +
> > +void
> > +ix86_expand_avx_vzeroupper (void)
> > +{
> > +  /* Initialize vzeroupper_abi here.  */
> > +  ix86_initialize_callee_abi (I386_VZEROUPPER);
> > +  rtx_insn *insn = emit_call_insn (gen_avx_vzeroupper_callee_abi ());
> > +  /* Return false for non-local goto in can_nonlocal_goto.  */
> > +  make_reg_eh_region_note (insn, 0, INT_MIN);
> > +  /* Flag used for call_insn indicates it's a fake call.  */
> > +  RTX_FLAG (insn, used) = 1;
> > +}
> > +
> > +
> >  /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
> >     saves SSE registers across calls is Win64 (thus no need to check the
> >     current ABI here), and with AVX enabled Win64 only guarantees that
> >     the low 16 bytes are saved.  */
> >
> >  static bool
> > -ix86_hard_regno_call_part_clobbered (unsigned int, unsigned int regno,
> > +ix86_hard_regno_call_part_clobbered (unsigned int abi_id, unsigned int regno,
> >                                      machine_mode mode)
> >  {
> > +  /* Special ABI for vzeroupper which only clobber higher part of sse regs.  */
> > +  if (abi_id == I386_VZEROUPPER)
> > +      return (GET_MODE_SIZE (mode) > 16
> > +             && ((TARGET_64BIT
> > +                  && (IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
> > +                 || (IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG))));
> > +
> >    return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
> >  }
> >
> > @@ -23916,6 +23964,9 @@ ix86_run_selftests (void)
> >  #define TARGET_HARD_REGNO_CALL_PART_CLOBBERED \
> >    ix86_hard_regno_call_part_clobbered
> >
> > +#undef TARGET_INSN_CALLEE_ABI
> > +#define TARGET_INSN_CALLEE_ABI ix86_insn_callee_abi
> > +
> >  #undef TARGET_CAN_CHANGE_MODE_CLASS
> >  #define TARGET_CAN_CHANGE_MODE_CLASS ix86_can_change_mode_class
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index 53d503fc6e0..919d0b2418a 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -2659,10 +2659,6 @@ struct GTY(()) machine_function {
> >    /* True if the function needs a stack frame.  */
> >    BOOL_BITFIELD stack_frame_required : 1;
> >
> > -  /* True if __builtin_ia32_vzeroupper () has been expanded in current
> > -     function.  */
> > -  BOOL_BITFIELD has_explicit_vzeroupper : 1;
> > -
> >    /* True if we should act silently, rather than raise an error for
> >       invalid calls.  */
> >    BOOL_BITFIELD silent_p : 1;
> > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > index 2fc8fae30f3..5d9f5aa39ac 100644
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -191,6 +191,10 @@ (define_c_enum "unspec" [
> >    ;; For MOVDIRI and MOVDIR64B support
> >    UNSPEC_MOVDIRI
> >    UNSPEC_MOVDIR64B
> > +
> > +  ;; For insn_callee_abi:
> > +  UNSPEC_CALLEE_ABI
> > +
> >  ])
> >
> >  (define_c_enum "unspecv" [
> > @@ -447,6 +451,12 @@ (define_constants
> >     (FIRST_PSEUDO_REG           76)
> >    ])
> >
> > +;; Insn callee abi index.
> > +(define_constants
> > +  [(I386_DEFAULT       0)
> > +   (I386_VZEROUPPER    1)
> > +   (I386_UNKNOWN       2)])
>
> Please name these ABI_DEFAULT, ABI_VZEROUPPER and ABI_UNKNOWN.
Yes, thanks for the review.
I'll hold this patch until middle-end part is approved.
>
> > +
> >  ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
> >  ;; from i386.c.
> >
> > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> > index abd307ebdb8..8b787553f32 100644
> > --- a/gcc/config/i386/predicates.md
> > +++ b/gcc/config/i386/predicates.md
> > @@ -1599,8 +1599,9 @@ (define_predicate "vzeroall_pattern"
> >  ;; return true if OP is a vzeroupper pattern.
> >  (define_predicate "vzeroupper_pattern"
> >    (and (match_code "parallel")
> > -       (match_code "unspec_volatile" "a")
> > -       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
> > +       (match_code "unspec" "b")
> > +       (match_test "XINT (XVECEXP (op, 0, 1), 1) == UNSPEC_CALLEE_ABI")
> > +       (match_test "INTVAL (XVECEXP (XVECEXP (op, 0, 1), 0, 0)) == I386_VZEROUPPER")))
> >
> >  ;; Return true if OP is an addsub vec_merge operation
> >  (define_predicate "addsub_vm_operator"
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index a4503ddcb73..949347a3247 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -205,7 +205,6 @@ (define_c_enum "unspecv" [
> >    UNSPECV_MONITOR
> >    UNSPECV_MWAIT
> >    UNSPECV_VZEROALL
> > -  UNSPECV_VZEROUPPER
> >
> >    ;; For KEYLOCKER
> >    UNSPECV_LOADIWKEY
> > @@ -20857,14 +20856,22 @@ (define_insn "*avx_vzeroall"
> >  ;; if the upper 128bits are unused.  Initially we expand the instructions
> >  ;; as though they had no effect on the SSE registers, but later add SETs and
> >  ;; CLOBBERs to the PARALLEL to model the real effect.
> > +
> >  (define_expand "avx_vzeroupper"
> > -  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> > -  "TARGET_AVX")
> > +  [(parallel [(call (mem:QI (const_int 0))
> > +                   (const_int 0))
> > +            (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)])]
> > +  "TARGET_AVX"
> > +{
> > +  ix86_expand_avx_vzeroupper ();
> > +  DONE;
> > +})
> >
> > -(define_insn "*avx_vzeroupper"
> > -  [(match_parallel 0 "vzeroupper_pattern"
> > -     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> > -  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
> > +(define_insn "avx_vzeroupper_callee_abi"
> > +  [(call (mem:QI (const_int 0))
> > +        (const_int 0))
> > +    (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)]
> > +  "TARGET_AVX"
> >    "vzeroupper"
> >    [(set_attr "type" "sse")
> >     (set_attr "modrm" "0")
> > @@ -20873,44 +20880,6 @@ (define_insn "*avx_vzeroupper"
> >     (set_attr "btver2_decode" "vector")
> >     (set_attr "mode" "OI")])
> >
> > -(define_insn_and_split "*avx_vzeroupper_1"
> > -  [(match_parallel 0 "vzeroupper_pattern"
> > -     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> > -  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> > -  "#"
> > -  "&& epilogue_completed"
> > -  [(match_dup 0)]
> > -{
> > -  /* For IPA-RA purposes, make it clear the instruction clobbers
> > -     even XMM registers not mentioned explicitly in the pattern.  */
> > -  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> > -  unsigned int npats = XVECLEN (operands[0], 0);
> > -  rtvec vec = rtvec_alloc (nregs + 1);
> > -  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> > -  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> > -    {
> > -      unsigned int regno = GET_SSE_REGNO (i);
> > -      if (j < npats
> > -         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> > -       {
> > -         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> > -         j++;
> > -       }
> > -      else
> > -       {
> > -         rtx reg = gen_rtx_REG (V2DImode, regno);
> > -         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> > -       }
> > -    }
> > -  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> > -}
> > -  [(set_attr "type" "sse")
> > -   (set_attr "modrm" "0")
> > -   (set_attr "memory" "none")
> > -   (set_attr "prefix" "vex")
> > -   (set_attr "btver2_decode" "vector")
> > -   (set_attr "mode" "OI")])
> > -
> >  (define_mode_attr pbroadcast_evex_isa
> >    [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
> >     (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
> > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
> > new file mode 100644
> > index 00000000000..1a63b9ae9c9
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
> > @@ -0,0 +1,29 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O2 -mavx" } */
> > +/* { dg-require-effective-target avx } */
> > +
> > +#include "avx-check.h"
> > +
> > +void
> > +__attribute__ ((noipa))
> > +mtest(char *dest)
> > +{
> > +  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
> > +  _mm256_zeroupper();
> > +  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
> > +  _mm256_storeu_si256((__m256i *)dest, ymm2);
> > +}
> > +
> > +void
> > +avx_test ()
> > +{
> > +  char buf[64];
> > +  for (int i = 0; i != 64; i++)
> > +    buf[i] = 2;
> > +  mtest (buf);
> > +
> > +  for (int i = 0; i < 32; ++i)
> > +    if (buf[i] != 1)
> > +      __builtin_abort ();
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
> > new file mode 100644
> > index 00000000000..ac9d006f794
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
> > @@ -0,0 +1,22 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx -O2" } */
> > +
> > +#include <immintrin.h>
> > +
> > +void test(char *dest)
> > +{
> > +  /* xmm1 can be propagated to xmm2 by CSE.  */
> > +  __m128i xmm1 = _mm_set_epi8(0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
> > +                             0x9, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16);
> > +  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  __m128i xmm2 = xmm1;
> > +  _mm_storeu_si128((__m128i *)dest, xmm2);
> > +}
> > +
> > +/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
> > +/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
> > new file mode 100644
> > index 00000000000..e3f801e6924
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
> > @@ -0,0 +1,5 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx -O2 -mabi=ms" } */
> > +/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
> > +
> > +#include "pr82735-2.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
> > new file mode 100644
> > index 00000000000..78c0a6cb2c8
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
> > @@ -0,0 +1,48 @@
> > +/* { dg-do compile { target { ! ia32 } } }  */
> > +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
> > +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
> > +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
> > +
> > +#include <immintrin.h>
> > +
> > +void test(char *dest)
> > +{
> > +  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
> > +  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
> > +               "vmovdqa\t%%ymm0, %1\n\t"
> > +               "vmovdqa\t%%ymm0, %2\n\t"
> > +               "vmovdqa\t%%ymm0, %3\n\t"
> > +               "vmovdqa\t%%ymm0, %4\n\t"
> > +               "vmovdqa\t%%ymm0, %5\n\t"
> > +               "vmovdqa\t%%ymm0, %6\n\t"
> > +               "vmovdqa\t%%ymm0, %7\n\t"
> > +               "vmovdqa\t%%ymm0, %8\n\t"
> > +               "vmovdqa\t%%ymm0, %9\n\t"
> > +               "vmovdqa\t%%ymm0, %10\n\t"
> > +               "vmovdqa\t%%ymm0, %11\n\t"
> > +               "vmovdqa\t%%ymm0, %12\n\t"
> > +               "vmovdqa\t%%ymm0, %13\n\t"
> > +               "vmovdqa\t%%ymm0, %14\n\t"
> > +               "vmovdqa\t%%ymm0, %15\n\t"
> > +               : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
> > +                 "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
> > +                 "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
> > +                 "=v"(ymm0)
> > +               ::);
> > +  _mm256_zeroupper();
> > +  _mm256_storeu_si256((__m256i *)dest, ymm1);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
> > new file mode 100644
> > index 00000000000..2a58cbe52d0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
> > @@ -0,0 +1,54 @@
> > +/* { dg-do compile { target { ! ia32 } } }  */
> > +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
> > +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
> > +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
> > +
> > +#include <immintrin.h>
> > +
> > +void test(char *dest)
> > +{
> > +  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
> > +  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
> > +               "vmovdqa\t%%ymm0, %1\n\t"
> > +               "vmovdqa\t%%ymm0, %2\n\t"
> > +               "vmovdqa\t%%ymm0, %3\n\t"
> > +               "vmovdqa\t%%ymm0, %4\n\t"
> > +               "vmovdqa\t%%ymm0, %5\n\t"
> > +               "vmovdqa\t%%ymm0, %6\n\t"
> > +               "vmovdqa\t%%ymm0, %7\n\t"
> > +               "vmovdqa\t%%ymm0, %8\n\t"
> > +               "vmovdqa\t%%ymm0, %9\n\t"
> > +               "vmovdqa\t%%ymm0, %10\n\t"
> > +               "vmovdqa\t%%ymm0, %11\n\t"
> > +               "vmovdqa\t%%ymm0, %12\n\t"
> > +               "vmovdqa\t%%ymm0, %13\n\t"
> > +               "vmovdqa\t%%ymm0, %14\n\t"
> > +               "vmovdqa\t%%ymm0, %15\n\t"
> > +               : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
> > +                 "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
> > +                 "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
> > +                 "=v"(ymm0)
> > +               ::);
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_zeroupper();
> > +  _mm256_storeu_si256((__m256i *)dest, ymm1);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
> > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
> > +}
> > --
> > 2.18.1
> >



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-06-03  6:54                               ` [PATCH 1/2] CALL_INSN may not be a real function call liuhongt
  2021-06-03  6:54                                 ` [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI liuhongt
  2021-06-04  2:55                                 ` [PATCH 1/2] CALL_INSN may not be a real function call Hongtao Liu
@ 2021-06-04  7:50                                 ` Jakub Jelinek
  2021-07-05 23:30                                 ` Segher Boessenkool
  3 siblings, 0 replies; 45+ messages in thread
From: Jakub Jelinek @ 2021-06-04  7:50 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches, richard.sandiford, ubizjak

On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt wrote:
> Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> fake call, it won't have its own function stack.
> 
> gcc/ChangeLog
> 
> 	PR target/82735
> 	* df-scan.c (df_get_call_refs): When call_insn is a fake call,
> 	it won't use stack pointer reg.
> 	* final.c (leaf_function_p): When call_insn is a fake call, it
> 	won't affect caller as a leaf function.
> 	* reg-stack.c (callee_clobbers_any_stack_reg): New.
> 	(subst_stack_regs): When call_insn doesn't clobber any stack
> 	reg, don't clear the arguments.
> 	* rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
> 	a insn.
> 	* shrink-wrap.c (requires_stack_frame_p): No need for stack
> 	frame for a fake call.
> 	* rtl.h (FAKE_CALL_P): New macro.

Ok, thanks.

	Jakub


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] x86: Don't compile pr82735-[345].c for x32
  2021-06-04  6:34                                     ` Hongtao Liu
@ 2021-06-07 19:04                                       ` H.J. Lu
  0 siblings, 0 replies; 45+ messages in thread
From: H.J. Lu @ 2021-06-07 19:04 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Uros Bizjak, Jakub Jelinek, Richard Sandiford, liuhongt, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 28299 bytes --]

On Thu, Jun 3, 2021 at 11:31 PM Hongtao Liu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, Jun 4, 2021 at 2:27 PM Uros Bizjak via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Thu, Jun 3, 2021 at 8:54 AM liuhongt <hongtao.liu@intel.com> wrote:
> > >
> > > When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> > > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> > > which leads to incorrect optimization in pass_reload. In order to
> > > solve this problem, this patch refine instructions as call_insns in
> > > which the call has a special vzeroupper ABI.
> > >
> > > gcc/ChangeLog:
> > >
> > >         PR target/82735
> > >         * config/i386/i386-expand.c (ix86_expand_builtin): Remove
> > >         assignment of cfun->machine->has_explicit_vzeroupper.
> > >         * config/i386/i386-features.c
> > >         (ix86_add_reg_usage_to_vzerouppers): Delete.
> > >         (ix86_add_reg_usage_to_vzeroupper): Ditto.
> > >         (rest_of_handle_insert_vzeroupper): Remove
> > >         ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
> > >         of the function.
> > >         (gate): Remove cfun->machine->has_explicit_vzeroupper.
> > >         * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
> > >         Declared.
> > >         * config/i386/i386.c (ix86_insn_callee_abi): New function.
> > >         (ix86_initialize_callee_abi): Ditto.
> > >         (ix86_expand_avx_vzeroupper): Ditto.
> > >         (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
> > >         ABI.
> > >         (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
> > >         (ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper
> > >         directly.
> > >         * config/i386/i386.h (struct GTY(()) machine_function): Delete
> > >         has_explicit_vzeroupper.
> > >         * config/i386/i386.md (enum unspec): New member
> > >         UNSPEC_CALLEE_ABI.
> > >         (I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New
> > >         define_constants for insn callee abi index.
> > >         * config/i386/predicates.md (vzeroupper_pattern): Adjust.
> > >         * config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted.
> > >         (avx_vzeroupper): Call ix86_expand_avx_vzeroupper.
> > >         (*avx_vzeroupper): Rename to ..
> > >         (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
> > >         call_insn which has a special vzeroupper ABI.
> > >         (*avx_vzeroupper_1): Deleted.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         PR target/82735
> > >         * gcc.target/i386/pr82735-1.c: New test.
> > >         * gcc.target/i386/pr82735-2.c: New test.
> > >         * gcc.target/i386/pr82735-3.c: New test.
> > >         * gcc.target/i386/pr82735-4.c: New test.
> > >         * gcc.target/i386/pr82735-5.c: New test.
> >
> > LGTM, with a small nit below.
> >
> > Thanks,
> > Uros.
> >
> > > ---
> > >  gcc/config/i386/i386-expand.c             |  4 -
> > >  gcc/config/i386/i386-features.c           | 99 +++--------------------
> > >  gcc/config/i386/i386-protos.h             |  1 +
> > >  gcc/config/i386/i386.c                    | 55 ++++++++++++-
> > >  gcc/config/i386/i386.h                    |  4 -
> > >  gcc/config/i386/i386.md                   | 10 +++
> > >  gcc/config/i386/predicates.md             |  5 +-
> > >  gcc/config/i386/sse.md                    | 59 ++++----------
> > >  gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 +++++++
> > >  gcc/testsuite/gcc.target/i386/pr82735-2.c | 22 +++++
> > >  gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 ++
> > >  gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 +++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++
> > >  13 files changed, 252 insertions(+), 143 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c
> > >
> > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > > index 9f3d41955a2..d25d59aa4e7 100644
> > > --- a/gcc/config/i386/i386-expand.c
> > > +++ b/gcc/config/i386/i386-expand.c
> > > @@ -13282,10 +13282,6 @@ rdseed_step:
> > >
> > >        return 0;
> > >
> > > -    case IX86_BUILTIN_VZEROUPPER:
> > > -      cfun->machine->has_explicit_vzeroupper = true;
> > > -      break;
> > > -
> > >      default:
> > >        break;
> > >      }
> > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > > index 77783a154b6..a25769ae478 100644
> > > --- a/gcc/config/i386/i386-features.c
> > > +++ b/gcc/config/i386/i386-features.c
> > > @@ -1768,92 +1768,22 @@ convert_scalars_to_vector (bool timode_p)
> > >    return 0;
> > >  }
> > >
> > > -/* Modify the vzeroupper pattern in INSN so that it describes the effect
> > > -   that the instruction has on the SSE registers.  LIVE_REGS are the set
> > > -   of registers that are live across the instruction.
> > > -
> > > -   For a live register R we use:
> > > -
> > > -     (set (reg:V2DF R) (reg:V2DF R))
> > > -
> > > -   which preserves the low 128 bits but clobbers the upper bits.  */
> > > -
> > > -static void
> > > -ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
> > > -{
> > > -  rtx pattern = PATTERN (insn);
> > > -  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> > > -  unsigned int npats = nregs;
> > > -  for (unsigned int i = 0; i < nregs; ++i)
> > > -    {
> > > -      unsigned int regno = GET_SSE_REGNO (i);
> > > -      if (!bitmap_bit_p (live_regs, regno))
> > > -       npats--;
> > > -    }
> > > -  if (npats == 0)
> > > -    return;
> > > -  rtvec vec = rtvec_alloc (npats + 1);
> > > -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> > > -  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> > > -    {
> > > -      unsigned int regno = GET_SSE_REGNO (i);
> > > -      if (!bitmap_bit_p (live_regs, regno))
> > > -       continue;
> > > -      rtx reg = gen_rtx_REG (V2DImode, regno);
> > > -      ++j;
> > > -      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
> > > -    }
> > > -  XVEC (pattern, 0) = vec;
> > > -  INSN_CODE (insn) = -1;
> > > -  df_insn_rescan (insn);
> > > -}
> > > -
> > > -/* Walk the vzeroupper instructions in the function and annotate them
> > > -   with the effect that they have on the SSE registers.  */
> > > -
> > > -static void
> > > -ix86_add_reg_usage_to_vzerouppers (void)
> > > -{
> > > -  basic_block bb;
> > > -  rtx_insn *insn;
> > > -  auto_bitmap live_regs;
> > > -
> > > -  df_analyze ();
> > > -  FOR_EACH_BB_FN (bb, cfun)
> > > -    {
> > > -      bitmap_copy (live_regs, df_get_live_out (bb));
> > > -      df_simulate_initialize_backwards (bb, live_regs);
> > > -      FOR_BB_INSNS_REVERSE (bb, insn)
> > > -       {
> > > -         if (!NONDEBUG_INSN_P (insn))
> > > -           continue;
> > > -         if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
> > > -           ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
> > > -         df_simulate_one_insn_backwards (bb, insn, live_regs);
> > > -       }
> > > -    }
> > > -}
> > > -
> > >  static unsigned int
> > >  rest_of_handle_insert_vzeroupper (void)
> > >  {
> > > -  if (TARGET_VZEROUPPER
> > > -      && flag_expensive_optimizations
> > > -      && !optimize_size)
> > > -    {
> > > -      /* vzeroupper instructions are inserted immediately after reload to
> > > -        account for possible spills from 256bit or 512bit registers.  The pass
> > > -        reuses mode switching infrastructure by re-running mode insertion
> > > -        pass, so disable entities that have already been processed.  */
> > > -      for (int i = 0; i < MAX_386_ENTITIES; i++)
> > > -       ix86_optimize_mode_switching[i] = 0;
> > > +  /* vzeroupper instructions are inserted immediately after reload to
> > > +     account for possible spills from 256bit or 512bit registers.  The pass
> > > +     reuses mode switching infrastructure by re-running mode insertion
> > > +     pass, so disable entities that have already been processed.  */
> > > +  for (int i = 0; i < MAX_386_ENTITIES; i++)
> > > +    ix86_optimize_mode_switching[i] = 0;
> > >
> > > -      ix86_optimize_mode_switching[AVX_U128] = 1;
> > > +  ix86_optimize_mode_switching[AVX_U128] = 1;
> > >
> > > -      /* Call optimize_mode_switching.  */
> > > -      g->get_passes ()->execute_pass_mode_switching ();
> > > -    }
> > > -  ix86_add_reg_usage_to_vzerouppers ();
> > > +  /* Call optimize_mode_switching.  */
> > > +  g->get_passes ()->execute_pass_mode_switching ();
> > > +
> > > +  df_analyze ();
> > >    return 0;
> > >  }
> > >
> > > @@ -1882,11 +1812,8 @@ public:
> > >    /* opt_pass methods: */
> > >    virtual bool gate (function *)
> > >      {
> > > -      return TARGET_AVX
> > > -            && ((TARGET_VZEROUPPER
> > > -                 && flag_expensive_optimizations
> > > -                 && !optimize_size)
> > > -                || cfun->machine->has_explicit_vzeroupper);
> > > +      return TARGET_AVX && TARGET_VZEROUPPER
> > > +       && flag_expensive_optimizations && !optimize_size;
> > >      }
> > >
> > >    virtual unsigned int execute (function *)
> > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > > index 7782cf1163f..e6ac9390777 100644
> > > --- a/gcc/config/i386/i386-protos.h
> > > +++ b/gcc/config/i386/i386-protos.h
> > > @@ -216,6 +216,7 @@ extern rtx ix86_split_stack_guard (void);
> > >  extern void ix86_move_vector_high_sse_to_mmx (rtx);
> > >  extern void ix86_split_mmx_pack (rtx[], enum rtx_code);
> > >  extern void ix86_split_mmx_punpck (rtx[], bool);
> > > +extern void ix86_expand_avx_vzeroupper (void);
> > >
> > >  #ifdef TREE_CODE
> > >  extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
> > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > > index 743d8a25fe3..f0b66dd0d56 100644
> > > --- a/gcc/config/i386/i386.c
> > > +++ b/gcc/config/i386/i386.c
> > > @@ -14426,7 +14426,7 @@ ix86_emit_mode_set (int entity, int mode, int prev_mode ATTRIBUTE_UNUSED,
> > >        break;
> > >      case AVX_U128:
> > >        if (mode == AVX_U128_CLEAN)
> > > -       emit_insn (gen_avx_vzeroupper ());
> > > +       ix86_expand_avx_vzeroupper ();
> > >        break;
> > >      case I387_ROUNDEVEN:
> > >      case I387_TRUNC:
> > > @@ -19494,15 +19494,63 @@ ix86_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
> > >    return false;
> > >  }
> > >
> > > +/* Implement TARGET_INSN_CALLEE_ABI.  */
> > > +
> > > +const predefined_function_abi &
> > > +ix86_insn_callee_abi (const rtx_insn *insn)
> > > +{
> > > +  unsigned int abi_id = 0;
> > > +  rtx pat = PATTERN (insn);
> > > +  if (vzeroupper_pattern (pat, VOIDmode))
> > > +    abi_id = I386_VZEROUPPER;
> > > +
> > > +  return function_abis[abi_id];
> > > +}
> > > +
> > > +/* Initialize function_abis with corresponding abi_id,
> > > +   currently only handle vzeroupper.  */
> > > +void
> > > +ix86_initialize_callee_abi (unsigned int abi_id)
> > > +{
> > > +  gcc_assert (abi_id == I386_VZEROUPPER);
> > > +  predefined_function_abi &vzeroupper_abi = function_abis[abi_id];
> > > +  if (!vzeroupper_abi.initialized_p ())
> > > +    {
> > > +      HARD_REG_SET full_reg_clobbers;
> > > +      CLEAR_HARD_REG_SET (full_reg_clobbers);
> > > +      vzeroupper_abi.initialize (I386_VZEROUPPER, full_reg_clobbers);
> > > +    }
> > > +}
> > > +
> > > +void
> > > +ix86_expand_avx_vzeroupper (void)
> > > +{
> > > +  /* Initialize vzeroupper_abi here.  */
> > > +  ix86_initialize_callee_abi (I386_VZEROUPPER);
> > > +  rtx_insn *insn = emit_call_insn (gen_avx_vzeroupper_callee_abi ());
> > > +  /* Return false for non-local goto in can_nonlocal_goto.  */
> > > +  make_reg_eh_region_note (insn, 0, INT_MIN);
> > > +  /* Flag used for call_insn indicates it's a fake call.  */
> > > +  RTX_FLAG (insn, used) = 1;
> > > +}
> > > +
> > > +
> > >  /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
> > >     saves SSE registers across calls is Win64 (thus no need to check the
> > >     current ABI here), and with AVX enabled Win64 only guarantees that
> > >     the low 16 bytes are saved.  */
> > >
> > >  static bool
> > > -ix86_hard_regno_call_part_clobbered (unsigned int, unsigned int regno,
> > > +ix86_hard_regno_call_part_clobbered (unsigned int abi_id, unsigned int regno,
> > >                                      machine_mode mode)
> > >  {
> > > +  /* Special ABI for vzeroupper which only clobber higher part of sse regs.  */
> > > +  if (abi_id == I386_VZEROUPPER)
> > > +      return (GET_MODE_SIZE (mode) > 16
> > > +             && ((TARGET_64BIT
> > > +                  && (IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
> > > +                 || (IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG))));
> > > +
> > >    return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
> > >  }
> > >
> > > @@ -23916,6 +23964,9 @@ ix86_run_selftests (void)
> > >  #define TARGET_HARD_REGNO_CALL_PART_CLOBBERED \
> > >    ix86_hard_regno_call_part_clobbered
> > >
> > > +#undef TARGET_INSN_CALLEE_ABI
> > > +#define TARGET_INSN_CALLEE_ABI ix86_insn_callee_abi
> > > +
> > >  #undef TARGET_CAN_CHANGE_MODE_CLASS
> > >  #define TARGET_CAN_CHANGE_MODE_CLASS ix86_can_change_mode_class
> > >
> > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > > index 53d503fc6e0..919d0b2418a 100644
> > > --- a/gcc/config/i386/i386.h
> > > +++ b/gcc/config/i386/i386.h
> > > @@ -2659,10 +2659,6 @@ struct GTY(()) machine_function {
> > >    /* True if the function needs a stack frame.  */
> > >    BOOL_BITFIELD stack_frame_required : 1;
> > >
> > > -  /* True if __builtin_ia32_vzeroupper () has been expanded in current
> > > -     function.  */
> > > -  BOOL_BITFIELD has_explicit_vzeroupper : 1;
> > > -
> > >    /* True if we should act silently, rather than raise an error for
> > >       invalid calls.  */
> > >    BOOL_BITFIELD silent_p : 1;
> > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > > index 2fc8fae30f3..5d9f5aa39ac 100644
> > > --- a/gcc/config/i386/i386.md
> > > +++ b/gcc/config/i386/i386.md
> > > @@ -191,6 +191,10 @@ (define_c_enum "unspec" [
> > >    ;; For MOVDIRI and MOVDIR64B support
> > >    UNSPEC_MOVDIRI
> > >    UNSPEC_MOVDIR64B
> > > +
> > > +  ;; For insn_callee_abi:
> > > +  UNSPEC_CALLEE_ABI
> > > +
> > >  ])
> > >
> > >  (define_c_enum "unspecv" [
> > > @@ -447,6 +451,12 @@ (define_constants
> > >     (FIRST_PSEUDO_REG           76)
> > >    ])
> > >
> > > +;; Insn callee abi index.
> > > +(define_constants
> > > +  [(I386_DEFAULT       0)
> > > +   (I386_VZEROUPPER    1)
> > > +   (I386_UNKNOWN       2)])
> >
> > Please name these ABI_DEFAULT, ABI_VZEROUPPER and ABI_UNKNOWN.
> Yes, thanks for the review.
> I'll hold this patch until middle-end part is approved.
> >
> > > +
> > >  ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
> > >  ;; from i386.c.
> > >
> > > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> > > index abd307ebdb8..8b787553f32 100644
> > > --- a/gcc/config/i386/predicates.md
> > > +++ b/gcc/config/i386/predicates.md
> > > @@ -1599,8 +1599,9 @@ (define_predicate "vzeroall_pattern"
> > >  ;; return true if OP is a vzeroupper pattern.
> > >  (define_predicate "vzeroupper_pattern"
> > >    (and (match_code "parallel")
> > > -       (match_code "unspec_volatile" "a")
> > > -       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
> > > +       (match_code "unspec" "b")
> > > +       (match_test "XINT (XVECEXP (op, 0, 1), 1) == UNSPEC_CALLEE_ABI")
> > > +       (match_test "INTVAL (XVECEXP (XVECEXP (op, 0, 1), 0, 0)) == I386_VZEROUPPER")))
> > >
> > >  ;; Return true if OP is an addsub vec_merge operation
> > >  (define_predicate "addsub_vm_operator"
> > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > index a4503ddcb73..949347a3247 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -205,7 +205,6 @@ (define_c_enum "unspecv" [
> > >    UNSPECV_MONITOR
> > >    UNSPECV_MWAIT
> > >    UNSPECV_VZEROALL
> > > -  UNSPECV_VZEROUPPER
> > >
> > >    ;; For KEYLOCKER
> > >    UNSPECV_LOADIWKEY
> > > @@ -20857,14 +20856,22 @@ (define_insn "*avx_vzeroall"
> > >  ;; if the upper 128bits are unused.  Initially we expand the instructions
> > >  ;; as though they had no effect on the SSE registers, but later add SETs and
> > >  ;; CLOBBERs to the PARALLEL to model the real effect.
> > > +
> > >  (define_expand "avx_vzeroupper"
> > > -  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> > > -  "TARGET_AVX")
> > > +  [(parallel [(call (mem:QI (const_int 0))
> > > +                   (const_int 0))
> > > +            (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)])]
> > > +  "TARGET_AVX"
> > > +{
> > > +  ix86_expand_avx_vzeroupper ();
> > > +  DONE;
> > > +})
> > >
> > > -(define_insn "*avx_vzeroupper"
> > > -  [(match_parallel 0 "vzeroupper_pattern"
> > > -     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> > > -  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
> > > +(define_insn "avx_vzeroupper_callee_abi"
> > > +  [(call (mem:QI (const_int 0))
> > > +        (const_int 0))
> > > +    (unspec [(const_int I386_VZEROUPPER)] UNSPEC_CALLEE_ABI)]
> > > +  "TARGET_AVX"
> > >    "vzeroupper"
> > >    [(set_attr "type" "sse")
> > >     (set_attr "modrm" "0")
> > > @@ -20873,44 +20880,6 @@ (define_insn "*avx_vzeroupper"
> > >     (set_attr "btver2_decode" "vector")
> > >     (set_attr "mode" "OI")])
> > >
> > > -(define_insn_and_split "*avx_vzeroupper_1"
> > > -  [(match_parallel 0 "vzeroupper_pattern"
> > > -     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> > > -  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> > > -  "#"
> > > -  "&& epilogue_completed"
> > > -  [(match_dup 0)]
> > > -{
> > > -  /* For IPA-RA purposes, make it clear the instruction clobbers
> > > -     even XMM registers not mentioned explicitly in the pattern.  */
> > > -  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> > > -  unsigned int npats = XVECLEN (operands[0], 0);
> > > -  rtvec vec = rtvec_alloc (nregs + 1);
> > > -  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> > > -  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> > > -    {
> > > -      unsigned int regno = GET_SSE_REGNO (i);
> > > -      if (j < npats
> > > -         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> > > -       {
> > > -         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> > > -         j++;
> > > -       }
> > > -      else
> > > -       {
> > > -         rtx reg = gen_rtx_REG (V2DImode, regno);
> > > -         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> > > -       }
> > > -    }
> > > -  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> > > -}
> > > -  [(set_attr "type" "sse")
> > > -   (set_attr "modrm" "0")
> > > -   (set_attr "memory" "none")
> > > -   (set_attr "prefix" "vex")
> > > -   (set_attr "btver2_decode" "vector")
> > > -   (set_attr "mode" "OI")])
> > > -
> > >  (define_mode_attr pbroadcast_evex_isa
> > >    [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
> > >     (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
> > > new file mode 100644
> > > index 00000000000..1a63b9ae9c9
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
> > > @@ -0,0 +1,29 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O2 -mavx" } */
> > > +/* { dg-require-effective-target avx } */
> > > +
> > > +#include "avx-check.h"
> > > +
> > > +void
> > > +__attribute__ ((noipa))
> > > +mtest(char *dest)
> > > +{
> > > +  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
> > > +  _mm256_zeroupper();
> > > +  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
> > > +  _mm256_storeu_si256((__m256i *)dest, ymm2);
> > > +}
> > > +
> > > +void
> > > +avx_test ()
> > > +{
> > > +  char buf[64];
> > > +  for (int i = 0; i != 64; i++)
> > > +    buf[i] = 2;
> > > +  mtest (buf);
> > > +
> > > +  for (int i = 0; i < 32; ++i)
> > > +    if (buf[i] != 1)
> > > +      __builtin_abort ();
> > > +}
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
> > > new file mode 100644
> > > index 00000000000..ac9d006f794
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
> > > @@ -0,0 +1,22 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-mavx -O2" } */
> > > +
> > > +#include <immintrin.h>
> > > +
> > > +void test(char *dest)
> > > +{
> > > +  /* xmm1 can be propagated to xmm2 by CSE.  */
> > > +  __m128i xmm1 = _mm_set_epi8(0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
> > > +                             0x9, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16);
> > > +  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  __m128i xmm2 = xmm1;
> > > +  _mm_storeu_si128((__m128i *)dest, xmm2);
> > > +}
> > > +
> > > +/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
> > > +/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
> > > new file mode 100644
> > > index 00000000000..e3f801e6924
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
> > > @@ -0,0 +1,5 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-mavx -O2 -mabi=ms" } */
> > > +/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
> > > +
> > > +#include "pr82735-2.c"
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
> > > new file mode 100644
> > > index 00000000000..78c0a6cb2c8
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
> > > @@ -0,0 +1,48 @@
> > > +/* { dg-do compile { target { ! ia32 } } }  */
> > > +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
> > > +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
> > > +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
> > > +
> > > +#include <immintrin.h>
> > > +
> > > +void test(char *dest)
> > > +{
> > > +  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
> > > +  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
> > > +               "vmovdqa\t%%ymm0, %1\n\t"
> > > +               "vmovdqa\t%%ymm0, %2\n\t"
> > > +               "vmovdqa\t%%ymm0, %3\n\t"
> > > +               "vmovdqa\t%%ymm0, %4\n\t"
> > > +               "vmovdqa\t%%ymm0, %5\n\t"
> > > +               "vmovdqa\t%%ymm0, %6\n\t"
> > > +               "vmovdqa\t%%ymm0, %7\n\t"
> > > +               "vmovdqa\t%%ymm0, %8\n\t"
> > > +               "vmovdqa\t%%ymm0, %9\n\t"
> > > +               "vmovdqa\t%%ymm0, %10\n\t"
> > > +               "vmovdqa\t%%ymm0, %11\n\t"
> > > +               "vmovdqa\t%%ymm0, %12\n\t"
> > > +               "vmovdqa\t%%ymm0, %13\n\t"
> > > +               "vmovdqa\t%%ymm0, %14\n\t"
> > > +               "vmovdqa\t%%ymm0, %15\n\t"
> > > +               : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
> > > +                 "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
> > > +                 "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
> > > +                 "=v"(ymm0)
> > > +               ::);
> > > +  _mm256_zeroupper();
> > > +  _mm256_storeu_si256((__m256i *)dest, ymm1);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
> > > +}
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
> > > new file mode 100644
> > > index 00000000000..2a58cbe52d0
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
> > > @@ -0,0 +1,54 @@
> > > +/* { dg-do compile { target { ! ia32 } } }  */
> > > +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
> > > +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
> > > +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
> > > +
> > > +#include <immintrin.h>
> > > +
> > > +void test(char *dest)
> > > +{
> > > +  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
> > > +  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
> > > +               "vmovdqa\t%%ymm0, %1\n\t"
> > > +               "vmovdqa\t%%ymm0, %2\n\t"
> > > +               "vmovdqa\t%%ymm0, %3\n\t"
> > > +               "vmovdqa\t%%ymm0, %4\n\t"
> > > +               "vmovdqa\t%%ymm0, %5\n\t"
> > > +               "vmovdqa\t%%ymm0, %6\n\t"
> > > +               "vmovdqa\t%%ymm0, %7\n\t"
> > > +               "vmovdqa\t%%ymm0, %8\n\t"
> > > +               "vmovdqa\t%%ymm0, %9\n\t"
> > > +               "vmovdqa\t%%ymm0, %10\n\t"
> > > +               "vmovdqa\t%%ymm0, %11\n\t"
> > > +               "vmovdqa\t%%ymm0, %12\n\t"
> > > +               "vmovdqa\t%%ymm0, %13\n\t"
> > > +               "vmovdqa\t%%ymm0, %14\n\t"
> > > +               "vmovdqa\t%%ymm0, %15\n\t"
> > > +               : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
> > > +                 "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
> > > +                 "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
> > > +                 "=v"(ymm0)
> > > +               ::);
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_zeroupper();
> > > +  _mm256_storeu_si256((__m256i *)dest, ymm1);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
> > > +  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
> > > +}
> > > --
> > > 2.18.1
> > >

Since -mabi=ms isn't compatible with x32, skip pr82735-[345].c for x32.
I am checking it into master branch.

-- 
H.J.

[-- Attachment #2: 0001-x86-Don-t-compile-pr82735-345-.c-for-x32.patch --]
[-- Type: text/x-patch, Size: 2363 bytes --]

From 5e646e1c2ebc8339260a2a44e99f7f04446e9120 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 7 Jun 2021 11:43:25 -0700
Subject: [PATCH] x86: Don't compile pr82735-[345].c for x32

Since -mabi=ms isn't compatible with x32, skip pr82735-[345].c for x32.

	PR target/82735
	* gcc.target/i386/pr82735-3.c: Don't compile for x32.
	* gcc.target/i386/pr82735-4.c: Likewise.
	* gcc.target/i386/pr82735-5.c: Likewise.
---
 gcc/testsuite/gcc.target/i386/pr82735-3.c | 2 +-
 gcc/testsuite/gcc.target/i386/pr82735-4.c | 2 +-
 gcc/testsuite/gcc.target/i386/pr82735-5.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
index e3f801e6924..854087d81a0 100644
--- a/gcc/testsuite/gcc.target/i386/pr82735-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
@@ -1,4 +1,4 @@
-/* { dg-do compile } */
+/* { dg-do compile { target { ! x32 } } } */
 /* { dg-options "-mavx -O2 -mabi=ms" } */
 /* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
index 78c0a6cb2c8..2218ffc08ab 100644
--- a/gcc/testsuite/gcc.target/i386/pr82735-4.c
+++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-do compile { target { lp64 } } }  */
 /* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
 /* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
 /* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
index 2a58cbe52d0..a9ef05f4e7b 100644
--- a/gcc/testsuite/gcc.target/i386/pr82735-5.c
+++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
@@ -1,4 +1,4 @@
-/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-do compile { target { lp64 } } }  */
 /* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
 /* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
 /* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-06-03  6:54                               ` [PATCH 1/2] CALL_INSN may not be a real function call liuhongt
                                                   ` (2 preceding siblings ...)
  2021-06-04  7:50                                 ` Jakub Jelinek
@ 2021-07-05 23:30                                 ` Segher Boessenkool
  2021-07-06  0:03                                   ` Jeff Law
  2021-07-06  1:37                                   ` Hongtao Liu
  3 siblings, 2 replies; 45+ messages in thread
From: Segher Boessenkool @ 2021-07-05 23:30 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches, jakub, richard.sandiford, ubizjak

Hi!

I ran into this in shrink-wrap.c today.

On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> fake call, it won't have its own function stack.

Could you document somewhere what a "fake call" *is*?  Including what
that means to RTL, how this is expected to be used, etc.?  In rtl.h is
fine with me, but as it is, no one can know when to use this.  What does
"its own function stack" mean in the description here?  You can only put
FAKE_CALL on functions that do not have a stack frame?  But that is
never true on x86, so that cannot be it, unless there isn't a call
instruction at all?  But then, why use an RTL call insn for this?

Other targets simply do not use an RTL "call" when they want to hide
such an instruction, why can't you do that here, wouldn't that work much
better?  There are many more insns that you may want to hide.  The
traditional solution is to use unspecs, which very directly hides all
details.


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-05 23:30                                 ` Segher Boessenkool
@ 2021-07-06  0:03                                   ` Jeff Law
  2021-07-06  1:49                                     ` Hongtao Liu
  2021-07-07 14:55                                     ` Segher Boessenkool
  2021-07-06  1:37                                   ` Hongtao Liu
  1 sibling, 2 replies; 45+ messages in thread
From: Jeff Law @ 2021-07-06  0:03 UTC (permalink / raw)
  To: Segher Boessenkool, liuhongt; +Cc: jakub, richard.sandiford, gcc-patches



On 7/5/2021 5:30 PM, Segher Boessenkool wrote:
> Hi!
>
> I ran into this in shrink-wrap.c today.
>
> On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
>> Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
>> fake call, it won't have its own function stack.
> Could you document somewhere what a "fake call" *is*?  Including what
> that means to RTL, how this is expected to be used, etc.?  In rtl.h is
> fine with me, but as it is, no one can know when to use this.  What does
> "its own function stack" mean in the description here?  You can only put
> FAKE_CALL on functions that do not have a stack frame?  But that is
> never true on x86, so that cannot be it, unless there isn't a call
> instruction at all?  But then, why use an RTL call insn for this?
>
> Other targets simply do not use an RTL "call" when they want to hide
> such an instruction, why can't you do that here, wouldn't that work much
> better?  There are many more insns that you may want to hide.  The
> traditional solution is to use unspecs, which very directly hides all
> details.
It reminds me a bit of millicode calls on the PA or calls to special 
routines in libgcc.  They're calls to functions, but those functions do 
not follow the standard ABI.  I'd like to remove 
INSN_REFERENCES_ARE_DELAYED and instead use the new fake call mechanism, 
but I haven't tried it or even looked at the fake call bits enough to 
know if that's possible.

jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-05 23:30                                 ` Segher Boessenkool
  2021-07-06  0:03                                   ` Jeff Law
@ 2021-07-06  1:37                                   ` Hongtao Liu
  2021-07-07  2:44                                     ` Hongtao Liu
  1 sibling, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-07-06  1:37 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: liuhongt, Jakub Jelinek, Richard Sandiford, GCC Patches

On Tue, Jul 6, 2021 at 7:31 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> Hi!
>
> I ran into this in shrink-wrap.c today.
>
> On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> > Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> > fake call, it won't have its own function stack.
>
> Could you document somewhere what a "fake call" *is*?  Including what
> that means to RTL, how this is expected to be used, etc.?  In rtl.h is
fake call is used for TARGET_INSN_CALLEE_ABI, i'll add comments for
#define FAKE_CALL_P(RTX) in rtl.h
> fine with me, but as it is, no one can know when to use this.  What does
> "its own function stack" mean in the description here?  You can only put
> FAKE_CALL on functions that do not have a stack frame?  But that is
> never true on x86, so that cannot be it, unless there isn't a call
> instruction at all?  But then, why use an RTL call insn for this?
>
> Other targets simply do not use an RTL "call" when they want to hide
> such an instruction, why can't you do that here, wouldn't that work much
> better?  There are many more insns that you may want to hide.  The
> traditional solution is to use unspecs, which very directly hides all
> details.

It's explained here,
> >> Yeah.  Initially clobber_high seemed like the best appraoch for
> >> handling the tlsdesc thing, but in practice it was too difficult
> >> to shoe-horn the concept in after the fact, when so much rtl
> >> infrastructure wasn't prepared to deal with it.  The old support
> >> didn't handle all cases and passes correctly, and handled others
> >> suboptimally.
> >>
> >> I think it would be worth using the same approach as
> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> >> vzeroupper: represent the instructions as call_insns in which the
> >> call has a special vzeroupper ABI.  I think that's likely to lead
> >> to better code than clobber_high would (or at least, it did for tlsdesc).

refer to [1] for more details
[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570265.html
>
>
> Segher



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-06  0:03                                   ` Jeff Law
@ 2021-07-06  1:49                                     ` Hongtao Liu
  2021-07-07 14:55                                     ` Segher Boessenkool
  1 sibling, 0 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-07-06  1:49 UTC (permalink / raw)
  To: Jeff Law
  Cc: Segher Boessenkool, liuhongt, Jakub Jelinek, Richard Sandiford,
	GCC Patches

On Tue, Jul 6, 2021 at 8:03 AM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> On 7/5/2021 5:30 PM, Segher Boessenkool wrote:
> > Hi!
> >
> > I ran into this in shrink-wrap.c today.
> >
> > On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> >> Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> >> fake call, it won't have its own function stack.
> > Could you document somewhere what a "fake call" *is*?  Including what
> > that means to RTL, how this is expected to be used, etc.?  In rtl.h is
> > fine with me, but as it is, no one can know when to use this.  What does
> > "its own function stack" mean in the description here?  You can only put
> > FAKE_CALL on functions that do not have a stack frame?  But that is
> > never true on x86, so that cannot be it, unless there isn't a call
> > instruction at all?  But then, why use an RTL call insn for this?
> >
> > Other targets simply do not use an RTL "call" when they want to hide
> > such an instruction, why can't you do that here, wouldn't that work much
> > better?  There are many more insns that you may want to hide.  The
> > traditional solution is to use unspecs, which very directly hides all
> > details.
> It reminds me a bit of millicode calls on the PA or calls to special
> routines in libgcc.  They're calls to functions, but those functions do
> not follow the standard ABI.  I'd like to remove
> INSN_REFERENCES_ARE_DELAYED and instead use the new fake call mechanism,
> but I haven't tried it or even looked at the fake call bits enough to
> know if that's possible.
Fake call is used for TARGET_INSN_CALLEE_ABI which is used for
vzeroupper in i386.
vzeroupper clobber high part of ymm registers but leave low part
unchanged, define it and call_insn with special callee ABI so that
RA/CSE knows this instruction kills high parts of ymm registers, and
can still optimize with lowpart.
I didn't handle FAKE_CALL_P thoroughly in the RTL, but only changed
the necessary parts so that I could get my patch to survive the
regression test(also fix some optimization issues I observed).
n through the tests>
> jeff



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-06  1:37                                   ` Hongtao Liu
@ 2021-07-07  2:44                                     ` Hongtao Liu
  2021-07-07  8:15                                       ` Richard Biener
  0 siblings, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-07-07  2:44 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: liuhongt, Jakub Jelinek, Richard Sandiford, GCC Patches

On Tue, Jul 6, 2021 at 9:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Tue, Jul 6, 2021 at 7:31 AM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > Hi!
> >
> > I ran into this in shrink-wrap.c today.
> >
> > On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> > > Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> > > fake call, it won't have its own function stack.
> >
> > Could you document somewhere what a "fake call" *is*?  Including what
> > that means to RTL, how this is expected to be used, etc.?  In rtl.h is
> fake call is used for TARGET_INSN_CALLEE_ABI, i'll add comments for
> #define FAKE_CALL_P(RTX) in rtl.h


Here's the patch I'm going to check in.

    Document FAKE_CALL_P in comments.

    gcc/ChangeLog:

            * rtl.h (FAKE_CALL_P): Add comments for FAKE_CALL_P.

diff --git a/gcc/rtl.h b/gcc/rtl.h
index 5ed0d6dd6fa..9afc60f08d8 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -840,7 +840,13 @@ struct GTY(()) rtvec_def {
 #define CALL_P(X) (GET_CODE (X) == CALL_INSN)

 /* 1 if RTX is a call_insn for a fake call.
-   CALL_INSN use "used" flag to indicate it's a fake call.  */
+   CALL_INSN use "used" flag to indicate it's a fake call.
+   Used by the x86 vzeroupper instruction,
+   in order to solve the problem of partial clobber registers,
+   vzeroupper is defined as a call_insn with a special callee_abi,
+   but it is not a real call and therefore has no function stack
+   of its own.
+   NB: FAKE_CALL_P is not handled thoroughly in the RTL.  */
 #define FAKE_CALL_P(RTX)                                        \
   (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)




-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07  2:44                                     ` Hongtao Liu
@ 2021-07-07  8:15                                       ` Richard Biener
  2021-07-07 14:52                                         ` Segher Boessenkool
  2021-07-07 15:52                                         ` Hongtao Liu
  0 siblings, 2 replies; 45+ messages in thread
From: Richard Biener @ 2021-07-07  8:15 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Segher Boessenkool, Jakub Jelinek, Richard Sandiford, liuhongt,
	GCC Patches

On Wed, Jul 7, 2021 at 4:40 AM Hongtao Liu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Tue, Jul 6, 2021 at 9:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Tue, Jul 6, 2021 at 7:31 AM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > >
> > > Hi!
> > >
> > > I ran into this in shrink-wrap.c today.
> > >
> > > On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> > > > Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> > > > fake call, it won't have its own function stack.
> > >
> > > Could you document somewhere what a "fake call" *is*?  Including what
> > > that means to RTL, how this is expected to be used, etc.?  In rtl.h is
> > fake call is used for TARGET_INSN_CALLEE_ABI, i'll add comments for
> > #define FAKE_CALL_P(RTX) in rtl.h
>
>
> Here's the patch I'm going to check in.
>
>     Document FAKE_CALL_P in comments.
>
>     gcc/ChangeLog:
>
>             * rtl.h (FAKE_CALL_P): Add comments for FAKE_CALL_P.
>
> diff --git a/gcc/rtl.h b/gcc/rtl.h
> index 5ed0d6dd6fa..9afc60f08d8 100644
> --- a/gcc/rtl.h
> +++ b/gcc/rtl.h
> @@ -840,7 +840,13 @@ struct GTY(()) rtvec_def {
>  #define CALL_P(X) (GET_CODE (X) == CALL_INSN)
>
>  /* 1 if RTX is a call_insn for a fake call.
> -   CALL_INSN use "used" flag to indicate it's a fake call.  */
> +   CALL_INSN use "used" flag to indicate it's a fake call.
> +   Used by the x86 vzeroupper instruction,
> +   in order to solve the problem of partial clobber registers,
> +   vzeroupper is defined as a call_insn with a special callee_abi,
> +   but it is not a real call and therefore has no function stack
> +   of its own.

I think that's a big vague - you could then say a sibling or tail call
to a function
that doesn't set up a stack frame is fake as well?  Maybe

 "CALL_INSN use "used" flag to indicate the instruction
  does not transfer control."

thus that this call is not affecting regular control flow? (it might
eventually still trap and thus cause non-call EH?)

Not sure if "no function stack of its own" is a good constraint,
vzeroupper does not perform any call or jump.

> +   NB: FAKE_CALL_P is not handled thoroughly in the RTL.  */
>  #define FAKE_CALL_P(RTX)                                        \
>    (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
>
>
>
>
> --
> BR,
> Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07  8:15                                       ` Richard Biener
@ 2021-07-07 14:52                                         ` Segher Boessenkool
  2021-07-07 15:23                                           ` Hongtao Liu
  2021-07-07 15:32                                           ` Hongtao Liu
  2021-07-07 15:52                                         ` Hongtao Liu
  1 sibling, 2 replies; 45+ messages in thread
From: Segher Boessenkool @ 2021-07-07 14:52 UTC (permalink / raw)
  To: Richard Biener
  Cc: Hongtao Liu, Jakub Jelinek, Richard Sandiford, liuhongt, GCC Patches

Hi!

On Wed, Jul 07, 2021 at 10:15:08AM +0200, Richard Biener wrote:
> On Wed, Jul 7, 2021 at 4:40 AM Hongtao Liu via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> > On Tue, Jul 6, 2021 at 9:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > > On Tue, Jul 6, 2021 at 7:31 AM Segher Boessenkool
> > > <segher@kernel.crashing.org> wrote:
> > > > I ran into this in shrink-wrap.c today.
> > > >
> > > > On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> > > > > Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> > > > > fake call, it won't have its own function stack.
> > > >
> > > > Could you document somewhere what a "fake call" *is*?  Including what
> > > > that means to RTL, how this is expected to be used, etc.?  In rtl.h is
> > > fake call is used for TARGET_INSN_CALLEE_ABI, i'll add comments for
> > > #define FAKE_CALL_P(RTX) in rtl.h
> >
> >
> > Here's the patch I'm going to check in.

Which doesn't do any of the things I asked for :-(  It doesn't say what
a "fake call" is, it doesn't say what its semantics are, it doesn't say
how it is exected to be used.

So, a "FAKE_CALL" is very much a *real* call, on the RTL level, which is
where we are here.  But you want it to be treated differently because it
will eventually be replaced by different insns.

This causes all kinds of unrelated code to need confusing changes, made
much worse because the name "FAKE_CALL" is the opposite of what it does.

As long as your description of it only says how it is (ab)used in one
case, I will call it a hack, and a gross hack at that.


> > --- a/gcc/rtl.h
> > +++ b/gcc/rtl.h
> > @@ -840,7 +840,13 @@ struct GTY(()) rtvec_def {
> >  #define CALL_P(X) (GET_CODE (X) == CALL_INSN)
> >
> >  /* 1 if RTX is a call_insn for a fake call.
> > -   CALL_INSN use "used" flag to indicate it's a fake call.  */
> > +   CALL_INSN use "used" flag to indicate it's a fake call.
> > +   Used by the x86 vzeroupper instruction,
> > +   in order to solve the problem of partial clobber registers,
> > +   vzeroupper is defined as a call_insn with a special callee_abi,
> > +   but it is not a real call and therefore has no function stack
> > +   of its own.

So because of this one thing (you need to insert partial clobbers) you
force all kinds of unrelated code to have changes, namely, code thatt
needs to do something with calls, but now you do not want to have that
doone on some calls because you promise that call will disappear
eventually, and it cannot cause any problems in the mean time?

I am not convinced.  This is not design, this is a terrible hack, this
is the opposite direction we should go in.

> that doesn't set up a stack frame is fake as well?  Maybe
> 
>  "CALL_INSN use "used" flag to indicate the instruction
>   does not transfer control."
> 
> thus that this call is not affecting regular control flow? (it might
> eventually still trap and thus cause non-call EH?)

How it is used in shrink-wrap requires it to not have a stack frame (in
the compiler sense).

> Not sure if "no function stack of its own" is a good constraint,
> vzeroupper does not perform any call or jump.

Yeah.  This stuff needs a rethink.

What is wrong with just using an unspec and clobbers?


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-06  0:03                                   ` Jeff Law
  2021-07-06  1:49                                     ` Hongtao Liu
@ 2021-07-07 14:55                                     ` Segher Boessenkool
  2021-07-07 17:56                                       ` Jeff Law
  1 sibling, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2021-07-07 14:55 UTC (permalink / raw)
  To: Jeff Law; +Cc: liuhongt, jakub, richard.sandiford, gcc-patches

On Mon, Jul 05, 2021 at 06:03:21PM -0600, Jeff Law wrote:
> It reminds me a bit of millicode calls on the PA or calls to special 
> routines in libgcc.  They're calls to functions, but those functions do 
> not follow the standard ABI.

Something with CALL_INSN_FUNCTION_USAGE?  And maybe some clobbers?


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07 14:52                                         ` Segher Boessenkool
@ 2021-07-07 15:23                                           ` Hongtao Liu
  2021-07-07 23:42                                             ` Segher Boessenkool
  2021-07-07 15:32                                           ` Hongtao Liu
  1 sibling, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-07-07 15:23 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, Jakub Jelinek, Richard Sandiford, liuhongt, GCC Patches

On Wed, Jul 7, 2021 at 10:54 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> Hi!
>
> On Wed, Jul 07, 2021 at 10:15:08AM +0200, Richard Biener wrote:
> > On Wed, Jul 7, 2021 at 4:40 AM Hongtao Liu via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > > On Tue, Jul 6, 2021 at 9:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > > > On Tue, Jul 6, 2021 at 7:31 AM Segher Boessenkool
> > > > <segher@kernel.crashing.org> wrote:
> > > > > I ran into this in shrink-wrap.c today.
> > > > >
> > > > > On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> > > > > > Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> > > > > > fake call, it won't have its own function stack.
> > > > >
> > > > > Could you document somewhere what a "fake call" *is*?  Including what
> > > > > that means to RTL, how this is expected to be used, etc.?  In rtl.h is
> > > > fake call is used for TARGET_INSN_CALLEE_ABI, i'll add comments for
> > > > #define FAKE_CALL_P(RTX) in rtl.h
> > >
> > >
> > > Here's the patch I'm going to check in.
>
> Which doesn't do any of the things I asked for :-(  It doesn't say what
> a "fake call" is, it doesn't say what its semantics are, it doesn't say
> how it is exected to be used.
>
> So, a "FAKE_CALL" is very much a *real* call, on the RTL level, which is
> where we are here.  But you want it to be treated differently because it
> will eventually be replaced by different insns.
>
> This causes all kinds of unrelated code to need confusing changes, made
> much worse because the name "FAKE_CALL" is the opposite of what it does.
>
> As long as your description of it only says how it is (ab)used in one
> case, I will call it a hack, and a gross hack at that.
>
>
> > > --- a/gcc/rtl.h
> > > +++ b/gcc/rtl.h
> > > @@ -840,7 +840,13 @@ struct GTY(()) rtvec_def {
> > >  #define CALL_P(X) (GET_CODE (X) == CALL_INSN)
> > >
> > >  /* 1 if RTX is a call_insn for a fake call.
> > > -   CALL_INSN use "used" flag to indicate it's a fake call.  */
> > > +   CALL_INSN use "used" flag to indicate it's a fake call.
> > > +   Used by the x86 vzeroupper instruction,
> > > +   in order to solve the problem of partial clobber registers,
> > > +   vzeroupper is defined as a call_insn with a special callee_abi,
> > > +   but it is not a real call and therefore has no function stack
> > > +   of its own.
>
> So because of this one thing (you need to insert partial clobbers) you
> force all kinds of unrelated code to have changes, namely, code thatt
> needs to do something with calls, but now you do not want to have that
> doone on some calls because you promise that call will disappear
> eventually, and it cannot cause any problems in the mean time?
>
> I am not convinced.  This is not design, this is a terrible hack, this
> is the opposite direction we should go in.
>
> > that doesn't set up a stack frame is fake as well?  Maybe
> >
> >  "CALL_INSN use "used" flag to indicate the instruction
> >   does not transfer control."
> >
> > thus that this call is not affecting regular control flow? (it might
> > eventually still trap and thus cause non-call EH?)
>
> How it is used in shrink-wrap requires it to not have a stack frame (in
> the compiler sense).
>
> > Not sure if "no function stack of its own" is a good constraint,
> > vzeroupper does not perform any call or jump.
>
> Yeah.  This stuff needs a rethink.
>
> What is wrong with just using an unspec and clobbers?
>
It's partial and **potential clobber**,  if we add unspec and clobbers
to the whole pack(8 or 16 xmm registers), it will force save/restore
of registers that aren't really needed in the function, especially for
64bit MS ABI where lower 128bit are preserved across function calls.
>
> Segher



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07 14:52                                         ` Segher Boessenkool
  2021-07-07 15:23                                           ` Hongtao Liu
@ 2021-07-07 15:32                                           ` Hongtao Liu
  2021-07-07 23:54                                             ` Segher Boessenkool
  1 sibling, 1 reply; 45+ messages in thread
From: Hongtao Liu @ 2021-07-07 15:32 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, Jakub Jelinek, Richard Sandiford, liuhongt, GCC Patches

On Wed, Jul 7, 2021 at 10:54 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> Hi!
>
> On Wed, Jul 07, 2021 at 10:15:08AM +0200, Richard Biener wrote:
> > On Wed, Jul 7, 2021 at 4:40 AM Hongtao Liu via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > > On Tue, Jul 6, 2021 at 9:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > > > On Tue, Jul 6, 2021 at 7:31 AM Segher Boessenkool
> > > > <segher@kernel.crashing.org> wrote:
> > > > > I ran into this in shrink-wrap.c today.
> > > > >
> > > > > On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> > > > > > Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> > > > > > fake call, it won't have its own function stack.
> > > > >
> > > > > Could you document somewhere what a "fake call" *is*?  Including what
> > > > > that means to RTL, how this is expected to be used, etc.?  In rtl.h is
> > > > fake call is used for TARGET_INSN_CALLEE_ABI, i'll add comments for
> > > > #define FAKE_CALL_P(RTX) in rtl.h
> > >
> > >
> > > Here's the patch I'm going to check in.
>
> Which doesn't do any of the things I asked for :-(  It doesn't say what
> a "fake call" is, it doesn't say what its semantics are, it doesn't say
> how it is exected to be used.
>
> So, a "FAKE_CALL" is very much a *real* call, on the RTL level, which is
> where we are here.  But you want it to be treated differently because it
> will eventually be replaced by different insns.
It's CALL_INSN on the rtl level,  but it's just a normal instruction
that it doesn't have a call stack, and it doesn't affect the control
flow
>
> This causes all kinds of unrelated code to need confusing changes, made
> much worse because the name "FAKE_CALL" is the opposite of what it does.
>
> As long as your description of it only says how it is (ab)used in one
> case, I will call it a hack, and a gross hack at that.
>
>
> > > --- a/gcc/rtl.h
> > > +++ b/gcc/rtl.h
> > > @@ -840,7 +840,13 @@ struct GTY(()) rtvec_def {
> > >  #define CALL_P(X) (GET_CODE (X) == CALL_INSN)
> > >
> > >  /* 1 if RTX is a call_insn for a fake call.
> > > -   CALL_INSN use "used" flag to indicate it's a fake call.  */
> > > +   CALL_INSN use "used" flag to indicate it's a fake call.
> > > +   Used by the x86 vzeroupper instruction,
> > > +   in order to solve the problem of partial clobber registers,
> > > +   vzeroupper is defined as a call_insn with a special callee_abi,
> > > +   but it is not a real call and therefore has no function stack
> > > +   of its own.
>
> So because of this one thing (you need to insert partial clobbers) you
> force all kinds of unrelated code to have changes, namely, code thatt
> needs to do something with calls, but now you do not want to have that
> doone on some calls because you promise that call will disappear
> eventually, and it cannot cause any problems in the mean time?
>
> I am not convinced.  This is not design, this is a terrible hack, this
> is the opposite direction we should go in.

Quote from  https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570634.html

> Also i grep CALL_P or CALL_INSN in GCC source codes, there are many
> places which hold the assumption CALL_P/CALL_INSN is a real call.
> Considering that vzeroupper is used a lot on the i386 backend, I'm a
> bit worried that this implementation solution will be a bottomless
> pit.

Maybe, but I think the same is true for CLOBBER_HIGH.  If we have
a third alternative then we should consider it, but I think the
call approach is still going to be less problematic then CLOBBER_HIGH.

The main advantage of the call approach is that the CALL_P handling
is (mostly) conservatively correct and performance problems are just
a one-line change.  The CLOBBER_HIGH approach instead requires
changes to the way that passes track liveness information for
non-call instructions (so is much more than a one-line change).
Also, treating a CLOBBER_HIGH like a CLOBBER isn't conservatively
correct, because other code might be relying on part of the register
being preserved.

>
> > that doesn't set up a stack frame is fake as well?  Maybe
> >
> >  "CALL_INSN use "used" flag to indicate the instruction
> >   does not transfer control."
> >
> > thus that this call is not affecting regular control flow? (it might
> > eventually still trap and thus cause non-call EH?)
>
> How it is used in shrink-wrap requires it to not have a stack frame (in
> the compiler sense).
>
> > Not sure if "no function stack of its own" is a good constraint,
> > vzeroupper does not perform any call or jump.
>
> Yeah.  This stuff needs a rethink.
>
> What is wrong with just using an unspec and clobbers?
>
>
> Segher



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07  8:15                                       ` Richard Biener
  2021-07-07 14:52                                         ` Segher Boessenkool
@ 2021-07-07 15:52                                         ` Hongtao Liu
  1 sibling, 0 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-07-07 15:52 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, Jakub Jelinek, Richard Sandiford, liuhongt,
	GCC Patches

On Wed, Jul 7, 2021 at 4:15 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Jul 7, 2021 at 4:40 AM Hongtao Liu via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Tue, Jul 6, 2021 at 9:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > >
> > > On Tue, Jul 6, 2021 at 7:31 AM Segher Boessenkool
> > > <segher@kernel.crashing.org> wrote:
> > > >
> > > > Hi!
> > > >
> > > > I ran into this in shrink-wrap.c today.
> > > >
> > > > On Thu, Jun 03, 2021 at 02:54:07PM +0800, liuhongt via Gcc-patches wrote:
> > > > > Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a
> > > > > fake call, it won't have its own function stack.
> > > >
> > > > Could you document somewhere what a "fake call" *is*?  Including what
> > > > that means to RTL, how this is expected to be used, etc.?  In rtl.h is
> > > fake call is used for TARGET_INSN_CALLEE_ABI, i'll add comments for
> > > #define FAKE_CALL_P(RTX) in rtl.h
> >
> >
> > Here's the patch I'm going to check in.
> >
> >     Document FAKE_CALL_P in comments.
> >
> >     gcc/ChangeLog:
> >
> >             * rtl.h (FAKE_CALL_P): Add comments for FAKE_CALL_P.
> >
> > diff --git a/gcc/rtl.h b/gcc/rtl.h
> > index 5ed0d6dd6fa..9afc60f08d8 100644
> > --- a/gcc/rtl.h
> > +++ b/gcc/rtl.h
> > @@ -840,7 +840,13 @@ struct GTY(()) rtvec_def {
> >  #define CALL_P(X) (GET_CODE (X) == CALL_INSN)
> >
> >  /* 1 if RTX is a call_insn for a fake call.
> > -   CALL_INSN use "used" flag to indicate it's a fake call.  */
> > +   CALL_INSN use "used" flag to indicate it's a fake call.
> > +   Used by the x86 vzeroupper instruction,
> > +   in order to solve the problem of partial clobber registers,
> > +   vzeroupper is defined as a call_insn with a special callee_abi,
> > +   but it is not a real call and therefore has no function stack
> > +   of its own.
>
> I think that's a big vague - you could then say a sibling or tail call
> to a function
> that doesn't set up a stack frame is fake as well?  Maybe
>
>  "CALL_INSN use "used" flag to indicate the instruction
>   does not transfer control."
>
> thus that this call is not affecting regular control flow? (it might
> eventually still trap and thus cause non-call EH?)
>
> Not sure if "no function stack of its own" is a good constraint,
> vzeroupper does not perform any call or jump.
>
How about this?

 /* 1 if RTX is a call_insn for a fake call.
-   CALL_INSN use "used" flag to indicate it's a fake call.  */
+   CALL_INSN use "used" flag to indicate it's a fake call.
+   FIXME: it's only used by x86 vzeroupper to indicate it's
+   a normal instruction which doesn't have function stack
+   and doesn't impact control flow, and FAKE_CALL_P is not
+   handled thoroughly in the RTL, PR82735.  */

> > +   NB: FAKE_CALL_P is not handled thoroughly in the RTL.  */
> >  #define FAKE_CALL_P(RTX)                                        \
> >    (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
> >
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07 14:55                                     ` Segher Boessenkool
@ 2021-07-07 17:56                                       ` Jeff Law
  0 siblings, 0 replies; 45+ messages in thread
From: Jeff Law @ 2021-07-07 17:56 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: liuhongt, jakub, richard.sandiford, gcc-patches



On 7/7/2021 8:55 AM, Segher Boessenkool wrote:
> On Mon, Jul 05, 2021 at 06:03:21PM -0600, Jeff Law wrote:
>> It reminds me a bit of millicode calls on the PA or calls to special
>> routines in libgcc.  They're calls to functions, but those functions do
>> not follow the standard ABI.
> Something with CALL_INSN_FUNCTION_USAGE?  And maybe some clobbers?
I don't remember all the details on the PA side and the decision to 
express mul, div, mod as regular insns pre-dates my involvement in the 
PA port (hard to believe, but true).  I'd hazard a guess the goal behind 
making them regular insns was to not inhibit leaf function detection, 
avoid caller-saves around the "calls" and such. CALL_INSN_FUNCTION_USAGE 
didn't exist until the mid 90s.  I wouldn't be surprised if we could 
migrate the millicode calls to the CALL_INSN_FUNCTION_USAGE model.


jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07 15:23                                           ` Hongtao Liu
@ 2021-07-07 23:42                                             ` Segher Boessenkool
  2021-07-08  4:14                                               ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2021-07-07 23:42 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Richard Biener, Jakub Jelinek, Richard Sandiford, liuhongt, GCC Patches

On Wed, Jul 07, 2021 at 11:23:48PM +0800, Hongtao Liu wrote:
> On Wed, Jul 7, 2021 at 10:54 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:

[ snip some old stuff ]

> > Yeah.  This stuff needs a rethink.
> >
> > What is wrong with just using an unspec and clobbers?
> >
> It's partial and **potential clobber**,

All RTL "clobber" is always a potential clobber, it never guarantees the
existing value does not survive.  You can pass it through some unspec to
make this more explicit.  You will have to add some hook that CSE can
use to figure out what bits are conserved by your target-specific
construct, as you should have done in the first place.  This will be
much less work for you too, compared to actually checking if all
existing GCC code needs too check "FAKE_CALL_P" or not (instead of just
hoping it works now, as you do).


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07 15:32                                           ` Hongtao Liu
@ 2021-07-07 23:54                                             ` Segher Boessenkool
  2021-07-09  7:20                                               ` Hongtao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2021-07-07 23:54 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Richard Biener, Jakub Jelinek, Richard Sandiford, liuhongt, GCC Patches

On Wed, Jul 07, 2021 at 11:32:59PM +0800, Hongtao Liu wrote:
> On Wed, Jul 7, 2021 at 10:54 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > So, a "FAKE_CALL" is very much a *real* call, on the RTL level, which is
> > where we are here.  But you want it to be treated differently because it
> > will eventually be replaced by different insns.
> It's CALL_INSN on the rtl level,  but it's just a normal instruction
> that it doesn't have a call stack, and it doesn't affect the control
> flow

There is no such thing as "call stack" (whatever that may mean) to do
with the RTL "call" insn.  How the return address is stored (if at all)
is up to the target.  Many do not store the return address on the stack
(for example they have an RA or LR register for it).  Those that do
store it on a stack do not all change the stack pointer.

In RTL, it *does* change the control flow.  If you don't like that,
don't use a "call" insn.  You will have to update a *lot* more code
than you did, otherwise.

> > So because of this one thing (you need to insert partial clobbers) you
> > force all kinds of unrelated code to have changes, namely, code thatt
> > needs to do something with calls, but now you do not want to have that
> > doone on some calls because you promise that call will disappear
> > eventually, and it cannot cause any problems in the mean time?
> >
> > I am not convinced.  This is not design, this is a terrible hack, this
> > is the opposite direction we should go in.
> 
> Quote from  https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570634.html
> 
> > Also i grep CALL_P or CALL_INSN in GCC source codes, there are many
> > places which hold the assumption CALL_P/CALL_INSN is a real call.
> > Considering that vzeroupper is used a lot on the i386 backend, I'm a
> > bit worried that this implementation solution will be a bottomless
> > pit.
> 
> Maybe, but I think the same is true for CLOBBER_HIGH.  If we have
> a third alternative then we should consider it, but I think the
> call approach is still going to be less problematic then CLOBBER_HIGH.
> 
> The main advantage of the call approach is that the CALL_P handling
> is (mostly) conservatively correct and performance problems are just
> a one-line change.  The CLOBBER_HIGH approach instead requires
> changes to the way that passes track liveness information for
> non-call instructions (so is much more than a one-line change).
> Also, treating a CLOBBER_HIGH like a CLOBBER isn't conservatively
> correct, because other code might be relying on part of the register
> being preserved.

And this isn't a one-line change either, and it is only partial already,
and we don't know how deep the rabbit hole goes.


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07 23:42                                             ` Segher Boessenkool
@ 2021-07-08  4:14                                               ` Hongtao Liu
  0 siblings, 0 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-07-08  4:14 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, Jakub Jelinek, Richard Sandiford, liuhongt, GCC Patches

On Thu, Jul 8, 2021 at 7:44 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Wed, Jul 07, 2021 at 11:23:48PM +0800, Hongtao Liu wrote:
> > On Wed, Jul 7, 2021 at 10:54 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
>
> [ snip some old stuff ]
>
> > > Yeah.  This stuff needs a rethink.
> > >
> > > What is wrong with just using an unspec and clobbers?
> > >
> > It's partial and **potential clobber**,
>
> All RTL "clobber" is always a potential clobber, it never guarantees the
> existing value does not survive.  You can pass it through some unspec to
> make this more explicit.  You will have to add some hook that CSE can
TARGET_INSN_CALLEE_ABI is the hook designed for this.


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] CALL_INSN may not be a real function call.
  2021-07-07 23:54                                             ` Segher Boessenkool
@ 2021-07-09  7:20                                               ` Hongtao Liu
  0 siblings, 0 replies; 45+ messages in thread
From: Hongtao Liu @ 2021-07-09  7:20 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, Jakub Jelinek, Richard Sandiford, liuhongt, GCC Patches

On Thu, Jul 8, 2021 at 7:56 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Wed, Jul 07, 2021 at 11:32:59PM +0800, Hongtao Liu wrote:
> > On Wed, Jul 7, 2021 at 10:54 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > > So, a "FAKE_CALL" is very much a *real* call, on the RTL level, which is
> > > where we are here.  But you want it to be treated differently because it
> > > will eventually be replaced by different insns.
> > It's CALL_INSN on the rtl level,  but it's just a normal instruction
> > that it doesn't have a call stack, and it doesn't affect the control
> > flow
>
> There is no such thing as "call stack" (whatever that may mean) to do
> with the RTL "call" insn.  How the return address is stored (if at all)
> is up to the target.  Many do not store the return address on the stack
> (for example they have an RA or LR register for it).  Those that do
> store it on a stack do not all change the stack pointer.
>
> In RTL, it *does* change the control flow.  If you don't like that,
> don't use a "call" insn.  You will have to update a *lot* more code
> than you did, otherwise.
>
> > > So because of this one thing (you need to insert partial clobbers) you
> > > force all kinds of unrelated code to have changes, namely, code thatt
> > > needs to do something with calls, but now you do not want to have that
> > > doone on some calls because you promise that call will disappear
> > > eventually, and it cannot cause any problems in the mean time?
> > >
> > > I am not convinced.  This is not design, this is a terrible hack, this
> > > is the opposite direction we should go in.
> >
> > Quote from  https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570634.html
> >
> > > Also i grep CALL_P or CALL_INSN in GCC source codes, there are many
> > > places which hold the assumption CALL_P/CALL_INSN is a real call.
> > > Considering that vzeroupper is used a lot on the i386 backend, I'm a
> > > bit worried that this implementation solution will be a bottomless
> > > pit.
> >
> > Maybe, but I think the same is true for CLOBBER_HIGH.  If we have
> > a third alternative then we should consider it, but I think the
> > call approach is still going to be less problematic then CLOBBER_HIGH.
> >
> > The main advantage of the call approach is that the CALL_P handling
> > is (mostly) conservatively correct and performance problems are just
> > a one-line change.  The CLOBBER_HIGH approach instead requires
> > changes to the way that passes track liveness information for
> > non-call instructions (so is much more than a one-line change).
> > Also, treating a CLOBBER_HIGH like a CLOBBER isn't conservatively
> > correct, because other code might be relying on part of the register
> > being preserved.
>
> And this isn't a one-line change either, and it is only partial already,
> and we don't know how deep the rabbit hole goes.
maybe, and if there's existed infrastructure to solve vzeroupper
issue, i'm ok to change my patch.
>
>
> Segher



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-07-09  7:15 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-13  9:23 [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] Hongtao Liu
2021-05-13  9:40 ` Uros Bizjak
2021-05-13  9:43   ` Uros Bizjak
2021-05-13  9:54     ` Jakub Jelinek
2021-05-13 11:32       ` Richard Sandiford
2021-05-13 11:37         ` Jakub Jelinek
2021-05-13 11:52           ` Richard Sandiford
2021-05-14  2:27             ` Hongtao Liu
2021-05-17  8:44               ` Hongtao Liu
2021-05-17  9:56                 ` Richard Sandiford
2021-05-18 13:12                   ` Hongtao Liu
2021-05-18 15:18                     ` Richard Sandiford
2021-05-25  6:04                       ` Hongtao Liu
2021-05-25  6:30                         ` Hongtao Liu
2021-05-27  5:07                           ` Hongtao Liu
2021-05-27  7:05                             ` Uros Bizjak
2021-06-01  2:24                               ` Hongtao Liu
2021-06-03  6:54                               ` [PATCH 1/2] CALL_INSN may not be a real function call liuhongt
2021-06-03  6:54                                 ` [PATCH 2/2] Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI liuhongt
2021-06-04  2:56                                   ` Hongtao Liu
2021-06-04  6:26                                   ` Uros Bizjak
2021-06-04  6:34                                     ` Hongtao Liu
2021-06-07 19:04                                       ` [PATCH] x86: Don't compile pr82735-[345].c for x32 H.J. Lu
2021-06-04  2:55                                 ` [PATCH 1/2] CALL_INSN may not be a real function call Hongtao Liu
2021-06-04  7:50                                 ` Jakub Jelinek
2021-07-05 23:30                                 ` Segher Boessenkool
2021-07-06  0:03                                   ` Jeff Law
2021-07-06  1:49                                     ` Hongtao Liu
2021-07-07 14:55                                     ` Segher Boessenkool
2021-07-07 17:56                                       ` Jeff Law
2021-07-06  1:37                                   ` Hongtao Liu
2021-07-07  2:44                                     ` Hongtao Liu
2021-07-07  8:15                                       ` Richard Biener
2021-07-07 14:52                                         ` Segher Boessenkool
2021-07-07 15:23                                           ` Hongtao Liu
2021-07-07 23:42                                             ` Segher Boessenkool
2021-07-08  4:14                                               ` Hongtao Liu
2021-07-07 15:32                                           ` Hongtao Liu
2021-07-07 23:54                                             ` Segher Boessenkool
2021-07-09  7:20                                               ` Hongtao Liu
2021-07-07 15:52                                         ` Hongtao Liu
2021-05-27  7:20                             ` [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] Jakub Jelinek
2021-05-27 10:50                               ` Richard Sandiford
2021-06-01  2:22                                 ` Hongtao Liu
2021-06-01  2:25                                   ` Hongtao Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).