[PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
@ 2020-02-04  9:39 Jakub Jelinek
  2020-02-04 10:16 ` Uros Bizjak
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-04  9:39 UTC (permalink / raw)
  To: Uros Bizjak, Richard Sandiford; +Cc: gcc-patches

Hi!

As mentioned in the PR, the CLOBBERs in vzeroupper are added there even for
registers that aren't ever live in the function before and break the
prologue/epilogue expansion with ms ABI (normal ABIs are fine, as they
consider all [xyz]mm registers call clobbered, but the ms ABI considers
xmm0-15 call used but the bits above low 128 ones call clobbered).
The following patch fixes it by not adding the clobbers during vzeroupper
pass (before pro_and_epilogue), but adding them for -fipa-ra purposes only
during the final output.  Perhaps we could add some CLOBBERs early (say for
df_regs_ever_live_p regs that aren't live in the live_regs bitmap, or
depending on the ABI either add all of them immediately, or for ms ABI add
CLOBBERs for xmm0-xmm5 if they don't have a SET) and add the rest later.
And the addition could be perhaps done at other spots, e.g. in an
epilogue_completed guarded splitter.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
Or any of the above mentioned variants?

A separate issue is CALL_USED_REGISTERS for msabi on xmm16-xmm31.

And yet another issue is that the presence of a vzeroupper instruction does
(and the patch doesn't change) prevent callers from trying to preserve
something in the xmm0-xmm15 registers if the callee doesn't use them.
Would be nice to be able to say that the vzeroupper will preserve the low
128 bits, so it is ok to hold a 128-bit value across the call, but not
256-bit or larger.  Not sure if the RA has a way to express that (and IPA-RA
certainly ATM doesn't).

2020-02-04  Jakub Jelinek  <jakub@redhat.com>

	PR target/92190
	* config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
	include sets and not clobbers in the vzeroupper pattern.
	* config/i386/sse.md (*avx_vzeroupper): Add the clobbers during
	pattern output.

	* gcc.target/i386/pr92190.c: New test.

--- gcc/config/i386/i386-features.c.jj	2020-02-03 13:17:15.919723150 +0100
+++ gcc/config/i386/i386-features.c	2020-02-03 18:30:08.274865983 +0100
@@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
 
      (set (reg:V2DF R) (reg:V2DF R))
 
-   which preserves the low 128 bits but clobbers the upper bits.
-   For a dead register we just use:
-
-     (clobber (reg:V2DF R))
-
-   which invalidates any previous contents of R and stops R from becoming
-   live across the vzeroupper in future.  */
+   which preserves the low 128 bits but clobbers the upper bits.  */
 
 static void
 ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
 {
   rtx pattern = PATTERN (insn);
   unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  rtvec vec = rtvec_alloc (nregs + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  unsigned int npats = nregs;
   for (unsigned int i = 0; i < nregs; ++i)
     {
       unsigned int regno = GET_SSE_REGNO (i);
+      if (!bitmap_bit_p (live_regs, regno))
+	npats--;
+    }
+  if (npats == 0)
+    return;
+  rtvec vec = rtvec_alloc (npats + 1);
+  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  for (unsigned int i = 0, j = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      if (!bitmap_bit_p (live_regs, regno))
+	continue;
       rtx reg = gen_rtx_REG (V2DImode, regno);
-      if (bitmap_bit_p (live_regs, regno))
-	RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
-      else
-	RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+      ++j;
+      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
     }
   XVEC (pattern, 0) = vec;
   df_insn_rescan (insn);
--- gcc/config/i386/sse.md.jj	2020-02-03 13:17:15.957722589 +0100
+++ gcc/config/i386/sse.md	2020-02-03 18:30:08.280865894 +0100
@@ -19819,7 +19819,34 @@ (define_insn "*avx_vzeroupper"
   [(match_parallel 0 "vzeroupper_pattern"
      [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
   "TARGET_AVX"
-  "vzeroupper"
+{
+  if (flag_ipa_ra && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1)
+    {
+      /* For IPA-RA purposes, make it clear the instruction clobbers
+	 even XMM registers not mentioned explicitly in the pattern.  */
+      unsigned int nregs = TARGET_64BIT ? 16 : 8;
+      unsigned int npats = XVECLEN (operands[0], 0);
+      rtvec vec = rtvec_alloc (nregs + 1);
+      RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
+      for (unsigned int i = 0, j = 1; i < nregs; ++i)
+	{
+	  unsigned int regno = GET_SSE_REGNO (i);
+	  if (j < npats
+	      && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
+	    {
+	      RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
+	      j++;
+	    }
+	  else
+	    {
+	      rtx reg = gen_rtx_REG (V2DImode, regno);
+	      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+	    }
+	}
+      XVEC (operands[0], 0) = vec;
+    }
+  return "vzeroupper";
+}
   [(set_attr "type" "sse")
    (set_attr "modrm" "0")
    (set_attr "memory" "none")
--- gcc/testsuite/gcc.target/i386/pr92190.c.jj	2020-02-03 18:29:30.924415644 +0100
+++ gcc/testsuite/gcc.target/i386/pr92190.c	2020-02-03 18:29:15.992635389 +0100
@@ -0,0 +1,19 @@
+/* PR target/92190 */
+/* { dg-do compile { target { *-*-linux* && lp64 } } } */
+/* { dg-options "-mabi=ms -O2 -mavx512f" } */
+
+typedef char VC __attribute__((vector_size (16)));
+typedef int VI __attribute__((vector_size (16 * sizeof 0)));
+VC a;
+VI b;
+void bar (VI);
+void baz (VC);
+
+void
+foo (void)
+{
+  VC k = a;
+  VI n = b;
+  bar (n);
+  baz (k);
+}

	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04  9:39 [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] Jakub Jelinek
@ 2020-02-04 10:16 ` Uros Bizjak
  2020-02-04 11:05   ` Jakub Jelinek
  2020-02-04 11:42   ` [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI Jakub Jelinek
  0 siblings, 2 replies; 29+ messages in thread
From: Uros Bizjak @ 2020-02-04 10:16 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches

On Tue, Feb 4, 2020 at 10:39 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> Hi!
>
> As mentioned in the PR, the CLOBBERs in vzeroupper are added there even for
> registers that aren't ever live in the function before and break the
> prologue/epilogue expansion with ms ABI (normal ABIs are fine, as they
> consider all [xyz]mm registers call clobbered, but the ms ABI considers
> xmm0-15 call used but the bits above low 128 ones call clobbered).
> The following patch fixes it by not adding the clobbers during vzeroupper
> pass (before pro_and_epilogue), but adding them for -fipa-ra purposes only
> during the final output.  Perhaps we could add some CLOBBERs early (say for
> df_regs_ever_live_p regs that aren't live in the live_regs bitmap, or
> depending on the ABI either add all of them immediately, or for ms ABI add
> CLOBBERs for xmm0-xmm5 if they don't have a SET) and add the rest later.
> And the addition could be perhaps done at other spots, e.g. in an
> epilogue_completed guarded splitter.

If it works OK, I'd rather see this functionality implemented as an
epilogue_completed guarded splitter. In the .md files, there are
already cases where we split at this point, and where it is assumed
that allocated registers won't change anymore. Also, please don't make
the functionality conditional on flag_split_ra. This way, we would
always get new patterns in the debug dumps, so in case something goes
wrong, one could at least clearly see the full pattern.

> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> Or any of the above mentioned variants?
>
> A separate issue is CALL_USED_REGISTERS for msabi on xmm16-xmm31.

I guess that Comment #9 patch form the PR should be trivially correct,
but althouhg it looks obvious, I don't want to propose the patch since
I have no means of testing it.

Uros.

> And yet another issue is that the presence of a vzeroupper instruction does
> (and the patch doesn't change) prevent callers from trying to preserve
> something in the xmm0-xmm15 registers if the callee doesn't use them.
> Would be nice to be able to say that the vzeroupper will preserve the low
> 128 bits, so it is ok to hold a 128-bit value across the call, but not
> 256-bit or larger.  Not sure if the RA has a way to express that (and IPA-RA
> certainly ATM doesn't).
>
> 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/92190
>         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
>         include sets and not clobbers in the vzeroupper pattern.
>         * config/i386/sse.md (*avx_vzeroupper): Add the clobbers during
>         pattern output.
>
>         * gcc.target/i386/pr92190.c: New test.
>
> --- gcc/config/i386/i386-features.c.jj  2020-02-03 13:17:15.919723150 +0100
> +++ gcc/config/i386/i386-features.c     2020-02-03 18:30:08.274865983 +0100
> @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
>
>       (set (reg:V2DF R) (reg:V2DF R))
>
> -   which preserves the low 128 bits but clobbers the upper bits.
> -   For a dead register we just use:
> -
> -     (clobber (reg:V2DF R))
> -
> -   which invalidates any previous contents of R and stops R from becoming
> -   live across the vzeroupper in future.  */
> +   which preserves the low 128 bits but clobbers the upper bits.  */
>
>  static void
>  ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
>  {
>    rtx pattern = PATTERN (insn);
>    unsigned int nregs = TARGET_64BIT ? 16 : 8;
> -  rtvec vec = rtvec_alloc (nregs + 1);
> -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  unsigned int npats = nregs;
>    for (unsigned int i = 0; i < nregs; ++i)
>      {
>        unsigned int regno = GET_SSE_REGNO (i);
> +      if (!bitmap_bit_p (live_regs, regno))
> +       npats--;
> +    }
> +  if (npats == 0)
> +    return;
> +  rtvec vec = rtvec_alloc (npats + 1);
> +  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      if (!bitmap_bit_p (live_regs, regno))
> +       continue;
>        rtx reg = gen_rtx_REG (V2DImode, regno);
> -      if (bitmap_bit_p (live_regs, regno))
> -       RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
> -      else
> -       RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +      ++j;
> +      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
>      }
>    XVEC (pattern, 0) = vec;
>    df_insn_rescan (insn);
> --- gcc/config/i386/sse.md.jj   2020-02-03 13:17:15.957722589 +0100
> +++ gcc/config/i386/sse.md      2020-02-03 18:30:08.280865894 +0100
> @@ -19819,7 +19819,34 @@ (define_insn "*avx_vzeroupper"
>    [(match_parallel 0 "vzeroupper_pattern"
>       [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
>    "TARGET_AVX"
> -  "vzeroupper"
> +{
> +  if (flag_ipa_ra && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1)
> +    {
> +      /* For IPA-RA purposes, make it clear the instruction clobbers
> +        even XMM registers not mentioned explicitly in the pattern.  */
> +      unsigned int nregs = TARGET_64BIT ? 16 : 8;
> +      unsigned int npats = XVECLEN (operands[0], 0);
> +      rtvec vec = rtvec_alloc (nregs + 1);
> +      RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> +      for (unsigned int i = 0, j = 1; i < nregs; ++i)
> +       {
> +         unsigned int regno = GET_SSE_REGNO (i);
> +         if (j < npats
> +             && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> +           {
> +             RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> +             j++;
> +           }
> +         else
> +           {
> +             rtx reg = gen_rtx_REG (V2DImode, regno);
> +             RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +           }
> +       }
> +      XVEC (operands[0], 0) = vec;
> +    }
> +  return "vzeroupper";
> +}
>    [(set_attr "type" "sse")
>     (set_attr "modrm" "0")
>     (set_attr "memory" "none")
> --- gcc/testsuite/gcc.target/i386/pr92190.c.jj  2020-02-03 18:29:30.924415644 +0100
> +++ gcc/testsuite/gcc.target/i386/pr92190.c     2020-02-03 18:29:15.992635389 +0100
> @@ -0,0 +1,19 @@
> +/* PR target/92190 */
> +/* { dg-do compile { target { *-*-linux* && lp64 } } } */
> +/* { dg-options "-mabi=ms -O2 -mavx512f" } */
> +
> +typedef char VC __attribute__((vector_size (16)));
> +typedef int VI __attribute__((vector_size (16 * sizeof 0)));
> +VC a;
> +VI b;
> +void bar (VI);
> +void baz (VC);
> +
> +void
> +foo (void)
> +{
> +  VC k = a;
> +  VI n = b;
> +  bar (n);
> +  baz (k);
> +}
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 10:16 ` Uros Bizjak
@ 2020-02-04 11:05   ` Jakub Jelinek
  2020-02-04 11:13     ` Uros Bizjak
  2020-02-04 11:42   ` [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI Jakub Jelinek
  1 sibling, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-04 11:05 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches

On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
> If it works OK, I'd rather see this functionality implemented as an
> epilogue_completed guarded splitter. In the .md files, there are
> already cases where we split at this point, and where it is assumed
> that allocated registers won't change anymore. Also, please don't make
> the functionality conditional on flag_split_ra. This way, we would
> always get new patterns in the debug dumps, so in case something goes
> wrong, one could at least clearly see the full pattern.

The following seems to work on the testcase, will bootstrap/regtest it soon.

2020-02-04  Jakub Jelinek  <jakub@redhat.com>

	PR target/92190
	* config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
	include sets and not clobbers in the vzeroupper pattern.
	* config/i386/sse.md (*avx_vzeroupper): Change from define_insn to
	define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15
	registers are mentioned in the pattern and add clobbers for the missing
	registers at that point.

	* gcc.target/i386/pr92190.c: New test.

--- gcc/config/i386/i386-features.c.jj	2020-02-04 11:40:58.755611428 +0100
+++ gcc/config/i386/i386-features.c	2020-02-04 11:51:33.602148491 +0100
@@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
 
      (set (reg:V2DF R) (reg:V2DF R))
 
-   which preserves the low 128 bits but clobbers the upper bits.
-   For a dead register we just use:
-
-     (clobber (reg:V2DF R))
-
-   which invalidates any previous contents of R and stops R from becoming
-   live across the vzeroupper in future.  */
+   which preserves the low 128 bits but clobbers the upper bits.  */
 
 static void
 ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
 {
   rtx pattern = PATTERN (insn);
   unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  rtvec vec = rtvec_alloc (nregs + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  unsigned int npats = nregs;
   for (unsigned int i = 0; i < nregs; ++i)
     {
       unsigned int regno = GET_SSE_REGNO (i);
+      if (!bitmap_bit_p (live_regs, regno))
+	npats--;
+    }
+  if (npats == 0)
+    return;
+  rtvec vec = rtvec_alloc (npats + 1);
+  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  for (unsigned int i = 0, j = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      if (!bitmap_bit_p (live_regs, regno))
+	continue;
       rtx reg = gen_rtx_REG (V2DImode, regno);
-      if (bitmap_bit_p (live_regs, regno))
-	RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
-      else
-	RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+      ++j;
+      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
     }
   XVEC (pattern, 0) = vec;
   df_insn_rescan (insn);
--- gcc/config/i386/sse.md.jj	2020-02-04 11:40:58.813610563 +0100
+++ gcc/config/i386/sse.md	2020-02-04 11:58:31.544909659 +0100
@@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper"
   [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
   "TARGET_AVX")
 
-(define_insn "*avx_vzeroupper"
+(define_insn_and_split "*avx_vzeroupper"
   [(match_parallel 0 "vzeroupper_pattern"
      [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
   "TARGET_AVX"
-  "vzeroupper"
+{
+  if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1)
+    return "#";
+  else
+    return "vzeroupper";
+}
+  "epilogue_completed
+   && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
+  [(match_dup 0)]
+{
+  /* For IPA-RA purposes, make it clear the instruction clobbers
+     even XMM registers not mentioned explicitly in the pattern.  */
+  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+  unsigned int npats = XVECLEN (operands[0], 0);
+  rtvec vec = rtvec_alloc (nregs + 1);
+  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
+  for (unsigned int i = 0, j = 1; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      if (j < npats
+	  && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
+	{
+	  RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
+	  j++;
+	}
+      else
+	{
+	  rtx reg = gen_rtx_REG (V2DImode, regno);
+	  RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+	}
+    }
+  XVEC (operands[0], 0) = vec;
+}
   [(set_attr "type" "sse")
    (set_attr "modrm" "0")
    (set_attr "memory" "none")
--- gcc/testsuite/gcc.target/i386/pr92190.c.jj	2020-02-04 11:51:33.608148402 +0100
+++ gcc/testsuite/gcc.target/i386/pr92190.c	2020-02-04 11:51:33.608148402 +0100
@@ -0,0 +1,19 @@
+/* PR target/92190 */
+/* { dg-do compile { target { *-*-linux* && lp64 } } } */
+/* { dg-options "-mabi=ms -O2 -mavx512f" } */
+
+typedef char VC __attribute__((vector_size (16)));
+typedef int VI __attribute__((vector_size (16 * sizeof 0)));
+VC a;
+VI b;
+void bar (VI);
+void baz (VC);
+
+void
+foo (void)
+{
+  VC k = a;
+  VI n = b;
+  bar (n);
+  baz (k);
+}


	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 11:05   ` Jakub Jelinek
@ 2020-02-04 11:13     ` Uros Bizjak
  2020-02-04 11:24       ` Uros Bizjak
  0 siblings, 1 reply; 29+ messages in thread
From: Uros Bizjak @ 2020-02-04 11:13 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches

On Tue, Feb 4, 2020 at 12:05 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
> > If it works OK, I'd rather see this functionality implemented as an
> > epilogue_completed guarded splitter. In the .md files, there are
> > already cases where we split at this point, and where it is assumed
> > that allocated registers won't change anymore. Also, please don't make
> > the functionality conditional on flag_split_ra. This way, we would
> > always get new patterns in the debug dumps, so in case something goes
> > wrong, one could at least clearly see the full pattern.
>
> The following seems to work on the testcase, will bootstrap/regtest it soon.
>
> 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/92190
>         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
>         include sets and not clobbers in the vzeroupper pattern.
>         * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to
>         define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15
>         registers are mentioned in the pattern and add clobbers for the missing
>         registers at that point.
>
>         * gcc.target/i386/pr92190.c: New test.

A && is missing in the split condition to inherit TARGET_AVX.

OK with this change.

Thanks,
Uros.

>
> --- gcc/config/i386/i386-features.c.jj  2020-02-04 11:40:58.755611428 +0100
> +++ gcc/config/i386/i386-features.c     2020-02-04 11:51:33.602148491 +0100
> @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
>
>       (set (reg:V2DF R) (reg:V2DF R))
>
> -   which preserves the low 128 bits but clobbers the upper bits.
> -   For a dead register we just use:
> -
> -     (clobber (reg:V2DF R))
> -
> -   which invalidates any previous contents of R and stops R from becoming
> -   live across the vzeroupper in future.  */
> +   which preserves the low 128 bits but clobbers the upper bits.  */
>
>  static void
>  ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
>  {
>    rtx pattern = PATTERN (insn);
>    unsigned int nregs = TARGET_64BIT ? 16 : 8;
> -  rtvec vec = rtvec_alloc (nregs + 1);
> -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  unsigned int npats = nregs;
>    for (unsigned int i = 0; i < nregs; ++i)
>      {
>        unsigned int regno = GET_SSE_REGNO (i);
> +      if (!bitmap_bit_p (live_regs, regno))
> +       npats--;
> +    }
> +  if (npats == 0)
> +    return;
> +  rtvec vec = rtvec_alloc (npats + 1);
> +  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      if (!bitmap_bit_p (live_regs, regno))
> +       continue;
>        rtx reg = gen_rtx_REG (V2DImode, regno);
> -      if (bitmap_bit_p (live_regs, regno))
> -       RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
> -      else
> -       RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +      ++j;
> +      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
>      }
>    XVEC (pattern, 0) = vec;
>    df_insn_rescan (insn);
> --- gcc/config/i386/sse.md.jj   2020-02-04 11:40:58.813610563 +0100
> +++ gcc/config/i386/sse.md      2020-02-04 11:58:31.544909659 +0100
> @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper"
>    [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
>    "TARGET_AVX")
>
> -(define_insn "*avx_vzeroupper"
> +(define_insn_and_split "*avx_vzeroupper"
>    [(match_parallel 0 "vzeroupper_pattern"
>       [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
>    "TARGET_AVX"
> -  "vzeroupper"
> +{
> +  if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1)
> +    return "#";
> +  else
> +    return "vzeroupper";
> +}
> +  "epilogue_completed

&& epilogue_completed

> +   && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> +  [(match_dup 0)]
> +{
> +  /* For IPA-RA purposes, make it clear the instruction clobbers
> +     even XMM registers not mentioned explicitly in the pattern.  */
> +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> +  unsigned int npats = XVECLEN (operands[0], 0);
> +  rtvec vec = rtvec_alloc (nregs + 1);
> +  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> +  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      if (j < npats
> +         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> +       {
> +         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> +         j++;
> +       }
> +      else
> +       {
> +         rtx reg = gen_rtx_REG (V2DImode, regno);
> +         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +       }
> +    }
> +  XVEC (operands[0], 0) = vec;
> +}
>    [(set_attr "type" "sse")
>     (set_attr "modrm" "0")
>     (set_attr "memory" "none")
> --- gcc/testsuite/gcc.target/i386/pr92190.c.jj  2020-02-04 11:51:33.608148402 +0100
> +++ gcc/testsuite/gcc.target/i386/pr92190.c     2020-02-04 11:51:33.608148402 +0100
> @@ -0,0 +1,19 @@
> +/* PR target/92190 */
> +/* { dg-do compile { target { *-*-linux* && lp64 } } } */
> +/* { dg-options "-mabi=ms -O2 -mavx512f" } */
> +
> +typedef char VC __attribute__((vector_size (16)));
> +typedef int VI __attribute__((vector_size (16 * sizeof 0)));
> +VC a;
> +VI b;
> +void bar (VI);
> +void baz (VC);
> +
> +void
> +foo (void)
> +{
> +  VC k = a;
> +  VI n = b;
> +  bar (n);
> +  baz (k);
> +}
>
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 11:13     ` Uros Bizjak
@ 2020-02-04 11:24       ` Uros Bizjak
  2020-02-04 12:06         ` Richard Sandiford
  2020-02-04 12:31         ` Jakub Jelinek
  0 siblings, 2 replies; 29+ messages in thread
From: Uros Bizjak @ 2020-02-04 11:24 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches

On Tue, Feb 4, 2020 at 12:13 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Tue, Feb 4, 2020 at 12:05 PM Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
> > > If it works OK, I'd rather see this functionality implemented as an
> > > epilogue_completed guarded splitter. In the .md files, there are
> > > already cases where we split at this point, and where it is assumed
> > > that allocated registers won't change anymore. Also, please don't make
> > > the functionality conditional on flag_split_ra. This way, we would
> > > always get new patterns in the debug dumps, so in case something goes
> > > wrong, one could at least clearly see the full pattern.
> >
> > The following seems to work on the testcase, will bootstrap/regtest it soon.
> >
> > 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
> >
> >         PR target/92190
> >         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
> >         include sets and not clobbers in the vzeroupper pattern.
> >         * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to
> >         define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15
> >         registers are mentioned in the pattern and add clobbers for the missing
> >         registers at that point.
> >
> >         * gcc.target/i386/pr92190.c: New test.
>
> A && is missing in the split condition to inherit TARGET_AVX.

Also, you don't need to emit "#" in output template. This is just for
extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1".

> OK with this change.
>
> Thanks,
> Uros.
>
> >
> > --- gcc/config/i386/i386-features.c.jj  2020-02-04 11:40:58.755611428 +0100
> > +++ gcc/config/i386/i386-features.c     2020-02-04 11:51:33.602148491 +0100
> > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
> >
> >       (set (reg:V2DF R) (reg:V2DF R))
> >
> > -   which preserves the low 128 bits but clobbers the upper bits.
> > -   For a dead register we just use:
> > -
> > -     (clobber (reg:V2DF R))
> > -
> > -   which invalidates any previous contents of R and stops R from becoming
> > -   live across the vzeroupper in future.  */
> > +   which preserves the low 128 bits but clobbers the upper bits.  */
> >
> >  static void
> >  ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
> >  {
> >    rtx pattern = PATTERN (insn);
> >    unsigned int nregs = TARGET_64BIT ? 16 : 8;
> > -  rtvec vec = rtvec_alloc (nregs + 1);
> > -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> > +  unsigned int npats = nregs;
> >    for (unsigned int i = 0; i < nregs; ++i)
> >      {
> >        unsigned int regno = GET_SSE_REGNO (i);
> > +      if (!bitmap_bit_p (live_regs, regno))
> > +       npats--;
> > +    }
> > +  if (npats == 0)
> > +    return;
> > +  rtvec vec = rtvec_alloc (npats + 1);
> > +  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> > +  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> > +    {
> > +      unsigned int regno = GET_SSE_REGNO (i);
> > +      if (!bitmap_bit_p (live_regs, regno))
> > +       continue;
> >        rtx reg = gen_rtx_REG (V2DImode, regno);
> > -      if (bitmap_bit_p (live_regs, regno))
> > -       RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
> > -      else
> > -       RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> > +      ++j;
> > +      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
> >      }
> >    XVEC (pattern, 0) = vec;
> >    df_insn_rescan (insn);
> > --- gcc/config/i386/sse.md.jj   2020-02-04 11:40:58.813610563 +0100
> > +++ gcc/config/i386/sse.md      2020-02-04 11:58:31.544909659 +0100
> > @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper"
> >    [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> >    "TARGET_AVX")
> >
> > -(define_insn "*avx_vzeroupper"
> > +(define_insn_and_split "*avx_vzeroupper"
> >    [(match_parallel 0 "vzeroupper_pattern"
> >       [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> >    "TARGET_AVX"
> > -  "vzeroupper"
> > +{
> > +  if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1)
> > +    return "#";
> > +  else
> > +    return "vzeroupper";
> > +}
> > +  "epilogue_completed
>
> && epilogue_completed
>
> > +   && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> > +  [(match_dup 0)]
> > +{
> > +  /* For IPA-RA purposes, make it clear the instruction clobbers
> > +     even XMM registers not mentioned explicitly in the pattern.  */
> > +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> > +  unsigned int npats = XVECLEN (operands[0], 0);
> > +  rtvec vec = rtvec_alloc (nregs + 1);
> > +  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> > +  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> > +    {
> > +      unsigned int regno = GET_SSE_REGNO (i);
> > +      if (j < npats
> > +         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> > +       {
> > +         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> > +         j++;
> > +       }
> > +      else
> > +       {
> > +         rtx reg = gen_rtx_REG (V2DImode, regno);
> > +         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> > +       }
> > +    }
> > +  XVEC (operands[0], 0) = vec;
> > +}
> >    [(set_attr "type" "sse")
> >     (set_attr "modrm" "0")
> >     (set_attr "memory" "none")
> > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj  2020-02-04 11:51:33.608148402 +0100
> > +++ gcc/testsuite/gcc.target/i386/pr92190.c     2020-02-04 11:51:33.608148402 +0100
> > @@ -0,0 +1,19 @@
> > +/* PR target/92190 */
> > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */
> > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */
> > +
> > +typedef char VC __attribute__((vector_size (16)));
> > +typedef int VI __attribute__((vector_size (16 * sizeof 0)));
> > +VC a;
> > +VI b;
> > +void bar (VI);
> > +void baz (VC);
> > +
> > +void
> > +foo (void)
> > +{
> > +  VC k = a;
> > +  VI n = b;
> > +  bar (n);
> > +  baz (k);
> > +}
> >
> >
> >         Jakub
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 11:24       ` Uros Bizjak
@ 2020-02-04 12:06         ` Richard Sandiford
  2020-02-04 12:23           ` Uros Bizjak
  2020-02-08  1:50           ` Segher Boessenkool
  2020-02-04 12:31         ` Jakub Jelinek
  1 sibling, 2 replies; 29+ messages in thread
From: Richard Sandiford @ 2020-02-04 12:06 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches

Uros Bizjak <ubizjak@gmail.com> writes:
> On Tue, Feb 4, 2020 at 12:13 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>>
>> On Tue, Feb 4, 2020 at 12:05 PM Jakub Jelinek <jakub@redhat.com> wrote:
>> >
>> > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
>> > > If it works OK, I'd rather see this functionality implemented as an
>> > > epilogue_completed guarded splitter. In the .md files, there are
>> > > already cases where we split at this point, and where it is assumed
>> > > that allocated registers won't change anymore. Also, please don't make
>> > > the functionality conditional on flag_split_ra. This way, we would
>> > > always get new patterns in the debug dumps, so in case something goes
>> > > wrong, one could at least clearly see the full pattern.
>> >
>> > The following seems to work on the testcase, will bootstrap/regtest it soon.
>> >
>> > 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
>> >
>> >         PR target/92190
>> >         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
>> >         include sets and not clobbers in the vzeroupper pattern.
>> >         * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to
>> >         define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15
>> >         registers are mentioned in the pattern and add clobbers for the missing
>> >         registers at that point.
>> >
>> >         * gcc.target/i386/pr92190.c: New test.
>>
>> A && is missing in the split condition to inherit TARGET_AVX.
>
> Also, you don't need to emit "#" in output template. This is just for
> extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1".

IMO it's better to keep it here, where we're relying on the split
happening for correctness.  There's no theoretical guarantee that
a split ever happens otherwise.

Thanks,
Richard

>> OK with this change.
>>
>> Thanks,
>> Uros.
>>
>> >
>> > --- gcc/config/i386/i386-features.c.jj  2020-02-04 11:40:58.755611428 +0100
>> > +++ gcc/config/i386/i386-features.c     2020-02-04 11:51:33.602148491 +0100
>> > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
>> >
>> >       (set (reg:V2DF R) (reg:V2DF R))
>> >
>> > -   which preserves the low 128 bits but clobbers the upper bits.
>> > -   For a dead register we just use:
>> > -
>> > -     (clobber (reg:V2DF R))
>> > -
>> > -   which invalidates any previous contents of R and stops R from becoming
>> > -   live across the vzeroupper in future.  */
>> > +   which preserves the low 128 bits but clobbers the upper bits.  */
>> >
>> >  static void
>> >  ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
>> >  {
>> >    rtx pattern = PATTERN (insn);
>> >    unsigned int nregs = TARGET_64BIT ? 16 : 8;
>> > -  rtvec vec = rtvec_alloc (nregs + 1);
>> > -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
>> > +  unsigned int npats = nregs;
>> >    for (unsigned int i = 0; i < nregs; ++i)
>> >      {
>> >        unsigned int regno = GET_SSE_REGNO (i);
>> > +      if (!bitmap_bit_p (live_regs, regno))
>> > +       npats--;
>> > +    }
>> > +  if (npats == 0)
>> > +    return;
>> > +  rtvec vec = rtvec_alloc (npats + 1);
>> > +  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
>> > +  for (unsigned int i = 0, j = 0; i < nregs; ++i)
>> > +    {
>> > +      unsigned int regno = GET_SSE_REGNO (i);
>> > +      if (!bitmap_bit_p (live_regs, regno))
>> > +       continue;
>> >        rtx reg = gen_rtx_REG (V2DImode, regno);
>> > -      if (bitmap_bit_p (live_regs, regno))
>> > -       RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
>> > -      else
>> > -       RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
>> > +      ++j;
>> > +      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
>> >      }
>> >    XVEC (pattern, 0) = vec;
>> >    df_insn_rescan (insn);
>> > --- gcc/config/i386/sse.md.jj   2020-02-04 11:40:58.813610563 +0100
>> > +++ gcc/config/i386/sse.md      2020-02-04 11:58:31.544909659 +0100
>> > @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper"
>> >    [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
>> >    "TARGET_AVX")
>> >
>> > -(define_insn "*avx_vzeroupper"
>> > +(define_insn_and_split "*avx_vzeroupper"
>> >    [(match_parallel 0 "vzeroupper_pattern"
>> >       [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
>> >    "TARGET_AVX"
>> > -  "vzeroupper"
>> > +{
>> > +  if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1)
>> > +    return "#";
>> > +  else
>> > +    return "vzeroupper";
>> > +}
>> > +  "epilogue_completed
>>
>> && epilogue_completed
>>
>> > +   && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
>> > +  [(match_dup 0)]
>> > +{
>> > +  /* For IPA-RA purposes, make it clear the instruction clobbers
>> > +     even XMM registers not mentioned explicitly in the pattern.  */
>> > +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
>> > +  unsigned int npats = XVECLEN (operands[0], 0);
>> > +  rtvec vec = rtvec_alloc (nregs + 1);
>> > +  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
>> > +  for (unsigned int i = 0, j = 1; i < nregs; ++i)
>> > +    {
>> > +      unsigned int regno = GET_SSE_REGNO (i);
>> > +      if (j < npats
>> > +         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
>> > +       {
>> > +         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
>> > +         j++;
>> > +       }
>> > +      else
>> > +       {
>> > +         rtx reg = gen_rtx_REG (V2DImode, regno);
>> > +         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
>> > +       }
>> > +    }
>> > +  XVEC (operands[0], 0) = vec;
>> > +}
>> >    [(set_attr "type" "sse")
>> >     (set_attr "modrm" "0")
>> >     (set_attr "memory" "none")
>> > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj  2020-02-04 11:51:33.608148402 +0100
>> > +++ gcc/testsuite/gcc.target/i386/pr92190.c     2020-02-04 11:51:33.608148402 +0100
>> > @@ -0,0 +1,19 @@
>> > +/* PR target/92190 */
>> > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */
>> > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */
>> > +
>> > +typedef char VC __attribute__((vector_size (16)));
>> > +typedef int VI __attribute__((vector_size (16 * sizeof 0)));
>> > +VC a;
>> > +VI b;
>> > +void bar (VI);
>> > +void baz (VC);
>> > +
>> > +void
>> > +foo (void)
>> > +{
>> > +  VC k = a;
>> > +  VI n = b;
>> > +  bar (n);
>> > +  baz (k);
>> > +}
>> >
>> >
>> >         Jakub
>> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 12:06         ` Richard Sandiford
@ 2020-02-04 12:23           ` Uros Bizjak
  2020-02-08  1:50           ` Segher Boessenkool
  1 sibling, 0 replies; 29+ messages in thread
From: Uros Bizjak @ 2020-02-04 12:23 UTC (permalink / raw)
  To: Uros Bizjak, Jakub Jelinek, gcc-patches, Richard Sandiford

On Tue, Feb 4, 2020 at 1:06 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Uros Bizjak <ubizjak@gmail.com> writes:
> > On Tue, Feb 4, 2020 at 12:13 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >>
> >> On Tue, Feb 4, 2020 at 12:05 PM Jakub Jelinek <jakub@redhat.com> wrote:
> >> >
> >> > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
> >> > > If it works OK, I'd rather see this functionality implemented as an
> >> > > epilogue_completed guarded splitter. In the .md files, there are
> >> > > already cases where we split at this point, and where it is assumed
> >> > > that allocated registers won't change anymore. Also, please don't make
> >> > > the functionality conditional on flag_split_ra. This way, we would
> >> > > always get new patterns in the debug dumps, so in case something goes
> >> > > wrong, one could at least clearly see the full pattern.
> >> >
> >> > The following seems to work on the testcase, will bootstrap/regtest it soon.
> >> >
> >> > 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
> >> >
> >> >         PR target/92190
> >> >         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
> >> >         include sets and not clobbers in the vzeroupper pattern.
> >> >         * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to
> >> >         define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15
> >> >         registers are mentioned in the pattern and add clobbers for the missing
> >> >         registers at that point.
> >> >
> >> >         * gcc.target/i386/pr92190.c: New test.
> >>
> >> A && is missing in the split condition to inherit TARGET_AVX.
> >
> > Also, you don't need to emit "#" in output template. This is just for
> > extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1".
>
> IMO it's better to keep it here, where we're relying on the split
> happening for correctness.  There's no theoretical guarantee that
> a split ever happens otherwise.

In this case, I think it would be better to have two patterns, one
(define_insn_and_split) with "#" and the other "real" (define_insn).

Uros.

> Thanks,
> Richard


> >> OK with this change.
> >>
> >> Thanks,
> >> Uros.
> >>
> >> >
> >> > --- gcc/config/i386/i386-features.c.jj  2020-02-04 11:40:58.755611428 +0100
> >> > +++ gcc/config/i386/i386-features.c     2020-02-04 11:51:33.602148491 +0100
> >> > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
> >> >
> >> >       (set (reg:V2DF R) (reg:V2DF R))
> >> >
> >> > -   which preserves the low 128 bits but clobbers the upper bits.
> >> > -   For a dead register we just use:
> >> > -
> >> > -     (clobber (reg:V2DF R))
> >> > -
> >> > -   which invalidates any previous contents of R and stops R from becoming
> >> > -   live across the vzeroupper in future.  */
> >> > +   which preserves the low 128 bits but clobbers the upper bits.  */
> >> >
> >> >  static void
> >> >  ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
> >> >  {
> >> >    rtx pattern = PATTERN (insn);
> >> >    unsigned int nregs = TARGET_64BIT ? 16 : 8;
> >> > -  rtvec vec = rtvec_alloc (nregs + 1);
> >> > -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> >> > +  unsigned int npats = nregs;
> >> >    for (unsigned int i = 0; i < nregs; ++i)
> >> >      {
> >> >        unsigned int regno = GET_SSE_REGNO (i);
> >> > +      if (!bitmap_bit_p (live_regs, regno))
> >> > +       npats--;
> >> > +    }
> >> > +  if (npats == 0)
> >> > +    return;
> >> > +  rtvec vec = rtvec_alloc (npats + 1);
> >> > +  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> >> > +  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> >> > +    {
> >> > +      unsigned int regno = GET_SSE_REGNO (i);
> >> > +      if (!bitmap_bit_p (live_regs, regno))
> >> > +       continue;
> >> >        rtx reg = gen_rtx_REG (V2DImode, regno);
> >> > -      if (bitmap_bit_p (live_regs, regno))
> >> > -       RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
> >> > -      else
> >> > -       RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> >> > +      ++j;
> >> > +      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
> >> >      }
> >> >    XVEC (pattern, 0) = vec;
> >> >    df_insn_rescan (insn);
> >> > --- gcc/config/i386/sse.md.jj   2020-02-04 11:40:58.813610563 +0100
> >> > +++ gcc/config/i386/sse.md      2020-02-04 11:58:31.544909659 +0100
> >> > @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper"
> >> >    [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> >> >    "TARGET_AVX")
> >> >
> >> > -(define_insn "*avx_vzeroupper"
> >> > +(define_insn_and_split "*avx_vzeroupper"
> >> >    [(match_parallel 0 "vzeroupper_pattern"
> >> >       [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> >> >    "TARGET_AVX"
> >> > -  "vzeroupper"
> >> > +{
> >> > +  if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1)
> >> > +    return "#";
> >> > +  else
> >> > +    return "vzeroupper";
> >> > +}
> >> > +  "epilogue_completed
> >>
> >> && epilogue_completed
> >>
> >> > +   && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> >> > +  [(match_dup 0)]
> >> > +{
> >> > +  /* For IPA-RA purposes, make it clear the instruction clobbers
> >> > +     even XMM registers not mentioned explicitly in the pattern.  */
> >> > +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> >> > +  unsigned int npats = XVECLEN (operands[0], 0);
> >> > +  rtvec vec = rtvec_alloc (nregs + 1);
> >> > +  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> >> > +  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> >> > +    {
> >> > +      unsigned int regno = GET_SSE_REGNO (i);
> >> > +      if (j < npats
> >> > +         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> >> > +       {
> >> > +         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> >> > +         j++;
> >> > +       }
> >> > +      else
> >> > +       {
> >> > +         rtx reg = gen_rtx_REG (V2DImode, regno);
> >> > +         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> >> > +       }
> >> > +    }
> >> > +  XVEC (operands[0], 0) = vec;
> >> > +}
> >> >    [(set_attr "type" "sse")
> >> >     (set_attr "modrm" "0")
> >> >     (set_attr "memory" "none")
> >> > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj  2020-02-04 11:51:33.608148402 +0100
> >> > +++ gcc/testsuite/gcc.target/i386/pr92190.c     2020-02-04 11:51:33.608148402 +0100
> >> > @@ -0,0 +1,19 @@
> >> > +/* PR target/92190 */
> >> > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */
> >> > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */
> >> > +
> >> > +typedef char VC __attribute__((vector_size (16)));
> >> > +typedef int VI __attribute__((vector_size (16 * sizeof 0)));
> >> > +VC a;
> >> > +VI b;
> >> > +void bar (VI);
> >> > +void baz (VC);
> >> > +
> >> > +void
> >> > +foo (void)
> >> > +{
> >> > +  VC k = a;
> >> > +  VI n = b;
> >> > +  bar (n);
> >> > +  baz (k);
> >> > +}
> >> >
> >> >
> >> >         Jakub
> >> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 12:06         ` Richard Sandiford
  2020-02-04 12:23           ` Uros Bizjak
@ 2020-02-08  1:50           ` Segher Boessenkool
  1 sibling, 0 replies; 29+ messages in thread
From: Segher Boessenkool @ 2020-02-08  1:50 UTC (permalink / raw)
  To: Uros Bizjak, Jakub Jelinek, gcc-patches, richard.sandiford

On Tue, Feb 04, 2020 at 12:06:54PM +0000, Richard Sandiford wrote:
> > Also, you don't need to emit "#" in output template. This is just for
> > extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1".
> 
> IMO it's better to keep it here, where we're relying on the split
> happening for correctness.  There's no theoretical guarantee that
> a split ever happens otherwise.

A split will always happen if you have the "length" attribute and no
register stacks (split_for_shorten_branches, "split5", almost directly
before final).  This has been true for so long that lots of things are
bound to rely on it.  "#" only has real meaning in final, of course
(but is nice documentation for everything else).

Maybe we should just always run split5, and not do it in final at all
anymore?  Or at least always for targets with the "length" attribute?

Segher

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 11:24       ` Uros Bizjak
  2020-02-04 12:06         ` Richard Sandiford
@ 2020-02-04 12:31         ` Jakub Jelinek
  2020-02-04 12:39           ` Uros Bizjak
  1 sibling, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-04 12:31 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches

On Tue, Feb 04, 2020 at 12:24:10PM +0100, Uros Bizjak wrote:
> > A && is missing in the split condition to inherit TARGET_AVX.
> 
> Also, you don't need to emit "#" in output template. This is just for
> extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1".

I was worried that -fipa-ra then would give wrong answers.
-O2 -mabi=ms -mavx512f -fno-schedule-insns2 I was worried about isn't a problem though,
because while the split4 pass is disabled, the split3 pass is scheduled instead before
regstack.
I can do
-O2 -mabi=ms -mavx512f -fdisable-rtl-split4
or
-O2 -mabi=ms -mavx512f -fno-schedule-insns2 -fdisable-rtl-split3
though and then we ICE, with the patch as is with just && epilogue_completed ...
the ICE is
cc1: note: disable pass rtl-split4 for functions in the range of [0, 4294967295]
pr92190.c: In function ‘foo’:
pr92190.c:19:1: error: could not split insn
   19 | }
      | ^
(insn:TI 23 18 10 2 (parallel [
            (unspec_volatile [
                    (const_int 0 [0])
                ] UNSPECV_VZEROUPPER)
            (clobber (reg:V2DI 20 xmm0))
            (clobber (reg:V2DI 21 xmm1))
            (clobber (reg:V2DI 22 xmm2))
            (clobber (reg:V2DI 23 xmm3))
            (clobber (reg:V2DI 24 xmm4))
            (clobber (reg:V2DI 25 xmm5))
            (set (reg:V2DI 26 xmm6)
                (reg:V2DI 26 xmm6))
            (clobber (reg:V2DI 27 xmm7))
            (clobber (reg:V2DI 44 xmm8))
            (clobber (reg:V2DI 45 xmm9))
            (clobber (reg:V2DI 46 xmm10))
            (clobber (reg:V2DI 47 xmm11))
            (clobber (reg:V2DI 48 xmm12))
            (clobber (reg:V2DI 49 xmm13))
            (clobber (reg:V2DI 50 xmm14))
            (clobber (reg:V2DI 51 xmm15))
        ]) "pr92190.c":17:3 4895 {*avx_vzeroupper}
     (nil))
because as the splitter is written, it modifies the insn in place and so
later in try_split
      if (INSN_P (insn_last)
          && rtx_equal_p (PATTERN (insn_last), pat))
        return trial;
rtx_equal_p is true because of the modification in place.
Now, if I slightly modify the patch, so that it does
  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
instead of
  XVEC (operands[0], 0) = vec;
and thus doesn't modify in place, it will ICE in a different spot:
pr92190.c: In function ‘foo’:
pr92190.c:19:1: internal compiler error: in final_scan_insn_1, at final.c:3078
   19 | }
      | ^
i.e. on
            /* If we have a length attribute, this instruction should have
               been split in shorten_branches, to ensure that we would have
               valid length info for the splitees.  */
            gcc_assert (!HAVE_ATTR_length);
Though, I guess that is not really specific to this patch, by
-fdisable-rtl-split{3,4,5} we effectively disable all possibilities of
splitting before final, so anything that isn't split before will ICE that
way.
So, if we wanted to remove that class of ICEs, we should just fatal_error if
somebody disables the last splitter pass before expansion and
HAVE_ATTR_length is defined.

Here is the patch with && before epilogue_completed, no "#" and that avoids
modification of the insn in place.  We don't ICE, just could generate wrong
code with -fipa-ra, but user asked for that with -fdisable-rtl-split{3,4,5},
so guess that is ok.

2020-02-04  Jakub Jelinek  <jakub@redhat.com>

	PR target/92190
	* config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
	include sets and not clobbers in the vzeroupper pattern.
	* config/i386/sse.md (*avx_vzeroupper): Change from define_insn to
	define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15
	registers are mentioned in the pattern and add clobbers for the missing
	registers at that point.

	* gcc.target/i386/pr92190.c: New test.

--- gcc/config/i386/i386-features.c.jj	2020-02-04 11:40:58.755611428 +0100
+++ gcc/config/i386/i386-features.c	2020-02-04 11:51:33.602148491 +0100
@@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
 
      (set (reg:V2DF R) (reg:V2DF R))
 
-   which preserves the low 128 bits but clobbers the upper bits.
-   For a dead register we just use:
-
-     (clobber (reg:V2DF R))
-
-   which invalidates any previous contents of R and stops R from becoming
-   live across the vzeroupper in future.  */
+   which preserves the low 128 bits but clobbers the upper bits.  */
 
 static void
 ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
 {
   rtx pattern = PATTERN (insn);
   unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  rtvec vec = rtvec_alloc (nregs + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  unsigned int npats = nregs;
   for (unsigned int i = 0; i < nregs; ++i)
     {
       unsigned int regno = GET_SSE_REGNO (i);
+      if (!bitmap_bit_p (live_regs, regno))
+	npats--;
+    }
+  if (npats == 0)
+    return;
+  rtvec vec = rtvec_alloc (npats + 1);
+  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  for (unsigned int i = 0, j = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      if (!bitmap_bit_p (live_regs, regno))
+	continue;
       rtx reg = gen_rtx_REG (V2DImode, regno);
-      if (bitmap_bit_p (live_regs, regno))
-	RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
-      else
-	RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+      ++j;
+      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
     }
   XVEC (pattern, 0) = vec;
   df_insn_rescan (insn);
--- gcc/config/i386/sse.md.jj	2020-02-04 11:40:58.813610563 +0100
+++ gcc/config/i386/sse.md	2020-02-04 13:21:50.088340875 +0100
@@ -19815,11 +19815,38 @@ (define_expand "avx_vzeroupper"
   [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
   "TARGET_AVX")
 
-(define_insn "*avx_vzeroupper"
+(define_insn_and_split "*avx_vzeroupper"
   [(match_parallel 0 "vzeroupper_pattern"
      [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
   "TARGET_AVX"
   "vzeroupper"
+  "&& epilogue_completed
+   && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
+  [(match_dup 0)]
+{
+  /* For IPA-RA purposes, make it clear the instruction clobbers
+     even XMM registers not mentioned explicitly in the pattern.  */
+  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+  unsigned int npats = XVECLEN (operands[0], 0);
+  rtvec vec = rtvec_alloc (nregs + 1);
+  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
+  for (unsigned int i = 0, j = 1; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      if (j < npats
+	  && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
+	{
+	  RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
+	  j++;
+	}
+      else
+	{
+	  rtx reg = gen_rtx_REG (V2DImode, regno);
+	  RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+	}
+    }
+  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
+}
   [(set_attr "type" "sse")
    (set_attr "modrm" "0")
    (set_attr "memory" "none")
--- gcc/testsuite/gcc.target/i386/pr92190.c.jj	2020-02-04 11:51:33.608148402 +0100
+++ gcc/testsuite/gcc.target/i386/pr92190.c	2020-02-04 11:51:33.608148402 +0100
@@ -0,0 +1,19 @@
+/* PR target/92190 */
+/* { dg-do compile { target { *-*-linux* && lp64 } } } */
+/* { dg-options "-mabi=ms -O2 -mavx512f" } */
+
+typedef char VC __attribute__((vector_size (16)));
+typedef int VI __attribute__((vector_size (16 * sizeof 0)));
+VC a;
+VI b;
+void bar (VI);
+void baz (VC);
+
+void
+foo (void)
+{
+  VC k = a;
+  VI n = b;
+  bar (n);
+  baz (k);
+}


	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 12:31         ` Jakub Jelinek
@ 2020-02-04 12:39           ` Uros Bizjak
  2020-02-04 13:13             ` Jakub Jelinek
  0 siblings, 1 reply; 29+ messages in thread
From: Uros Bizjak @ 2020-02-04 12:39 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches

On Tue, Feb 4, 2020 at 1:30 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Feb 04, 2020 at 12:24:10PM +0100, Uros Bizjak wrote:
> > > A && is missing in the split condition to inherit TARGET_AVX.
> >
> > Also, you don't need to emit "#" in output template. This is just for
> > extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1".
>
> I was worried that -fipa-ra then would give wrong answers.
> -O2 -mabi=ms -mavx512f -fno-schedule-insns2 I was worried about isn't a problem though,
> because while the split4 pass is disabled, the split3 pass is scheduled instead before
> regstack.
> I can do
> -O2 -mabi=ms -mavx512f -fdisable-rtl-split4
> or
> -O2 -mabi=ms -mavx512f -fno-schedule-insns2 -fdisable-rtl-split3
> though and then we ICE, with the patch as is with just && epilogue_completed ...
> the ICE is
> cc1: note: disable pass rtl-split4 for functions in the range of [0, 4294967295]
> pr92190.c: In function ‘foo’:
> pr92190.c:19:1: error: could not split insn
>    19 | }
>       | ^
> (insn:TI 23 18 10 2 (parallel [
>             (unspec_volatile [
>                     (const_int 0 [0])
>                 ] UNSPECV_VZEROUPPER)
>             (clobber (reg:V2DI 20 xmm0))
>             (clobber (reg:V2DI 21 xmm1))
>             (clobber (reg:V2DI 22 xmm2))
>             (clobber (reg:V2DI 23 xmm3))
>             (clobber (reg:V2DI 24 xmm4))
>             (clobber (reg:V2DI 25 xmm5))
>             (set (reg:V2DI 26 xmm6)
>                 (reg:V2DI 26 xmm6))
>             (clobber (reg:V2DI 27 xmm7))
>             (clobber (reg:V2DI 44 xmm8))
>             (clobber (reg:V2DI 45 xmm9))
>             (clobber (reg:V2DI 46 xmm10))
>             (clobber (reg:V2DI 47 xmm11))
>             (clobber (reg:V2DI 48 xmm12))
>             (clobber (reg:V2DI 49 xmm13))
>             (clobber (reg:V2DI 50 xmm14))
>             (clobber (reg:V2DI 51 xmm15))
>         ]) "pr92190.c":17:3 4895 {*avx_vzeroupper}
>      (nil))
> because as the splitter is written, it modifies the insn in place and so
> later in try_split
>       if (INSN_P (insn_last)
>           && rtx_equal_p (PATTERN (insn_last), pat))
>         return trial;
> rtx_equal_p is true because of the modification in place.
> Now, if I slightly modify the patch, so that it does
>   operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> instead of
>   XVEC (operands[0], 0) = vec;
> and thus doesn't modify in place, it will ICE in a different spot:
> pr92190.c: In function ‘foo’:
> pr92190.c:19:1: internal compiler error: in final_scan_insn_1, at final.c:3078
>    19 | }
>       | ^
> i.e. on
>             /* If we have a length attribute, this instruction should have
>                been split in shorten_branches, to ensure that we would have
>                valid length info for the splitees.  */
>             gcc_assert (!HAVE_ATTR_length);
> Though, I guess that is not really specific to this patch, by
> -fdisable-rtl-split{3,4,5} we effectively disable all possibilities of
> splitting before final, so anything that isn't split before will ICE that
> way.
> So, if we wanted to remove that class of ICEs, we should just fatal_error if
> somebody disables the last splitter pass before expansion and
> HAVE_ATTR_length is defined.
>
> Here is the patch with && before epilogue_completed, no "#" and that avoids
> modification of the insn in place.  We don't ICE, just could generate wrong
> code with -fipa-ra, but user asked for that with -fdisable-rtl-split{3,4,5},
> so guess that is ok.
>
> 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/92190
>         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
>         include sets and not clobbers in the vzeroupper pattern.
>         * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to
>         define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15
>         registers are mentioned in the pattern and add clobbers for the missing
>         registers at that point.
>
>         * gcc.target/i386/pr92190.c: New test.

As Richard advised, let's put this safety stuff back. Usually, in
i386.md, these kind of splitters are implemented as two patterns, one
(define_insn_and_split) having "#", and the other (define_insn) with a
real insn. My opinion is, that this separation avoids confusion as
much as possible.

Uros.

> --- gcc/config/i386/i386-features.c.jj  2020-02-04 11:40:58.755611428 +0100
> +++ gcc/config/i386/i386-features.c     2020-02-04 11:51:33.602148491 +0100
> @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
>
>       (set (reg:V2DF R) (reg:V2DF R))
>
> -   which preserves the low 128 bits but clobbers the upper bits.
> -   For a dead register we just use:
> -
> -     (clobber (reg:V2DF R))
> -
> -   which invalidates any previous contents of R and stops R from becoming
> -   live across the vzeroupper in future.  */
> +   which preserves the low 128 bits but clobbers the upper bits.  */
>
>  static void
>  ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
>  {
>    rtx pattern = PATTERN (insn);
>    unsigned int nregs = TARGET_64BIT ? 16 : 8;
> -  rtvec vec = rtvec_alloc (nregs + 1);
> -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  unsigned int npats = nregs;
>    for (unsigned int i = 0; i < nregs; ++i)
>      {
>        unsigned int regno = GET_SSE_REGNO (i);
> +      if (!bitmap_bit_p (live_regs, regno))
> +       npats--;
> +    }
> +  if (npats == 0)
> +    return;
> +  rtvec vec = rtvec_alloc (npats + 1);
> +  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      if (!bitmap_bit_p (live_regs, regno))
> +       continue;
>        rtx reg = gen_rtx_REG (V2DImode, regno);
> -      if (bitmap_bit_p (live_regs, regno))
> -       RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
> -      else
> -       RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +      ++j;
> +      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
>      }
>    XVEC (pattern, 0) = vec;
>    df_insn_rescan (insn);
> --- gcc/config/i386/sse.md.jj   2020-02-04 11:40:58.813610563 +0100
> +++ gcc/config/i386/sse.md      2020-02-04 13:21:50.088340875 +0100
> @@ -19815,11 +19815,38 @@ (define_expand "avx_vzeroupper"
>    [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
>    "TARGET_AVX")
>
> -(define_insn "*avx_vzeroupper"
> +(define_insn_and_split "*avx_vzeroupper"
>    [(match_parallel 0 "vzeroupper_pattern"
>       [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
>    "TARGET_AVX"
>    "vzeroupper"
> +  "&& epilogue_completed
> +   && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> +  [(match_dup 0)]
> +{
> +  /* For IPA-RA purposes, make it clear the instruction clobbers
> +     even XMM registers not mentioned explicitly in the pattern.  */
> +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> +  unsigned int npats = XVECLEN (operands[0], 0);
> +  rtvec vec = rtvec_alloc (nregs + 1);
> +  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> +  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      if (j < npats
> +         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> +       {
> +         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> +         j++;
> +       }
> +      else
> +       {
> +         rtx reg = gen_rtx_REG (V2DImode, regno);
> +         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +       }
> +    }
> +  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> +}
>    [(set_attr "type" "sse")
>     (set_attr "modrm" "0")
>     (set_attr "memory" "none")
> --- gcc/testsuite/gcc.target/i386/pr92190.c.jj  2020-02-04 11:51:33.608148402 +0100
> +++ gcc/testsuite/gcc.target/i386/pr92190.c     2020-02-04 11:51:33.608148402 +0100
> @@ -0,0 +1,19 @@
> +/* PR target/92190 */
> +/* { dg-do compile { target { *-*-linux* && lp64 } } } */
> +/* { dg-options "-mabi=ms -O2 -mavx512f" } */
> +
> +typedef char VC __attribute__((vector_size (16)));
> +typedef int VI __attribute__((vector_size (16 * sizeof 0)));
> +VC a;
> +VI b;
> +void bar (VI);
> +void baz (VC);
> +
> +void
> +foo (void)
> +{
> +  VC k = a;
> +  VI n = b;
> +  bar (n);
> +  baz (k);
> +}
>
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 12:39           ` Uros Bizjak
@ 2020-02-04 13:13             ` Jakub Jelinek
  2020-02-04 13:15               ` Uros Bizjak
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-04 13:13 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches

On Tue, Feb 04, 2020 at 01:38:51PM +0100, Uros Bizjak wrote:
> As Richard advised, let's put this safety stuff back. Usually, in
> i386.md, these kind of splitters are implemented as two patterns, one
> (define_insn_and_split) having "#", and the other (define_insn) with a
> real insn. My opinion is, that this separation avoids confusion as
> much as possible.

Okay.  So like this if it passes bootstrap/regtest then?

2020-02-04  Jakub Jelinek  <jakub@redhat.com>

	PR target/92190
	* config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
	include sets and not clobbers in the vzeroupper pattern.
	* config/i386/sse.md (*avx_vzeroupper): Require in insn condition that
	the parallel has 17 (64-bit) or 9 (32-bit) elts.
	(*avx_vzeroupper_1): New define_insn_and_split.

	* gcc.target/i386/pr92190.c: New test.

--- gcc/config/i386/i386-features.c.jj	2020-02-04 13:33:32.713885386 +0100
+++ gcc/config/i386/i386-features.c	2020-02-04 13:55:44.358058104 +0100
@@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
 
      (set (reg:V2DF R) (reg:V2DF R))
 
-   which preserves the low 128 bits but clobbers the upper bits.
-   For a dead register we just use:
-
-     (clobber (reg:V2DF R))
-
-   which invalidates any previous contents of R and stops R from becoming
-   live across the vzeroupper in future.  */
+   which preserves the low 128 bits but clobbers the upper bits.  */
 
 static void
 ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
 {
   rtx pattern = PATTERN (insn);
   unsigned int nregs = TARGET_64BIT ? 16 : 8;
-  rtvec vec = rtvec_alloc (nregs + 1);
-  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  unsigned int npats = nregs;
   for (unsigned int i = 0; i < nregs; ++i)
     {
       unsigned int regno = GET_SSE_REGNO (i);
+      if (!bitmap_bit_p (live_regs, regno))
+	npats--;
+    }
+  if (npats == 0)
+    return;
+  rtvec vec = rtvec_alloc (npats + 1);
+  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  for (unsigned int i = 0, j = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      if (!bitmap_bit_p (live_regs, regno))
+	continue;
       rtx reg = gen_rtx_REG (V2DImode, regno);
-      if (bitmap_bit_p (live_regs, regno))
-	RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
-      else
-	RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+      ++j;
+      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
     }
   XVEC (pattern, 0) = vec;
   df_insn_rescan (insn);
--- gcc/config/i386/sse.md.jj	2020-02-04 13:33:32.733885088 +0100
+++ gcc/config/i386/sse.md	2020-02-04 13:57:38.995349722 +0100
@@ -19818,11 +19818,49 @@ (define_expand "avx_vzeroupper"
 (define_insn "*avx_vzeroupper"
   [(match_parallel 0 "vzeroupper_pattern"
      [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX"
+  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
   "vzeroupper"
   [(set_attr "type" "sse")
    (set_attr "modrm" "0")
    (set_attr "memory" "none")
+   (set_attr "prefix" "vex")
+   (set_attr "btver2_decode" "vector")
+   (set_attr "mode" "OI")])
+
+(define_insn_and_split "*avx_vzeroupper_1"
+  [(match_parallel 0 "vzeroupper_pattern"
+     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
+  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
+  "#"
+  "&& epilogue_completed"
+  [(match_dup 0)]
+{
+  /* For IPA-RA purposes, make it clear the instruction clobbers
+     even XMM registers not mentioned explicitly in the pattern.  */
+  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+  unsigned int npats = XVECLEN (operands[0], 0);
+  rtvec vec = rtvec_alloc (nregs + 1);
+  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
+  for (unsigned int i = 0, j = 1; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      if (j < npats
+	  && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
+	{
+	  RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
+	  j++;
+	}
+      else
+	{
+	  rtx reg = gen_rtx_REG (V2DImode, regno);
+	  RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+	}
+    }
+  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
+}
+  [(set_attr "type" "sse")
+   (set_attr "modrm" "0")
+   (set_attr "memory" "none")
    (set_attr "prefix" "vex")
    (set_attr "btver2_decode" "vector")
    (set_attr "mode" "OI")])
--- gcc/testsuite/gcc.target/i386/pr92190.c.jj	2020-02-04 13:55:44.364058015 +0100
+++ gcc/testsuite/gcc.target/i386/pr92190.c	2020-02-04 13:55:44.364058015 +0100
@@ -0,0 +1,19 @@
+/* PR target/92190 */
+/* { dg-do compile { target { *-*-linux* && lp64 } } } */
+/* { dg-options "-mabi=ms -O2 -mavx512f" } */
+
+typedef char VC __attribute__((vector_size (16)));
+typedef int VI __attribute__((vector_size (16 * sizeof 0)));
+VC a;
+VI b;
+void bar (VI);
+void baz (VC);
+
+void
+foo (void)
+{
+  VC k = a;
+  VI n = b;
+  bar (n);
+  baz (k);
+}


	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 13:13             ` Jakub Jelinek
@ 2020-02-04 13:15               ` Uros Bizjak
  2020-02-05 10:05                 ` Jakub Jelinek
  0 siblings, 1 reply; 29+ messages in thread
From: Uros Bizjak @ 2020-02-04 13:15 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches

On Tue, Feb 4, 2020 at 2:13 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Feb 04, 2020 at 01:38:51PM +0100, Uros Bizjak wrote:
> > As Richard advised, let's put this safety stuff back. Usually, in
> > i386.md, these kind of splitters are implemented as two patterns, one
> > (define_insn_and_split) having "#", and the other (define_insn) with a
> > real insn. My opinion is, that this separation avoids confusion as
> > much as possible.
>
> Okay.  So like this if it passes bootstrap/regtest then?

Yes.

> 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/92190
>         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
>         include sets and not clobbers in the vzeroupper pattern.
>         * config/i386/sse.md (*avx_vzeroupper): Require in insn condition that
>         the parallel has 17 (64-bit) or 9 (32-bit) elts.
>         (*avx_vzeroupper_1): New define_insn_and_split.
>
>         * gcc.target/i386/pr92190.c: New test.

OK.

Thanks,
Uros.

> --- gcc/config/i386/i386-features.c.jj  2020-02-04 13:33:32.713885386 +0100
> +++ gcc/config/i386/i386-features.c     2020-02-04 13:55:44.358058104 +0100
> @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p
>
>       (set (reg:V2DF R) (reg:V2DF R))
>
> -   which preserves the low 128 bits but clobbers the upper bits.
> -   For a dead register we just use:
> -
> -     (clobber (reg:V2DF R))
> -
> -   which invalidates any previous contents of R and stops R from becoming
> -   live across the vzeroupper in future.  */
> +   which preserves the low 128 bits but clobbers the upper bits.  */
>
>  static void
>  ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
>  {
>    rtx pattern = PATTERN (insn);
>    unsigned int nregs = TARGET_64BIT ? 16 : 8;
> -  rtvec vec = rtvec_alloc (nregs + 1);
> -  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  unsigned int npats = nregs;
>    for (unsigned int i = 0; i < nregs; ++i)
>      {
>        unsigned int regno = GET_SSE_REGNO (i);
> +      if (!bitmap_bit_p (live_regs, regno))
> +       npats--;
> +    }
> +  if (npats == 0)
> +    return;
> +  rtvec vec = rtvec_alloc (npats + 1);
> +  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  for (unsigned int i = 0, j = 0; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      if (!bitmap_bit_p (live_regs, regno))
> +       continue;
>        rtx reg = gen_rtx_REG (V2DImode, regno);
> -      if (bitmap_bit_p (live_regs, regno))
> -       RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
> -      else
> -       RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +      ++j;
> +      RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg);
>      }
>    XVEC (pattern, 0) = vec;
>    df_insn_rescan (insn);
> --- gcc/config/i386/sse.md.jj   2020-02-04 13:33:32.733885088 +0100
> +++ gcc/config/i386/sse.md      2020-02-04 13:57:38.995349722 +0100
> @@ -19818,11 +19818,49 @@ (define_expand "avx_vzeroupper"
>  (define_insn "*avx_vzeroupper"
>    [(match_parallel 0 "vzeroupper_pattern"
>       [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX"
> +  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
>    "vzeroupper"
>    [(set_attr "type" "sse")
>     (set_attr "modrm" "0")
>     (set_attr "memory" "none")
> +   (set_attr "prefix" "vex")
> +   (set_attr "btver2_decode" "vector")
> +   (set_attr "mode" "OI")])
> +
> +(define_insn_and_split "*avx_vzeroupper_1"
> +  [(match_parallel 0 "vzeroupper_pattern"
> +     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> +  "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1"
> +  "#"
> +  "&& epilogue_completed"
> +  [(match_dup 0)]
> +{
> +  /* For IPA-RA purposes, make it clear the instruction clobbers
> +     even XMM registers not mentioned explicitly in the pattern.  */
> +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> +  unsigned int npats = XVECLEN (operands[0], 0);
> +  rtvec vec = rtvec_alloc (nregs + 1);
> +  RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0);
> +  for (unsigned int i = 0, j = 1; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      if (j < npats
> +         && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno)
> +       {
> +         RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j);
> +         j++;
> +       }
> +      else
> +       {
> +         rtx reg = gen_rtx_REG (V2DImode, regno);
> +         RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +       }
> +    }
> +  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> +}
> +  [(set_attr "type" "sse")
> +   (set_attr "modrm" "0")
> +   (set_attr "memory" "none")
>     (set_attr "prefix" "vex")
>     (set_attr "btver2_decode" "vector")
>     (set_attr "mode" "OI")])
> --- gcc/testsuite/gcc.target/i386/pr92190.c.jj  2020-02-04 13:55:44.364058015 +0100
> +++ gcc/testsuite/gcc.target/i386/pr92190.c     2020-02-04 13:55:44.364058015 +0100
> @@ -0,0 +1,19 @@
> +/* PR target/92190 */
> +/* { dg-do compile { target { *-*-linux* && lp64 } } } */
> +/* { dg-options "-mabi=ms -O2 -mavx512f" } */
> +
> +typedef char VC __attribute__((vector_size (16)));
> +typedef int VI __attribute__((vector_size (16 * sizeof 0)));
> +VC a;
> +VI b;
> +void bar (VI);
> +void baz (VC);
> +
> +void
> +foo (void)
> +{
> +  VC k = a;
> +  VI n = b;
> +  bar (n);
> +  baz (k);
> +}
>
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-04 13:15               ` Uros Bizjak
@ 2020-02-05 10:05                 ` Jakub Jelinek
  2020-02-05 10:12                   ` Jakub Jelinek
  2020-02-05 10:47                   ` Uros Bizjak
  0 siblings, 2 replies; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-05 10:05 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4493 bytes --]

On Tue, Feb 04, 2020 at 02:15:04PM +0100, Uros Bizjak wrote:
> On Tue, Feb 4, 2020 at 2:13 PM Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > On Tue, Feb 04, 2020 at 01:38:51PM +0100, Uros Bizjak wrote:
> > > As Richard advised, let's put this safety stuff back. Usually, in
> > > i386.md, these kind of splitters are implemented as two patterns, one
> > > (define_insn_and_split) having "#", and the other (define_insn) with a
> > > real insn. My opinion is, that this separation avoids confusion as
> > > much as possible.
> >
> > Okay.  So like this if it passes bootstrap/regtest then?
> 
> Yes.
> 
> > 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
> >
> >         PR target/92190
> >         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
> >         include sets and not clobbers in the vzeroupper pattern.
> >         * config/i386/sse.md (*avx_vzeroupper): Require in insn condition that
> >         the parallel has 17 (64-bit) or 9 (32-bit) elts.
> >         (*avx_vzeroupper_1): New define_insn_and_split.
> >
> >         * gcc.target/i386/pr92190.c: New test.
> 
> OK.

Unfortunately, it breaks
+FAIL: gcc.target/i386/avx-2.c (internal compiler error)
+FAIL: gcc.target/i386/avx-2.c (test for excess errors)
+FAIL: gcc.target/i386/avx-vzeroupper-10.c (internal compiler error)
+FAIL: gcc.target/i386/avx-vzeroupper-10.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-10.c scan-assembler-times avx_vzeroupper 3
+FAIL: gcc.target/i386/avx-vzeroupper-11.c (internal compiler error)
+FAIL: gcc.target/i386/avx-vzeroupper-11.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-11.c scan-assembler-times \\\\*avx_vzeroall 1
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-11.c scan-assembler-times avx_vzeroupper 3
+FAIL: gcc.target/i386/avx-vzeroupper-12.c (internal compiler error)
+FAIL: gcc.target/i386/avx-vzeroupper-12.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-12.c scan-assembler-times \\\\*avx_vzeroall 1
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-12.c scan-assembler-times avx_vzeroupper 4
+FAIL: gcc.target/i386/avx-vzeroupper-5.c (internal compiler error)
+FAIL: gcc.target/i386/avx-vzeroupper-5.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-5.c scan-assembler-times avx_vzeroupper 1
+FAIL: gcc.target/i386/avx-vzeroupper-7.c (internal compiler error)
+FAIL: gcc.target/i386/avx-vzeroupper-7.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-7.c scan-assembler-times avx_vzeroupper 1
+FAIL: gcc.target/i386/avx-vzeroupper-8.c (internal compiler error)
+FAIL: gcc.target/i386/avx-vzeroupper-8.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-8.c scan-assembler-times avx_vzeroupper 1
+FAIL: gcc.target/i386/avx-vzeroupper-9.c (internal compiler error)
+FAIL: gcc.target/i386/avx-vzeroupper-9.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx-vzeroupper-9.c scan-assembler-times avx_vzeroupper 4
+FAIL: gcc.target/i386/sse-14.c (internal compiler error)
+FAIL: gcc.target/i386/sse-14.c (test for excess errors)
+FAIL: gcc.target/i386/sse-22.c (internal compiler error)
+FAIL: gcc.target/i386/sse-22.c (test for excess errors)
+FAIL: gcc.target/i386/sse-22a.c (internal compiler error)
+FAIL: gcc.target/i386/sse-22a.c (test for excess errors)

The problem is that x86 is the only target that defines HAVE_ATTR_length and
doesn't schedule any splitting pass at -O0 after pro_and_epilogue.

So, either we go back to handling this during vzeroupper output
(unconditionally, rather than flag_ipa_ra guarded), or we need to tweak the
split* passes for x86.

Seems there are 5 split passes, split1 is run unconditionally before reload,
split2 is run for optimize > 0 or STACK_REGS (x86) after ra but before
epilogue_completed, split3 is run before regstack for STACK_REGS and
optimize and -fno-schedule-insns2, split4 is run before sched2 if sched2 is
run and split5 is run before shorten_branches if HAVE_ATTR_length and not
STACK_REGS.

Attached are 3 possible incremental patches for recog.c, all of them fix
all the above regressions, but haven't fully bootstrapped/regtested any of
them yet.  My preference would be the last one, which for -O0 and x86
disables split2 and enables split3, as it doesn't add any extra passes.
The first one just enables split3 for -O0 on x86, the second one enables
split5 for -O0 on x86.

	Jakub

[-- Attachment #2: 12 --]
[-- Type: text/plain, Size: 690 bytes --]

2020-02-05  Jakub Jelinek  <jakub@redhat.com>

	PR target/92190
	* recog.c (pass_split_before_regstack::gate): For STACK_REGS targets,
	run even when !optimize.

--- gcc/recog.c.jj	2020-01-12 11:54:36.918405790 +0100
+++ gcc/recog.c	2020-02-05 10:13:20.236989567 +0100
@@ -3991,12 +3991,12 @@ pass_split_before_regstack::gate (functi
      split until final which doesn't allow splitting
      if HAVE_ATTR_length.  */
 # ifdef INSN_SCHEDULING
-  return (optimize && !flag_schedule_insns_after_reload);
+  return !optimize || !flag_schedule_insns_after_reload;
 # else
-  return (optimize);
+  return true;
 # endif
 #else
-  return 0;
+  return false;
 #endif
 }
 

[-- Attachment #3: 13 --]
[-- Type: text/plain, Size: 653 bytes --]

2020-02-05  Jakub Jelinek  <jakub@redhat.com>

	PR target/92190
	* recog.c (pass_split_for_shorten_branches::gate): For STACK_REGS targets,
	return !optimize.

--- gcc/recog.c.jj	2020-01-12 11:54:36.918405790 +0100
+++ gcc/recog.c	2020-02-05 10:24:34.506773724 +0100
@@ -4091,8 +4091,12 @@ public:
     {
       /* The placement of the splitting that we do for shorten_branches
 	 depends on whether regstack is used by the target or not.  */
-#if HAVE_ATTR_length && !defined (STACK_REGS)
+#if HAVE_ATTR_length
+# ifndef STACK_REGS
       return true;
+# else
+      return !optimize;
+# endif
 #else
       return false;
 #endif

[-- Attachment #4: 14 --]
[-- Type: text/plain, Size: 1128 bytes --]

2020-02-05  Jakub Jelinek  <jakub@redhat.com>

	PR target/92190
	* recog.c (pass_split_after_reload::gate): For STACK_REGS targets,
	don't run when !optimize.
	(pass_split_before_regstack::gate): For STACK_REGS targets, run even
	when !optimize.

--- gcc/recog.c.jj	2020-01-12 11:54:36.918405790 +0100
+++ gcc/recog.c	2020-02-05 10:45:19.125938154 +0100
@@ -3924,14 +3924,7 @@ public:
   virtual bool gate (function *)
     {
       /* If optimizing, then go ahead and split insns now.  */
-      if (optimize > 0)
-	return true;
-
-#ifdef STACK_REGS
-      return true;
-#else
-      return false;
-#endif
+      return optimize > 0;
     }
 
   virtual unsigned int execute (function *)
@@ -3991,12 +3984,12 @@ pass_split_before_regstack::gate (functi
      split until final which doesn't allow splitting
      if HAVE_ATTR_length.  */
 # ifdef INSN_SCHEDULING
-  return (optimize && !flag_schedule_insns_after_reload);
+  return !optimize || !flag_schedule_insns_after_reload;
 # else
-  return (optimize);
+  return true;
 # endif
 #else
-  return 0;
+  return false;
 #endif
 }
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-05 10:05                 ` Jakub Jelinek
@ 2020-02-05 10:12                   ` Jakub Jelinek
  2020-02-05 10:47                   ` Uros Bizjak
  1 sibling, 0 replies; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-05 10:12 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches

On Wed, Feb 05, 2020 at 10:58:04AM +0100, Jakub Jelinek wrote:
> Attached are 3 possible incremental patches for recog.c, all of them fix
> all the above regressions, but haven't fully bootstrapped/regtested any of
> them yet.  My preference would be the last one, which for -O0 and x86
> disables split2 and enables split3, as it doesn't add any extra passes.
> The first one just enables split3 for -O0 on x86, the second one enables
> split5 for -O0 on x86.

The -da -O0 changes of the last patch are:

 test.c.275r.split1
 test.c.277r.dfinit
 test.c.278r.mode_sw
 test.c.279r.asmcons
 test.c.284r.ira
 test.c.285r.reload
-test.c.289r.split2
 test.c.292r.pro_and_epilogue
 test.c.295r.jump2
+test.c.306r.split3
 test.c.307r.stack
 test.c.308r.alignments
 test.c.310r.mach
 test.c.311r.barriers
 test.c.316r.shorten
 test.c.317r.nothrow
 test.c.318r.dwarf2
 test.c.319r.final
 test.c.320r.dfinish

while the first one just adds test.c.306r.split3 and doesn't remove
test.c.289r.split2 and the second one just adds test.c.313r.split5
before test.c.316r.shorten.

	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-05 10:05                 ` Jakub Jelinek
  2020-02-05 10:12                   ` Jakub Jelinek
@ 2020-02-05 10:47                   ` Uros Bizjak
  2020-02-05 11:03                     ` Jakub Jelinek
  1 sibling, 1 reply; 29+ messages in thread
From: Uros Bizjak @ 2020-02-05 10:47 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches

On Wed, Feb 5, 2020 at 11:05 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Feb 04, 2020 at 02:15:04PM +0100, Uros Bizjak wrote:
> > On Tue, Feb 4, 2020 at 2:13 PM Jakub Jelinek <jakub@redhat.com> wrote:
> > >
> > > On Tue, Feb 04, 2020 at 01:38:51PM +0100, Uros Bizjak wrote:
> > > > As Richard advised, let's put this safety stuff back. Usually, in
> > > > i386.md, these kind of splitters are implemented as two patterns, one
> > > > (define_insn_and_split) having "#", and the other (define_insn) with a
> > > > real insn. My opinion is, that this separation avoids confusion as
> > > > much as possible.
> > >
> > > Okay.  So like this if it passes bootstrap/regtest then?
> >
> > Yes.
> >
> > > 2020-02-04  Jakub Jelinek  <jakub@redhat.com>
> > >
> > >         PR target/92190
> > >         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only
> > >         include sets and not clobbers in the vzeroupper pattern.
> > >         * config/i386/sse.md (*avx_vzeroupper): Require in insn condition that
> > >         the parallel has 17 (64-bit) or 9 (32-bit) elts.
> > >         (*avx_vzeroupper_1): New define_insn_and_split.
> > >
> > >         * gcc.target/i386/pr92190.c: New test.
> >
> > OK.
>
> Unfortunately, it breaks
> +FAIL: gcc.target/i386/avx-2.c (internal compiler error)
> +FAIL: gcc.target/i386/avx-2.c (test for excess errors)
> +FAIL: gcc.target/i386/avx-vzeroupper-10.c (internal compiler error)
> +FAIL: gcc.target/i386/avx-vzeroupper-10.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-10.c scan-assembler-times avx_vzeroupper 3
> +FAIL: gcc.target/i386/avx-vzeroupper-11.c (internal compiler error)
> +FAIL: gcc.target/i386/avx-vzeroupper-11.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-11.c scan-assembler-times \\\\*avx_vzeroall 1
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-11.c scan-assembler-times avx_vzeroupper 3
> +FAIL: gcc.target/i386/avx-vzeroupper-12.c (internal compiler error)
> +FAIL: gcc.target/i386/avx-vzeroupper-12.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-12.c scan-assembler-times \\\\*avx_vzeroall 1
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-12.c scan-assembler-times avx_vzeroupper 4
> +FAIL: gcc.target/i386/avx-vzeroupper-5.c (internal compiler error)
> +FAIL: gcc.target/i386/avx-vzeroupper-5.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-5.c scan-assembler-times avx_vzeroupper 1
> +FAIL: gcc.target/i386/avx-vzeroupper-7.c (internal compiler error)
> +FAIL: gcc.target/i386/avx-vzeroupper-7.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-7.c scan-assembler-times avx_vzeroupper 1
> +FAIL: gcc.target/i386/avx-vzeroupper-8.c (internal compiler error)
> +FAIL: gcc.target/i386/avx-vzeroupper-8.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-8.c scan-assembler-times avx_vzeroupper 1
> +FAIL: gcc.target/i386/avx-vzeroupper-9.c (internal compiler error)
> +FAIL: gcc.target/i386/avx-vzeroupper-9.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx-vzeroupper-9.c scan-assembler-times avx_vzeroupper 4
> +FAIL: gcc.target/i386/sse-14.c (internal compiler error)
> +FAIL: gcc.target/i386/sse-14.c (test for excess errors)
> +FAIL: gcc.target/i386/sse-22.c (internal compiler error)
> +FAIL: gcc.target/i386/sse-22.c (test for excess errors)
> +FAIL: gcc.target/i386/sse-22a.c (internal compiler error)
> +FAIL: gcc.target/i386/sse-22a.c (test for excess errors)
>
> The problem is that x86 is the only target that defines HAVE_ATTR_length and
> doesn't schedule any splitting pass at -O0 after pro_and_epilogue.
>
> So, either we go back to handling this during vzeroupper output
> (unconditionally, rather than flag_ipa_ra guarded), or we need to tweak the
> split* passes for x86.
>
> Seems there are 5 split passes, split1 is run unconditionally before reload,
> split2 is run for optimize > 0 or STACK_REGS (x86) after ra but before
> epilogue_completed, split3 is run before regstack for STACK_REGS and
> optimize and -fno-schedule-insns2, split4 is run before sched2 if sched2 is
> run and split5 is run before shorten_branches if HAVE_ATTR_length and not
> STACK_REGS.
>
> Attached are 3 possible incremental patches for recog.c, all of them fix
> all the above regressions, but haven't fully bootstrapped/regtested any of
> them yet.  My preference would be the last one, which for -O0 and x86
> disables split2 and enables split3, as it doesn't add any extra passes.
> The first one just enables split3 for -O0 on x86, the second one enables
> split5 for -O0 on x86.

Please note that in i386.md we expect that the check for
"epilogue_completed" means split4 point. There are some places with:

  "TARGET_64BIT && ((optimize > 0 && flag_peephole2)
            ? epilogue_completed : reload_completed)

so for flag_peephole2, we split after peephole2 pass was performed.

Apparently, we already hit this problem in the past, so the check for
"optimize > 0" was added to solve -O0 -fpeephole2 combo.

Another one is with

   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed

where we assume that allocated registers won't change anymore when
breaking SSE partial register stall.

I think we should just enable split4 also for -O0. This would also
allow us to remove the "optimize > 0" check above and allow us to
generate a bit more optimal code even with -O0 for
TARGET_SSE_PARTIAL_REG_DEPENDENCY and TARGET_AVOID_FALSE_DEP_FOR_BMI.

Uros.

Uros.


>
>         Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-05 10:47                   ` Uros Bizjak
@ 2020-02-05 11:03                     ` Jakub Jelinek
  2020-02-05 11:11                       ` Uros Bizjak
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-05 11:03 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches

On Wed, Feb 05, 2020 at 11:46:51AM +0100, Uros Bizjak wrote:
> I think we should just enable split4 also for -O0. This would also
> allow us to remove the "optimize > 0" check above and allow us to
> generate a bit more optimal code even with -O0 for
> TARGET_SSE_PARTIAL_REG_DEPENDENCY and TARGET_AVOID_FALSE_DEP_FOR_BMI.

The -O0 passes are:
test.c.284r.ira
test.c.285r.reload
# placement of split2
test.c.292r.pro_and_epilogue
test.c.295r.jump2
# placement of split4
# placement of split3
test.c.307r.stack
test.c.308r.alignments
test.c.310r.mach
test.c.311r.barriers
# placement of split5
test.c.316r.shorten
test.c.317r.nothrow
test.c.318r.dwarf2
test.c.319r.final

It really doesn't matter if for -O0 STACK_REGS we enable split4 or
split3, it is the same location, just split3 is better named (split before
reg-stack, which is run even at -O0, rather than split before sched2, which
is not run at -O0).

I think the primary question is, do we for -O0 need to enable split2 (as in,
will anything in pro_and_epilogue or jump2 passes benefit from the
splitting and worth running two -O0 post-RA split passes), then we could go
with the first patch I've posted, or nothing benefits from it in the two
passes and it is ok to split only before stack, in that case we can go with
the third patch.

	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-05 11:03                     ` Jakub Jelinek
@ 2020-02-05 11:11                       ` Uros Bizjak
  2020-02-05 11:24                         ` Jakub Jelinek
  0 siblings, 1 reply; 29+ messages in thread
From: Uros Bizjak @ 2020-02-05 11:11 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches

On Wed, Feb 5, 2020 at 12:03 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Wed, Feb 05, 2020 at 11:46:51AM +0100, Uros Bizjak wrote:
> > I think we should just enable split4 also for -O0. This would also
> > allow us to remove the "optimize > 0" check above and allow us to
> > generate a bit more optimal code even with -O0 for
> > TARGET_SSE_PARTIAL_REG_DEPENDENCY and TARGET_AVOID_FALSE_DEP_FOR_BMI.
>
> The -O0 passes are:
> test.c.284r.ira
> test.c.285r.reload
> # placement of split2
> test.c.292r.pro_and_epilogue
> test.c.295r.jump2
> # placement of split4
> # placement of split3
> test.c.307r.stack
> test.c.308r.alignments
> test.c.310r.mach
> test.c.311r.barriers
> # placement of split5
> test.c.316r.shorten
> test.c.317r.nothrow
> test.c.318r.dwarf2
> test.c.319r.final
>
> It really doesn't matter if for -O0 STACK_REGS we enable split4 or
> split3, it is the same location, just split3 is better named (split before
> reg-stack, which is run even at -O0, rather than split before sched2, which
> is not run at -O0).
>
> I think the primary question is, do we for -O0 need to enable split2 (as in,
> will anything in pro_and_epilogue or jump2 passes benefit from the
> splitting and worth running two -O0 post-RA split passes), then we could go
> with the first patch I've posted, or nothing benefits from it in the two
> passes and it is ok to split only before stack, in that case we can go with
> the third patch.

Oh, I see. No passes get scheduled between split3 and split4. Looks to
me that your third patch is then the way to go.

Uros.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190]
  2020-02-05 11:11                       ` Uros Bizjak
@ 2020-02-05 11:24                         ` Jakub Jelinek
  0 siblings, 0 replies; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-05 11:24 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches

On Wed, Feb 05, 2020 at 12:11:15PM +0100, Uros Bizjak wrote:
> > I think the primary question is, do we for -O0 need to enable split2 (as in,
> > will anything in pro_and_epilogue or jump2 passes benefit from the
> > splitting and worth running two -O0 post-RA split passes), then we could go
> > with the first patch I've posted, or nothing benefits from it in the two
> > passes and it is ok to split only before stack, in that case we can go with
> > the third patch.
> 
> Oh, I see. No passes get scheduled between split3 and split4. Looks to
> me that your third patch is then the way to go.

Ok, will bootstrap/regtest it soon.  Thanks.

	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-04 10:16 ` Uros Bizjak
  2020-02-04 11:05   ` Jakub Jelinek
@ 2020-02-04 11:42   ` Jakub Jelinek
  2020-02-06  1:00     ` JonY
  1 sibling, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-04 11:42 UTC (permalink / raw)
  To: Jonathan Yong; +Cc: Uros Bizjak, gcc-patches

Hi!

On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
> I guess that Comment #9 patch form the PR should be trivially correct,
> but althouhg it looks obvious, I don't want to propose the patch since
> I have no means of testing it.

I don't have means of testing it either.
https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019
is quite explicit that [xyz]mm16-31 are call clobbered and only xmm6-15 (low
128-bits only) are call preserved.

Jonathan, could you please test this if it is sufficient to just change
CALL_USED_REGISTERS or if e.g. something in the pro/epilogue needs tweaking
too?  Thanks.

We are talking e.g. about
/* { dg-options "-O2 -mabi=ms -mavx512vl" } */

typedef double V __attribute__((vector_size (16)));
void foo (void);
V bar (void);
void baz (V);
void
qux (void)
{
  V c;
  {
    register V a __asm ("xmm18");
    V b = bar ();
    asm ("" : "=x" (a) : "0" (b));
    c = a;
  }
  foo ();
  {
    register V d __asm ("xmm18");
    V e;
    d = c;
    asm ("" : "=x" (e) : "0" (d));
    baz (e);
  }
}
where according to the MSDN doc gcc incorrectly holds the c value
in xmm18 register across the foo call; if foo is compiled by some Microsoft
compiler (or LLVM), then it could clobber %xmm18.
If all xmm18 occurrences are changed to say xmm15, then it is valid to hold
the 128-bit value across the foo call (though, surprisingly, LLVM saves it
into stack anyway).

2020-02-04  Uroš Bizjak  <ubizjak@gmail.com>

	* config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make
	xmm16-xmm31 call-used even in 64-bit ms-abi.

--- gcc/config/i386/config/i386/i386.h.jj	2020-01-22 10:19:24.199221986 +0100
+++ gcc/config/i386/config/i386/i386.h	2020-02-04 12:09:12.338341003 +0100
@@ -1128,9 +1128,9 @@ extern const char *host_detect_local_cpu
 /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/		\
      6,   6,    6,    6,    6,    6,    6,    6,		\
 /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/		\
-     6,    6,     6,    6,    6,    6,    6,    6,		\
+     1,    1,     1,    1,    1,    1,    1,    1,		\
 /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/		\
-     6,    6,     6,    6,    6,    6,    6,    6,		\
+     1,    1,     1,    1,    1,    1,    1,    1,		\
  /* k0,  k1,  k2,  k3,  k4,  k5,  k6,  k7*/			\
      1,   1,   1,   1,   1,   1,   1,   1 }
 


	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-04 11:42   ` [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI Jakub Jelinek
@ 2020-02-06  1:00     ` JonY
  2020-02-06  6:07       ` Jakub Jelinek
  0 siblings, 1 reply; 29+ messages in thread
From: JonY @ 2020-02-06  1:00 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Uros Bizjak, gcc-patches


[-- Attachment #1.1: Type: text/plain, Size: 800 bytes --]

On 2/4/20 11:42 AM, Jakub Jelinek wrote:
> Hi!
> 
> On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
>> I guess that Comment #9 patch form the PR should be trivially correct,
>> but althouhg it looks obvious, I don't want to propose the patch since
>> I have no means of testing it.
> 
> I don't have means of testing it either.
> https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019
> is quite explicit that [xyz]mm16-31 are call clobbered and only xmm6-15 (low
> 128-bits only) are call preserved.
> 
> Jonathan, could you please test this if it is sufficient to just change
> CALL_USED_REGISTERS or if e.g. something in the pro/epilogue needs tweaking
> too?  Thanks.

Is this patch testing still required? I just got back from traveling.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-06  1:00     ` JonY
@ 2020-02-06  6:07       ` Jakub Jelinek
  2020-02-07 10:57         ` JonY
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-06  6:07 UTC (permalink / raw)
  To: JonY; +Cc: Uros Bizjak, gcc-patches

On Thu, Feb 06, 2020 at 01:00:36AM +0000, JonY wrote:
> On 2/4/20 11:42 AM, Jakub Jelinek wrote:
> > Hi!
> > 
> > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
> >> I guess that Comment #9 patch form the PR should be trivially correct,
> >> but althouhg it looks obvious, I don't want to propose the patch since
> >> I have no means of testing it.
> > 
> > I don't have means of testing it either.
> > https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019
> > is quite explicit that [xyz]mm16-31 are call clobbered and only xmm6-15 (low
> > 128-bits only) are call preserved.
> > 
> > Jonathan, could you please test this if it is sufficient to just change
> > CALL_USED_REGISTERS or if e.g. something in the pro/epilogue needs tweaking
> > too?  Thanks.
> 
> Is this patch testing still required? I just got back from traveling.

Yes, our reading of the MS ABI docs show that xmm16-31 are to be call used
(not preserved over calls), while in gcc they are currently handled as
preserved across the calls.

	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-06  6:07       ` Jakub Jelinek
@ 2020-02-07 10:57         ` JonY
  2020-02-07 11:28           ` Jakub Jelinek
  0 siblings, 1 reply; 29+ messages in thread
From: JonY @ 2020-02-07 10:57 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Uros Bizjak, gcc-patches


[-- Attachment #1.1: Type: text/plain, Size: 2421 bytes --]

On 2/6/20 6:07 AM, Jakub Jelinek wrote:
> On Thu, Feb 06, 2020 at 01:00:36AM +0000, JonY wrote:
>> On 2/4/20 11:42 AM, Jakub Jelinek wrote:
>>> Hi!
>>>
>>> On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote:
>>>> I guess that Comment #9 patch form the PR should be trivially correct,
>>>> but althouhg it looks obvious, I don't want to propose the patch since
>>>> I have no means of testing it.
>>>
>>> I don't have means of testing it either.
>>> https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019
>>> is quite explicit that [xyz]mm16-31 are call clobbered and only xmm6-15 (low
>>> 128-bits only) are call preserved.
>>>
>>> Jonathan, could you please test this if it is sufficient to just change
>>> CALL_USED_REGISTERS or if e.g. something in the pro/epilogue needs tweaking
>>> too?  Thanks.
>>
>> Is this patch testing still required? I just got back from traveling.
> 
> Yes, our reading of the MS ABI docs show that xmm16-31 are to be call used
> (not preserved over calls), while in gcc they are currently handled as
> preserved across the calls.
> 
> 	Jakub
> 


--- original.s  2020-02-06 09:00:02.014638069 +0000
+++ new.s       2020-02-07 10:28:55.678317667 +0000
@@ -7,23 +7,23 @@
 qux:
        subq    $72, %rsp
        .seh_stackalloc 72
-       vmovaps %xmm18, 48(%rsp)
-       .seh_savexmm    %xmm18, 48
+       vmovaps %xmm6, 48(%rsp)
+       .seh_savexmm    %xmm6, 48
        .seh_endprologue
        call    bar
        vmovapd %xmm0, %xmm1
-       vmovapd %xmm1, %xmm18
+       vmovapd %xmm1, %xmm6
        call    foo
        leaq    32(%rsp), %rcx
-       vmovapd %xmm18, %xmm0
-       vmovaps %xmm0, 32(%rsp)
+       vmovapd %xmm6, %xmm0
+       vmovapd %xmm0, 32(%rsp)
        call    baz
        nop
-       vmovaps 48(%rsp), %xmm18
+       vmovaps 48(%rsp), %xmm6
        addq    $72, %rsp
        ret
        .seh_endproc
-       .ident  "GCC: (GNU) 10.0.0 20191024 (experimental)"
+       .ident  "GCC: (GNU) 10.0.1 20200206 (experimental)"
        .def    bar;    .scl    2;      .type   32;     .endef
        .def    foo;    .scl    2;      .type   32;     .endef
        .def    baz;    .scl    2;      .type   32;     .endef

GCC with the patch now seems to put the variables in xmm6, unfortunately
I don't know enough of AVX or stack setups to know if that's all that is
needed.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-07 10:57         ` JonY
@ 2020-02-07 11:28           ` Jakub Jelinek
  2020-02-08  8:24             ` JonY
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-07 11:28 UTC (permalink / raw)
  To: JonY; +Cc: Uros Bizjak, gcc-patches

On Fri, Feb 07, 2020 at 10:57:22AM +0000, JonY wrote:
> >> Is this patch testing still required? I just got back from traveling.
> > 
> > Yes, our reading of the MS ABI docs show that xmm16-31 are to be call used
> > (not preserved over calls), while in gcc they are currently handled as
> > preserved across the calls.

The other parts are I guess mainly about SEH.  Consider e.g.
void
foo (void)
{
  register double x __asm ("xmm14");
  register double y __asm ("xmm18");
  asm ("" : "=x" (x));
  asm ("" : "=v" (y));
  x += y;
  y += x;
  asm ("" : : "x" (x));
  asm ("" : : "v" (y));
}
looking at cross-compiler output, with -O2 -mavx512f this emits
	.file	"abcdeq.c"
	.text
	.align 16
	.globl	foo
	.def	foo;	.scl	2;	.type	32;	.endef
	.seh_proc	foo
foo:
	subq	$40, %rsp
	.seh_stackalloc	40
	vmovaps	%xmm14, (%rsp)
	.seh_savexmm	%xmm14, 0
	vmovaps	%xmm18, 16(%rsp)
	.seh_savexmm	%xmm18, 16
	.seh_endprologue
	vaddsd	%xmm18, %xmm14, %xmm14
	vaddsd	%xmm18, %xmm14, %xmm18
	vmovaps	(%rsp), %xmm14
	vmovaps	16(%rsp), %xmm18
	addq	$40, %rsp
	ret
	.seh_endproc
	.ident	"GCC: (GNU) 10.0.1 20200207 (experimental)"
Does whatever assembler mingw64 uses even assemble this (I mean the
.seh_savexmm %xmm16, 16 could be problematic)?
I can find e.g.
https://stackoverflow.com/questions/43152633/invalid-register-for-seh-savexmm-in-cygwin/43210527
which then links to
https://gcc.gnu.org/PR65782

So, I'd say we want to add PR target/65782 to the ChangeLog entry in the
patch, and likely add a testcase like the above, so like below?

Have you tested the earlier version of the patch on mingw64 or cygwin?
If yes, can you just test the testcase from the following patch, without and
with the i386.h part and verify it FAILs without and PASSes with it?
Just make check-gcc RUNTESTFLAGS=i386.exp=pr65782.c should do?
If not, can you please test the whole patch?

2020-02-07  Uroš Bizjak  <ubizjak@gmail.com>
	    Jakub Jelinek  <jakub@redhat.com>

	PR target/65782
	* config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make
	xmm16-xmm31 call-used even in 64-bit ms-abi.

	* gcc.target/i386/pr65782.c: New test.

--- gcc/config/i386/config/i386/i386.h.jj	2020-01-22 10:19:24.199221986 +0100
+++ gcc/config/i386/config/i386/i386.h	2020-02-04 12:09:12.338341003 +0100
@@ -1128,9 +1128,9 @@ extern const char *host_detect_local_cpu
 /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/		\
      6,   6,    6,    6,    6,    6,    6,    6,		\
 /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/		\
-     6,    6,     6,    6,    6,    6,    6,    6,		\
+     1,    1,     1,    1,    1,    1,    1,    1,		\
 /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/		\
-     6,    6,     6,    6,    6,    6,    6,    6,		\
+     1,    1,     1,    1,    1,    1,    1,    1,		\
  /* k0,  k1,  k2,  k3,  k4,  k5,  k6,  k7*/			\
      1,   1,   1,   1,   1,   1,   1,   1 }
 
--- gcc/testsuite/gcc.target/i386/pr65782.c.jj	2020-02-07 12:21:09.472819018 +0100
+++ gcc/testsuite/gcc.target/i386/pr65782.c	2020-02-07 12:24:06.820154495 +0100
@@ -0,0 +1,16 @@
+/* PR target/65782 */
+/* { dg-do assemble { target { avx512vl && { ! ia32 } } } } */
+/* { dg-options "-O2 -mavx512vl" } */
+
+void
+foo (void)
+{
+  register double x __asm ("xmm14");
+  register double y __asm ("xmm18");
+  asm ("" : "=x" (x));
+  asm ("" : "=v" (y));
+  x += y;
+  y += x;
+  asm ("" : : "x" (x));
+  asm ("" : : "v" (y));
+}


	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-07 11:28           ` Jakub Jelinek
@ 2020-02-08  8:24             ` JonY
  2020-02-08 10:05               ` Jakub Jelinek
  0 siblings, 1 reply; 29+ messages in thread
From: JonY @ 2020-02-08  8:24 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Uros Bizjak, gcc-patches


[-- Attachment #1.1: Type: text/plain, Size: 4028 bytes --]

On 2/7/20 11:28 AM, Jakub Jelinek wrote:
> On Fri, Feb 07, 2020 at 10:57:22AM +0000, JonY wrote:
>>>> Is this patch testing still required? I just got back from traveling.
>>>
>>> Yes, our reading of the MS ABI docs show that xmm16-31 are to be call used
>>> (not preserved over calls), while in gcc they are currently handled as
>>> preserved across the calls.
> 
> The other parts are I guess mainly about SEH.  Consider e.g.
> void
> foo (void)
> {
>   register double x __asm ("xmm14");
>   register double y __asm ("xmm18");
>   asm ("" : "=x" (x));
>   asm ("" : "=v" (y));
>   x += y;
>   y += x;
>   asm ("" : : "x" (x));
>   asm ("" : : "v" (y));
> }
> looking at cross-compiler output, with -O2 -mavx512f this emits
> 	.file	"abcdeq.c"
> 	.text
> 	.align 16
> 	.globl	foo
> 	.def	foo;	.scl	2;	.type	32;	.endef
> 	.seh_proc	foo
> foo:
> 	subq	$40, %rsp
> 	.seh_stackalloc	40
> 	vmovaps	%xmm14, (%rsp)
> 	.seh_savexmm	%xmm14, 0
> 	vmovaps	%xmm18, 16(%rsp)
> 	.seh_savexmm	%xmm18, 16
> 	.seh_endprologue
> 	vaddsd	%xmm18, %xmm14, %xmm14
> 	vaddsd	%xmm18, %xmm14, %xmm18
> 	vmovaps	(%rsp), %xmm14
> 	vmovaps	16(%rsp), %xmm18
> 	addq	$40, %rsp
> 	ret
> 	.seh_endproc
> 	.ident	"GCC: (GNU) 10.0.1 20200207 (experimental)"
> Does whatever assembler mingw64 uses even assemble this (I mean the
> .seh_savexmm %xmm16, 16 could be problematic)?''

It does not, I just checked with the master branch of binutils.

> I can find e.g.
> https://stackoverflow.com/questions/43152633/invalid-register-for-seh-savexmm-in-cygwin/43210527
> which then links to
> https://gcc.gnu.org/PR65782
> 
> So, I'd say we want to add PR target/65782 to the ChangeLog entry in the
> patch, and likely add a testcase like the above, so like below?
> 
> Have you tested the earlier version of the patch on mingw64 or cygwin?
> If yes, can you just test the testcase from the following patch, without and
> with the i386.h part and verify it FAILs without and PASSes with it?
> Just make check-gcc RUNTESTFLAGS=i386.exp=pr65782.c should do?
> If not, can you please test the whole patch?
> 

I did a -c test build with an older toolchain, it fails to compile
(invalid register for .seh_savexmm) while the latest gcc is passing,
both are using the same binutils version.

I guess that covers it?


> 2020-02-07  Uroš Bizjak  <ubizjak@gmail.com>
> 	    Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR target/65782
> 	* config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make
> 	xmm16-xmm31 call-used even in 64-bit ms-abi.
> 
> 	* gcc.target/i386/pr65782.c: New test.
> 
> --- gcc/config/i386/config/i386/i386.h.jj	2020-01-22 10:19:24.199221986 +0100
> +++ gcc/config/i386/config/i386/i386.h	2020-02-04 12:09:12.338341003 +0100
> @@ -1128,9 +1128,9 @@ extern const char *host_detect_local_cpu
>  /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/		\
>       6,   6,    6,    6,    6,    6,    6,    6,		\
>  /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/		\
> -     6,    6,     6,    6,    6,    6,    6,    6,		\
> +     1,    1,     1,    1,    1,    1,    1,    1,		\
>  /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/		\
> -     6,    6,     6,    6,    6,    6,    6,    6,		\
> +     1,    1,     1,    1,    1,    1,    1,    1,		\
>   /* k0,  k1,  k2,  k3,  k4,  k5,  k6,  k7*/			\
>       1,   1,   1,   1,   1,   1,   1,   1 }
>  
> --- gcc/testsuite/gcc.target/i386/pr65782.c.jj	2020-02-07 12:21:09.472819018 +0100
> +++ gcc/testsuite/gcc.target/i386/pr65782.c	2020-02-07 12:24:06.820154495 +0100
> @@ -0,0 +1,16 @@
> +/* PR target/65782 */
> +/* { dg-do assemble { target { avx512vl && { ! ia32 } } } } */
> +/* { dg-options "-O2 -mavx512vl" } */
> +
> +void
> +foo (void)
> +{
> +  register double x __asm ("xmm14");
> +  register double y __asm ("xmm18");
> +  asm ("" : "=x" (x));
> +  asm ("" : "=v" (y));
> +  x += y;
> +  y += x;
> +  asm ("" : : "x" (x));
> +  asm ("" : : "v" (y));
> +}
> 
> 
> 	Jakub
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-08  8:24             ` JonY
@ 2020-02-08 10:05               ` Jakub Jelinek
  2020-02-08 10:32                 ` Uros Bizjak
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-08 10:05 UTC (permalink / raw)
  To: JonY; +Cc: Uros Bizjak, gcc-patches

On Sat, Feb 08, 2020 at 08:24:38AM +0000, JonY wrote:
> It does not, I just checked with the master branch of binutils.
...
> I did a -c test build with an older toolchain, it fails to compile
> (invalid register for .seh_savexmm) while the latest gcc is passing,
> both are using the same binutils version.
> 
> I guess that covers it?

I went ahead and committed the patch (with the double config/i386/ in it
fixed obviously).

> > 2020-02-07  Uroš Bizjak  <ubizjak@gmail.com>
> > 	    Jakub Jelinek  <jakub@redhat.com>
> > 
> > 	PR target/65782
> > 	* config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make
> > 	xmm16-xmm31 call-used even in 64-bit ms-abi.
> > 
> > 	* gcc.target/i386/pr65782.c: New test.

	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-08 10:05               ` Jakub Jelinek
@ 2020-02-08 10:32                 ` Uros Bizjak
  2020-02-08 10:34                   ` Jakub Jelinek
  2020-02-08 10:52                   ` Jakub Jelinek
  0 siblings, 2 replies; 29+ messages in thread
From: Uros Bizjak @ 2020-02-08 10:32 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: JonY, gcc-patches

On Sat, Feb 8, 2020 at 11:05 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Sat, Feb 08, 2020 at 08:24:38AM +0000, JonY wrote:
> > It does not, I just checked with the master branch of binutils.
> ...
> > I did a -c test build with an older toolchain, it fails to compile
> > (invalid register for .seh_savexmm) while the latest gcc is passing,
> > both are using the same binutils version.
> >
> > I guess that covers it?
>
> I went ahead and committed the patch (with the double config/i386/ in it
> fixed obviously).

I think that the patch should also be backported to gcc-9 branch. The
change is backward compatible, since the new code will save and
restore zmm16+ registers at the caller site, and the old code (e.g.
existing libraries) will then unnecessarily save and restore regs in
the callee.

HJ fixed some other MSABI issue there.

Uros.

>
> > > 2020-02-07  Uroš Bizjak  <ubizjak@gmail.com>
> > >         Jakub Jelinek  <jakub@redhat.com>
> > >
> > >     PR target/65782
> > >     * config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make
> > >     xmm16-xmm31 call-used even in 64-bit ms-abi.
> > >
> > >     * gcc.target/i386/pr65782.c: New test.
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-08 10:32                 ` Uros Bizjak
@ 2020-02-08 10:34                   ` Jakub Jelinek
  2020-02-08 10:52                   ` Jakub Jelinek
  1 sibling, 0 replies; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-08 10:34 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: JonY, gcc-patches

On Sat, Feb 08, 2020 at 11:32:40AM +0100, Uros Bizjak wrote:
> On Sat, Feb 8, 2020 at 11:05 AM Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > On Sat, Feb 08, 2020 at 08:24:38AM +0000, JonY wrote:
> > > It does not, I just checked with the master branch of binutils.
> > ...
> > > I did a -c test build with an older toolchain, it fails to compile
> > > (invalid register for .seh_savexmm) while the latest gcc is passing,
> > > both are using the same binutils version.
> > >
> > > I guess that covers it?
> >
> > I went ahead and committed the patch (with the double config/i386/ in it
> > fixed obviously).
> 
> I think that the patch should also be backported to gcc-9 branch. The
> change is backward compatible, since the new code will save and
> restore zmm16+ registers at the caller site, and the old code (e.g.
> existing libraries) will then unnecessarily save and restore regs in
> the callee.

Agreed, will do together with other backports soon.

	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-08 10:32                 ` Uros Bizjak
  2020-02-08 10:34                   ` Jakub Jelinek
@ 2020-02-08 10:52                   ` Jakub Jelinek
  2020-02-08 12:52                     ` Uros Bizjak
  1 sibling, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2020-02-08 10:52 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: JonY, gcc-patches

On Sat, Feb 08, 2020 at 11:32:40AM +0100, Uros Bizjak wrote:
> I think that the patch should also be backported to gcc-9 branch. The
> change is backward compatible, since the new code will save and
> restore zmm16+ registers at the caller site, and the old code (e.g.
> existing libraries) will then unnecessarily save and restore regs in
> the callee.

Actually, it isn't backward compatible.
If both caller and callee touch xmm16+ (sure, unlikely), then if caller
is compiled by new gcc and callee by old gcc it is the case you described,
caller doesn't try to preserve values across the call and callee uselessly
saves them anyway (unless it fails to assemble due to SEH).
But if caller is compiled by old gcc and callee by new gcc, caller assumes
it can preserve values across the call in those registers and callee doesn't
save them and clobbers them.

	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI
  2020-02-08 10:52                   ` Jakub Jelinek
@ 2020-02-08 12:52                     ` Uros Bizjak
  0 siblings, 0 replies; 29+ messages in thread
From: Uros Bizjak @ 2020-02-08 12:52 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: JonY, gcc-patches

On Sat, Feb 8, 2020 at 11:52 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Sat, Feb 08, 2020 at 11:32:40AM +0100, Uros Bizjak wrote:
> > I think that the patch should also be backported to gcc-9 branch. The
> > change is backward compatible, since the new code will save and
> > restore zmm16+ registers at the caller site, and the old code (e.g.
> > existing libraries) will then unnecessarily save and restore regs in
> > the callee.
>
> Actually, it isn't backward compatible.
> If both caller and callee touch xmm16+ (sure, unlikely), then if caller
> is compiled by new gcc and callee by old gcc it is the case you described,
> caller doesn't try to preserve values across the call and callee uselessly
> saves them anyway (unless it fails to assemble due to SEH).
> But if caller is compiled by old gcc and callee by new gcc, caller assumes
> it can preserve values across the call in those registers and callee doesn't
> save them and clobbers them.

I was considering only the most likely scenario where the compiler is
upgraded without recompiling existing system libraries. So, with the
new compiler installed, the new compiled code calls either new or old
libraries. *Assuming* that old libraries never call new code, the
situation is OK.

As you point out, when the old code calls new code, registers get
clobbered. By not backporting the patch, we can consider gcc-9 and
gcc-10 ABI incompatible, otherwise the forward incompatibility moves
in between gcc-9.2 and gcc-9.3. I agree that the former is somehow
more convenient.

IMO, this issue should be pointed out in the compiler release notes in any case.

Uros.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-02-08 12:52 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-04  9:39 [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] Jakub Jelinek
2020-02-04 10:16 ` Uros Bizjak
2020-02-04 11:05   ` Jakub Jelinek
2020-02-04 11:13     ` Uros Bizjak
2020-02-04 11:24       ` Uros Bizjak
2020-02-04 12:06         ` Richard Sandiford
2020-02-04 12:23           ` Uros Bizjak
2020-02-08  1:50           ` Segher Boessenkool
2020-02-04 12:31         ` Jakub Jelinek
2020-02-04 12:39           ` Uros Bizjak
2020-02-04 13:13             ` Jakub Jelinek
2020-02-04 13:15               ` Uros Bizjak
2020-02-05 10:05                 ` Jakub Jelinek
2020-02-05 10:12                   ` Jakub Jelinek
2020-02-05 10:47                   ` Uros Bizjak
2020-02-05 11:03                     ` Jakub Jelinek
2020-02-05 11:11                       ` Uros Bizjak
2020-02-05 11:24                         ` Jakub Jelinek
2020-02-04 11:42   ` [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI Jakub Jelinek
2020-02-06  1:00     ` JonY
2020-02-06  6:07       ` Jakub Jelinek
2020-02-07 10:57         ` JonY
2020-02-07 11:28           ` Jakub Jelinek
2020-02-08  8:24             ` JonY
2020-02-08 10:05               ` Jakub Jelinek
2020-02-08 10:32                 ` Uros Bizjak
2020-02-08 10:34                   ` Jakub Jelinek
2020-02-08 10:52                   ` Jakub Jelinek
2020-02-08 12:52                     ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).