* [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] @ 2020-02-04 9:39 Jakub Jelinek 2020-02-04 10:16 ` Uros Bizjak 0 siblings, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-04 9:39 UTC (permalink / raw) To: Uros Bizjak, Richard Sandiford; +Cc: gcc-patches Hi! As mentioned in the PR, the CLOBBERs in vzeroupper are added there even for registers that aren't ever live in the function before and break the prologue/epilogue expansion with ms ABI (normal ABIs are fine, as they consider all [xyz]mm registers call clobbered, but the ms ABI considers xmm0-15 call used but the bits above low 128 ones call clobbered). The following patch fixes it by not adding the clobbers during vzeroupper pass (before pro_and_epilogue), but adding them for -fipa-ra purposes only during the final output. Perhaps we could add some CLOBBERs early (say for df_regs_ever_live_p regs that aren't live in the live_regs bitmap, or depending on the ABI either add all of them immediately, or for ms ABI add CLOBBERs for xmm0-xmm5 if they don't have a SET) and add the rest later. And the addition could be perhaps done at other spots, e.g. in an epilogue_completed guarded splitter. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Or any of the above mentioned variants? A separate issue is CALL_USED_REGISTERS for msabi on xmm16-xmm31. And yet another issue is that the presence of a vzeroupper instruction does (and the patch doesn't change) prevent callers from trying to preserve something in the xmm0-xmm15 registers if the callee doesn't use them. Would be nice to be able to say that the vzeroupper will preserve the low 128 bits, so it is ok to hold a 128-bit value across the call, but not 256-bit or larger. Not sure if the RA has a way to express that (and IPA-RA certainly ATM doesn't). 2020-02-04 Jakub Jelinek <jakub@redhat.com> PR target/92190 * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only include sets and not clobbers in the vzeroupper pattern. * config/i386/sse.md (*avx_vzeroupper): Add the clobbers during pattern output. * gcc.target/i386/pr92190.c: New test. --- gcc/config/i386/i386-features.c.jj 2020-02-03 13:17:15.919723150 +0100 +++ gcc/config/i386/i386-features.c 2020-02-03 18:30:08.274865983 +0100 @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p (set (reg:V2DF R) (reg:V2DF R)) - which preserves the low 128 bits but clobbers the upper bits. - For a dead register we just use: - - (clobber (reg:V2DF R)) - - which invalidates any previous contents of R and stops R from becoming - live across the vzeroupper in future. */ + which preserves the low 128 bits but clobbers the upper bits. */ static void ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) { rtx pattern = PATTERN (insn); unsigned int nregs = TARGET_64BIT ? 16 : 8; - rtvec vec = rtvec_alloc (nregs + 1); - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + unsigned int npats = nregs; for (unsigned int i = 0; i < nregs; ++i) { unsigned int regno = GET_SSE_REGNO (i); + if (!bitmap_bit_p (live_regs, regno)) + npats--; + } + if (npats == 0) + return; + rtvec vec = rtvec_alloc (npats + 1); + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + for (unsigned int i = 0, j = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + if (!bitmap_bit_p (live_regs, regno)) + continue; rtx reg = gen_rtx_REG (V2DImode, regno); - if (bitmap_bit_p (live_regs, regno)) - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); - else - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + ++j; + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); } XVEC (pattern, 0) = vec; df_insn_rescan (insn); --- gcc/config/i386/sse.md.jj 2020-02-03 13:17:15.957722589 +0100 +++ gcc/config/i386/sse.md 2020-02-03 18:30:08.280865894 +0100 @@ -19819,7 +19819,34 @@ (define_insn "*avx_vzeroupper" [(match_parallel 0 "vzeroupper_pattern" [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] "TARGET_AVX" - "vzeroupper" +{ + if (flag_ipa_ra && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1) + { + /* For IPA-RA purposes, make it clear the instruction clobbers + even XMM registers not mentioned explicitly in the pattern. */ + unsigned int nregs = TARGET_64BIT ? 16 : 8; + unsigned int npats = XVECLEN (operands[0], 0); + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); + for (unsigned int i = 0, j = 1; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + if (j < npats + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) + { + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); + j++; + } + else + { + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + } + XVEC (operands[0], 0) = vec; + } + return "vzeroupper"; +} [(set_attr "type" "sse") (set_attr "modrm" "0") (set_attr "memory" "none") --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-03 18:29:30.924415644 +0100 +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-03 18:29:15.992635389 +0100 @@ -0,0 +1,19 @@ +/* PR target/92190 */ +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ + +typedef char VC __attribute__((vector_size (16))); +typedef int VI __attribute__((vector_size (16 * sizeof 0))); +VC a; +VI b; +void bar (VI); +void baz (VC); + +void +foo (void) +{ + VC k = a; + VI n = b; + bar (n); + baz (k); +} Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 9:39 [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] Jakub Jelinek @ 2020-02-04 10:16 ` Uros Bizjak 2020-02-04 11:05 ` Jakub Jelinek 2020-02-04 11:42 ` [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI Jakub Jelinek 0 siblings, 2 replies; 29+ messages in thread From: Uros Bizjak @ 2020-02-04 10:16 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches On Tue, Feb 4, 2020 at 10:39 AM Jakub Jelinek <jakub@redhat.com> wrote: > > Hi! > > As mentioned in the PR, the CLOBBERs in vzeroupper are added there even for > registers that aren't ever live in the function before and break the > prologue/epilogue expansion with ms ABI (normal ABIs are fine, as they > consider all [xyz]mm registers call clobbered, but the ms ABI considers > xmm0-15 call used but the bits above low 128 ones call clobbered). > The following patch fixes it by not adding the clobbers during vzeroupper > pass (before pro_and_epilogue), but adding them for -fipa-ra purposes only > during the final output. Perhaps we could add some CLOBBERs early (say for > df_regs_ever_live_p regs that aren't live in the live_regs bitmap, or > depending on the ABI either add all of them immediately, or for ms ABI add > CLOBBERs for xmm0-xmm5 if they don't have a SET) and add the rest later. > And the addition could be perhaps done at other spots, e.g. in an > epilogue_completed guarded splitter. If it works OK, I'd rather see this functionality implemented as an epilogue_completed guarded splitter. In the .md files, there are already cases where we split at this point, and where it is assumed that allocated registers won't change anymore. Also, please don't make the functionality conditional on flag_split_ra. This way, we would always get new patterns in the debug dumps, so in case something goes wrong, one could at least clearly see the full pattern. > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > Or any of the above mentioned variants? > > A separate issue is CALL_USED_REGISTERS for msabi on xmm16-xmm31. I guess that Comment #9 patch form the PR should be trivially correct, but althouhg it looks obvious, I don't want to propose the patch since I have no means of testing it. Uros. > And yet another issue is that the presence of a vzeroupper instruction does > (and the patch doesn't change) prevent callers from trying to preserve > something in the xmm0-xmm15 registers if the callee doesn't use them. > Would be nice to be able to say that the vzeroupper will preserve the low > 128 bits, so it is ok to hold a 128-bit value across the call, but not > 256-bit or larger. Not sure if the RA has a way to express that (and IPA-RA > certainly ATM doesn't). > > 2020-02-04 Jakub Jelinek <jakub@redhat.com> > > PR target/92190 > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only > include sets and not clobbers in the vzeroupper pattern. > * config/i386/sse.md (*avx_vzeroupper): Add the clobbers during > pattern output. > > * gcc.target/i386/pr92190.c: New test. > > --- gcc/config/i386/i386-features.c.jj 2020-02-03 13:17:15.919723150 +0100 > +++ gcc/config/i386/i386-features.c 2020-02-03 18:30:08.274865983 +0100 > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p > > (set (reg:V2DF R) (reg:V2DF R)) > > - which preserves the low 128 bits but clobbers the upper bits. > - For a dead register we just use: > - > - (clobber (reg:V2DF R)) > - > - which invalidates any previous contents of R and stops R from becoming > - live across the vzeroupper in future. */ > + which preserves the low 128 bits but clobbers the upper bits. */ > > static void > ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) > { > rtx pattern = PATTERN (insn); > unsigned int nregs = TARGET_64BIT ? 16 : 8; > - rtvec vec = rtvec_alloc (nregs + 1); > - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + unsigned int npats = nregs; > for (unsigned int i = 0; i < nregs; ++i) > { > unsigned int regno = GET_SSE_REGNO (i); > + if (!bitmap_bit_p (live_regs, regno)) > + npats--; > + } > + if (npats == 0) > + return; > + rtvec vec = rtvec_alloc (npats + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + for (unsigned int i = 0, j = 0; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + if (!bitmap_bit_p (live_regs, regno)) > + continue; > rtx reg = gen_rtx_REG (V2DImode, regno); > - if (bitmap_bit_p (live_regs, regno)) > - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); > - else > - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + ++j; > + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); > } > XVEC (pattern, 0) = vec; > df_insn_rescan (insn); > --- gcc/config/i386/sse.md.jj 2020-02-03 13:17:15.957722589 +0100 > +++ gcc/config/i386/sse.md 2020-02-03 18:30:08.280865894 +0100 > @@ -19819,7 +19819,34 @@ (define_insn "*avx_vzeroupper" > [(match_parallel 0 "vzeroupper_pattern" > [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > "TARGET_AVX" > - "vzeroupper" > +{ > + if (flag_ipa_ra && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1) > + { > + /* For IPA-RA purposes, make it clear the instruction clobbers > + even XMM registers not mentioned explicitly in the pattern. */ > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > + unsigned int npats = XVECLEN (operands[0], 0); > + rtvec vec = rtvec_alloc (nregs + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); > + for (unsigned int i = 0, j = 1; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + if (j < npats > + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) > + { > + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); > + j++; > + } > + else > + { > + rtx reg = gen_rtx_REG (V2DImode, regno); > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + } > + } > + XVEC (operands[0], 0) = vec; > + } > + return "vzeroupper"; > +} > [(set_attr "type" "sse") > (set_attr "modrm" "0") > (set_attr "memory" "none") > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-03 18:29:30.924415644 +0100 > +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-03 18:29:15.992635389 +0100 > @@ -0,0 +1,19 @@ > +/* PR target/92190 */ > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ > + > +typedef char VC __attribute__((vector_size (16))); > +typedef int VI __attribute__((vector_size (16 * sizeof 0))); > +VC a; > +VI b; > +void bar (VI); > +void baz (VC); > + > +void > +foo (void) > +{ > + VC k = a; > + VI n = b; > + bar (n); > + baz (k); > +} > > Jakub > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 10:16 ` Uros Bizjak @ 2020-02-04 11:05 ` Jakub Jelinek 2020-02-04 11:13 ` Uros Bizjak 2020-02-04 11:42 ` [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI Jakub Jelinek 1 sibling, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-04 11:05 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: > If it works OK, I'd rather see this functionality implemented as an > epilogue_completed guarded splitter. In the .md files, there are > already cases where we split at this point, and where it is assumed > that allocated registers won't change anymore. Also, please don't make > the functionality conditional on flag_split_ra. This way, we would > always get new patterns in the debug dumps, so in case something goes > wrong, one could at least clearly see the full pattern. The following seems to work on the testcase, will bootstrap/regtest it soon. 2020-02-04 Jakub Jelinek <jakub@redhat.com> PR target/92190 * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only include sets and not clobbers in the vzeroupper pattern. * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15 registers are mentioned in the pattern and add clobbers for the missing registers at that point. * gcc.target/i386/pr92190.c: New test. --- gcc/config/i386/i386-features.c.jj 2020-02-04 11:40:58.755611428 +0100 +++ gcc/config/i386/i386-features.c 2020-02-04 11:51:33.602148491 +0100 @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p (set (reg:V2DF R) (reg:V2DF R)) - which preserves the low 128 bits but clobbers the upper bits. - For a dead register we just use: - - (clobber (reg:V2DF R)) - - which invalidates any previous contents of R and stops R from becoming - live across the vzeroupper in future. */ + which preserves the low 128 bits but clobbers the upper bits. */ static void ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) { rtx pattern = PATTERN (insn); unsigned int nregs = TARGET_64BIT ? 16 : 8; - rtvec vec = rtvec_alloc (nregs + 1); - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + unsigned int npats = nregs; for (unsigned int i = 0; i < nregs; ++i) { unsigned int regno = GET_SSE_REGNO (i); + if (!bitmap_bit_p (live_regs, regno)) + npats--; + } + if (npats == 0) + return; + rtvec vec = rtvec_alloc (npats + 1); + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + for (unsigned int i = 0, j = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + if (!bitmap_bit_p (live_regs, regno)) + continue; rtx reg = gen_rtx_REG (V2DImode, regno); - if (bitmap_bit_p (live_regs, regno)) - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); - else - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + ++j; + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); } XVEC (pattern, 0) = vec; df_insn_rescan (insn); --- gcc/config/i386/sse.md.jj 2020-02-04 11:40:58.813610563 +0100 +++ gcc/config/i386/sse.md 2020-02-04 11:58:31.544909659 +0100 @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper" [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] "TARGET_AVX") -(define_insn "*avx_vzeroupper" +(define_insn_and_split "*avx_vzeroupper" [(match_parallel 0 "vzeroupper_pattern" [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] "TARGET_AVX" - "vzeroupper" +{ + if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1) + return "#"; + else + return "vzeroupper"; +} + "epilogue_completed + && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" + [(match_dup 0)] +{ + /* For IPA-RA purposes, make it clear the instruction clobbers + even XMM registers not mentioned explicitly in the pattern. */ + unsigned int nregs = TARGET_64BIT ? 16 : 8; + unsigned int npats = XVECLEN (operands[0], 0); + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); + for (unsigned int i = 0, j = 1; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + if (j < npats + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) + { + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); + j++; + } + else + { + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + } + XVEC (operands[0], 0) = vec; +} [(set_attr "type" "sse") (set_attr "modrm" "0") (set_attr "memory" "none") --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 11:51:33.608148402 +0100 +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 11:51:33.608148402 +0100 @@ -0,0 +1,19 @@ +/* PR target/92190 */ +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ + +typedef char VC __attribute__((vector_size (16))); +typedef int VI __attribute__((vector_size (16 * sizeof 0))); +VC a; +VI b; +void bar (VI); +void baz (VC); + +void +foo (void) +{ + VC k = a; + VI n = b; + bar (n); + baz (k); +} Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 11:05 ` Jakub Jelinek @ 2020-02-04 11:13 ` Uros Bizjak 2020-02-04 11:24 ` Uros Bizjak 0 siblings, 1 reply; 29+ messages in thread From: Uros Bizjak @ 2020-02-04 11:13 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches On Tue, Feb 4, 2020 at 12:05 PM Jakub Jelinek <jakub@redhat.com> wrote: > > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: > > If it works OK, I'd rather see this functionality implemented as an > > epilogue_completed guarded splitter. In the .md files, there are > > already cases where we split at this point, and where it is assumed > > that allocated registers won't change anymore. Also, please don't make > > the functionality conditional on flag_split_ra. This way, we would > > always get new patterns in the debug dumps, so in case something goes > > wrong, one could at least clearly see the full pattern. > > The following seems to work on the testcase, will bootstrap/regtest it soon. > > 2020-02-04 Jakub Jelinek <jakub@redhat.com> > > PR target/92190 > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only > include sets and not clobbers in the vzeroupper pattern. > * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to > define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15 > registers are mentioned in the pattern and add clobbers for the missing > registers at that point. > > * gcc.target/i386/pr92190.c: New test. A && is missing in the split condition to inherit TARGET_AVX. OK with this change. Thanks, Uros. > > --- gcc/config/i386/i386-features.c.jj 2020-02-04 11:40:58.755611428 +0100 > +++ gcc/config/i386/i386-features.c 2020-02-04 11:51:33.602148491 +0100 > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p > > (set (reg:V2DF R) (reg:V2DF R)) > > - which preserves the low 128 bits but clobbers the upper bits. > - For a dead register we just use: > - > - (clobber (reg:V2DF R)) > - > - which invalidates any previous contents of R and stops R from becoming > - live across the vzeroupper in future. */ > + which preserves the low 128 bits but clobbers the upper bits. */ > > static void > ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) > { > rtx pattern = PATTERN (insn); > unsigned int nregs = TARGET_64BIT ? 16 : 8; > - rtvec vec = rtvec_alloc (nregs + 1); > - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + unsigned int npats = nregs; > for (unsigned int i = 0; i < nregs; ++i) > { > unsigned int regno = GET_SSE_REGNO (i); > + if (!bitmap_bit_p (live_regs, regno)) > + npats--; > + } > + if (npats == 0) > + return; > + rtvec vec = rtvec_alloc (npats + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + for (unsigned int i = 0, j = 0; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + if (!bitmap_bit_p (live_regs, regno)) > + continue; > rtx reg = gen_rtx_REG (V2DImode, regno); > - if (bitmap_bit_p (live_regs, regno)) > - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); > - else > - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + ++j; > + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); > } > XVEC (pattern, 0) = vec; > df_insn_rescan (insn); > --- gcc/config/i386/sse.md.jj 2020-02-04 11:40:58.813610563 +0100 > +++ gcc/config/i386/sse.md 2020-02-04 11:58:31.544909659 +0100 > @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper" > [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > "TARGET_AVX") > > -(define_insn "*avx_vzeroupper" > +(define_insn_and_split "*avx_vzeroupper" > [(match_parallel 0 "vzeroupper_pattern" > [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > "TARGET_AVX" > - "vzeroupper" > +{ > + if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1) > + return "#"; > + else > + return "vzeroupper"; > +} > + "epilogue_completed && epilogue_completed > + && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" > + [(match_dup 0)] > +{ > + /* For IPA-RA purposes, make it clear the instruction clobbers > + even XMM registers not mentioned explicitly in the pattern. */ > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > + unsigned int npats = XVECLEN (operands[0], 0); > + rtvec vec = rtvec_alloc (nregs + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); > + for (unsigned int i = 0, j = 1; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + if (j < npats > + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) > + { > + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); > + j++; > + } > + else > + { > + rtx reg = gen_rtx_REG (V2DImode, regno); > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + } > + } > + XVEC (operands[0], 0) = vec; > +} > [(set_attr "type" "sse") > (set_attr "modrm" "0") > (set_attr "memory" "none") > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 11:51:33.608148402 +0100 > +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 11:51:33.608148402 +0100 > @@ -0,0 +1,19 @@ > +/* PR target/92190 */ > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ > + > +typedef char VC __attribute__((vector_size (16))); > +typedef int VI __attribute__((vector_size (16 * sizeof 0))); > +VC a; > +VI b; > +void bar (VI); > +void baz (VC); > + > +void > +foo (void) > +{ > + VC k = a; > + VI n = b; > + bar (n); > + baz (k); > +} > > > Jakub > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 11:13 ` Uros Bizjak @ 2020-02-04 11:24 ` Uros Bizjak 2020-02-04 12:06 ` Richard Sandiford 2020-02-04 12:31 ` Jakub Jelinek 0 siblings, 2 replies; 29+ messages in thread From: Uros Bizjak @ 2020-02-04 11:24 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches On Tue, Feb 4, 2020 at 12:13 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Tue, Feb 4, 2020 at 12:05 PM Jakub Jelinek <jakub@redhat.com> wrote: > > > > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: > > > If it works OK, I'd rather see this functionality implemented as an > > > epilogue_completed guarded splitter. In the .md files, there are > > > already cases where we split at this point, and where it is assumed > > > that allocated registers won't change anymore. Also, please don't make > > > the functionality conditional on flag_split_ra. This way, we would > > > always get new patterns in the debug dumps, so in case something goes > > > wrong, one could at least clearly see the full pattern. > > > > The following seems to work on the testcase, will bootstrap/regtest it soon. > > > > 2020-02-04 Jakub Jelinek <jakub@redhat.com> > > > > PR target/92190 > > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only > > include sets and not clobbers in the vzeroupper pattern. > > * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to > > define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15 > > registers are mentioned in the pattern and add clobbers for the missing > > registers at that point. > > > > * gcc.target/i386/pr92190.c: New test. > > A && is missing in the split condition to inherit TARGET_AVX. Also, you don't need to emit "#" in output template. This is just for extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1". > OK with this change. > > Thanks, > Uros. > > > > > --- gcc/config/i386/i386-features.c.jj 2020-02-04 11:40:58.755611428 +0100 > > +++ gcc/config/i386/i386-features.c 2020-02-04 11:51:33.602148491 +0100 > > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p > > > > (set (reg:V2DF R) (reg:V2DF R)) > > > > - which preserves the low 128 bits but clobbers the upper bits. > > - For a dead register we just use: > > - > > - (clobber (reg:V2DF R)) > > - > > - which invalidates any previous contents of R and stops R from becoming > > - live across the vzeroupper in future. */ > > + which preserves the low 128 bits but clobbers the upper bits. */ > > > > static void > > ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) > > { > > rtx pattern = PATTERN (insn); > > unsigned int nregs = TARGET_64BIT ? 16 : 8; > > - rtvec vec = rtvec_alloc (nregs + 1); > > - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > > + unsigned int npats = nregs; > > for (unsigned int i = 0; i < nregs; ++i) > > { > > unsigned int regno = GET_SSE_REGNO (i); > > + if (!bitmap_bit_p (live_regs, regno)) > > + npats--; > > + } > > + if (npats == 0) > > + return; > > + rtvec vec = rtvec_alloc (npats + 1); > > + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > > + for (unsigned int i = 0, j = 0; i < nregs; ++i) > > + { > > + unsigned int regno = GET_SSE_REGNO (i); > > + if (!bitmap_bit_p (live_regs, regno)) > > + continue; > > rtx reg = gen_rtx_REG (V2DImode, regno); > > - if (bitmap_bit_p (live_regs, regno)) > > - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); > > - else > > - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > > + ++j; > > + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); > > } > > XVEC (pattern, 0) = vec; > > df_insn_rescan (insn); > > --- gcc/config/i386/sse.md.jj 2020-02-04 11:40:58.813610563 +0100 > > +++ gcc/config/i386/sse.md 2020-02-04 11:58:31.544909659 +0100 > > @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper" > > [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > > "TARGET_AVX") > > > > -(define_insn "*avx_vzeroupper" > > +(define_insn_and_split "*avx_vzeroupper" > > [(match_parallel 0 "vzeroupper_pattern" > > [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > > "TARGET_AVX" > > - "vzeroupper" > > +{ > > + if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1) > > + return "#"; > > + else > > + return "vzeroupper"; > > +} > > + "epilogue_completed > > && epilogue_completed > > > + && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" > > + [(match_dup 0)] > > +{ > > + /* For IPA-RA purposes, make it clear the instruction clobbers > > + even XMM registers not mentioned explicitly in the pattern. */ > > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > > + unsigned int npats = XVECLEN (operands[0], 0); > > + rtvec vec = rtvec_alloc (nregs + 1); > > + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); > > + for (unsigned int i = 0, j = 1; i < nregs; ++i) > > + { > > + unsigned int regno = GET_SSE_REGNO (i); > > + if (j < npats > > + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) > > + { > > + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); > > + j++; > > + } > > + else > > + { > > + rtx reg = gen_rtx_REG (V2DImode, regno); > > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > > + } > > + } > > + XVEC (operands[0], 0) = vec; > > +} > > [(set_attr "type" "sse") > > (set_attr "modrm" "0") > > (set_attr "memory" "none") > > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 11:51:33.608148402 +0100 > > +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 11:51:33.608148402 +0100 > > @@ -0,0 +1,19 @@ > > +/* PR target/92190 */ > > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ > > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ > > + > > +typedef char VC __attribute__((vector_size (16))); > > +typedef int VI __attribute__((vector_size (16 * sizeof 0))); > > +VC a; > > +VI b; > > +void bar (VI); > > +void baz (VC); > > + > > +void > > +foo (void) > > +{ > > + VC k = a; > > + VI n = b; > > + bar (n); > > + baz (k); > > +} > > > > > > Jakub > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 11:24 ` Uros Bizjak @ 2020-02-04 12:06 ` Richard Sandiford 2020-02-04 12:23 ` Uros Bizjak 2020-02-08 1:50 ` Segher Boessenkool 2020-02-04 12:31 ` Jakub Jelinek 1 sibling, 2 replies; 29+ messages in thread From: Richard Sandiford @ 2020-02-04 12:06 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches Uros Bizjak <ubizjak@gmail.com> writes: > On Tue, Feb 4, 2020 at 12:13 PM Uros Bizjak <ubizjak@gmail.com> wrote: >> >> On Tue, Feb 4, 2020 at 12:05 PM Jakub Jelinek <jakub@redhat.com> wrote: >> > >> > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: >> > > If it works OK, I'd rather see this functionality implemented as an >> > > epilogue_completed guarded splitter. In the .md files, there are >> > > already cases where we split at this point, and where it is assumed >> > > that allocated registers won't change anymore. Also, please don't make >> > > the functionality conditional on flag_split_ra. This way, we would >> > > always get new patterns in the debug dumps, so in case something goes >> > > wrong, one could at least clearly see the full pattern. >> > >> > The following seems to work on the testcase, will bootstrap/regtest it soon. >> > >> > 2020-02-04 Jakub Jelinek <jakub@redhat.com> >> > >> > PR target/92190 >> > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only >> > include sets and not clobbers in the vzeroupper pattern. >> > * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to >> > define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15 >> > registers are mentioned in the pattern and add clobbers for the missing >> > registers at that point. >> > >> > * gcc.target/i386/pr92190.c: New test. >> >> A && is missing in the split condition to inherit TARGET_AVX. > > Also, you don't need to emit "#" in output template. This is just for > extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1". IMO it's better to keep it here, where we're relying on the split happening for correctness. There's no theoretical guarantee that a split ever happens otherwise. Thanks, Richard >> OK with this change. >> >> Thanks, >> Uros. >> >> > >> > --- gcc/config/i386/i386-features.c.jj 2020-02-04 11:40:58.755611428 +0100 >> > +++ gcc/config/i386/i386-features.c 2020-02-04 11:51:33.602148491 +0100 >> > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p >> > >> > (set (reg:V2DF R) (reg:V2DF R)) >> > >> > - which preserves the low 128 bits but clobbers the upper bits. >> > - For a dead register we just use: >> > - >> > - (clobber (reg:V2DF R)) >> > - >> > - which invalidates any previous contents of R and stops R from becoming >> > - live across the vzeroupper in future. */ >> > + which preserves the low 128 bits but clobbers the upper bits. */ >> > >> > static void >> > ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) >> > { >> > rtx pattern = PATTERN (insn); >> > unsigned int nregs = TARGET_64BIT ? 16 : 8; >> > - rtvec vec = rtvec_alloc (nregs + 1); >> > - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); >> > + unsigned int npats = nregs; >> > for (unsigned int i = 0; i < nregs; ++i) >> > { >> > unsigned int regno = GET_SSE_REGNO (i); >> > + if (!bitmap_bit_p (live_regs, regno)) >> > + npats--; >> > + } >> > + if (npats == 0) >> > + return; >> > + rtvec vec = rtvec_alloc (npats + 1); >> > + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); >> > + for (unsigned int i = 0, j = 0; i < nregs; ++i) >> > + { >> > + unsigned int regno = GET_SSE_REGNO (i); >> > + if (!bitmap_bit_p (live_regs, regno)) >> > + continue; >> > rtx reg = gen_rtx_REG (V2DImode, regno); >> > - if (bitmap_bit_p (live_regs, regno)) >> > - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); >> > - else >> > - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); >> > + ++j; >> > + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); >> > } >> > XVEC (pattern, 0) = vec; >> > df_insn_rescan (insn); >> > --- gcc/config/i386/sse.md.jj 2020-02-04 11:40:58.813610563 +0100 >> > +++ gcc/config/i386/sse.md 2020-02-04 11:58:31.544909659 +0100 >> > @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper" >> > [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] >> > "TARGET_AVX") >> > >> > -(define_insn "*avx_vzeroupper" >> > +(define_insn_and_split "*avx_vzeroupper" >> > [(match_parallel 0 "vzeroupper_pattern" >> > [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] >> > "TARGET_AVX" >> > - "vzeroupper" >> > +{ >> > + if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1) >> > + return "#"; >> > + else >> > + return "vzeroupper"; >> > +} >> > + "epilogue_completed >> >> && epilogue_completed >> >> > + && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" >> > + [(match_dup 0)] >> > +{ >> > + /* For IPA-RA purposes, make it clear the instruction clobbers >> > + even XMM registers not mentioned explicitly in the pattern. */ >> > + unsigned int nregs = TARGET_64BIT ? 16 : 8; >> > + unsigned int npats = XVECLEN (operands[0], 0); >> > + rtvec vec = rtvec_alloc (nregs + 1); >> > + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); >> > + for (unsigned int i = 0, j = 1; i < nregs; ++i) >> > + { >> > + unsigned int regno = GET_SSE_REGNO (i); >> > + if (j < npats >> > + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) >> > + { >> > + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); >> > + j++; >> > + } >> > + else >> > + { >> > + rtx reg = gen_rtx_REG (V2DImode, regno); >> > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); >> > + } >> > + } >> > + XVEC (operands[0], 0) = vec; >> > +} >> > [(set_attr "type" "sse") >> > (set_attr "modrm" "0") >> > (set_attr "memory" "none") >> > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 11:51:33.608148402 +0100 >> > +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 11:51:33.608148402 +0100 >> > @@ -0,0 +1,19 @@ >> > +/* PR target/92190 */ >> > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ >> > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ >> > + >> > +typedef char VC __attribute__((vector_size (16))); >> > +typedef int VI __attribute__((vector_size (16 * sizeof 0))); >> > +VC a; >> > +VI b; >> > +void bar (VI); >> > +void baz (VC); >> > + >> > +void >> > +foo (void) >> > +{ >> > + VC k = a; >> > + VI n = b; >> > + bar (n); >> > + baz (k); >> > +} >> > >> > >> > Jakub >> > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 12:06 ` Richard Sandiford @ 2020-02-04 12:23 ` Uros Bizjak 2020-02-08 1:50 ` Segher Boessenkool 1 sibling, 0 replies; 29+ messages in thread From: Uros Bizjak @ 2020-02-04 12:23 UTC (permalink / raw) To: Uros Bizjak, Jakub Jelinek, gcc-patches, Richard Sandiford On Tue, Feb 4, 2020 at 1:06 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Uros Bizjak <ubizjak@gmail.com> writes: > > On Tue, Feb 4, 2020 at 12:13 PM Uros Bizjak <ubizjak@gmail.com> wrote: > >> > >> On Tue, Feb 4, 2020 at 12:05 PM Jakub Jelinek <jakub@redhat.com> wrote: > >> > > >> > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: > >> > > If it works OK, I'd rather see this functionality implemented as an > >> > > epilogue_completed guarded splitter. In the .md files, there are > >> > > already cases where we split at this point, and where it is assumed > >> > > that allocated registers won't change anymore. Also, please don't make > >> > > the functionality conditional on flag_split_ra. This way, we would > >> > > always get new patterns in the debug dumps, so in case something goes > >> > > wrong, one could at least clearly see the full pattern. > >> > > >> > The following seems to work on the testcase, will bootstrap/regtest it soon. > >> > > >> > 2020-02-04 Jakub Jelinek <jakub@redhat.com> > >> > > >> > PR target/92190 > >> > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only > >> > include sets and not clobbers in the vzeroupper pattern. > >> > * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to > >> > define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15 > >> > registers are mentioned in the pattern and add clobbers for the missing > >> > registers at that point. > >> > > >> > * gcc.target/i386/pr92190.c: New test. > >> > >> A && is missing in the split condition to inherit TARGET_AVX. > > > > Also, you don't need to emit "#" in output template. This is just for > > extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1". > > IMO it's better to keep it here, where we're relying on the split > happening for correctness. There's no theoretical guarantee that > a split ever happens otherwise. In this case, I think it would be better to have two patterns, one (define_insn_and_split) with "#" and the other "real" (define_insn). Uros. > Thanks, > Richard > >> OK with this change. > >> > >> Thanks, > >> Uros. > >> > >> > > >> > --- gcc/config/i386/i386-features.c.jj 2020-02-04 11:40:58.755611428 +0100 > >> > +++ gcc/config/i386/i386-features.c 2020-02-04 11:51:33.602148491 +0100 > >> > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p > >> > > >> > (set (reg:V2DF R) (reg:V2DF R)) > >> > > >> > - which preserves the low 128 bits but clobbers the upper bits. > >> > - For a dead register we just use: > >> > - > >> > - (clobber (reg:V2DF R)) > >> > - > >> > - which invalidates any previous contents of R and stops R from becoming > >> > - live across the vzeroupper in future. */ > >> > + which preserves the low 128 bits but clobbers the upper bits. */ > >> > > >> > static void > >> > ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) > >> > { > >> > rtx pattern = PATTERN (insn); > >> > unsigned int nregs = TARGET_64BIT ? 16 : 8; > >> > - rtvec vec = rtvec_alloc (nregs + 1); > >> > - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > >> > + unsigned int npats = nregs; > >> > for (unsigned int i = 0; i < nregs; ++i) > >> > { > >> > unsigned int regno = GET_SSE_REGNO (i); > >> > + if (!bitmap_bit_p (live_regs, regno)) > >> > + npats--; > >> > + } > >> > + if (npats == 0) > >> > + return; > >> > + rtvec vec = rtvec_alloc (npats + 1); > >> > + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > >> > + for (unsigned int i = 0, j = 0; i < nregs; ++i) > >> > + { > >> > + unsigned int regno = GET_SSE_REGNO (i); > >> > + if (!bitmap_bit_p (live_regs, regno)) > >> > + continue; > >> > rtx reg = gen_rtx_REG (V2DImode, regno); > >> > - if (bitmap_bit_p (live_regs, regno)) > >> > - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); > >> > - else > >> > - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > >> > + ++j; > >> > + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); > >> > } > >> > XVEC (pattern, 0) = vec; > >> > df_insn_rescan (insn); > >> > --- gcc/config/i386/sse.md.jj 2020-02-04 11:40:58.813610563 +0100 > >> > +++ gcc/config/i386/sse.md 2020-02-04 11:58:31.544909659 +0100 > >> > @@ -19815,11 +19815,43 @@ (define_expand "avx_vzeroupper" > >> > [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > >> > "TARGET_AVX") > >> > > >> > -(define_insn "*avx_vzeroupper" > >> > +(define_insn_and_split "*avx_vzeroupper" > >> > [(match_parallel 0 "vzeroupper_pattern" > >> > [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > >> > "TARGET_AVX" > >> > - "vzeroupper" > >> > +{ > >> > + if (XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1) > >> > + return "#"; > >> > + else > >> > + return "vzeroupper"; > >> > +} > >> > + "epilogue_completed > >> > >> && epilogue_completed > >> > >> > + && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" > >> > + [(match_dup 0)] > >> > +{ > >> > + /* For IPA-RA purposes, make it clear the instruction clobbers > >> > + even XMM registers not mentioned explicitly in the pattern. */ > >> > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > >> > + unsigned int npats = XVECLEN (operands[0], 0); > >> > + rtvec vec = rtvec_alloc (nregs + 1); > >> > + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); > >> > + for (unsigned int i = 0, j = 1; i < nregs; ++i) > >> > + { > >> > + unsigned int regno = GET_SSE_REGNO (i); > >> > + if (j < npats > >> > + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) > >> > + { > >> > + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); > >> > + j++; > >> > + } > >> > + else > >> > + { > >> > + rtx reg = gen_rtx_REG (V2DImode, regno); > >> > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > >> > + } > >> > + } > >> > + XVEC (operands[0], 0) = vec; > >> > +} > >> > [(set_attr "type" "sse") > >> > (set_attr "modrm" "0") > >> > (set_attr "memory" "none") > >> > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 11:51:33.608148402 +0100 > >> > +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 11:51:33.608148402 +0100 > >> > @@ -0,0 +1,19 @@ > >> > +/* PR target/92190 */ > >> > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ > >> > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ > >> > + > >> > +typedef char VC __attribute__((vector_size (16))); > >> > +typedef int VI __attribute__((vector_size (16 * sizeof 0))); > >> > +VC a; > >> > +VI b; > >> > +void bar (VI); > >> > +void baz (VC); > >> > + > >> > +void > >> > +foo (void) > >> > +{ > >> > + VC k = a; > >> > + VI n = b; > >> > + bar (n); > >> > + baz (k); > >> > +} > >> > > >> > > >> > Jakub > >> > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 12:06 ` Richard Sandiford 2020-02-04 12:23 ` Uros Bizjak @ 2020-02-08 1:50 ` Segher Boessenkool 1 sibling, 0 replies; 29+ messages in thread From: Segher Boessenkool @ 2020-02-08 1:50 UTC (permalink / raw) To: Uros Bizjak, Jakub Jelinek, gcc-patches, richard.sandiford On Tue, Feb 04, 2020 at 12:06:54PM +0000, Richard Sandiford wrote: > > Also, you don't need to emit "#" in output template. This is just for > > extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1". > > IMO it's better to keep it here, where we're relying on the split > happening for correctness. There's no theoretical guarantee that > a split ever happens otherwise. A split will always happen if you have the "length" attribute and no register stacks (split_for_shorten_branches, "split5", almost directly before final). This has been true for so long that lots of things are bound to rely on it. "#" only has real meaning in final, of course (but is nice documentation for everything else). Maybe we should just always run split5, and not do it in final at all anymore? Or at least always for targets with the "length" attribute? Segher ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 11:24 ` Uros Bizjak 2020-02-04 12:06 ` Richard Sandiford @ 2020-02-04 12:31 ` Jakub Jelinek 2020-02-04 12:39 ` Uros Bizjak 1 sibling, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-04 12:31 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches On Tue, Feb 04, 2020 at 12:24:10PM +0100, Uros Bizjak wrote: > > A && is missing in the split condition to inherit TARGET_AVX. > > Also, you don't need to emit "#" in output template. This is just for > extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1". I was worried that -fipa-ra then would give wrong answers. -O2 -mabi=ms -mavx512f -fno-schedule-insns2 I was worried about isn't a problem though, because while the split4 pass is disabled, the split3 pass is scheduled instead before regstack. I can do -O2 -mabi=ms -mavx512f -fdisable-rtl-split4 or -O2 -mabi=ms -mavx512f -fno-schedule-insns2 -fdisable-rtl-split3 though and then we ICE, with the patch as is with just && epilogue_completed ... the ICE is cc1: note: disable pass rtl-split4 for functions in the range of [0, 4294967295] pr92190.c: In function ‘foo’: pr92190.c:19:1: error: could not split insn 19 | } | ^ (insn:TI 23 18 10 2 (parallel [ (unspec_volatile [ (const_int 0 [0]) ] UNSPECV_VZEROUPPER) (clobber (reg:V2DI 20 xmm0)) (clobber (reg:V2DI 21 xmm1)) (clobber (reg:V2DI 22 xmm2)) (clobber (reg:V2DI 23 xmm3)) (clobber (reg:V2DI 24 xmm4)) (clobber (reg:V2DI 25 xmm5)) (set (reg:V2DI 26 xmm6) (reg:V2DI 26 xmm6)) (clobber (reg:V2DI 27 xmm7)) (clobber (reg:V2DI 44 xmm8)) (clobber (reg:V2DI 45 xmm9)) (clobber (reg:V2DI 46 xmm10)) (clobber (reg:V2DI 47 xmm11)) (clobber (reg:V2DI 48 xmm12)) (clobber (reg:V2DI 49 xmm13)) (clobber (reg:V2DI 50 xmm14)) (clobber (reg:V2DI 51 xmm15)) ]) "pr92190.c":17:3 4895 {*avx_vzeroupper} (nil)) because as the splitter is written, it modifies the insn in place and so later in try_split if (INSN_P (insn_last) && rtx_equal_p (PATTERN (insn_last), pat)) return trial; rtx_equal_p is true because of the modification in place. Now, if I slightly modify the patch, so that it does operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); instead of XVEC (operands[0], 0) = vec; and thus doesn't modify in place, it will ICE in a different spot: pr92190.c: In function ‘foo’: pr92190.c:19:1: internal compiler error: in final_scan_insn_1, at final.c:3078 19 | } | ^ i.e. on /* If we have a length attribute, this instruction should have been split in shorten_branches, to ensure that we would have valid length info for the splitees. */ gcc_assert (!HAVE_ATTR_length); Though, I guess that is not really specific to this patch, by -fdisable-rtl-split{3,4,5} we effectively disable all possibilities of splitting before final, so anything that isn't split before will ICE that way. So, if we wanted to remove that class of ICEs, we should just fatal_error if somebody disables the last splitter pass before expansion and HAVE_ATTR_length is defined. Here is the patch with && before epilogue_completed, no "#" and that avoids modification of the insn in place. We don't ICE, just could generate wrong code with -fipa-ra, but user asked for that with -fdisable-rtl-split{3,4,5}, so guess that is ok. 2020-02-04 Jakub Jelinek <jakub@redhat.com> PR target/92190 * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only include sets and not clobbers in the vzeroupper pattern. * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15 registers are mentioned in the pattern and add clobbers for the missing registers at that point. * gcc.target/i386/pr92190.c: New test. --- gcc/config/i386/i386-features.c.jj 2020-02-04 11:40:58.755611428 +0100 +++ gcc/config/i386/i386-features.c 2020-02-04 11:51:33.602148491 +0100 @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p (set (reg:V2DF R) (reg:V2DF R)) - which preserves the low 128 bits but clobbers the upper bits. - For a dead register we just use: - - (clobber (reg:V2DF R)) - - which invalidates any previous contents of R and stops R from becoming - live across the vzeroupper in future. */ + which preserves the low 128 bits but clobbers the upper bits. */ static void ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) { rtx pattern = PATTERN (insn); unsigned int nregs = TARGET_64BIT ? 16 : 8; - rtvec vec = rtvec_alloc (nregs + 1); - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + unsigned int npats = nregs; for (unsigned int i = 0; i < nregs; ++i) { unsigned int regno = GET_SSE_REGNO (i); + if (!bitmap_bit_p (live_regs, regno)) + npats--; + } + if (npats == 0) + return; + rtvec vec = rtvec_alloc (npats + 1); + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + for (unsigned int i = 0, j = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + if (!bitmap_bit_p (live_regs, regno)) + continue; rtx reg = gen_rtx_REG (V2DImode, regno); - if (bitmap_bit_p (live_regs, regno)) - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); - else - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + ++j; + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); } XVEC (pattern, 0) = vec; df_insn_rescan (insn); --- gcc/config/i386/sse.md.jj 2020-02-04 11:40:58.813610563 +0100 +++ gcc/config/i386/sse.md 2020-02-04 13:21:50.088340875 +0100 @@ -19815,11 +19815,38 @@ (define_expand "avx_vzeroupper" [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] "TARGET_AVX") -(define_insn "*avx_vzeroupper" +(define_insn_and_split "*avx_vzeroupper" [(match_parallel 0 "vzeroupper_pattern" [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] "TARGET_AVX" "vzeroupper" + "&& epilogue_completed + && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" + [(match_dup 0)] +{ + /* For IPA-RA purposes, make it clear the instruction clobbers + even XMM registers not mentioned explicitly in the pattern. */ + unsigned int nregs = TARGET_64BIT ? 16 : 8; + unsigned int npats = XVECLEN (operands[0], 0); + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); + for (unsigned int i = 0, j = 1; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + if (j < npats + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) + { + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); + j++; + } + else + { + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + } + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); +} [(set_attr "type" "sse") (set_attr "modrm" "0") (set_attr "memory" "none") --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 11:51:33.608148402 +0100 +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 11:51:33.608148402 +0100 @@ -0,0 +1,19 @@ +/* PR target/92190 */ +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ + +typedef char VC __attribute__((vector_size (16))); +typedef int VI __attribute__((vector_size (16 * sizeof 0))); +VC a; +VI b; +void bar (VI); +void baz (VC); + +void +foo (void) +{ + VC k = a; + VI n = b; + bar (n); + baz (k); +} Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 12:31 ` Jakub Jelinek @ 2020-02-04 12:39 ` Uros Bizjak 2020-02-04 13:13 ` Jakub Jelinek 0 siblings, 1 reply; 29+ messages in thread From: Uros Bizjak @ 2020-02-04 12:39 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches On Tue, Feb 4, 2020 at 1:30 PM Jakub Jelinek <jakub@redhat.com> wrote: > > On Tue, Feb 04, 2020 at 12:24:10PM +0100, Uros Bizjak wrote: > > > A && is missing in the split condition to inherit TARGET_AVX. > > > > Also, you don't need to emit "#" in output template. This is just for > > extra safety, we can live without. Please see e.g. "*tzcnt<mode>_1". > > I was worried that -fipa-ra then would give wrong answers. > -O2 -mabi=ms -mavx512f -fno-schedule-insns2 I was worried about isn't a problem though, > because while the split4 pass is disabled, the split3 pass is scheduled instead before > regstack. > I can do > -O2 -mabi=ms -mavx512f -fdisable-rtl-split4 > or > -O2 -mabi=ms -mavx512f -fno-schedule-insns2 -fdisable-rtl-split3 > though and then we ICE, with the patch as is with just && epilogue_completed ... > the ICE is > cc1: note: disable pass rtl-split4 for functions in the range of [0, 4294967295] > pr92190.c: In function ‘foo’: > pr92190.c:19:1: error: could not split insn > 19 | } > | ^ > (insn:TI 23 18 10 2 (parallel [ > (unspec_volatile [ > (const_int 0 [0]) > ] UNSPECV_VZEROUPPER) > (clobber (reg:V2DI 20 xmm0)) > (clobber (reg:V2DI 21 xmm1)) > (clobber (reg:V2DI 22 xmm2)) > (clobber (reg:V2DI 23 xmm3)) > (clobber (reg:V2DI 24 xmm4)) > (clobber (reg:V2DI 25 xmm5)) > (set (reg:V2DI 26 xmm6) > (reg:V2DI 26 xmm6)) > (clobber (reg:V2DI 27 xmm7)) > (clobber (reg:V2DI 44 xmm8)) > (clobber (reg:V2DI 45 xmm9)) > (clobber (reg:V2DI 46 xmm10)) > (clobber (reg:V2DI 47 xmm11)) > (clobber (reg:V2DI 48 xmm12)) > (clobber (reg:V2DI 49 xmm13)) > (clobber (reg:V2DI 50 xmm14)) > (clobber (reg:V2DI 51 xmm15)) > ]) "pr92190.c":17:3 4895 {*avx_vzeroupper} > (nil)) > because as the splitter is written, it modifies the insn in place and so > later in try_split > if (INSN_P (insn_last) > && rtx_equal_p (PATTERN (insn_last), pat)) > return trial; > rtx_equal_p is true because of the modification in place. > Now, if I slightly modify the patch, so that it does > operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); > instead of > XVEC (operands[0], 0) = vec; > and thus doesn't modify in place, it will ICE in a different spot: > pr92190.c: In function ‘foo’: > pr92190.c:19:1: internal compiler error: in final_scan_insn_1, at final.c:3078 > 19 | } > | ^ > i.e. on > /* If we have a length attribute, this instruction should have > been split in shorten_branches, to ensure that we would have > valid length info for the splitees. */ > gcc_assert (!HAVE_ATTR_length); > Though, I guess that is not really specific to this patch, by > -fdisable-rtl-split{3,4,5} we effectively disable all possibilities of > splitting before final, so anything that isn't split before will ICE that > way. > So, if we wanted to remove that class of ICEs, we should just fatal_error if > somebody disables the last splitter pass before expansion and > HAVE_ATTR_length is defined. > > Here is the patch with && before epilogue_completed, no "#" and that avoids > modification of the insn in place. We don't ICE, just could generate wrong > code with -fipa-ra, but user asked for that with -fdisable-rtl-split{3,4,5}, > so guess that is ok. > > 2020-02-04 Jakub Jelinek <jakub@redhat.com> > > PR target/92190 > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only > include sets and not clobbers in the vzeroupper pattern. > * config/i386/sse.md (*avx_vzeroupper): Change from define_insn to > define_insn_and_split, split if epilogue_completed and not all xmm0-xmm15 > registers are mentioned in the pattern and add clobbers for the missing > registers at that point. > > * gcc.target/i386/pr92190.c: New test. As Richard advised, let's put this safety stuff back. Usually, in i386.md, these kind of splitters are implemented as two patterns, one (define_insn_and_split) having "#", and the other (define_insn) with a real insn. My opinion is, that this separation avoids confusion as much as possible. Uros. > --- gcc/config/i386/i386-features.c.jj 2020-02-04 11:40:58.755611428 +0100 > +++ gcc/config/i386/i386-features.c 2020-02-04 11:51:33.602148491 +0100 > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p > > (set (reg:V2DF R) (reg:V2DF R)) > > - which preserves the low 128 bits but clobbers the upper bits. > - For a dead register we just use: > - > - (clobber (reg:V2DF R)) > - > - which invalidates any previous contents of R and stops R from becoming > - live across the vzeroupper in future. */ > + which preserves the low 128 bits but clobbers the upper bits. */ > > static void > ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) > { > rtx pattern = PATTERN (insn); > unsigned int nregs = TARGET_64BIT ? 16 : 8; > - rtvec vec = rtvec_alloc (nregs + 1); > - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + unsigned int npats = nregs; > for (unsigned int i = 0; i < nregs; ++i) > { > unsigned int regno = GET_SSE_REGNO (i); > + if (!bitmap_bit_p (live_regs, regno)) > + npats--; > + } > + if (npats == 0) > + return; > + rtvec vec = rtvec_alloc (npats + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + for (unsigned int i = 0, j = 0; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + if (!bitmap_bit_p (live_regs, regno)) > + continue; > rtx reg = gen_rtx_REG (V2DImode, regno); > - if (bitmap_bit_p (live_regs, regno)) > - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); > - else > - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + ++j; > + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); > } > XVEC (pattern, 0) = vec; > df_insn_rescan (insn); > --- gcc/config/i386/sse.md.jj 2020-02-04 11:40:58.813610563 +0100 > +++ gcc/config/i386/sse.md 2020-02-04 13:21:50.088340875 +0100 > @@ -19815,11 +19815,38 @@ (define_expand "avx_vzeroupper" > [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > "TARGET_AVX") > > -(define_insn "*avx_vzeroupper" > +(define_insn_and_split "*avx_vzeroupper" > [(match_parallel 0 "vzeroupper_pattern" > [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > "TARGET_AVX" > "vzeroupper" > + "&& epilogue_completed > + && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" > + [(match_dup 0)] > +{ > + /* For IPA-RA purposes, make it clear the instruction clobbers > + even XMM registers not mentioned explicitly in the pattern. */ > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > + unsigned int npats = XVECLEN (operands[0], 0); > + rtvec vec = rtvec_alloc (nregs + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); > + for (unsigned int i = 0, j = 1; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + if (j < npats > + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) > + { > + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); > + j++; > + } > + else > + { > + rtx reg = gen_rtx_REG (V2DImode, regno); > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + } > + } > + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); > +} > [(set_attr "type" "sse") > (set_attr "modrm" "0") > (set_attr "memory" "none") > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 11:51:33.608148402 +0100 > +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 11:51:33.608148402 +0100 > @@ -0,0 +1,19 @@ > +/* PR target/92190 */ > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ > + > +typedef char VC __attribute__((vector_size (16))); > +typedef int VI __attribute__((vector_size (16 * sizeof 0))); > +VC a; > +VI b; > +void bar (VI); > +void baz (VC); > + > +void > +foo (void) > +{ > + VC k = a; > + VI n = b; > + bar (n); > + baz (k); > +} > > > Jakub > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 12:39 ` Uros Bizjak @ 2020-02-04 13:13 ` Jakub Jelinek 2020-02-04 13:15 ` Uros Bizjak 0 siblings, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-04 13:13 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches On Tue, Feb 04, 2020 at 01:38:51PM +0100, Uros Bizjak wrote: > As Richard advised, let's put this safety stuff back. Usually, in > i386.md, these kind of splitters are implemented as two patterns, one > (define_insn_and_split) having "#", and the other (define_insn) with a > real insn. My opinion is, that this separation avoids confusion as > much as possible. Okay. So like this if it passes bootstrap/regtest then? 2020-02-04 Jakub Jelinek <jakub@redhat.com> PR target/92190 * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only include sets and not clobbers in the vzeroupper pattern. * config/i386/sse.md (*avx_vzeroupper): Require in insn condition that the parallel has 17 (64-bit) or 9 (32-bit) elts. (*avx_vzeroupper_1): New define_insn_and_split. * gcc.target/i386/pr92190.c: New test. --- gcc/config/i386/i386-features.c.jj 2020-02-04 13:33:32.713885386 +0100 +++ gcc/config/i386/i386-features.c 2020-02-04 13:55:44.358058104 +0100 @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p (set (reg:V2DF R) (reg:V2DF R)) - which preserves the low 128 bits but clobbers the upper bits. - For a dead register we just use: - - (clobber (reg:V2DF R)) - - which invalidates any previous contents of R and stops R from becoming - live across the vzeroupper in future. */ + which preserves the low 128 bits but clobbers the upper bits. */ static void ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) { rtx pattern = PATTERN (insn); unsigned int nregs = TARGET_64BIT ? 16 : 8; - rtvec vec = rtvec_alloc (nregs + 1); - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + unsigned int npats = nregs; for (unsigned int i = 0; i < nregs; ++i) { unsigned int regno = GET_SSE_REGNO (i); + if (!bitmap_bit_p (live_regs, regno)) + npats--; + } + if (npats == 0) + return; + rtvec vec = rtvec_alloc (npats + 1); + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + for (unsigned int i = 0, j = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + if (!bitmap_bit_p (live_regs, regno)) + continue; rtx reg = gen_rtx_REG (V2DImode, regno); - if (bitmap_bit_p (live_regs, regno)) - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); - else - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + ++j; + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); } XVEC (pattern, 0) = vec; df_insn_rescan (insn); --- gcc/config/i386/sse.md.jj 2020-02-04 13:33:32.733885088 +0100 +++ gcc/config/i386/sse.md 2020-02-04 13:57:38.995349722 +0100 @@ -19818,11 +19818,49 @@ (define_expand "avx_vzeroupper" (define_insn "*avx_vzeroupper" [(match_parallel 0 "vzeroupper_pattern" [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] - "TARGET_AVX" + "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1" "vzeroupper" [(set_attr "type" "sse") (set_attr "modrm" "0") (set_attr "memory" "none") + (set_attr "prefix" "vex") + (set_attr "btver2_decode" "vector") + (set_attr "mode" "OI")]) + +(define_insn_and_split "*avx_vzeroupper_1" + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" + "#" + "&& epilogue_completed" + [(match_dup 0)] +{ + /* For IPA-RA purposes, make it clear the instruction clobbers + even XMM registers not mentioned explicitly in the pattern. */ + unsigned int nregs = TARGET_64BIT ? 16 : 8; + unsigned int npats = XVECLEN (operands[0], 0); + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); + for (unsigned int i = 0, j = 1; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + if (j < npats + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) + { + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); + j++; + } + else + { + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + } + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); +} + [(set_attr "type" "sse") + (set_attr "modrm" "0") + (set_attr "memory" "none") (set_attr "prefix" "vex") (set_attr "btver2_decode" "vector") (set_attr "mode" "OI")]) --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 13:55:44.364058015 +0100 +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 13:55:44.364058015 +0100 @@ -0,0 +1,19 @@ +/* PR target/92190 */ +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ + +typedef char VC __attribute__((vector_size (16))); +typedef int VI __attribute__((vector_size (16 * sizeof 0))); +VC a; +VI b; +void bar (VI); +void baz (VC); + +void +foo (void) +{ + VC k = a; + VI n = b; + bar (n); + baz (k); +} Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 13:13 ` Jakub Jelinek @ 2020-02-04 13:15 ` Uros Bizjak 2020-02-05 10:05 ` Jakub Jelinek 0 siblings, 1 reply; 29+ messages in thread From: Uros Bizjak @ 2020-02-04 13:15 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches On Tue, Feb 4, 2020 at 2:13 PM Jakub Jelinek <jakub@redhat.com> wrote: > > On Tue, Feb 04, 2020 at 01:38:51PM +0100, Uros Bizjak wrote: > > As Richard advised, let's put this safety stuff back. Usually, in > > i386.md, these kind of splitters are implemented as two patterns, one > > (define_insn_and_split) having "#", and the other (define_insn) with a > > real insn. My opinion is, that this separation avoids confusion as > > much as possible. > > Okay. So like this if it passes bootstrap/regtest then? Yes. > 2020-02-04 Jakub Jelinek <jakub@redhat.com> > > PR target/92190 > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only > include sets and not clobbers in the vzeroupper pattern. > * config/i386/sse.md (*avx_vzeroupper): Require in insn condition that > the parallel has 17 (64-bit) or 9 (32-bit) elts. > (*avx_vzeroupper_1): New define_insn_and_split. > > * gcc.target/i386/pr92190.c: New test. OK. Thanks, Uros. > --- gcc/config/i386/i386-features.c.jj 2020-02-04 13:33:32.713885386 +0100 > +++ gcc/config/i386/i386-features.c 2020-02-04 13:55:44.358058104 +0100 > @@ -1764,29 +1764,32 @@ convert_scalars_to_vector (bool timode_p > > (set (reg:V2DF R) (reg:V2DF R)) > > - which preserves the low 128 bits but clobbers the upper bits. > - For a dead register we just use: > - > - (clobber (reg:V2DF R)) > - > - which invalidates any previous contents of R and stops R from becoming > - live across the vzeroupper in future. */ > + which preserves the low 128 bits but clobbers the upper bits. */ > > static void > ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) > { > rtx pattern = PATTERN (insn); > unsigned int nregs = TARGET_64BIT ? 16 : 8; > - rtvec vec = rtvec_alloc (nregs + 1); > - RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + unsigned int npats = nregs; > for (unsigned int i = 0; i < nregs; ++i) > { > unsigned int regno = GET_SSE_REGNO (i); > + if (!bitmap_bit_p (live_regs, regno)) > + npats--; > + } > + if (npats == 0) > + return; > + rtvec vec = rtvec_alloc (npats + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + for (unsigned int i = 0, j = 0; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + if (!bitmap_bit_p (live_regs, regno)) > + continue; > rtx reg = gen_rtx_REG (V2DImode, regno); > - if (bitmap_bit_p (live_regs, regno)) > - RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); > - else > - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + ++j; > + RTVEC_ELT (vec, j) = gen_rtx_SET (reg, reg); > } > XVEC (pattern, 0) = vec; > df_insn_rescan (insn); > --- gcc/config/i386/sse.md.jj 2020-02-04 13:33:32.733885088 +0100 > +++ gcc/config/i386/sse.md 2020-02-04 13:57:38.995349722 +0100 > @@ -19818,11 +19818,49 @@ (define_expand "avx_vzeroupper" > (define_insn "*avx_vzeroupper" > [(match_parallel 0 "vzeroupper_pattern" > [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > - "TARGET_AVX" > + "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1" > "vzeroupper" > [(set_attr "type" "sse") > (set_attr "modrm" "0") > (set_attr "memory" "none") > + (set_attr "prefix" "vex") > + (set_attr "btver2_decode" "vector") > + (set_attr "mode" "OI")]) > + > +(define_insn_and_split "*avx_vzeroupper_1" > + [(match_parallel 0 "vzeroupper_pattern" > + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > + "TARGET_AVX && XVECLEN (operands[0], 0) != (TARGET_64BIT ? 16 : 8) + 1" > + "#" > + "&& epilogue_completed" > + [(match_dup 0)] > +{ > + /* For IPA-RA purposes, make it clear the instruction clobbers > + even XMM registers not mentioned explicitly in the pattern. */ > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > + unsigned int npats = XVECLEN (operands[0], 0); > + rtvec vec = rtvec_alloc (nregs + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (operands[0], 0, 0); > + for (unsigned int i = 0, j = 1; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + if (j < npats > + && REGNO (SET_DEST (XVECEXP (operands[0], 0, j))) == regno) > + { > + RTVEC_ELT (vec, i + 1) = XVECEXP (operands[0], 0, j); > + j++; > + } > + else > + { > + rtx reg = gen_rtx_REG (V2DImode, regno); > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + } > + } > + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); > +} > + [(set_attr "type" "sse") > + (set_attr "modrm" "0") > + (set_attr "memory" "none") > (set_attr "prefix" "vex") > (set_attr "btver2_decode" "vector") > (set_attr "mode" "OI")]) > --- gcc/testsuite/gcc.target/i386/pr92190.c.jj 2020-02-04 13:55:44.364058015 +0100 > +++ gcc/testsuite/gcc.target/i386/pr92190.c 2020-02-04 13:55:44.364058015 +0100 > @@ -0,0 +1,19 @@ > +/* PR target/92190 */ > +/* { dg-do compile { target { *-*-linux* && lp64 } } } */ > +/* { dg-options "-mabi=ms -O2 -mavx512f" } */ > + > +typedef char VC __attribute__((vector_size (16))); > +typedef int VI __attribute__((vector_size (16 * sizeof 0))); > +VC a; > +VI b; > +void bar (VI); > +void baz (VC); > + > +void > +foo (void) > +{ > + VC k = a; > + VI n = b; > + bar (n); > + baz (k); > +} > > > Jakub > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-04 13:15 ` Uros Bizjak @ 2020-02-05 10:05 ` Jakub Jelinek 2020-02-05 10:12 ` Jakub Jelinek 2020-02-05 10:47 ` Uros Bizjak 0 siblings, 2 replies; 29+ messages in thread From: Jakub Jelinek @ 2020-02-05 10:05 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches [-- Attachment #1: Type: text/plain, Size: 4493 bytes --] On Tue, Feb 04, 2020 at 02:15:04PM +0100, Uros Bizjak wrote: > On Tue, Feb 4, 2020 at 2:13 PM Jakub Jelinek <jakub@redhat.com> wrote: > > > > On Tue, Feb 04, 2020 at 01:38:51PM +0100, Uros Bizjak wrote: > > > As Richard advised, let's put this safety stuff back. Usually, in > > > i386.md, these kind of splitters are implemented as two patterns, one > > > (define_insn_and_split) having "#", and the other (define_insn) with a > > > real insn. My opinion is, that this separation avoids confusion as > > > much as possible. > > > > Okay. So like this if it passes bootstrap/regtest then? > > Yes. > > > 2020-02-04 Jakub Jelinek <jakub@redhat.com> > > > > PR target/92190 > > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only > > include sets and not clobbers in the vzeroupper pattern. > > * config/i386/sse.md (*avx_vzeroupper): Require in insn condition that > > the parallel has 17 (64-bit) or 9 (32-bit) elts. > > (*avx_vzeroupper_1): New define_insn_and_split. > > > > * gcc.target/i386/pr92190.c: New test. > > OK. Unfortunately, it breaks +FAIL: gcc.target/i386/avx-2.c (internal compiler error) +FAIL: gcc.target/i386/avx-2.c (test for excess errors) +FAIL: gcc.target/i386/avx-vzeroupper-10.c (internal compiler error) +FAIL: gcc.target/i386/avx-vzeroupper-10.c (test for excess errors) +UNRESOLVED: gcc.target/i386/avx-vzeroupper-10.c scan-assembler-times avx_vzeroupper 3 +FAIL: gcc.target/i386/avx-vzeroupper-11.c (internal compiler error) +FAIL: gcc.target/i386/avx-vzeroupper-11.c (test for excess errors) +UNRESOLVED: gcc.target/i386/avx-vzeroupper-11.c scan-assembler-times \\\\*avx_vzeroall 1 +UNRESOLVED: gcc.target/i386/avx-vzeroupper-11.c scan-assembler-times avx_vzeroupper 3 +FAIL: gcc.target/i386/avx-vzeroupper-12.c (internal compiler error) +FAIL: gcc.target/i386/avx-vzeroupper-12.c (test for excess errors) +UNRESOLVED: gcc.target/i386/avx-vzeroupper-12.c scan-assembler-times \\\\*avx_vzeroall 1 +UNRESOLVED: gcc.target/i386/avx-vzeroupper-12.c scan-assembler-times avx_vzeroupper 4 +FAIL: gcc.target/i386/avx-vzeroupper-5.c (internal compiler error) +FAIL: gcc.target/i386/avx-vzeroupper-5.c (test for excess errors) +UNRESOLVED: gcc.target/i386/avx-vzeroupper-5.c scan-assembler-times avx_vzeroupper 1 +FAIL: gcc.target/i386/avx-vzeroupper-7.c (internal compiler error) +FAIL: gcc.target/i386/avx-vzeroupper-7.c (test for excess errors) +UNRESOLVED: gcc.target/i386/avx-vzeroupper-7.c scan-assembler-times avx_vzeroupper 1 +FAIL: gcc.target/i386/avx-vzeroupper-8.c (internal compiler error) +FAIL: gcc.target/i386/avx-vzeroupper-8.c (test for excess errors) +UNRESOLVED: gcc.target/i386/avx-vzeroupper-8.c scan-assembler-times avx_vzeroupper 1 +FAIL: gcc.target/i386/avx-vzeroupper-9.c (internal compiler error) +FAIL: gcc.target/i386/avx-vzeroupper-9.c (test for excess errors) +UNRESOLVED: gcc.target/i386/avx-vzeroupper-9.c scan-assembler-times avx_vzeroupper 4 +FAIL: gcc.target/i386/sse-14.c (internal compiler error) +FAIL: gcc.target/i386/sse-14.c (test for excess errors) +FAIL: gcc.target/i386/sse-22.c (internal compiler error) +FAIL: gcc.target/i386/sse-22.c (test for excess errors) +FAIL: gcc.target/i386/sse-22a.c (internal compiler error) +FAIL: gcc.target/i386/sse-22a.c (test for excess errors) The problem is that x86 is the only target that defines HAVE_ATTR_length and doesn't schedule any splitting pass at -O0 after pro_and_epilogue. So, either we go back to handling this during vzeroupper output (unconditionally, rather than flag_ipa_ra guarded), or we need to tweak the split* passes for x86. Seems there are 5 split passes, split1 is run unconditionally before reload, split2 is run for optimize > 0 or STACK_REGS (x86) after ra but before epilogue_completed, split3 is run before regstack for STACK_REGS and optimize and -fno-schedule-insns2, split4 is run before sched2 if sched2 is run and split5 is run before shorten_branches if HAVE_ATTR_length and not STACK_REGS. Attached are 3 possible incremental patches for recog.c, all of them fix all the above regressions, but haven't fully bootstrapped/regtested any of them yet. My preference would be the last one, which for -O0 and x86 disables split2 and enables split3, as it doesn't add any extra passes. The first one just enables split3 for -O0 on x86, the second one enables split5 for -O0 on x86. Jakub [-- Attachment #2: 12 --] [-- Type: text/plain, Size: 690 bytes --] 2020-02-05 Jakub Jelinek <jakub@redhat.com> PR target/92190 * recog.c (pass_split_before_regstack::gate): For STACK_REGS targets, run even when !optimize. --- gcc/recog.c.jj 2020-01-12 11:54:36.918405790 +0100 +++ gcc/recog.c 2020-02-05 10:13:20.236989567 +0100 @@ -3991,12 +3991,12 @@ pass_split_before_regstack::gate (functi split until final which doesn't allow splitting if HAVE_ATTR_length. */ # ifdef INSN_SCHEDULING - return (optimize && !flag_schedule_insns_after_reload); + return !optimize || !flag_schedule_insns_after_reload; # else - return (optimize); + return true; # endif #else - return 0; + return false; #endif } [-- Attachment #3: 13 --] [-- Type: text/plain, Size: 653 bytes --] 2020-02-05 Jakub Jelinek <jakub@redhat.com> PR target/92190 * recog.c (pass_split_for_shorten_branches::gate): For STACK_REGS targets, return !optimize. --- gcc/recog.c.jj 2020-01-12 11:54:36.918405790 +0100 +++ gcc/recog.c 2020-02-05 10:24:34.506773724 +0100 @@ -4091,8 +4091,12 @@ public: { /* The placement of the splitting that we do for shorten_branches depends on whether regstack is used by the target or not. */ -#if HAVE_ATTR_length && !defined (STACK_REGS) +#if HAVE_ATTR_length +# ifndef STACK_REGS return true; +# else + return !optimize; +# endif #else return false; #endif [-- Attachment #4: 14 --] [-- Type: text/plain, Size: 1128 bytes --] 2020-02-05 Jakub Jelinek <jakub@redhat.com> PR target/92190 * recog.c (pass_split_after_reload::gate): For STACK_REGS targets, don't run when !optimize. (pass_split_before_regstack::gate): For STACK_REGS targets, run even when !optimize. --- gcc/recog.c.jj 2020-01-12 11:54:36.918405790 +0100 +++ gcc/recog.c 2020-02-05 10:45:19.125938154 +0100 @@ -3924,14 +3924,7 @@ public: virtual bool gate (function *) { /* If optimizing, then go ahead and split insns now. */ - if (optimize > 0) - return true; - -#ifdef STACK_REGS - return true; -#else - return false; -#endif + return optimize > 0; } virtual unsigned int execute (function *) @@ -3991,12 +3984,12 @@ pass_split_before_regstack::gate (functi split until final which doesn't allow splitting if HAVE_ATTR_length. */ # ifdef INSN_SCHEDULING - return (optimize && !flag_schedule_insns_after_reload); + return !optimize || !flag_schedule_insns_after_reload; # else - return (optimize); + return true; # endif #else - return 0; + return false; #endif } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-05 10:05 ` Jakub Jelinek @ 2020-02-05 10:12 ` Jakub Jelinek 2020-02-05 10:47 ` Uros Bizjak 1 sibling, 0 replies; 29+ messages in thread From: Jakub Jelinek @ 2020-02-05 10:12 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches On Wed, Feb 05, 2020 at 10:58:04AM +0100, Jakub Jelinek wrote: > Attached are 3 possible incremental patches for recog.c, all of them fix > all the above regressions, but haven't fully bootstrapped/regtested any of > them yet. My preference would be the last one, which for -O0 and x86 > disables split2 and enables split3, as it doesn't add any extra passes. > The first one just enables split3 for -O0 on x86, the second one enables > split5 for -O0 on x86. The -da -O0 changes of the last patch are: test.c.275r.split1 test.c.277r.dfinit test.c.278r.mode_sw test.c.279r.asmcons test.c.284r.ira test.c.285r.reload -test.c.289r.split2 test.c.292r.pro_and_epilogue test.c.295r.jump2 +test.c.306r.split3 test.c.307r.stack test.c.308r.alignments test.c.310r.mach test.c.311r.barriers test.c.316r.shorten test.c.317r.nothrow test.c.318r.dwarf2 test.c.319r.final test.c.320r.dfinish while the first one just adds test.c.306r.split3 and doesn't remove test.c.289r.split2 and the second one just adds test.c.313r.split5 before test.c.316r.shorten. Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-05 10:05 ` Jakub Jelinek 2020-02-05 10:12 ` Jakub Jelinek @ 2020-02-05 10:47 ` Uros Bizjak 2020-02-05 11:03 ` Jakub Jelinek 1 sibling, 1 reply; 29+ messages in thread From: Uros Bizjak @ 2020-02-05 10:47 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches On Wed, Feb 5, 2020 at 11:05 AM Jakub Jelinek <jakub@redhat.com> wrote: > > On Tue, Feb 04, 2020 at 02:15:04PM +0100, Uros Bizjak wrote: > > On Tue, Feb 4, 2020 at 2:13 PM Jakub Jelinek <jakub@redhat.com> wrote: > > > > > > On Tue, Feb 04, 2020 at 01:38:51PM +0100, Uros Bizjak wrote: > > > > As Richard advised, let's put this safety stuff back. Usually, in > > > > i386.md, these kind of splitters are implemented as two patterns, one > > > > (define_insn_and_split) having "#", and the other (define_insn) with a > > > > real insn. My opinion is, that this separation avoids confusion as > > > > much as possible. > > > > > > Okay. So like this if it passes bootstrap/regtest then? > > > > Yes. > > > > > 2020-02-04 Jakub Jelinek <jakub@redhat.com> > > > > > > PR target/92190 > > > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper): Only > > > include sets and not clobbers in the vzeroupper pattern. > > > * config/i386/sse.md (*avx_vzeroupper): Require in insn condition that > > > the parallel has 17 (64-bit) or 9 (32-bit) elts. > > > (*avx_vzeroupper_1): New define_insn_and_split. > > > > > > * gcc.target/i386/pr92190.c: New test. > > > > OK. > > Unfortunately, it breaks > +FAIL: gcc.target/i386/avx-2.c (internal compiler error) > +FAIL: gcc.target/i386/avx-2.c (test for excess errors) > +FAIL: gcc.target/i386/avx-vzeroupper-10.c (internal compiler error) > +FAIL: gcc.target/i386/avx-vzeroupper-10.c (test for excess errors) > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-10.c scan-assembler-times avx_vzeroupper 3 > +FAIL: gcc.target/i386/avx-vzeroupper-11.c (internal compiler error) > +FAIL: gcc.target/i386/avx-vzeroupper-11.c (test for excess errors) > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-11.c scan-assembler-times \\\\*avx_vzeroall 1 > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-11.c scan-assembler-times avx_vzeroupper 3 > +FAIL: gcc.target/i386/avx-vzeroupper-12.c (internal compiler error) > +FAIL: gcc.target/i386/avx-vzeroupper-12.c (test for excess errors) > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-12.c scan-assembler-times \\\\*avx_vzeroall 1 > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-12.c scan-assembler-times avx_vzeroupper 4 > +FAIL: gcc.target/i386/avx-vzeroupper-5.c (internal compiler error) > +FAIL: gcc.target/i386/avx-vzeroupper-5.c (test for excess errors) > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-5.c scan-assembler-times avx_vzeroupper 1 > +FAIL: gcc.target/i386/avx-vzeroupper-7.c (internal compiler error) > +FAIL: gcc.target/i386/avx-vzeroupper-7.c (test for excess errors) > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-7.c scan-assembler-times avx_vzeroupper 1 > +FAIL: gcc.target/i386/avx-vzeroupper-8.c (internal compiler error) > +FAIL: gcc.target/i386/avx-vzeroupper-8.c (test for excess errors) > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-8.c scan-assembler-times avx_vzeroupper 1 > +FAIL: gcc.target/i386/avx-vzeroupper-9.c (internal compiler error) > +FAIL: gcc.target/i386/avx-vzeroupper-9.c (test for excess errors) > +UNRESOLVED: gcc.target/i386/avx-vzeroupper-9.c scan-assembler-times avx_vzeroupper 4 > +FAIL: gcc.target/i386/sse-14.c (internal compiler error) > +FAIL: gcc.target/i386/sse-14.c (test for excess errors) > +FAIL: gcc.target/i386/sse-22.c (internal compiler error) > +FAIL: gcc.target/i386/sse-22.c (test for excess errors) > +FAIL: gcc.target/i386/sse-22a.c (internal compiler error) > +FAIL: gcc.target/i386/sse-22a.c (test for excess errors) > > The problem is that x86 is the only target that defines HAVE_ATTR_length and > doesn't schedule any splitting pass at -O0 after pro_and_epilogue. > > So, either we go back to handling this during vzeroupper output > (unconditionally, rather than flag_ipa_ra guarded), or we need to tweak the > split* passes for x86. > > Seems there are 5 split passes, split1 is run unconditionally before reload, > split2 is run for optimize > 0 or STACK_REGS (x86) after ra but before > epilogue_completed, split3 is run before regstack for STACK_REGS and > optimize and -fno-schedule-insns2, split4 is run before sched2 if sched2 is > run and split5 is run before shorten_branches if HAVE_ATTR_length and not > STACK_REGS. > > Attached are 3 possible incremental patches for recog.c, all of them fix > all the above regressions, but haven't fully bootstrapped/regtested any of > them yet. My preference would be the last one, which for -O0 and x86 > disables split2 and enables split3, as it doesn't add any extra passes. > The first one just enables split3 for -O0 on x86, the second one enables > split5 for -O0 on x86. Please note that in i386.md we expect that the check for "epilogue_completed" means split4 point. There are some places with: "TARGET_64BIT && ((optimize > 0 && flag_peephole2) ? epilogue_completed : reload_completed) so for flag_peephole2, we split after peephole2 pass was performed. Apparently, we already hit this problem in the past, so the check for "optimize > 0" was added to solve -O0 -fpeephole2 combo. Another one is with && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed where we assume that allocated registers won't change anymore when breaking SSE partial register stall. I think we should just enable split4 also for -O0. This would also allow us to remove the "optimize > 0" check above and allow us to generate a bit more optimal code even with -O0 for TARGET_SSE_PARTIAL_REG_DEPENDENCY and TARGET_AVOID_FALSE_DEP_FOR_BMI. Uros. Uros. > > Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-05 10:47 ` Uros Bizjak @ 2020-02-05 11:03 ` Jakub Jelinek 2020-02-05 11:11 ` Uros Bizjak 0 siblings, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-05 11:03 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches On Wed, Feb 05, 2020 at 11:46:51AM +0100, Uros Bizjak wrote: > I think we should just enable split4 also for -O0. This would also > allow us to remove the "optimize > 0" check above and allow us to > generate a bit more optimal code even with -O0 for > TARGET_SSE_PARTIAL_REG_DEPENDENCY and TARGET_AVOID_FALSE_DEP_FOR_BMI. The -O0 passes are: test.c.284r.ira test.c.285r.reload # placement of split2 test.c.292r.pro_and_epilogue test.c.295r.jump2 # placement of split4 # placement of split3 test.c.307r.stack test.c.308r.alignments test.c.310r.mach test.c.311r.barriers # placement of split5 test.c.316r.shorten test.c.317r.nothrow test.c.318r.dwarf2 test.c.319r.final It really doesn't matter if for -O0 STACK_REGS we enable split4 or split3, it is the same location, just split3 is better named (split before reg-stack, which is run even at -O0, rather than split before sched2, which is not run at -O0). I think the primary question is, do we for -O0 need to enable split2 (as in, will anything in pro_and_epilogue or jump2 passes benefit from the splitting and worth running two -O0 post-RA split passes), then we could go with the first patch I've posted, or nothing benefits from it in the two passes and it is ok to split only before stack, in that case we can go with the third patch. Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-05 11:03 ` Jakub Jelinek @ 2020-02-05 11:11 ` Uros Bizjak 2020-02-05 11:24 ` Jakub Jelinek 0 siblings, 1 reply; 29+ messages in thread From: Uros Bizjak @ 2020-02-05 11:11 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Sandiford, gcc-patches On Wed, Feb 5, 2020 at 12:03 PM Jakub Jelinek <jakub@redhat.com> wrote: > > On Wed, Feb 05, 2020 at 11:46:51AM +0100, Uros Bizjak wrote: > > I think we should just enable split4 also for -O0. This would also > > allow us to remove the "optimize > 0" check above and allow us to > > generate a bit more optimal code even with -O0 for > > TARGET_SSE_PARTIAL_REG_DEPENDENCY and TARGET_AVOID_FALSE_DEP_FOR_BMI. > > The -O0 passes are: > test.c.284r.ira > test.c.285r.reload > # placement of split2 > test.c.292r.pro_and_epilogue > test.c.295r.jump2 > # placement of split4 > # placement of split3 > test.c.307r.stack > test.c.308r.alignments > test.c.310r.mach > test.c.311r.barriers > # placement of split5 > test.c.316r.shorten > test.c.317r.nothrow > test.c.318r.dwarf2 > test.c.319r.final > > It really doesn't matter if for -O0 STACK_REGS we enable split4 or > split3, it is the same location, just split3 is better named (split before > reg-stack, which is run even at -O0, rather than split before sched2, which > is not run at -O0). > > I think the primary question is, do we for -O0 need to enable split2 (as in, > will anything in pro_and_epilogue or jump2 passes benefit from the > splitting and worth running two -O0 post-RA split passes), then we could go > with the first patch I've posted, or nothing benefits from it in the two > passes and it is ok to split only before stack, in that case we can go with > the third patch. Oh, I see. No passes get scheduled between split3 and split4. Looks to me that your third patch is then the way to go. Uros. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] 2020-02-05 11:11 ` Uros Bizjak @ 2020-02-05 11:24 ` Jakub Jelinek 0 siblings, 0 replies; 29+ messages in thread From: Jakub Jelinek @ 2020-02-05 11:24 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches On Wed, Feb 05, 2020 at 12:11:15PM +0100, Uros Bizjak wrote: > > I think the primary question is, do we for -O0 need to enable split2 (as in, > > will anything in pro_and_epilogue or jump2 passes benefit from the > > splitting and worth running two -O0 post-RA split passes), then we could go > > with the first patch I've posted, or nothing benefits from it in the two > > passes and it is ok to split only before stack, in that case we can go with > > the third patch. > > Oh, I see. No passes get scheduled between split3 and split4. Looks to > me that your third patch is then the way to go. Ok, will bootstrap/regtest it soon. Thanks. Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-04 10:16 ` Uros Bizjak 2020-02-04 11:05 ` Jakub Jelinek @ 2020-02-04 11:42 ` Jakub Jelinek 2020-02-06 1:00 ` JonY 1 sibling, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-04 11:42 UTC (permalink / raw) To: Jonathan Yong; +Cc: Uros Bizjak, gcc-patches Hi! On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: > I guess that Comment #9 patch form the PR should be trivially correct, > but althouhg it looks obvious, I don't want to propose the patch since > I have no means of testing it. I don't have means of testing it either. https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019 is quite explicit that [xyz]mm16-31 are call clobbered and only xmm6-15 (low 128-bits only) are call preserved. Jonathan, could you please test this if it is sufficient to just change CALL_USED_REGISTERS or if e.g. something in the pro/epilogue needs tweaking too? Thanks. We are talking e.g. about /* { dg-options "-O2 -mabi=ms -mavx512vl" } */ typedef double V __attribute__((vector_size (16))); void foo (void); V bar (void); void baz (V); void qux (void) { V c; { register V a __asm ("xmm18"); V b = bar (); asm ("" : "=x" (a) : "0" (b)); c = a; } foo (); { register V d __asm ("xmm18"); V e; d = c; asm ("" : "=x" (e) : "0" (d)); baz (e); } } where according to the MSDN doc gcc incorrectly holds the c value in xmm18 register across the foo call; if foo is compiled by some Microsoft compiler (or LLVM), then it could clobber %xmm18. If all xmm18 occurrences are changed to say xmm15, then it is valid to hold the 128-bit value across the foo call (though, surprisingly, LLVM saves it into stack anyway). 2020-02-04 Uroš Bizjak <ubizjak@gmail.com> * config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make xmm16-xmm31 call-used even in 64-bit ms-abi. --- gcc/config/i386/config/i386/i386.h.jj 2020-01-22 10:19:24.199221986 +0100 +++ gcc/config/i386/config/i386/i386.h 2020-02-04 12:09:12.338341003 +0100 @@ -1128,9 +1128,9 @@ extern const char *host_detect_local_cpu /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/ \ 6, 6, 6, 6, 6, 6, 6, 6, \ /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/ \ - 6, 6, 6, 6, 6, 6, 6, 6, \ + 1, 1, 1, 1, 1, 1, 1, 1, \ /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/ \ - 6, 6, 6, 6, 6, 6, 6, 6, \ + 1, 1, 1, 1, 1, 1, 1, 1, \ /* k0, k1, k2, k3, k4, k5, k6, k7*/ \ 1, 1, 1, 1, 1, 1, 1, 1 } Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-04 11:42 ` [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI Jakub Jelinek @ 2020-02-06 1:00 ` JonY 2020-02-06 6:07 ` Jakub Jelinek 0 siblings, 1 reply; 29+ messages in thread From: JonY @ 2020-02-06 1:00 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Uros Bizjak, gcc-patches [-- Attachment #1.1: Type: text/plain, Size: 800 bytes --] On 2/4/20 11:42 AM, Jakub Jelinek wrote: > Hi! > > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: >> I guess that Comment #9 patch form the PR should be trivially correct, >> but althouhg it looks obvious, I don't want to propose the patch since >> I have no means of testing it. > > I don't have means of testing it either. > https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019 > is quite explicit that [xyz]mm16-31 are call clobbered and only xmm6-15 (low > 128-bits only) are call preserved. > > Jonathan, could you please test this if it is sufficient to just change > CALL_USED_REGISTERS or if e.g. something in the pro/epilogue needs tweaking > too? Thanks. Is this patch testing still required? I just got back from traveling. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-06 1:00 ` JonY @ 2020-02-06 6:07 ` Jakub Jelinek 2020-02-07 10:57 ` JonY 0 siblings, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-06 6:07 UTC (permalink / raw) To: JonY; +Cc: Uros Bizjak, gcc-patches On Thu, Feb 06, 2020 at 01:00:36AM +0000, JonY wrote: > On 2/4/20 11:42 AM, Jakub Jelinek wrote: > > Hi! > > > > On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: > >> I guess that Comment #9 patch form the PR should be trivially correct, > >> but althouhg it looks obvious, I don't want to propose the patch since > >> I have no means of testing it. > > > > I don't have means of testing it either. > > https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019 > > is quite explicit that [xyz]mm16-31 are call clobbered and only xmm6-15 (low > > 128-bits only) are call preserved. > > > > Jonathan, could you please test this if it is sufficient to just change > > CALL_USED_REGISTERS or if e.g. something in the pro/epilogue needs tweaking > > too? Thanks. > > Is this patch testing still required? I just got back from traveling. Yes, our reading of the MS ABI docs show that xmm16-31 are to be call used (not preserved over calls), while in gcc they are currently handled as preserved across the calls. Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-06 6:07 ` Jakub Jelinek @ 2020-02-07 10:57 ` JonY 2020-02-07 11:28 ` Jakub Jelinek 0 siblings, 1 reply; 29+ messages in thread From: JonY @ 2020-02-07 10:57 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Uros Bizjak, gcc-patches [-- Attachment #1.1: Type: text/plain, Size: 2421 bytes --] On 2/6/20 6:07 AM, Jakub Jelinek wrote: > On Thu, Feb 06, 2020 at 01:00:36AM +0000, JonY wrote: >> On 2/4/20 11:42 AM, Jakub Jelinek wrote: >>> Hi! >>> >>> On Tue, Feb 04, 2020 at 11:16:06AM +0100, Uros Bizjak wrote: >>>> I guess that Comment #9 patch form the PR should be trivially correct, >>>> but althouhg it looks obvious, I don't want to propose the patch since >>>> I have no means of testing it. >>> >>> I don't have means of testing it either. >>> https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019 >>> is quite explicit that [xyz]mm16-31 are call clobbered and only xmm6-15 (low >>> 128-bits only) are call preserved. >>> >>> Jonathan, could you please test this if it is sufficient to just change >>> CALL_USED_REGISTERS or if e.g. something in the pro/epilogue needs tweaking >>> too? Thanks. >> >> Is this patch testing still required? I just got back from traveling. > > Yes, our reading of the MS ABI docs show that xmm16-31 are to be call used > (not preserved over calls), while in gcc they are currently handled as > preserved across the calls. > > Jakub > --- original.s 2020-02-06 09:00:02.014638069 +0000 +++ new.s 2020-02-07 10:28:55.678317667 +0000 @@ -7,23 +7,23 @@ qux: subq $72, %rsp .seh_stackalloc 72 - vmovaps %xmm18, 48(%rsp) - .seh_savexmm %xmm18, 48 + vmovaps %xmm6, 48(%rsp) + .seh_savexmm %xmm6, 48 .seh_endprologue call bar vmovapd %xmm0, %xmm1 - vmovapd %xmm1, %xmm18 + vmovapd %xmm1, %xmm6 call foo leaq 32(%rsp), %rcx - vmovapd %xmm18, %xmm0 - vmovaps %xmm0, 32(%rsp) + vmovapd %xmm6, %xmm0 + vmovapd %xmm0, 32(%rsp) call baz nop - vmovaps 48(%rsp), %xmm18 + vmovaps 48(%rsp), %xmm6 addq $72, %rsp ret .seh_endproc - .ident "GCC: (GNU) 10.0.0 20191024 (experimental)" + .ident "GCC: (GNU) 10.0.1 20200206 (experimental)" .def bar; .scl 2; .type 32; .endef .def foo; .scl 2; .type 32; .endef .def baz; .scl 2; .type 32; .endef GCC with the patch now seems to put the variables in xmm6, unfortunately I don't know enough of AVX or stack setups to know if that's all that is needed. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-07 10:57 ` JonY @ 2020-02-07 11:28 ` Jakub Jelinek 2020-02-08 8:24 ` JonY 0 siblings, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-07 11:28 UTC (permalink / raw) To: JonY; +Cc: Uros Bizjak, gcc-patches On Fri, Feb 07, 2020 at 10:57:22AM +0000, JonY wrote: > >> Is this patch testing still required? I just got back from traveling. > > > > Yes, our reading of the MS ABI docs show that xmm16-31 are to be call used > > (not preserved over calls), while in gcc they are currently handled as > > preserved across the calls. The other parts are I guess mainly about SEH. Consider e.g. void foo (void) { register double x __asm ("xmm14"); register double y __asm ("xmm18"); asm ("" : "=x" (x)); asm ("" : "=v" (y)); x += y; y += x; asm ("" : : "x" (x)); asm ("" : : "v" (y)); } looking at cross-compiler output, with -O2 -mavx512f this emits .file "abcdeq.c" .text .align 16 .globl foo .def foo; .scl 2; .type 32; .endef .seh_proc foo foo: subq $40, %rsp .seh_stackalloc 40 vmovaps %xmm14, (%rsp) .seh_savexmm %xmm14, 0 vmovaps %xmm18, 16(%rsp) .seh_savexmm %xmm18, 16 .seh_endprologue vaddsd %xmm18, %xmm14, %xmm14 vaddsd %xmm18, %xmm14, %xmm18 vmovaps (%rsp), %xmm14 vmovaps 16(%rsp), %xmm18 addq $40, %rsp ret .seh_endproc .ident "GCC: (GNU) 10.0.1 20200207 (experimental)" Does whatever assembler mingw64 uses even assemble this (I mean the .seh_savexmm %xmm16, 16 could be problematic)? I can find e.g. https://stackoverflow.com/questions/43152633/invalid-register-for-seh-savexmm-in-cygwin/43210527 which then links to https://gcc.gnu.org/PR65782 So, I'd say we want to add PR target/65782 to the ChangeLog entry in the patch, and likely add a testcase like the above, so like below? Have you tested the earlier version of the patch on mingw64 or cygwin? If yes, can you just test the testcase from the following patch, without and with the i386.h part and verify it FAILs without and PASSes with it? Just make check-gcc RUNTESTFLAGS=i386.exp=pr65782.c should do? If not, can you please test the whole patch? 2020-02-07 Uroš Bizjak <ubizjak@gmail.com> Jakub Jelinek <jakub@redhat.com> PR target/65782 * config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make xmm16-xmm31 call-used even in 64-bit ms-abi. * gcc.target/i386/pr65782.c: New test. --- gcc/config/i386/config/i386/i386.h.jj 2020-01-22 10:19:24.199221986 +0100 +++ gcc/config/i386/config/i386/i386.h 2020-02-04 12:09:12.338341003 +0100 @@ -1128,9 +1128,9 @@ extern const char *host_detect_local_cpu /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/ \ 6, 6, 6, 6, 6, 6, 6, 6, \ /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/ \ - 6, 6, 6, 6, 6, 6, 6, 6, \ + 1, 1, 1, 1, 1, 1, 1, 1, \ /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/ \ - 6, 6, 6, 6, 6, 6, 6, 6, \ + 1, 1, 1, 1, 1, 1, 1, 1, \ /* k0, k1, k2, k3, k4, k5, k6, k7*/ \ 1, 1, 1, 1, 1, 1, 1, 1 } --- gcc/testsuite/gcc.target/i386/pr65782.c.jj 2020-02-07 12:21:09.472819018 +0100 +++ gcc/testsuite/gcc.target/i386/pr65782.c 2020-02-07 12:24:06.820154495 +0100 @@ -0,0 +1,16 @@ +/* PR target/65782 */ +/* { dg-do assemble { target { avx512vl && { ! ia32 } } } } */ +/* { dg-options "-O2 -mavx512vl" } */ + +void +foo (void) +{ + register double x __asm ("xmm14"); + register double y __asm ("xmm18"); + asm ("" : "=x" (x)); + asm ("" : "=v" (y)); + x += y; + y += x; + asm ("" : : "x" (x)); + asm ("" : : "v" (y)); +} Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-07 11:28 ` Jakub Jelinek @ 2020-02-08 8:24 ` JonY 2020-02-08 10:05 ` Jakub Jelinek 0 siblings, 1 reply; 29+ messages in thread From: JonY @ 2020-02-08 8:24 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Uros Bizjak, gcc-patches [-- Attachment #1.1: Type: text/plain, Size: 4028 bytes --] On 2/7/20 11:28 AM, Jakub Jelinek wrote: > On Fri, Feb 07, 2020 at 10:57:22AM +0000, JonY wrote: >>>> Is this patch testing still required? I just got back from traveling. >>> >>> Yes, our reading of the MS ABI docs show that xmm16-31 are to be call used >>> (not preserved over calls), while in gcc they are currently handled as >>> preserved across the calls. > > The other parts are I guess mainly about SEH. Consider e.g. > void > foo (void) > { > register double x __asm ("xmm14"); > register double y __asm ("xmm18"); > asm ("" : "=x" (x)); > asm ("" : "=v" (y)); > x += y; > y += x; > asm ("" : : "x" (x)); > asm ("" : : "v" (y)); > } > looking at cross-compiler output, with -O2 -mavx512f this emits > .file "abcdeq.c" > .text > .align 16 > .globl foo > .def foo; .scl 2; .type 32; .endef > .seh_proc foo > foo: > subq $40, %rsp > .seh_stackalloc 40 > vmovaps %xmm14, (%rsp) > .seh_savexmm %xmm14, 0 > vmovaps %xmm18, 16(%rsp) > .seh_savexmm %xmm18, 16 > .seh_endprologue > vaddsd %xmm18, %xmm14, %xmm14 > vaddsd %xmm18, %xmm14, %xmm18 > vmovaps (%rsp), %xmm14 > vmovaps 16(%rsp), %xmm18 > addq $40, %rsp > ret > .seh_endproc > .ident "GCC: (GNU) 10.0.1 20200207 (experimental)" > Does whatever assembler mingw64 uses even assemble this (I mean the > .seh_savexmm %xmm16, 16 could be problematic)?'' It does not, I just checked with the master branch of binutils. > I can find e.g. > https://stackoverflow.com/questions/43152633/invalid-register-for-seh-savexmm-in-cygwin/43210527 > which then links to > https://gcc.gnu.org/PR65782 > > So, I'd say we want to add PR target/65782 to the ChangeLog entry in the > patch, and likely add a testcase like the above, so like below? > > Have you tested the earlier version of the patch on mingw64 or cygwin? > If yes, can you just test the testcase from the following patch, without and > with the i386.h part and verify it FAILs without and PASSes with it? > Just make check-gcc RUNTESTFLAGS=i386.exp=pr65782.c should do? > If not, can you please test the whole patch? > I did a -c test build with an older toolchain, it fails to compile (invalid register for .seh_savexmm) while the latest gcc is passing, both are using the same binutils version. I guess that covers it? > 2020-02-07 Uroš Bizjak <ubizjak@gmail.com> > Jakub Jelinek <jakub@redhat.com> > > PR target/65782 > * config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make > xmm16-xmm31 call-used even in 64-bit ms-abi. > > * gcc.target/i386/pr65782.c: New test. > > --- gcc/config/i386/config/i386/i386.h.jj 2020-01-22 10:19:24.199221986 +0100 > +++ gcc/config/i386/config/i386/i386.h 2020-02-04 12:09:12.338341003 +0100 > @@ -1128,9 +1128,9 @@ extern const char *host_detect_local_cpu > /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/ \ > 6, 6, 6, 6, 6, 6, 6, 6, \ > /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/ \ > - 6, 6, 6, 6, 6, 6, 6, 6, \ > + 1, 1, 1, 1, 1, 1, 1, 1, \ > /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/ \ > - 6, 6, 6, 6, 6, 6, 6, 6, \ > + 1, 1, 1, 1, 1, 1, 1, 1, \ > /* k0, k1, k2, k3, k4, k5, k6, k7*/ \ > 1, 1, 1, 1, 1, 1, 1, 1 } > > --- gcc/testsuite/gcc.target/i386/pr65782.c.jj 2020-02-07 12:21:09.472819018 +0100 > +++ gcc/testsuite/gcc.target/i386/pr65782.c 2020-02-07 12:24:06.820154495 +0100 > @@ -0,0 +1,16 @@ > +/* PR target/65782 */ > +/* { dg-do assemble { target { avx512vl && { ! ia32 } } } } */ > +/* { dg-options "-O2 -mavx512vl" } */ > + > +void > +foo (void) > +{ > + register double x __asm ("xmm14"); > + register double y __asm ("xmm18"); > + asm ("" : "=x" (x)); > + asm ("" : "=v" (y)); > + x += y; > + y += x; > + asm ("" : : "x" (x)); > + asm ("" : : "v" (y)); > +} > > > Jakub > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-08 8:24 ` JonY @ 2020-02-08 10:05 ` Jakub Jelinek 2020-02-08 10:32 ` Uros Bizjak 0 siblings, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-08 10:05 UTC (permalink / raw) To: JonY; +Cc: Uros Bizjak, gcc-patches On Sat, Feb 08, 2020 at 08:24:38AM +0000, JonY wrote: > It does not, I just checked with the master branch of binutils. ... > I did a -c test build with an older toolchain, it fails to compile > (invalid register for .seh_savexmm) while the latest gcc is passing, > both are using the same binutils version. > > I guess that covers it? I went ahead and committed the patch (with the double config/i386/ in it fixed obviously). > > 2020-02-07 Uroš Bizjak <ubizjak@gmail.com> > > Jakub Jelinek <jakub@redhat.com> > > > > PR target/65782 > > * config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make > > xmm16-xmm31 call-used even in 64-bit ms-abi. > > > > * gcc.target/i386/pr65782.c: New test. Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-08 10:05 ` Jakub Jelinek @ 2020-02-08 10:32 ` Uros Bizjak 2020-02-08 10:34 ` Jakub Jelinek 2020-02-08 10:52 ` Jakub Jelinek 0 siblings, 2 replies; 29+ messages in thread From: Uros Bizjak @ 2020-02-08 10:32 UTC (permalink / raw) To: Jakub Jelinek; +Cc: JonY, gcc-patches On Sat, Feb 8, 2020 at 11:05 AM Jakub Jelinek <jakub@redhat.com> wrote: > > On Sat, Feb 08, 2020 at 08:24:38AM +0000, JonY wrote: > > It does not, I just checked with the master branch of binutils. > ... > > I did a -c test build with an older toolchain, it fails to compile > > (invalid register for .seh_savexmm) while the latest gcc is passing, > > both are using the same binutils version. > > > > I guess that covers it? > > I went ahead and committed the patch (with the double config/i386/ in it > fixed obviously). I think that the patch should also be backported to gcc-9 branch. The change is backward compatible, since the new code will save and restore zmm16+ registers at the caller site, and the old code (e.g. existing libraries) will then unnecessarily save and restore regs in the callee. HJ fixed some other MSABI issue there. Uros. > > > > 2020-02-07 Uroš Bizjak <ubizjak@gmail.com> > > > Jakub Jelinek <jakub@redhat.com> > > > > > > PR target/65782 > > > * config/i386/config/i386/i386.h (CALL_USED_REGISTERS): Make > > > xmm16-xmm31 call-used even in 64-bit ms-abi. > > > > > > * gcc.target/i386/pr65782.c: New test. > > Jakub > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-08 10:32 ` Uros Bizjak @ 2020-02-08 10:34 ` Jakub Jelinek 2020-02-08 10:52 ` Jakub Jelinek 1 sibling, 0 replies; 29+ messages in thread From: Jakub Jelinek @ 2020-02-08 10:34 UTC (permalink / raw) To: Uros Bizjak; +Cc: JonY, gcc-patches On Sat, Feb 08, 2020 at 11:32:40AM +0100, Uros Bizjak wrote: > On Sat, Feb 8, 2020 at 11:05 AM Jakub Jelinek <jakub@redhat.com> wrote: > > > > On Sat, Feb 08, 2020 at 08:24:38AM +0000, JonY wrote: > > > It does not, I just checked with the master branch of binutils. > > ... > > > I did a -c test build with an older toolchain, it fails to compile > > > (invalid register for .seh_savexmm) while the latest gcc is passing, > > > both are using the same binutils version. > > > > > > I guess that covers it? > > > > I went ahead and committed the patch (with the double config/i386/ in it > > fixed obviously). > > I think that the patch should also be backported to gcc-9 branch. The > change is backward compatible, since the new code will save and > restore zmm16+ registers at the caller site, and the old code (e.g. > existing libraries) will then unnecessarily save and restore regs in > the callee. Agreed, will do together with other backports soon. Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-08 10:32 ` Uros Bizjak 2020-02-08 10:34 ` Jakub Jelinek @ 2020-02-08 10:52 ` Jakub Jelinek 2020-02-08 12:52 ` Uros Bizjak 1 sibling, 1 reply; 29+ messages in thread From: Jakub Jelinek @ 2020-02-08 10:52 UTC (permalink / raw) To: Uros Bizjak; +Cc: JonY, gcc-patches On Sat, Feb 08, 2020 at 11:32:40AM +0100, Uros Bizjak wrote: > I think that the patch should also be backported to gcc-9 branch. The > change is backward compatible, since the new code will save and > restore zmm16+ registers at the caller site, and the old code (e.g. > existing libraries) will then unnecessarily save and restore regs in > the callee. Actually, it isn't backward compatible. If both caller and callee touch xmm16+ (sure, unlikely), then if caller is compiled by new gcc and callee by old gcc it is the case you described, caller doesn't try to preserve values across the call and callee uselessly saves them anyway (unless it fails to assemble due to SEH). But if caller is compiled by old gcc and callee by new gcc, caller assumes it can preserve values across the call in those registers and callee doesn't save them and clobbers them. Jakub ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI 2020-02-08 10:52 ` Jakub Jelinek @ 2020-02-08 12:52 ` Uros Bizjak 0 siblings, 0 replies; 29+ messages in thread From: Uros Bizjak @ 2020-02-08 12:52 UTC (permalink / raw) To: Jakub Jelinek; +Cc: JonY, gcc-patches On Sat, Feb 8, 2020 at 11:52 AM Jakub Jelinek <jakub@redhat.com> wrote: > > On Sat, Feb 08, 2020 at 11:32:40AM +0100, Uros Bizjak wrote: > > I think that the patch should also be backported to gcc-9 branch. The > > change is backward compatible, since the new code will save and > > restore zmm16+ registers at the caller site, and the old code (e.g. > > existing libraries) will then unnecessarily save and restore regs in > > the callee. > > Actually, it isn't backward compatible. > If both caller and callee touch xmm16+ (sure, unlikely), then if caller > is compiled by new gcc and callee by old gcc it is the case you described, > caller doesn't try to preserve values across the call and callee uselessly > saves them anyway (unless it fails to assemble due to SEH). > But if caller is compiled by old gcc and callee by new gcc, caller assumes > it can preserve values across the call in those registers and callee doesn't > save them and clobbers them. I was considering only the most likely scenario where the compiler is upgraded without recompiling existing system libraries. So, with the new compiler installed, the new compiled code calls either new or old libraries. *Assuming* that old libraries never call new code, the situation is OK. As you point out, when the old code calls new code, registers get clobbered. By not backporting the patch, we can consider gcc-9 and gcc-10 ABI incompatible, otherwise the forward incompatibility moves in between gcc-9.2 and gcc-9.3. I agree that the former is somehow more convenient. IMO, this issue should be pointed out in the compiler release notes in any case. Uros. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2020-02-08 12:52 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-04 9:39 [PATCH] i386: Omit clobbers from vzeroupper until final [PR92190] Jakub Jelinek 2020-02-04 10:16 ` Uros Bizjak 2020-02-04 11:05 ` Jakub Jelinek 2020-02-04 11:13 ` Uros Bizjak 2020-02-04 11:24 ` Uros Bizjak 2020-02-04 12:06 ` Richard Sandiford 2020-02-04 12:23 ` Uros Bizjak 2020-02-08 1:50 ` Segher Boessenkool 2020-02-04 12:31 ` Jakub Jelinek 2020-02-04 12:39 ` Uros Bizjak 2020-02-04 13:13 ` Jakub Jelinek 2020-02-04 13:15 ` Uros Bizjak 2020-02-05 10:05 ` Jakub Jelinek 2020-02-05 10:12 ` Jakub Jelinek 2020-02-05 10:47 ` Uros Bizjak 2020-02-05 11:03 ` Jakub Jelinek 2020-02-05 11:11 ` Uros Bizjak 2020-02-05 11:24 ` Jakub Jelinek 2020-02-04 11:42 ` [PATCH] i386: Make xmm16-xmm31 call used even in ms ABI Jakub Jelinek 2020-02-06 1:00 ` JonY 2020-02-06 6:07 ` Jakub Jelinek 2020-02-07 10:57 ` JonY 2020-02-07 11:28 ` Jakub Jelinek 2020-02-08 8:24 ` JonY 2020-02-08 10:05 ` Jakub Jelinek 2020-02-08 10:32 ` Uros Bizjak 2020-02-08 10:34 ` Jakub Jelinek 2020-02-08 10:52 ` Jakub Jelinek 2020-02-08 12:52 ` Uros Bizjak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).