public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg
@ 2022-09-16  1:06 liuhongt
  2022-09-16  1:13 ` Hongtao Liu
  2022-09-16  1:31 ` Jeff Law
  0 siblings, 2 replies; 6+ messages in thread
From: liuhongt @ 2022-09-16  1:06 UTC (permalink / raw)
  To: gcc-patches; +Cc: ubizjak, hjl.tools

There's peephole2 submit in 1990s which split cmp mem, 0 to load mem,
reg + test reg, reg. I don't know exact reason why gcc do this.

For latest x86 processors, ciscization should help processor frontend
also codesize, for processor backend, they should be the same(has same
uops).

So the patch deleted the peephole2, and also modify another splitter to
generate more cmp mem, 0 for 32-bit target.

It will help instruction fetch.

for minmax-1.c minmax-2.c minmax-10, pr96891.c, it's supposed to scan there's no
comparison to 1 or -1, so adjust the testcase since under 32-bit
target, we now generate cmp mem, 0 instead of load + test.

Similar for pr78035.c.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
No performance impact for SPEC2017 on ICX/Znver3.

Ok for trunk?

gcc/ChangeLog:

	* config/i386/i386.md (*<code><mode>3_1): Replace
	register_operand with nonimmediate_operand for operand 1. Also
	force_reg it when mode is QImode.
	(define_peephole2): Deleted related peephole2.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/minmax-1.c: Scan-assemble-not for cmp with 1
	or -1, also don't scan-assembler test for ia32.
	* gcc.target/i386/minmax-10.c: Ditto.
	* gcc.target/i386/minmax-2.c: Ditto.
	* gcc.target/i386/pr78035.c: Ditto.
	* gcc.target/i386/pr96861.c: Scan either cmp or test 3 times.
---
 gcc/config/i386/i386.md                   | 18 +++++-------------
 gcc/testsuite/gcc.target/i386/minmax-1.c  |  4 ++--
 gcc/testsuite/gcc.target/i386/minmax-10.c |  4 ++--
 gcc/testsuite/gcc.target/i386/minmax-2.c  |  4 ++--
 gcc/testsuite/gcc.target/i386/pr78035.c   |  2 +-
 gcc/testsuite/gcc.target/i386/pr96861.c   |  4 ++--
 6 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 1be9b669909..93b905beb72 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -21871,7 +21871,7 @@ (define_insn_and_split "*<code><dwi>3_doubleword"
 (define_insn_and_split "*<code><mode>3_1"
   [(set (match_operand:SWI 0 "register_operand")
 	(maxmin:SWI
-	  (match_operand:SWI 1 "register_operand")
+	  (match_operand:SWI 1 "nonimmediate_operand")
 	  (match_operand:SWI 2 "general_operand")))
    (clobber (reg:CC FLAGS_REG))]
   "TARGET_CMOVE
@@ -21886,9 +21886,12 @@ (define_insn_and_split "*<code><mode>3_1"
 {
   machine_mode mode = <MODE>mode;
   rtx cmp_op = operands[2];
-
   operands[2] = force_reg (mode, cmp_op);
 
+  /* movqicc_noc only support register_operand for op1.  */
+  if (mode == QImode)
+    operands[1] = force_reg (mode, operands[1]);
+
   enum rtx_code code = <maxmin_rel>;
 
   if (cmp_op == const1_rtx)
@@ -22482,17 +22485,6 @@ (define_peephole2
   [(set (match_dup 2) (match_dup 1))
    (set (match_dup 0) (match_dup 2))])
 
-;; Don't compare memory with zero, load and use a test instead.
-(define_peephole2
-  [(set (match_operand 0 "flags_reg_operand")
- 	(match_operator 1 "compare_operator"
-	  [(match_operand:SI 2 "memory_operand")
-	   (const_int 0)]))
-   (match_scratch:SI 3 "r")]
-  "optimize_insn_for_speed_p () && ix86_match_ccmode (insn, CCNOmode)"
-  [(set (match_dup 3) (match_dup 2))
-   (set (match_dup 0) (match_op_dup 1 [(match_dup 3) (const_int 0)]))])
-
 ;; NOT is not pairable on Pentium, while XOR is, but one byte longer.
 ;; Don't split NOTs with a displacement operand, because resulting XOR
 ;; will not be pairable anyway.
diff --git a/gcc/testsuite/gcc.target/i386/minmax-1.c b/gcc/testsuite/gcc.target/i386/minmax-1.c
index 0ec35b1c5a1..840b32c5414 100644
--- a/gcc/testsuite/gcc.target/i386/minmax-1.c
+++ b/gcc/testsuite/gcc.target/i386/minmax-1.c
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -march=opteron -mno-stv" } */
-/* { dg-final { scan-assembler "test" } } */
-/* { dg-final { scan-assembler-not "cmp" } } */
+/* { dg-final { scan-assembler "test" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not {(?n)cmp.*[$]+1} } } */
 #define max(a,b) (((a) > (b))? (a) : (b))
 int
 t(int a)
diff --git a/gcc/testsuite/gcc.target/i386/minmax-10.c b/gcc/testsuite/gcc.target/i386/minmax-10.c
index b044462c5a9..1dd2eedf435 100644
--- a/gcc/testsuite/gcc.target/i386/minmax-10.c
+++ b/gcc/testsuite/gcc.target/i386/minmax-10.c
@@ -34,5 +34,5 @@ unsigned int umin1(unsigned int x)
   return min(x,1);
 }
 
-/* { dg-final { scan-assembler-times "test" 6 } } */
-/* { dg-final { scan-assembler-not "cmp" } } */
+/* { dg-final { scan-assembler-times "test" 6 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not {(?n)cmp.*1} } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-2.c b/gcc/testsuite/gcc.target/i386/minmax-2.c
index af9baeaaf7c..2c82f6cecb9 100644
--- a/gcc/testsuite/gcc.target/i386/minmax-2.c
+++ b/gcc/testsuite/gcc.target/i386/minmax-2.c
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mno-stv" } */
-/* { dg-final { scan-assembler "test" } } */
-/* { dg-final { scan-assembler-not "cmp" } } */
+/* { dg-final { scan-assembler "test" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not {(?n)cmp.*[$]1} } } */
 #define max(a,b) (((a) > (b))? (a) : (b))
 unsigned int
 t(unsigned int a)
diff --git a/gcc/testsuite/gcc.target/i386/pr78035.c b/gcc/testsuite/gcc.target/i386/pr78035.c
index 7d3a983b218..d543d3f1d38 100644
--- a/gcc/testsuite/gcc.target/i386/pr78035.c
+++ b/gcc/testsuite/gcc.target/i386/pr78035.c
@@ -22,4 +22,4 @@ int bar ()
 }
 
 /* We should not optimize away either comparison.  */
-/* { dg-final { scan-assembler-times "cmp" 2 } } */
+/* { dg-final { scan-assembler-times "(?:cmp|test)" 3 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr96861.c b/gcc/testsuite/gcc.target/i386/pr96861.c
index 7b7aeccb83c..8c0f0841f7d 100644
--- a/gcc/testsuite/gcc.target/i386/pr96861.c
+++ b/gcc/testsuite/gcc.target/i386/pr96861.c
@@ -34,5 +34,5 @@ unsigned int umin1(unsigned int x)
   return min(x,1);
 }
 
-/* { dg-final { scan-assembler-times "test" 6 } } */
-/* { dg-final { scan-assembler-not "cmp" } } */
+/* { dg-final { scan-assembler-times "test" 6 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not {(?n)cmp.*[$]+1} } } */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg
  2022-09-16  1:06 [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg liuhongt
@ 2022-09-16  1:13 ` Hongtao Liu
  2022-09-16  1:31 ` Jeff Law
  1 sibling, 0 replies; 6+ messages in thread
From: Hongtao Liu @ 2022-09-16  1:13 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, H. J. Lu

On Fri, Sep 16, 2022 at 9:09 AM liuhongt via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> There's peephole2 submit in 1990s which split cmp mem, 0 to load mem,
> reg + test reg, reg. I don't know exact reason why gcc do this.
>
> For latest x86 processors, ciscization should help processor frontend
> also codesize, for processor backend, they should be the same(has same
> uops).
>
> So the patch deleted the peephole2, and also modify another splitter to
> generate more cmp mem, 0 for 32-bit target.
>
> It will help instruction fetch.
>
> for minmax-1.c minmax-2.c minmax-10, pr96891.c, it's supposed to scan there's no
> comparison to 1 or -1, so adjust the testcase since under 32-bit
> target, we now generate cmp mem, 0 instead of load + test.
>
> Similar for pr78035.c.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> No performance impact for SPEC2017 on ICX/Znver3.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
>         * config/i386/i386.md (*<code><mode>3_1): Replace
>         register_operand with nonimmediate_operand for operand 1. Also
>         force_reg it when mode is QImode.
>         (define_peephole2): Deleted related peephole2.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/minmax-1.c: Scan-assemble-not for cmp with 1
>         or -1, also don't scan-assembler test for ia32.
>         * gcc.target/i386/minmax-10.c: Ditto.
>         * gcc.target/i386/minmax-2.c: Ditto.
>         * gcc.target/i386/pr78035.c: Ditto.
>         * gcc.target/i386/pr96861.c: Scan either cmp or test 3 times.
> ---
>  gcc/config/i386/i386.md                   | 18 +++++-------------
>  gcc/testsuite/gcc.target/i386/minmax-1.c  |  4 ++--
>  gcc/testsuite/gcc.target/i386/minmax-10.c |  4 ++--
>  gcc/testsuite/gcc.target/i386/minmax-2.c  |  4 ++--
>  gcc/testsuite/gcc.target/i386/pr78035.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr96861.c   |  4 ++--
>  6 files changed, 14 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 1be9b669909..93b905beb72 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -21871,7 +21871,7 @@ (define_insn_and_split "*<code><dwi>3_doubleword"
>  (define_insn_and_split "*<code><mode>3_1"
>    [(set (match_operand:SWI 0 "register_operand")
>         (maxmin:SWI
> -         (match_operand:SWI 1 "register_operand")
> +         (match_operand:SWI 1 "nonimmediate_operand")
>           (match_operand:SWI 2 "general_operand")))
>     (clobber (reg:CC FLAGS_REG))]
>    "TARGET_CMOVE
> @@ -21886,9 +21886,12 @@ (define_insn_and_split "*<code><mode>3_1"
>  {
>    machine_mode mode = <MODE>mode;
>    rtx cmp_op = operands[2];
> -
>    operands[2] = force_reg (mode, cmp_op);
>
> +  /* movqicc_noc only support register_operand for op1.  */
> +  if (mode == QImode)
> +    operands[1] = force_reg (mode, operands[1]);
> +
>    enum rtx_code code = <maxmin_rel>;
>
>    if (cmp_op == const1_rtx)
> @@ -22482,17 +22485,6 @@ (define_peephole2
>    [(set (match_dup 2) (match_dup 1))
>     (set (match_dup 0) (match_dup 2))])
>
> -;; Don't compare memory with zero, load and use a test instead.
> -(define_peephole2
> -  [(set (match_operand 0 "flags_reg_operand")
> -       (match_operator 1 "compare_operator"
> -         [(match_operand:SI 2 "memory_operand")
> -          (const_int 0)]))
> -   (match_scratch:SI 3 "r")]
> -  "optimize_insn_for_speed_p () && ix86_match_ccmode (insn, CCNOmode)"
> -  [(set (match_dup 3) (match_dup 2))
> -   (set (match_dup 0) (match_op_dup 1 [(match_dup 3) (const_int 0)]))])
> -
>  ;; NOT is not pairable on Pentium, while XOR is, but one byte longer.
>  ;; Don't split NOTs with a displacement operand, because resulting XOR
>  ;; will not be pairable anyway.
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-1.c b/gcc/testsuite/gcc.target/i386/minmax-1.c
> index 0ec35b1c5a1..840b32c5414 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-1.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-1.c
> @@ -1,7 +1,7 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -march=opteron -mno-stv" } */
> -/* { dg-final { scan-assembler "test" } } */
> -/* { dg-final { scan-assembler-not "cmp" } } */
> +/* { dg-final { scan-assembler "test" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-not {(?n)cmp.*[$]+1} } } */
>  #define max(a,b) (((a) > (b))? (a) : (b))
>  int
>  t(int a)
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-10.c b/gcc/testsuite/gcc.target/i386/minmax-10.c
> index b044462c5a9..1dd2eedf435 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-10.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-10.c
> @@ -34,5 +34,5 @@ unsigned int umin1(unsigned int x)
>    return min(x,1);
>  }
>
> -/* { dg-final { scan-assembler-times "test" 6 } } */
> -/* { dg-final { scan-assembler-not "cmp" } } */
> +/* { dg-final { scan-assembler-times "test" 6 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-not {(?n)cmp.*1} } } */
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-2.c b/gcc/testsuite/gcc.target/i386/minmax-2.c
> index af9baeaaf7c..2c82f6cecb9 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-2.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-2.c
> @@ -1,7 +1,7 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mno-stv" } */
> -/* { dg-final { scan-assembler "test" } } */
> -/* { dg-final { scan-assembler-not "cmp" } } */
> +/* { dg-final { scan-assembler "test" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-not {(?n)cmp.*[$]1} } } */
>  #define max(a,b) (((a) > (b))? (a) : (b))
>  unsigned int
>  t(unsigned int a)
> diff --git a/gcc/testsuite/gcc.target/i386/pr78035.c b/gcc/testsuite/gcc.target/i386/pr78035.c
> index 7d3a983b218..d543d3f1d38 100644
> --- a/gcc/testsuite/gcc.target/i386/pr78035.c
> +++ b/gcc/testsuite/gcc.target/i386/pr78035.c
> @@ -22,4 +22,4 @@ int bar ()
>  }
>
>  /* We should not optimize away either comparison.  */
> -/* { dg-final { scan-assembler-times "cmp" 2 } } */
> +/* { dg-final { scan-assembler-times "(?:cmp|test)" 3 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr96861.c b/gcc/testsuite/gcc.target/i386/pr96861.c
> index 7b7aeccb83c..8c0f0841f7d 100644
> --- a/gcc/testsuite/gcc.target/i386/pr96861.c
> +++ b/gcc/testsuite/gcc.target/i386/pr96861.c
> @@ -34,5 +34,5 @@ unsigned int umin1(unsigned int x)
>    return min(x,1);
>  }
>
> -/* { dg-final { scan-assembler-times "test" 6 } } */
> -/* { dg-final { scan-assembler-not "cmp" } } */
> +/* { dg-final { scan-assembler-times "test" 6 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-not {(?n)cmp.*[$]+1} } } */
> --
> 2.18.1
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg
  2022-09-16  1:06 [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg liuhongt
  2022-09-16  1:13 ` Hongtao Liu
@ 2022-09-16  1:31 ` Jeff Law
  2022-09-16  6:47   ` Uros Bizjak
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff Law @ 2022-09-16  1:31 UTC (permalink / raw)
  To: gcc-patches


On 9/15/22 19:06, liuhongt via Gcc-patches wrote:
> There's peephole2 submit in 1990s which split cmp mem, 0 to load mem,
> reg + test reg, reg. I don't know exact reason why gcc do this.
>
> For latest x86 processors, ciscization should help processor frontend
> also codesize, for processor backend, they should be the same(has same
> uops).
>
> So the patch deleted the peephole2, and also modify another splitter to
> generate more cmp mem, 0 for 32-bit target.
>
> It will help instruction fetch.
>
> for minmax-1.c minmax-2.c minmax-10, pr96891.c, it's supposed to scan there's no
> comparison to 1 or -1, so adjust the testcase since under 32-bit
> target, we now generate cmp mem, 0 instead of load + test.
>
> Similar for pr78035.c.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> No performance impact for SPEC2017 on ICX/Znver3.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
> 	* config/i386/i386.md (*<code><mode>3_1): Replace
> 	register_operand with nonimmediate_operand for operand 1. Also
> 	force_reg it when mode is QImode.
> 	(define_peephole2): Deleted related peephole2.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/i386/minmax-1.c: Scan-assemble-not for cmp with 1
> 	or -1, also don't scan-assembler test for ia32.
> 	* gcc.target/i386/minmax-10.c: Ditto.
> 	* gcc.target/i386/minmax-2.c: Ditto.
> 	* gcc.target/i386/pr78035.c: Ditto.
> 	* gcc.target/i386/pr96861.c: Scan either cmp or test 3 times.

It was almost certainly for PPro/P2 given it was rth's work from 
1999.    Probably should have been conditionalized on PPro/P2 at the 
time.   No worries losing it now...


Jeff



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg
  2022-09-16  1:31 ` Jeff Law
@ 2022-09-16  6:47   ` Uros Bizjak
  2022-09-16 13:38     ` Alexander Monakov
  0 siblings, 1 reply; 6+ messages in thread
From: Uros Bizjak @ 2022-09-16  6:47 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

On Fri, Sep 16, 2022 at 3:32 AM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
> On 9/15/22 19:06, liuhongt via Gcc-patches wrote:
> > There's peephole2 submit in 1990s which split cmp mem, 0 to load mem,
> > reg + test reg, reg. I don't know exact reason why gcc do this.
> >
> > For latest x86 processors, ciscization should help processor frontend
> > also codesize, for processor backend, they should be the same(has same
> > uops).
> >
> > So the patch deleted the peephole2, and also modify another splitter to
> > generate more cmp mem, 0 for 32-bit target.
> >
> > It will help instruction fetch.
> >
> > for minmax-1.c minmax-2.c minmax-10, pr96891.c, it's supposed to scan there's no
> > comparison to 1 or -1, so adjust the testcase since under 32-bit
> > target, we now generate cmp mem, 0 instead of load + test.
> >
> > Similar for pr78035.c.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> > No performance impact for SPEC2017 on ICX/Znver3.
> >
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >       * config/i386/i386.md (*<code><mode>3_1): Replace
> >       register_operand with nonimmediate_operand for operand 1. Also
> >       force_reg it when mode is QImode.
> >       (define_peephole2): Deleted related peephole2.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/i386/minmax-1.c: Scan-assemble-not for cmp with 1
> >       or -1, also don't scan-assembler test for ia32.
> >       * gcc.target/i386/minmax-10.c: Ditto.
> >       * gcc.target/i386/minmax-2.c: Ditto.
> >       * gcc.target/i386/pr78035.c: Ditto.
> >       * gcc.target/i386/pr96861.c: Scan either cmp or test 3 times.
>
> It was almost certainly for PPro/P2 given it was rth's work from
> 1999.    Probably should have been conditionalized on PPro/P2 at the
> time.   No worries losing it now...

Please add a tune flag in x86-tune.def under "Historical relics" and
use it in the relevant peephole2 instead of deleting it.

Uros.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg
  2022-09-16  6:47   ` Uros Bizjak
@ 2022-09-16 13:38     ` Alexander Monakov
  2022-09-20  2:27       ` Hongtao Liu
  0 siblings, 1 reply; 6+ messages in thread
From: Alexander Monakov @ 2022-09-16 13:38 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Jeff Law, gcc-patches

On Fri, 16 Sep 2022, Uros Bizjak via Gcc-patches wrote:

> On Fri, Sep 16, 2022 at 3:32 AM Jeff Law via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> >
> > On 9/15/22 19:06, liuhongt via Gcc-patches wrote:
> > > There's peephole2 submit in 1990s which split cmp mem, 0 to load mem,
> > > reg + test reg, reg. I don't know exact reason why gcc do this.
> > >
> > > For latest x86 processors, ciscization should help processor frontend
> > > also codesize, for processor backend, they should be the same(has same
> > > uops).
> > >
> > > So the patch deleted the peephole2, and also modify another splitter to
> > > generate more cmp mem, 0 for 32-bit target.
> > >
> > > It will help instruction fetch.
> > >
> > > for minmax-1.c minmax-2.c minmax-10, pr96891.c, it's supposed to scan there's no
> > > comparison to 1 or -1, so adjust the testcase since under 32-bit
> > > target, we now generate cmp mem, 0 instead of load + test.
> > >
> > > Similar for pr78035.c.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> > > No performance impact for SPEC2017 on ICX/Znver3.
> > >
> > It was almost certainly for PPro/P2 given it was rth's work from
> > 1999.    Probably should have been conditionalized on PPro/P2 at the
> > time.   No worries losing it now...
> 
> Please add a tune flag in x86-tune.def under "Historical relics" and
> use it in the relevant peephole2 instead of deleting it.

When the next instruction after 'load mem; test reg, reg' is a conditional
branch, this disables macro-op fusion because Intel CPUs do not macro-fuse
'cmp mem, imm; jcc'.

It would be nice to rephrase the commit message to acknowledge this (the
statement 'has same uops' is not always true with this considered).

AMD CPUs can fuse some 'cmp mem, imm; jcc' under some conditions, so this
should be beneficial for AMD.

Alexander

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg
  2022-09-16 13:38     ` Alexander Monakov
@ 2022-09-20  2:27       ` Hongtao Liu
  0 siblings, 0 replies; 6+ messages in thread
From: Hongtao Liu @ 2022-09-20  2:27 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Uros Bizjak, gcc-patches

On Fri, Sep 16, 2022 at 9:38 PM Alexander Monakov via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, 16 Sep 2022, Uros Bizjak via Gcc-patches wrote:
>
> > On Fri, Sep 16, 2022 at 3:32 AM Jeff Law via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > >
> > > On 9/15/22 19:06, liuhongt via Gcc-patches wrote:
> > > > There's peephole2 submit in 1990s which split cmp mem, 0 to load mem,
> > > > reg + test reg, reg. I don't know exact reason why gcc do this.
> > > >
> > > > For latest x86 processors, ciscization should help processor frontend
> > > > also codesize, for processor backend, they should be the same(has same
> > > > uops).
> > > >
> > > > So the patch deleted the peephole2, and also modify another splitter to
> > > > generate more cmp mem, 0 for 32-bit target.
> > > >
> > > > It will help instruction fetch.
> > > >
> > > > for minmax-1.c minmax-2.c minmax-10, pr96891.c, it's supposed to scan there's no
> > > > comparison to 1 or -1, so adjust the testcase since under 32-bit
> > > > target, we now generate cmp mem, 0 instead of load + test.
> > > >
> > > > Similar for pr78035.c.
> > > >
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> > > > No performance impact for SPEC2017 on ICX/Znver3.
> > > >
> > > It was almost certainly for PPro/P2 given it was rth's work from
> > > 1999.    Probably should have been conditionalized on PPro/P2 at the
> > > time.   No worries losing it now...
> >
> > Please add a tune flag in x86-tune.def under "Historical relics" and
> > use it in the relevant peephole2 instead of deleting it.
>
> When the next instruction after 'load mem; test reg, reg' is a conditional
> branch, this disables macro-op fusion because Intel CPUs do not macro-fuse
> 'cmp mem, imm; jcc'.
>
Oh, i didn't realize it, thanks for your reply.
I'll hold on the patch until more investigation.
> It would be nice to rephrase the commit message to acknowledge this (the
> statement 'has same uops' is not always true with this considered).
>
> AMD CPUs can fuse some 'cmp mem, imm; jcc' under some conditions, so this
> should be beneficial for AMD.
>
> Alexander



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-09-20  2:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16  1:06 [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg liuhongt
2022-09-16  1:13 ` Hongtao Liu
2022-09-16  1:31 ` Jeff Law
2022-09-16  6:47   ` Uros Bizjak
2022-09-16 13:38     ` Alexander Monakov
2022-09-20  2:27       ` Hongtao Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).