public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593)
@ 2019-01-07 22:51 Jakub Jelinek
  2019-01-08  7:29 ` Uros Bizjak
  2019-01-11 19:03 ` Jeff Law
  0 siblings, 2 replies; 7+ messages in thread
From: Jakub Jelinek @ 2019-01-07 22:51 UTC (permalink / raw)
  To: Uros Bizjak, Jeff Law; +Cc: gcc-patches

Hi!

As mentioned in that PR, we have a SI->DImode zero extension and RA happens
to choose to zero extend from a SImode memory slot which is the low part of
the DImode memory slot into which the zero extension is to be stored. 
Unfortunately, the RTL DSE part really doesn't have infrastructure to
remember and, if needed, invalidate loads, it just remembers stores, so
handling this generically is quite unlikely at least for GCC9.

This patch just handles that through a peephole2 (other option would be to
handle it in the define_split for the zero extension, but the peephole2 is
likely to catch more things).

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2019-01-07  Jakub Jelinek  <jakub@redhat.com>

	PR rtl-optimization/79593
	* config/i386/i386.md (reg = mem; mem = reg): New define_peephole2.

--- gcc/config/i386/i386.md.jj	2019-01-01 12:37:31.564738571 +0100
+++ gcc/config/i386/i386.md	2019-01-07 17:11:21.056392168 +0100
@@ -18740,6 +18740,21 @@ (define_peephole2
 		       const0_rtx);
 })
 
+;; Attempt to optimize away memory stores of values the memory already
+;; has.  See PR79593.
+(define_peephole2
+  [(set (match_operand 0 "register_operand")
+        (match_operand 1 "memory_operand"))
+   (set (match_dup 1) (match_dup 0))]
+  "REG_P (operands[0])
+   && !STACK_REGNO_P (operands[0])
+   && !MEM_VOLATILE_P (operands[1])"
+  [(set (match_dup 0) (match_dup 1))]
+{
+  if (peep2_reg_dead_p (1, operands[0]))
+    DONE;
+})
+
 ;; Attempt to always use XOR for zeroing registers (including FP modes).
 (define_peephole2
   [(set (match_operand 0 "general_reg_operand")

	Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593)
  2019-01-07 22:51 [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593) Jakub Jelinek
@ 2019-01-08  7:29 ` Uros Bizjak
  2019-01-08  9:27   ` Jakub Jelinek
  2019-01-11 19:03 ` Jeff Law
  1 sibling, 1 reply; 7+ messages in thread
From: Uros Bizjak @ 2019-01-08  7:29 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jeff Law, gcc-patches

On Mon, Jan 7, 2019 at 11:51 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> Hi!
>
> As mentioned in that PR, we have a SI->DImode zero extension and RA happens
> to choose to zero extend from a SImode memory slot which is the low part of
> the DImode memory slot into which the zero extension is to be stored.
> Unfortunately, the RTL DSE part really doesn't have infrastructure to
> remember and, if needed, invalidate loads, it just remembers stores, so
> handling this generically is quite unlikely at least for GCC9.
>
> This patch just handles that through a peephole2 (other option would be to
> handle it in the define_split for the zero extension, but the peephole2 is
> likely to catch more things).
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Is there a reason stack registers are excluded? Before stackreg pass,
these registers are just like other hard registers.

Other that that, there is no need for REG_P predicate; after reload we
don't have subregs and register_operand will match only hard regs.
Also, please put peep2_reg_dead_p predicate in the pattern predicate.

Uros.

> 2019-01-07  Jakub Jelinek  <jakub@redhat.com>
>
>         PR rtl-optimization/79593
>         * config/i386/i386.md (reg = mem; mem = reg): New define_peephole2.
>
> --- gcc/config/i386/i386.md.jj  2019-01-01 12:37:31.564738571 +0100
> +++ gcc/config/i386/i386.md     2019-01-07 17:11:21.056392168 +0100
> @@ -18740,6 +18740,21 @@ (define_peephole2
>                        const0_rtx);
>  })
>
> +;; Attempt to optimize away memory stores of values the memory already
> +;; has.  See PR79593.
> +(define_peephole2
> +  [(set (match_operand 0 "register_operand")
> +        (match_operand 1 "memory_operand"))
> +   (set (match_dup 1) (match_dup 0))]
> +  "REG_P (operands[0])
> +   && !STACK_REGNO_P (operands[0])
> +   && !MEM_VOLATILE_P (operands[1])"
> +  [(set (match_dup 0) (match_dup 1))]
> +{
> +  if (peep2_reg_dead_p (1, operands[0]))
> +    DONE;
> +})
> +
>  ;; Attempt to always use XOR for zeroing registers (including FP modes).
>  (define_peephole2
>    [(set (match_operand 0 "general_reg_operand")
>
>         Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593)
  2019-01-08  7:29 ` Uros Bizjak
@ 2019-01-08  9:27   ` Jakub Jelinek
  2019-01-08 10:49     ` Uros Bizjak
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Jelinek @ 2019-01-08  9:27 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Jeff Law, gcc-patches

On Tue, Jan 08, 2019 at 08:29:03AM +0100, Uros Bizjak wrote:
> Is there a reason stack registers are excluded? Before stackreg pass,
> these registers are just like other hard registers.

I was just afraid of those, after stack pass removing a store but not
load would result in the stack getting out of sync, and I wasn't sure if the
stack pass will tolerate a RA chosen stack reg not being touched anymore.

Also, doesn't loading a float/double/long double into a stack reg and
storing again canonicalize it in any way?  Say NaNs?  Or unnormal or
pseudo-denormal long doubles etc.?  Tried that now with a sNaN and the
load/store didn't change it, but haven't tried with exceptions enabled to
see if it would raise one.

If that is fine, I'll leave that out.

Looking around, I was using a wrong macro anyway and am surprised nothing
complained, it should have been STACK_REG_P rather than STACK_REGNO_P.

> Other that that, there is no need for REG_P predicate; after reload we
> don't have subregs and register_operand will match only hard regs.

Ok, I wasn't sure because "register_operand" allows (subreg (reg)) even if
reload_completed, just disallows subregs of mem after that.

> Also, please put peep2_reg_dead_p predicate in the pattern predicate.

I don't see how, that would mean I'd have to write two peephole2s instead of
one.  It tries to deal with two different cases, one is where the temporary
reg is dead, in that case we can optimize away both the load or store, the
second case is where the temporary reg isn't dead, in that case we can
optimize away the store, but not the load.  With the optimizing away of both
load and store I was just trying to do a cheap DCE there.

Looking around more, I actually think I need to replace
(match_dup 1) with (match_operand 2 "memory_operand"), add rtx_equal_p
(operands[1], operands[2]) and !MEM_VOLATILE_P (operands[2]), because
apparently rtx_equal_p doesn't check the MEM_VOLATILE_P bit.

> > 2019-01-07  Jakub Jelinek  <jakub@redhat.com>
> >
> >         PR rtl-optimization/79593
> >         * config/i386/i386.md (reg = mem; mem = reg): New define_peephole2.
> >
> > --- gcc/config/i386/i386.md.jj  2019-01-01 12:37:31.564738571 +0100
> > +++ gcc/config/i386/i386.md     2019-01-07 17:11:21.056392168 +0100
> > @@ -18740,6 +18740,21 @@ (define_peephole2
> >                        const0_rtx);
> >  })
> >
> > +;; Attempt to optimize away memory stores of values the memory already
> > +;; has.  See PR79593.
> > +(define_peephole2
> > +  [(set (match_operand 0 "register_operand")
> > +        (match_operand 1 "memory_operand"))
> > +   (set (match_dup 1) (match_dup 0))]
> > +  "REG_P (operands[0])
> > +   && !STACK_REGNO_P (operands[0])
> > +   && !MEM_VOLATILE_P (operands[1])"
> > +  [(set (match_dup 0) (match_dup 1))]
> > +{
> > +  if (peep2_reg_dead_p (1, operands[0]))
> > +    DONE;
> > +})
> > +
> >  ;; Attempt to always use XOR for zeroing registers (including FP modes).
> >  (define_peephole2
> >    [(set (match_operand 0 "general_reg_operand")

	Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593)
  2019-01-08  9:27   ` Jakub Jelinek
@ 2019-01-08 10:49     ` Uros Bizjak
  2019-01-08 11:43       ` Jakub Jelinek
  0 siblings, 1 reply; 7+ messages in thread
From: Uros Bizjak @ 2019-01-08 10:49 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jeff Law, gcc-patches

On Tue, Jan 8, 2019 at 10:27 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Jan 08, 2019 at 08:29:03AM +0100, Uros Bizjak wrote:
> > Is there a reason stack registers are excluded? Before stackreg pass,
> > these registers are just like other hard registers.
>
> I was just afraid of those, after stack pass removing a store but not
> load would result in the stack getting out of sync, and I wasn't sure if the
> stack pass will tolerate a RA chosen stack reg not being touched anymore.

It does. There is a stack compensating code that handles these situations.

> Also, doesn't loading a float/double/long double into a stack reg and
> storing again canonicalize it in any way?  Say NaNs?  Or unnormal or
> pseudo-denormal long doubles etc.?  Tried that now with a sNaN and the
> load/store didn't change it, but haven't tried with exceptions enabled to
> see if it would raise one.

FLD from memory in SF and DFmode is considered a conversion, and
converts sNaN to NaN (and emits #IA exception). But sNaN handling is
already busted in the compiler as RA is free to spill the register in
non-XFmode. IMO, the peephole2 pattern is no worse than the current
situation.

> If that is fine, I'll leave that out.

I think we are good to do so.

> Looking around, I was using a wrong macro anyway and am surprised nothing
> complained, it should have been STACK_REG_P rather than STACK_REGNO_P.
>
> > Other that that, there is no need for REG_P predicate; after reload we
> > don't have subregs and register_operand will match only hard regs.
>
> Ok, I wasn't sure because "register_operand" allows (subreg (reg)) even if
> reload_completed, just disallows subregs of mem after that.

At least for x86, there are no SUBREGs after reload, otherwise other
parts of the compiler would break.

> > Also, please put peep2_reg_dead_p predicate in the pattern predicate.
>
> I don't see how, that would mean I'd have to write two peephole2s instead of
> one.  It tries to deal with two different cases, one is where the temporary
> reg is dead, in that case we can optimize away both the load or store, the
> second case is where the temporary reg isn't dead, in that case we can
> optimize away the store, but not the load.  With the optimizing away of both
> load and store I was just trying to do a cheap DCE there.

I didn't realize this is an optimization, a comment would be welcome here.

Uros.

> Looking around more, I actually think I need to replace
> (match_dup 1) with (match_operand 2 "memory_operand"), add rtx_equal_p
> (operands[1], operands[2]) and !MEM_VOLATILE_P (operands[2]), because
> apparently rtx_equal_p doesn't check the MEM_VOLATILE_P bit.
>
> > > 2019-01-07  Jakub Jelinek  <jakub@redhat.com>
> > >
> > >         PR rtl-optimization/79593
> > >         * config/i386/i386.md (reg = mem; mem = reg): New define_peephole2.
> > >
> > > --- gcc/config/i386/i386.md.jj  2019-01-01 12:37:31.564738571 +0100
> > > +++ gcc/config/i386/i386.md     2019-01-07 17:11:21.056392168 +0100
> > > @@ -18740,6 +18740,21 @@ (define_peephole2
> > >                        const0_rtx);
> > >  })
> > >
> > > +;; Attempt to optimize away memory stores of values the memory already
> > > +;; has.  See PR79593.
> > > +(define_peephole2
> > > +  [(set (match_operand 0 "register_operand")
> > > +        (match_operand 1 "memory_operand"))
> > > +   (set (match_dup 1) (match_dup 0))]
> > > +  "REG_P (operands[0])
> > > +   && !STACK_REGNO_P (operands[0])
> > > +   && !MEM_VOLATILE_P (operands[1])"
> > > +  [(set (match_dup 0) (match_dup 1))]
> > > +{
> > > +  if (peep2_reg_dead_p (1, operands[0]))
> > > +    DONE;
> > > +})
> > > +
> > >  ;; Attempt to always use XOR for zeroing registers (including FP modes).
> > >  (define_peephole2
> > >    [(set (match_operand 0 "general_reg_operand")
>
>         Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593)
  2019-01-08 10:49     ` Uros Bizjak
@ 2019-01-08 11:43       ` Jakub Jelinek
  2019-01-08 17:40         ` Uros Bizjak
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Jelinek @ 2019-01-08 11:43 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Jeff Law, gcc-patches

On Tue, Jan 08, 2019 at 11:49:10AM +0100, Uros Bizjak wrote:
> FLD from memory in SF and DFmode is considered a conversion, and
> converts sNaN to NaN (and emits #IA exception). But sNaN handling is
> already busted in the compiler as RA is free to spill the register in
> non-XFmode. IMO, the peephole2 pattern is no worse than the current
> situation.

Ok.

> At least for x86, there are no SUBREGs after reload, otherwise other
> parts of the compiler would break.

The new patch would really handle even a SUBREG there...

> > I don't see how, that would mean I'd have to write two peephole2s instead of
> > one.  It tries to deal with two different cases, one is where the temporary
> > reg is dead, in that case we can optimize away both the load or store, the
> > second case is where the temporary reg isn't dead, in that case we can
> > optimize away the store, but not the load.  With the optimizing away of both
> > load and store I was just trying to do a cheap DCE there.
> 
> I didn't realize this is an optimization, a comment would be welcome here.

Ugh, except that it doesn't work.  peep2_reg_dead_p (1, operands[0])
is not what I meant, that is always false, as the register must be live in
between the first and second instruction.  I meant
peep2_reg_dead_p (2, operands[0]), the register dead at the end of the
second instruction, except we don't really support
define_split/define_peephole2 splitting into zero instructions, DONE; in
that case returns NULL like FAIL; does.  So, let's just wait for DCE to
finish it up.

Here is what I'll bootstrap/regtest then.  Added also
reg_overlap_mentioned_p, in case there is e.g.
  movl (%eax,%edx), %eax
  movl %eax, (%eax,%edx)
or similar and as I said earlier, explicit match_operand so that I can
check MEM_VOLATILE_P on both MEMs.

2019-01-08  Jakub Jelinek  <jakub@redhat.com>

	PR rtl-optimization/79593
	* config/i386/i386.md (reg = mem; mem = reg): New define_peephole2.

--- gcc/config/i386/i386.md.jj	2019-01-07 23:54:54.494800693 +0100
+++ gcc/config/i386/i386.md	2019-01-08 12:34:18.916832780 +0100
@@ -18740,6 +18740,18 @@ (define_peephole2
 		       const0_rtx);
 })
 
+;; Attempt to optimize away memory stores of values the memory already
+;; has.  See PR79593.
+(define_peephole2
+  [(set (match_operand 0 "register_operand")
+        (match_operand 1 "memory_operand"))
+   (set (match_operand 2 "memory_operand") (match_dup 0))]
+  "!MEM_VOLATILE_P (operands[1])
+   && !MEM_VOLATILE_P (operands[2])
+   && rtx_equal_p (operands[1], operands[2])
+   && !reg_overlap_mentioned_p (operands[0], operands[2])"
+  [(set (match_dup 0) (match_dup 1))])
+
 ;; Attempt to always use XOR for zeroing registers (including FP modes).
 (define_peephole2
   [(set (match_operand 0 "general_reg_operand")


	Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593)
  2019-01-08 11:43       ` Jakub Jelinek
@ 2019-01-08 17:40         ` Uros Bizjak
  0 siblings, 0 replies; 7+ messages in thread
From: Uros Bizjak @ 2019-01-08 17:40 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jeff Law, gcc-patches

On Tue, Jan 8, 2019 at 12:43 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Jan 08, 2019 at 11:49:10AM +0100, Uros Bizjak wrote:
> > FLD from memory in SF and DFmode is considered a conversion, and
> > converts sNaN to NaN (and emits #IA exception). But sNaN handling is
> > already busted in the compiler as RA is free to spill the register in
> > non-XFmode. IMO, the peephole2 pattern is no worse than the current
> > situation.
>
> Ok.
>
> > At least for x86, there are no SUBREGs after reload, otherwise other
> > parts of the compiler would break.
>
> The new patch would really handle even a SUBREG there...
>
> > > I don't see how, that would mean I'd have to write two peephole2s instead of
> > > one.  It tries to deal with two different cases, one is where the temporary
> > > reg is dead, in that case we can optimize away both the load or store, the
> > > second case is where the temporary reg isn't dead, in that case we can
> > > optimize away the store, but not the load.  With the optimizing away of both
> > > load and store I was just trying to do a cheap DCE there.
> >
> > I didn't realize this is an optimization, a comment would be welcome here.
>
> Ugh, except that it doesn't work.  peep2_reg_dead_p (1, operands[0])
> is not what I meant, that is always false, as the register must be live in
> between the first and second instruction.  I meant
> peep2_reg_dead_p (2, operands[0]), the register dead at the end of the
> second instruction, except we don't really support
> define_split/define_peephole2 splitting into zero instructions, DONE; in
> that case returns NULL like FAIL; does.  So, let's just wait for DCE to
> finish it up.
>
> Here is what I'll bootstrap/regtest then.  Added also
> reg_overlap_mentioned_p, in case there is e.g.
>   movl (%eax,%edx), %eax
>   movl %eax, (%eax,%edx)

I doubt this would *ever* happen, but ... OK.

> or similar and as I said earlier, explicit match_operand so that I can
> check MEM_VOLATILE_P on both MEMs.
>
> 2019-01-08  Jakub Jelinek  <jakub@redhat.com>
>
>         PR rtl-optimization/79593
>         * config/i386/i386.md (reg = mem; mem = reg): New define_peephole2.

OK for mainline.

Thanks,
Uros.

> --- gcc/config/i386/i386.md.jj  2019-01-07 23:54:54.494800693 +0100
> +++ gcc/config/i386/i386.md     2019-01-08 12:34:18.916832780 +0100
> @@ -18740,6 +18740,18 @@ (define_peephole2
>                        const0_rtx);
>  })
>
> +;; Attempt to optimize away memory stores of values the memory already
> +;; has.  See PR79593.
> +(define_peephole2
> +  [(set (match_operand 0 "register_operand")
> +        (match_operand 1 "memory_operand"))
> +   (set (match_operand 2 "memory_operand") (match_dup 0))]
> +  "!MEM_VOLATILE_P (operands[1])
> +   && !MEM_VOLATILE_P (operands[2])
> +   && rtx_equal_p (operands[1], operands[2])
> +   && !reg_overlap_mentioned_p (operands[0], operands[2])"
> +  [(set (match_dup 0) (match_dup 1))])
> +
>  ;; Attempt to always use XOR for zeroing registers (including FP modes).
>  (define_peephole2
>    [(set (match_operand 0 "general_reg_operand")
>
>
>         Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593)
  2019-01-07 22:51 [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593) Jakub Jelinek
  2019-01-08  7:29 ` Uros Bizjak
@ 2019-01-11 19:03 ` Jeff Law
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff Law @ 2019-01-11 19:03 UTC (permalink / raw)
  To: Jakub Jelinek, Uros Bizjak; +Cc: gcc-patches

On 1/7/19 3:51 PM, Jakub Jelinek wrote:
> Hi!
> 
> As mentioned in that PR, we have a SI->DImode zero extension and RA happens
> to choose to zero extend from a SImode memory slot which is the low part of
> the DImode memory slot into which the zero extension is to be stored. 
> Unfortunately, the RTL DSE part really doesn't have infrastructure to
> remember and, if needed, invalidate loads, it just remembers stores, so
> handling this generically is quite unlikely at least for GCC9.
> 
> This patch just handles that through a peephole2 (other option would be to
> handle it in the define_split for the zero extension, but the peephole2 is
> likely to catch more things).
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> 2019-01-07  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR rtl-optimization/79593
> 	* config/i386/i386.md (reg = mem; mem = reg): New define_peephole2.
Note I'd hacked up a POC in DSE and it caught something like 10
instances across a bootstrap.  In the end I don't think this matters
much in practice (peephole vs optimizing at the split point).

Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-01-11 19:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-07 22:51 [PATCH] Optimize away x86 mem stores of what the mem contains already (PR rtl-optimization/79593) Jakub Jelinek
2019-01-08  7:29 ` Uros Bizjak
2019-01-08  9:27   ` Jakub Jelinek
2019-01-08 10:49     ` Uros Bizjak
2019-01-08 11:43       ` Jakub Jelinek
2019-01-08 17:40         ` Uros Bizjak
2019-01-11 19:03 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).