Different ASM for ReLU function between GCC11 and GCC12

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Different ASM for ReLU function between GCC11 and GCC12
@ 2023-06-19 19:10 André Günther
  2023-06-19 19:44 ` Jakub Jelinek
  2023-06-19 21:56 ` Marc Glisse
  0 siblings, 2 replies; 7+ messages in thread
From: André Günther @ 2023-06-19 19:10 UTC (permalink / raw)
  To: gcc, André Günther

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

Hi,
I noticed that a simple function like
auto relu( float x ) {
    return x > 0.f ? x : 0.f;
}
compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
-O3 -mavx2 the former compiles above function to

relu(float):
    vmaxss xmm0, xmm0, DWORD PTR .LC0[rip]
    ret
.LC0:
    .long 0

which is what I would naively expect and what also clang essentially does
(clang actually uses an xor before the maxss to get the zero). The latter,
however, compiles the function to

relu(float):
    vxorps xmm1, xmm1, xmm1
    vcmpltss xmm2, xmm1, xmm0
    vblendvps xmm0, xmm1, xmm0, xmm2
    ret

which looks like a missed optimisation. Does anyone know if there's a
reason for the changed behaviour?

Andre

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Different ASM for ReLU function between GCC11 and GCC12
  2023-06-19 19:10 Different ASM for ReLU function between GCC11 and GCC12 André Günther
@ 2023-06-19 19:44 ` Jakub Jelinek
  2023-06-20  8:15   ` Richard Biener
  2023-06-19 21:56 ` Marc Glisse
  1 sibling, 1 reply; 7+ messages in thread
From: Jakub Jelinek @ 2023-06-19 19:44 UTC (permalink / raw)
  To: André Günther, Roger Sayle; +Cc: gcc, André Günther

On Mon, Jun 19, 2023 at 09:10:53PM +0200, André Günther via Gcc wrote:
> I noticed that a simple function like
> auto relu( float x ) {
>     return x > 0.f ? x : 0.f;
> }
> compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
> -O3 -mavx2 the former compiles above function to

Such reports should go into gcc.gnu.org/bugzilla/, not to the mailing list,
if you are convinced that loading the constant from memory is faster.
Another possibility is
	vxorps xmm1, xmm1, xmm1
	vmaxss xmm0, xmm0, xmm1
	ret
which doesn't need to wait for the memory.
This changed with https://gcc.gnu.org/r12-7693

> 
> relu(float):
>     vmaxss xmm0, xmm0, DWORD PTR .LC0[rip]
>     ret
> .LC0:
>     .long 0
> 
> which is what I would naively expect and what also clang essentially does
> (clang actually uses an xor before the maxss to get the zero). The latter,
> however, compiles the function to
> 
> relu(float):
>     vxorps xmm1, xmm1, xmm1
>     vcmpltss xmm2, xmm1, xmm0
>     vblendvps xmm0, xmm1, xmm0, xmm2
>     ret
> 
> which looks like a missed optimisation. Does anyone know if there's a
> reason for the changed behaviour?

	Jakub


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Different ASM for ReLU function between GCC11 and GCC12
  2023-06-19 19:10 Different ASM for ReLU function between GCC11 and GCC12 André Günther
  2023-06-19 19:44 ` Jakub Jelinek
@ 2023-06-19 21:56 ` Marc Glisse
  1 sibling, 0 replies; 7+ messages in thread
From: Marc Glisse @ 2023-06-19 21:56 UTC (permalink / raw)
  To: André Günther; +Cc: gcc, André Günther

On Mon, 19 Jun 2023, André Günther via Gcc wrote:

> I noticed that a simple function like
> auto relu( float x ) {
>    return x > 0.f ? x : 0.f;
> }
> compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
> -O3 -mavx2 the former compiles above function to
>
> relu(float):
>    vmaxss xmm0, xmm0, DWORD PTR .LC0[rip]
>    ret
> .LC0:
>    .long 0
>
> which is what I would naively expect and what also clang essentially does
> (clang actually uses an xor before the maxss to get the zero). The latter,
> however, compiles the function to
>
> relu(float):
>    vxorps xmm1, xmm1, xmm1
>    vcmpltss xmm2, xmm1, xmm0
>    vblendvps xmm0, xmm1, xmm0, xmm2
>    ret
>
> which looks like a missed optimisation. Does anyone know if there's a
> reason for the changed behaviour?

With -fno-signed-zeros -ffinite-math-only, gcc-12 still uses max instead 
of cmp+blend. So the first thing to check would be if both versions give 
the same result on negative 0 and NaN.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Different ASM for ReLU function between GCC11 and GCC12
  2023-06-19 19:44 ` Jakub Jelinek
@ 2023-06-20  8:15   ` Richard Biener
  2023-06-20 13:06     ` Jakub Jelinek
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2023-06-20  8:15 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: André Günther, Roger Sayle, gcc, André Günther

On Mon, Jun 19, 2023 at 9:45 PM Jakub Jelinek via Gcc <gcc@gcc.gnu.org> wrote:
>
> On Mon, Jun 19, 2023 at 09:10:53PM +0200, André Günther via Gcc wrote:
> > I noticed that a simple function like
> > auto relu( float x ) {
> >     return x > 0.f ? x : 0.f;
> > }
> > compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
> > -O3 -mavx2 the former compiles above function to
>
> Such reports should go into gcc.gnu.org/bugzilla/, not to the mailing list,
> if you are convinced that loading the constant from memory is faster.
> Another possibility is
>         vxorps xmm1, xmm1, xmm1
>         vmaxss xmm0, xmm0, xmm1
>         ret
> which doesn't need to wait for the memory.
> This changed with https://gcc.gnu.org/r12-7693

I guess we previously were able to see that one operand of
the comparison was not NaN.  Maybe some REG_EQUAL
note can come to the rescue here?

> >
> > relu(float):
> >     vmaxss xmm0, xmm0, DWORD PTR .LC0[rip]
> >     ret
> > .LC0:
> >     .long 0
> >
> > which is what I would naively expect and what also clang essentially does
> > (clang actually uses an xor before the maxss to get the zero). The latter,
> > however, compiles the function to
> >
> > relu(float):
> >     vxorps xmm1, xmm1, xmm1
> >     vcmpltss xmm2, xmm1, xmm0
> >     vblendvps xmm0, xmm1, xmm0, xmm2
> >     ret
> >
> > which looks like a missed optimisation. Does anyone know if there's a
> > reason for the changed behaviour?
>
>         Jakub
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Different ASM for ReLU function between GCC11 and GCC12
  2023-06-20  8:15   ` Richard Biener
@ 2023-06-20 13:06     ` Jakub Jelinek
  2023-06-20 15:03       ` Michael Matz
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Jelinek @ 2023-06-20 13:06 UTC (permalink / raw)
  To: Richard Biener, Uros Bizjak
  Cc: André Günther, Roger Sayle, gcc, André Günther

On Tue, Jun 20, 2023 at 10:15:37AM +0200, Richard Biener wrote:
> On Mon, Jun 19, 2023 at 9:45 PM Jakub Jelinek via Gcc <gcc@gcc.gnu.org> wrote:
> >
> > On Mon, Jun 19, 2023 at 09:10:53PM +0200, André Günther via Gcc wrote:
> > > I noticed that a simple function like
> > > auto relu( float x ) {
> > >     return x > 0.f ? x : 0.f;
> > > }
> > > compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
> > > -O3 -mavx2 the former compiles above function to
> >
> > Such reports should go into gcc.gnu.org/bugzilla/, not to the mailing list,
> > if you are convinced that loading the constant from memory is faster.
> > Another possibility is
> >         vxorps xmm1, xmm1, xmm1
> >         vmaxss xmm0, xmm0, xmm1
> >         ret
> > which doesn't need to wait for the memory.
> > This changed with https://gcc.gnu.org/r12-7693
> 
> I guess we previously were able to see that one operand of
> the comparison was not NaN.  Maybe some REG_EQUAL
> note can come to the rescue here?

ce1 pass results in emit_conditional_move with
(gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (reg:SF 84)
operands in the GCC 11 case and so is successfully matched by
ix86_expand_fp_movcc as ix86_expand_sse_fp_minmax.
But, in GCC 12+, emit_conditional_move is called with
(gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (const_double:SF 0.0 [0x0.0p+0])
instead (reg:SF 84 in both cases contains the (const_double:SF 0.0 [0x0.0p+0])
value, in the GCC 11 case loaded from memory, in the GCC 12+ case set
directly in a move.  The reason for the difference is exactly that
because cheap SSE constant can be moved directly into a reg, it is done so
instead of reusing a pseudo that contains that value already.
In the latter case ix86_expand_fp_movcc is called even not with the
const_double because the expander doesn't allow immediates, but with
it forced into some other register, so it can't really find out it is
actually a minmax.  Even if it allowed the cheap SSE constants,
it wouldn't know that r84 is also zero (unless the expander checks that
it is a pseudo with a single setter and verifies it or something similar).

	Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Different ASM for ReLU function between GCC11 and GCC12
  2023-06-20 13:06     ` Jakub Jelinek
@ 2023-06-20 15:03       ` Michael Matz
  2023-06-20 15:15         ` Jakub Jelinek
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Matz @ 2023-06-20 15:03 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, Uros Bizjak, André Günther,
	Roger Sayle, gcc, André Günther

Hello,

On Tue, 20 Jun 2023, Jakub Jelinek via Gcc wrote:

> ce1 pass results in emit_conditional_move with
> (gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (reg:SF 84)
> operands in the GCC 11 case and so is successfully matched by
> ix86_expand_fp_movcc as ix86_expand_sse_fp_minmax.
> But, in GCC 12+, emit_conditional_move is called with
> (gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (const_double:SF 0.0 [0x0.0p+0])
> instead (reg:SF 84 in both cases contains the (const_double:SF 0.0 [0x0.0p+0])
> value, in the GCC 11 case loaded from memory, in the GCC 12+ case set
> directly in a move.  The reason for the difference is exactly that
> because cheap SSE constant can be moved directly into a reg, it is done so
> instead of reusing a pseudo that contains that value already.

But reg84 is _also_ used as operand of emit_conditional_move, so there's 
no reason to not also use it as third operand.  It seems more canonical to 
call a function with

  X-containing-B, A, B

than with

  X-containing-B, A, something-equal-to-B-but-not-B

so either the (const_double) RTL should be used in both, or reg84 should, 
but not a mix.  Exactly so to ...

> actually a minmax.  Even if it allowed the cheap SSE constants,
> it wouldn't know that r84 is also zero (unless the expander checks that
> it is a pseudo with a single setter and verifies it or something similar).

... not have this problem.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Different ASM for ReLU function between GCC11 and GCC12
  2023-06-20 15:03       ` Michael Matz
@ 2023-06-20 15:15         ` Jakub Jelinek
  0 siblings, 0 replies; 7+ messages in thread
From: Jakub Jelinek @ 2023-06-20 15:15 UTC (permalink / raw)
  To: Michael Matz
  Cc: Richard Biener, Uros Bizjak, André Günther,
	Roger Sayle, gcc, André Günther

On Tue, Jun 20, 2023 at 03:03:19PM +0000, Michael Matz via Gcc wrote:
> Hello,
> 
> On Tue, 20 Jun 2023, Jakub Jelinek via Gcc wrote:
> 
> > ce1 pass results in emit_conditional_move with
> > (gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (reg:SF 84)
> > operands in the GCC 11 case and so is successfully matched by
> > ix86_expand_fp_movcc as ix86_expand_sse_fp_minmax.
> > But, in GCC 12+, emit_conditional_move is called with
> > (gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (const_double:SF 0.0 [0x0.0p+0])
> > instead (reg:SF 84 in both cases contains the (const_double:SF 0.0 [0x0.0p+0])
> > value, in the GCC 11 case loaded from memory, in the GCC 12+ case set
> > directly in a move.  The reason for the difference is exactly that
> > because cheap SSE constant can be moved directly into a reg, it is done so
> > instead of reusing a pseudo that contains that value already.
> 
> But reg84 is _also_ used as operand of emit_conditional_move, so there's 
> no reason to not also use it as third operand.  It seems more canonical to 
> call a function with
> 
>   X-containing-B, A, B
> 
> than with
> 
>   X-containing-B, A, something-equal-to-B-but-not-B
> 
> so either the (const_double) RTL should be used in both, or reg84 should, 
> but not a mix.  Exactly so to ...

If ifcvt.cc knows that, then sure, but it can't be (easily) worked around on the
backend side...

	Jakub


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-06-20 15:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-19 19:10 Different ASM for ReLU function between GCC11 and GCC12 André Günther
2023-06-19 19:44 ` Jakub Jelinek
2023-06-20  8:15   ` Richard Biener
2023-06-20 13:06     ` Jakub Jelinek
2023-06-20 15:03       ` Michael Matz
2023-06-20 15:15         ` Jakub Jelinek
2023-06-19 21:56 ` Marc Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).