* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 12:43 ` Stefan Kanthak
@ 2021-08-06 12:59 ` Richard Biener
2021-08-06 13:20 ` Gabriel Paubert
2021-08-06 13:31 ` Michael Matz
2 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2021-08-06 12:59 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Gabriel Paubert, GCC Development
On Fri, Aug 6, 2021 at 2:47 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> Gabriel Paubert <paubert@iram.es> wrote:
>
> > Hi,
> >
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert <paubert@iram.es> wrote:
> >>
> >>
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>
> >> >> .intel_syntax
> >> >> .text
> >> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> >> >> 5: 48 f7 d8 neg rax
> >> >> # jz .L0 # argument zero?
> >> >> 8: 70 16 jo .L0 # argument indefinite?
> >> >> # argument overflows 64-bit integer?
> >> >> a: 48 f7 d8 neg rax
> >> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> >> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63
> >> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> >> 20: c3 .L0: ret
> >> >> .end
> >> >
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >>
> >> Right, I overlooked this fault. Thanks for pointing out.
> >>
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >>
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> >
> > Indeed, but what I had in mind was something along the following lines:
> >
> > movq rax,xmm0 # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs
> > andl eax,0x7ff # mask the sign
> > cmpl eax,0x434 # value to be checked
> > ja return # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax # conversion done
> >
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
>
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.
>
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
>
> Correct.
>
> > I don't know if it's faster than your new code,
>
> It should be faster.
>
> > I'm almost sure that it's shorter.
>
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
>
> JFTR: better use "add rax,rax; shr rax,53" instead of
> "shr rax,52; and eax,0x7ff" and save 2 bytes.
>
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
>
> .code64
> .intel_syntax
> .equ BIAS, 1023
> .text
> movq rax, xmm0 # rax = argument
> add rax, rax
> shr rax, 53 # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit # argument indefinite?
> # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0 # rax = trunc(argument)
> cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> psrlq xmm0, 63
> psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpd xmm0, xmm1 # xmm0 = trunc(argument)
> .L0: ret
> .end
>
> @Richard Biener (et. al.):
>
> 1. Is a primitive for "floating-point > 2**x", which generates such
> an "integer" code sequence, already available, at least for
> float/binary32 and double/binary64?
Not that I know, but it should be possible to craft that.
> 2. the procedural code generator for __builtin_trunc() etc. uses
> __builtin_fabs() and __builtin_copysign() as building blocks.
> These would need to (and of course should) be modified to generate
> psllq/psrlq pairs instead of andpd/andnpd referencing a memory
> location with either -0.0 oder ~(-0.0).
>
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
>
> .code64
> .intel_syntax
> .equ BIAS, 1023
> .text
> cvttsd2si rax, xmm0 # rax = trunc(argument)
> jo .Lexit # argument indefinite?
> # |argument| > 0x1.0p63?
> cvtsi2sd xmm0, rax # xmm1 = trunc(argument)
> .L0: ret
> .end
>
> [...]
>
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> >> 57 + 32 = 89 bytes):
> >>
> >> .code64
> >> .intel_syntax noprefix
> >> .text
> >> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
> >> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = 0x1.0p52 = 4503599627370496.0
> >> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
> >> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = argument
> >> 1d: 66 0f 73 f0 01 psllq xmm0, 1
> >> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = |argument|
> >> 27: 66 0f 73 d1 3f psrlq xmm1, 63
> >> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = (argument & -0.0) ? -0.0 : +0.0
> >> 31: f2 0f 10 d8 movsd xmm3, xmm0
> >> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = |argument| + 0x1.0p52
> >> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = |argument| - 0x1.0p52
> >> # = rint(|argument|)
> >> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = -0x1.0p0 = -1.0
> >
> > Huh? I see +1.0, -1 would be 0xBFF0000000000000.
>
> Spurious error in the comment.
> I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
> instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
>
> >> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
> >> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = rint(|argument|)
> >> # - (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >> # = trunc(|argument|)
> >> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> 53: c3 ret
>
> regards
> Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 12:43 ` Stefan Kanthak
2021-08-06 12:59 ` Richard Biener
@ 2021-08-06 13:20 ` Gabriel Paubert
2021-08-06 14:37 ` Stefan Kanthak
2021-08-06 13:31 ` Michael Matz
2 siblings, 1 reply; 19+ messages in thread
From: Gabriel Paubert @ 2021-08-06 13:20 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: gcc
On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert <paubert@iram.es> wrote:
>
> > Hi,
> >
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert <paubert@iram.es> wrote:
> >>
> >>
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>
> >> >> .intel_syntax
> >> >> .text
> >> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> >> >> 5: 48 f7 d8 neg rax
> >> >> # jz .L0 # argument zero?
> >> >> 8: 70 16 jo .L0 # argument indefinite?
> >> >> # argument overflows 64-bit integer?
> >> >> a: 48 f7 d8 neg rax
> >> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> >> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63
> >> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> >> 20: c3 .L0: ret
> >> >> .end
> >> >
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >>
> >> Right, I overlooked this fault. Thanks for pointing out.
> >>
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >>
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> >
> > Indeed, but what I had in mind was something along the following lines:
> >
> > movq rax,xmm0 # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs
> > andl eax,0x7ff # mask the sign
> > cmpl eax,0x434 # value to be checked
> > ja return # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax # conversion done
> >
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
>
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.
I don't know what the standard says about NaNs in this case, I seem to
remember that arithmetic instructions typically produce QNaN when one of
the inputs is a NaN, whether signaling or not.
>
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
>
> Correct.
>
> > I don't know if it's faster than your new code,
>
> It should be faster.
>
> > I'm almost sure that it's shorter.
>
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
>
> JFTR: better use "add rax,rax; shr rax,53" instead of
> "shr rax,52; and eax,0x7ff" and save 2 bytes.
Indeed, I don't have the exact size of instructions in my head,
especially since I've not written x86 assembly since the mid 90s.
In any case, with your last improvement, the code is now down to a
single 32 bit immediate constant. And I don't see how to eliminate it...
>
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
>
> .code64
> .intel_syntax
> .equ BIAS, 1023
> .text
> movq rax, xmm0 # rax = argument
> add rax, rax
> shr rax, 53 # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit # argument indefinite?
Maybe s/.Lexit/.L0/
> # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0 # rax = trunc(argument)
> cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> psrlq xmm0, 63
> psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpd xmm0, xmm1 # xmm0 = trunc(argument)
> .L0: ret
> .end
>
This looks nice.
> @Richard Biener (et. al.):
>
> 1. Is a primitive for "floating-point > 2**x", which generates such
> an "integer" code sequence, already available, at least for
> float/binary32 and double/binary64?
>
> 2. the procedural code generator for __builtin_trunc() etc. uses
> __builtin_fabs() and __builtin_copysign() as building blocks.
> These would need to (and of course should) be modified to generate
> psllq/psrlq pairs instead of andpd/andnpd referencing a memory
> location with either -0.0 oder ~(-0.0).
>
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
>
> .code64
> .intel_syntax
> .equ BIAS, 1023
> .text
> cvttsd2si rax, xmm0 # rax = trunc(argument)
> jo .Lexit # argument indefinite?
> # |argument| > 0x1.0p63?
> cvtsi2sd xmm0, rax # xmm1 = trunc(argument)
> .L0: ret
> .end
>
> [...]
>
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> >> 57 + 32 = 89 bytes):
> >>
> >> .code64
> >> .intel_syntax noprefix
> >> .text
> >> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
> >> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = 0x1.0p52 = 4503599627370496.0
> >> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
> >> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = argument
> >> 1d: 66 0f 73 f0 01 psllq xmm0, 1
> >> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = |argument|
> >> 27: 66 0f 73 d1 3f psrlq xmm1, 63
> >> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = (argument & -0.0) ? -0.0 : +0.0
> >> 31: f2 0f 10 d8 movsd xmm3, xmm0
> >> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = |argument| + 0x1.0p52
> >> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = |argument| - 0x1.0p52
> >> # = rint(|argument|)
> >> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = -0x1.0p0 = -1.0
> >
> > Huh? I see +1.0, -1 would be 0xBFF0000000000000.
>
> Spurious error in the comment.
> I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
> instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
>
> >> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
> >> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = rint(|argument|)
> >> # - (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >> # = trunc(|argument|)
> >> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> 53: c3 ret
>
> regards
> Stefan
Regards,
Gabriel
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 13:20 ` Gabriel Paubert
@ 2021-08-06 14:37 ` Stefan Kanthak
2021-08-06 17:44 ` Joseph Myers
0 siblings, 1 reply; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 14:37 UTC (permalink / raw)
To: Gabriel Paubert; +Cc: gcc
Gabriel Paubert <paubert@iram.es> wrote:
> On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert <paubert@iram.es> wrote:
>>
>> > Hi,
>> >
>> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
[...]
>> >> The whole idea behind these implementations is to get rid of loading
>> >> floating-point constants to perform comparisions.
>> >
>> > Indeed, but what I had in mind was something along the following lines:
>> >
>> > movq rax,xmm0 # and copy rax to say rcx, if needed later
>> > shrq rax,52 # move sign and exponent to 12 LSBs
>> > andl eax,0x7ff # mask the sign
>> > cmpl eax,0x434 # value to be checked
>> > ja return # exponent too large, we're done (what about NaNs?)
>> > cvttsd2si rax,xmm0 # safe after exponent check
>> > cvtsi2sd xmm0,rax # conversion done
>> >
>> > and a bit more to handle the corner cases (essentially preserve the
>> > sign to be correct between -1 and -0.0).
>>
>> The sign of -0.0 is the only corner case and already handled in my code.
>> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
>> preserved, as in the code GCC generates as well as my code.
>
> I don't know what the standard says about NaNs in this case, I seem to
> remember that arithmetic instructions typically produce QNaN when one of
> the inputs is a NaN, whether signaling or not.
<https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
and its cousins as well as the C standard say
| If x is NaN, a NaN shall be returned.
That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.
>> > But the CPU can (speculatively) start the conversions early, so the
>> > dependency chain is rather short.
>>
>> Correct.
>>
>> > I don't know if it's faster than your new code,
>>
>> It should be faster.
>>
>> > I'm almost sure that it's shorter.
>>
>> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
>> 5+4+5+5+2=21 bytes.
>>
>> JFTR: better use "add rax,rax; shr rax,53" instead of
>> "shr rax,52; and eax,0x7ff" and save 2 bytes.
>
> Indeed, I don't have the exact size of instructions in my head,
> especially since I've not written x86 assembly since the mid 90s.
>
> In any case, with your last improvement, the code is now down to a
> single 32 bit immediate constant. And I don't see how to eliminate it...
>
>>
>> Complete properly optimized code for __builtin_trunc is then as follows
>> (11 instructions, 44 bytes):
>>
>> .code64
>> .intel_syntax
>> .equ BIAS, 1023
>> .text
>> movq rax, xmm0 # rax = argument
>> add rax, rax
>> shr rax, 53 # rax = exponent of |argument|
>> cmp eax, BIAS + 53
>> jae .Lexit # argument indefinite?
>
> Maybe s/.Lexit/.L0/
Surely!
>> # |argument| >= 0x1.0p53?
>> cvttsd2si rax, xmm0 # rax = trunc(argument)
>> cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
>> psrlq xmm0, 63
>> psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>> orpd xmm0, xmm1 # xmm0 = trunc(argument)
>> .L0: ret
>> .end
>>
>
> This looks nice.
Let's see how to convince GCC to generate such code sequences...
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 14:37 ` Stefan Kanthak
@ 2021-08-06 17:44 ` Joseph Myers
2021-08-07 12:32 ` Stefan Kanthak
0 siblings, 1 reply; 19+ messages in thread
From: Joseph Myers @ 2021-08-06 17:44 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc
On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> > I don't know what the standard says about NaNs in this case, I seem to
> > remember that arithmetic instructions typically produce QNaN when one of
> > the inputs is a NaN, whether signaling or not.
>
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
> and its cousins as well as the C standard say
>
> | If x is NaN, a NaN shall be returned.
>
> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.
You should be looking at TS 18661-3 / C2x Annex F for sNaN handling; the
POSIX attempts to deal with signaling NaNs aren't well thought out. (And
possibly adding flag_signaling_nans conditions as appropriate to disable
for -fsignaling-nans anything for these to-integer operations that doesn't
produce a qNaN with INVALID raised from sNaN input.) Though in C2x mode,
these SSE2 code sequences won't be used by default anyway, except for rint
(C2x implies -fno-fp-int-builtin-inexact).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 17:44 ` Joseph Myers
@ 2021-08-07 12:32 ` Stefan Kanthak
2021-08-08 22:58 ` Vincent Lefevre
2021-08-09 17:19 ` Joseph Myers
0 siblings, 2 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-07 12:32 UTC (permalink / raw)
To: Joseph Myers; +Cc: Gabriel Paubert, gcc
Joseph Myers <joseph@codesourcery.com> wrote:
> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
PLEASE DON'T STRIP ATTRIBUTION LINES: I did not write the following paragraph!
>> > I don't know what the standard says about NaNs in this case, I seem to
>> > remember that arithmetic instructions typically produce QNaN when one of
>> > the inputs is a NaN, whether signaling or not.
>>
>> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
>> and its cousins as well as the C standard say
>>
>> | If x is NaN, a NaN shall be returned.
>>
>> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.
>
> You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
I'll do so as soon as GCC drops support for all C dialects before C2x!
Unless you use a time machine and fix the POSIX and ISO C standards
written in the past you CAN'T neglect all software written before C2x
modified sNaN handling that relies on the documented behaviour at the
time it was written.
> the POSIX attempts to deal with signaling NaNs aren't well thought out.
[...]
> Though in C2x mode, these SSE2 code sequences won't be used by default
> anyway, except for rint (C2x implies -fno-fp-int-builtin-inexact).
regards
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-07 12:32 ` Stefan Kanthak
@ 2021-08-08 22:58 ` Vincent Lefevre
2021-08-09 17:19 ` Joseph Myers
1 sibling, 0 replies; 19+ messages in thread
From: Vincent Lefevre @ 2021-08-08 22:58 UTC (permalink / raw)
To: gcc
On 2021-08-07 14:32:32 +0200, Stefan Kanthak wrote:
> Joseph Myers <joseph@codesourcery.com> wrote:
> > On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>
> PLEASE DON'T STRIP ATTRIBUTION LINES: I did not write the following paragraph!
>
> >> > I don't know what the standard says about NaNs in this case, I seem to
> >> > remember that arithmetic instructions typically produce QNaN when one of
> >> > the inputs is a NaN, whether signaling or not.
> >>
> >> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
> >> and its cousins as well as the C standard say
> >>
> >> | If x is NaN, a NaN shall be returned.
> >>
> >> That's why I mentioned that the code GCC generates also doesn't
> >> quiet SNaNs.
> >
> > You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
>
> I'll do so as soon as GCC drops support for all C dialects before C2x!
>
> Unless you use a time machine and fix the POSIX and ISO C standards
> written in the past you CAN'T neglect all software written before C2x
> modified sNaN handling that relies on the documented behaviour at the
> time it was written.
Before C2x:
This specification does not define the behavior of signaling NaNs.365)
It generally uses the term NaN to denote quiet NaNs.
365) Since NaNs created by IEC 60559 operations are always quiet,
quiet NaNs (along with infinities) are sufficient for closure of
the arithmetic.
(in Annex F).
You can't create signaling NaNs with C operations, but you may get
them when reading data from memory. So, IMHO, they should really be
supported in practice, at least in some sense. I would expect that
when a sNaN occurs as an input, it is handled either like a sNaN
(see IEC 60559 / IEEE 754) or like a qNaN. Propagating the
signaling status (forbidden in IEEE 754 for almost all operations)
could be acceptable (this means that an implementation may ignore
whether a NaN is quiet or signaling), but should be avoided.
--
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-07 12:32 ` Stefan Kanthak
2021-08-08 22:58 ` Vincent Lefevre
@ 2021-08-09 17:19 ` Joseph Myers
1 sibling, 0 replies; 19+ messages in thread
From: Joseph Myers @ 2021-08-09 17:19 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: gcc
On Sat, 7 Aug 2021, Stefan Kanthak wrote:
> Joseph Myers <joseph@codesourcery.com> wrote:
> > You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
>
> I'll do so as soon as GCC drops support for all C dialects before C2x!
>
> Unless you use a time machine and fix the POSIX and ISO C standards
> written in the past you CAN'T neglect all software written before C2x
> modified sNaN handling that relies on the documented behaviour at the
> time it was written.
Pre-C2x versions of C don't cover signaling NaNs at all; they use "NaN" to
mean "quiet NaN" (so signaling NaNs are trap representations). Software
written before C2x thus can't rely on any particular sNaN handling.
The POSIX description of signaling NaNs ("On implementations that support
the IEC 60559:1989 standard floating point, functions with signaling NaN
argument(s) shall be treated as if the function were called with an
argument that is a required domain error and shall return a quiet NaN
result, except where stated otherwise.") is consistent with C2x as regards
trunc (sNaN) needing to return a quiet NaN with INVALID raised. The
problems are (a) POSIX fails to "state otherwise" for the cases (e.g.
fabs, copysign) where a signaling NaN argument should not result in a
quiet NaN with INVALID raised (as per IEEE semantics for those operations)
and (b) the POSIX rule about setting errno to EDOM when (math_errhandling
& MATH_ERRNO) is nonzero is inappropriate for sNaN arguments (incompatible
with the normal approach of generating INVALID and a quiet NaN by passing
NaN arguments through arithmetic) and the C2x approach of being
implementation-defined whether an sNaN input is a domain error is more
appropriate.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 12:43 ` Stefan Kanthak
2021-08-06 12:59 ` Richard Biener
2021-08-06 13:20 ` Gabriel Paubert
@ 2021-08-06 13:31 ` Michael Matz
2021-08-06 14:32 ` Stefan Kanthak
2 siblings, 1 reply; 19+ messages in thread
From: Michael Matz @ 2021-08-06 13:31 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc
Hello,
On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
This claim would need to be proven in the wild. |argument| > 2**52 are
already integer, and shouldn't generate a spurious exception from the
various to-int conversions, not even in fast-math mode for some relevant
set of applications (at least SPECcpu).
Btw, have you made speed measurements with your improvements? The size
improvements are obvious, but speed changes can be fairly unintuitive,
e.g. there were old K8 CPUs where the memory loads for constants are
actually faster than the equivalent sequence of shifting and masking for
the >= compares. That's an irrelevant CPU now, but it shows that
intuition about speed consequences can be wrong.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 13:31 ` Michael Matz
@ 2021-08-06 14:32 ` Stefan Kanthak
2021-08-06 15:04 ` Michael Matz
2021-08-06 15:16 ` Richard Biener
0 siblings, 2 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 14:32 UTC (permalink / raw)
To: Michael Matz; +Cc: Gabriel Paubert, gcc
Michael Matz <matz@suse.de> wrote:
> Hello,
>
> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>
>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
>
> This claim would need to be proven in the wild.
I should have left the "when" after the "and" which I originally had
written...
> |argument| > 2**52 are already integer, and shouldn't generate a spurious
> exception from the various to-int conversions, not even in fast-math mode
> for some relevant set of applications (at least SPECcpu).
>
> Btw, have you made speed measurements with your improvements?
No.
> The size improvements are obvious, but speed changes can be fairly
> unintuitive, e.g. there were old K8 CPUs where the memory loads for
> constants are actually faster than the equivalent sequence of shifting
> and masking for the >= compares. That's an irrelevant CPU now, but it
> shows that intuition about speed consequences can be wrong.
I know. I also know of CPUs that can't load a 16-byte wide XMM register
in one go, but had to split the load into 2 8-byte loads.
If the constant happens to be present in L1 cache, it MAY load as fast
as an immediate.
BUT: on current CPUs, the code GCC generates
movsd .LC1(%rip), %xmm2
movsd .LC0(%rip), %xmm4
movapd %xmm0, %xmm3
movapd %xmm0, %xmm1
andpd %xmm2, %xmm3
ucomisd %xmm3, %xmm4
jbe 38 <_trunc+0x38>
needs
- 4 cycles if the movsd are executed in parallel and the movapd are
handled by the register renamer,
- 5 cycles if the movsd and the movapd are executed in parallel,
- 7 cycles else,
plus an unknown number of cycles if the constants are not in L1.
The proposed
movq rax, xmm0
add rax, rax
shr rax, 53
cmp eax, 53+1023
jae return
needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
register renamer).
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 14:32 ` Stefan Kanthak
@ 2021-08-06 15:04 ` Michael Matz
2021-08-06 15:16 ` Richard Biener
1 sibling, 0 replies; 19+ messages in thread
From: Michael Matz @ 2021-08-06 15:04 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc
Hello,
On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> >> For -ffast-math, where the sign of -0.0 is not handled and the
> >> spurios invalid floating-point exception for |argument| >= 2**63 is
> >> acceptable,
> >
> > This claim would need to be proven in the wild.
>
> I should have left the "when" after the "and" which I originally had
> written...
>
> > |argument| > 2**52 are already integer, and shouldn't generate a
> > spurious exception from the various to-int conversions, not even in
> > fast-math mode for some relevant set of applications (at least
> > SPECcpu).
> >
> > Btw, have you made speed measurements with your improvements?
>
> No.
>
> > The size improvements are obvious, but speed changes can be fairly
> > unintuitive, e.g. there were old K8 CPUs where the memory loads for
> > constants are actually faster than the equivalent sequence of shifting
> > and masking for the >= compares. That's an irrelevant CPU now, but it
> > shows that intuition about speed consequences can be wrong.
>
> I know. I also know of CPUs that can't load a 16-byte wide XMM register
> in one go, but had to split the load into 2 8-byte loads.
>
> If the constant happens to be present in L1 cache, it MAY load as fast
> as an immediate.
> BUT: on current CPUs, the code GCC generates
>
> movsd .LC1(%rip), %xmm2
> movsd .LC0(%rip), %xmm4
> movapd %xmm0, %xmm3
> movapd %xmm0, %xmm1
> andpd %xmm2, %xmm3
> ucomisd %xmm3, %xmm4
> jbe 38 <_trunc+0x38>
>
> needs
> - 4 cycles if the movsd are executed in parallel and the movapd are
> handled by the register renamer,
> - 5 cycles if the movsd and the movapd are executed in parallel,
> - 7 cycles else,
> plus an unknown number of cycles if the constants are not in L1.
You also need to consider the case that the to-int converters are called
in a loop (which ultimately are the only interesting cases for
performance), where it's possible to load the constants before the loop
and keep them in registers (at the expense of two register pressure of
course) effectively removing the loads from cost considerations. It's all
tough choices, which is why stuff needs to be measured in some contexts
:-)
(I do like your sequences btw, it's just not 100% clearcut that they are
always a speed improvement).
Ciao,
Michael.
> The proposed
>
> movq rax, xmm0
> add rax, rax
> shr rax, 53
> cmp eax, 53+1023
> jae return
>
> needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
> register renamer).
>
> Stefan
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 14:32 ` Stefan Kanthak
2021-08-06 15:04 ` Michael Matz
@ 2021-08-06 15:16 ` Richard Biener
2021-08-06 16:57 ` Stefan Kanthak
1 sibling, 1 reply; 19+ messages in thread
From: Richard Biener @ 2021-08-06 15:16 UTC (permalink / raw)
To: gcc, Stefan Kanthak, Michael Matz
On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>Michael Matz <matz@suse.de> wrote:
>
>
>> Hello,
>>
>> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>>
>>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
>>
>> This claim would need to be proven in the wild.
>
>I should have left the "when" after the "and" which I originally had
>written...
>
>> |argument| > 2**52 are already integer, and shouldn't generate a spurious
>> exception from the various to-int conversions, not even in fast-math mode
>> for some relevant set of applications (at least SPECcpu).
>>
>> Btw, have you made speed measurements with your improvements?
>
>No.
>
>> The size improvements are obvious, but speed changes can be fairly
>> unintuitive, e.g. there were old K8 CPUs where the memory loads for
>> constants are actually faster than the equivalent sequence of shifting
>> and masking for the >= compares. That's an irrelevant CPU now, but it
>> shows that intuition about speed consequences can be wrong.
>
>I know. I also know of CPUs that can't load a 16-byte wide XMM register
>in one go, but had to split the load into 2 8-byte loads.
>
>If the constant happens to be present in L1 cache, it MAY load as fast
>as an immediate.
>BUT: on current CPUs, the code GCC generates
>
> movsd .LC1(%rip), %xmm2
> movsd .LC0(%rip), %xmm4
> movapd %xmm0, %xmm3
> movapd %xmm0, %xmm1
> andpd %xmm2, %xmm3
> ucomisd %xmm3, %xmm4
> jbe 38 <_trunc+0x38>
>
>needs
>- 4 cycles if the movsd are executed in parallel and the movapd are
> handled by the register renamer,
>- 5 cycles if the movsd and the movapd are executed in parallel,
>- 7 cycles else,
>plus an unknown number of cycles if the constants are not in L1.
>The proposed
>
> movq rax, xmm0
The xmm to GPR move costs you an extra cycle in latency. Shifts also tend to be port constrained. The original sequences are also somewhat straight forward to vectorize.
> add rax, rax
> shr rax, 53
> cmp eax, 53+1023
> jae return
>
>needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
>register renamer).
>
>Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 15:16 ` Richard Biener
@ 2021-08-06 16:57 ` Stefan Kanthak
0 siblings, 0 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 16:57 UTC (permalink / raw)
To: Richard Biener, gcc, Michael Matz
Richard Biener <richard.guenther@gmail.com> wrote:
> On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>Michael Matz <matz@suse.de> wrote:
>>> Btw, have you made speed measurements with your improvements?
>>
>>No.
[...]
>>If the constant happens to be present in L1 cache, it MAY load as fast
>>as an immediate.
>>BUT: on current CPUs, the code GCC generates
>>
>> movsd .LC1(%rip), %xmm2
>> movsd .LC0(%rip), %xmm4
>> movapd %xmm0, %xmm3
>> movapd %xmm0, %xmm1
>> andpd %xmm2, %xmm3
>> ucomisd %xmm3, %xmm4
>> jbe 38 <_trunc+0x38>
>>
>>needs
>>- 4 cycles if the movsd are executed in parallel and the movapd are
>> handled by the register renamer,
>>- 5 cycles if the movsd and the movapd are executed in parallel,
>>- 7 cycles else,
>>plus an unknown number of cycles if the constants are not in L1.
>>The proposed
>>
>> movq rax, xmm0
>
> The xmm to GPR move costs you an extra cycle in latency. Shifts also
> tend to be port constrained. The original sequences are also somewhat
> straight forward to vectorize.
Please show how GCC vectorizes CVT[T]SD2SI and CVTSI2SD!
These are the bottlenecks in the current code.
If you want the code for trunc() and cousins to be vectorizable you
should stay with the alternative code I presented some posts before,
which GCC should be (able to) generate from its other procedural
variant.
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread