* Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
@ 2021-08-05 7:25 Stefan Kanthak
2021-08-05 9:42 ` Gabriel Paubert
0 siblings, 1 reply; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-05 7:25 UTC (permalink / raw)
To: gcc
Hi,
targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (13 instructions using 57 bytes, plus 4 quadwords
using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
.text
0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
4: R_X86_64_PC32 .rdata
8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
c: R_X86_64_PC32 .rdata
10: 66 0f 28 d8 movapd %xmm0, %xmm3
14: 66 0f 28 c8 movapd %xmm0, %xmm1
18: 66 0f 54 da andpd %xmm2, %xmm3
1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
20: 76 16 jbe 38 <_trunc+0x38>
22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
27: 66 0f ef c0 pxor %xmm0, %xmm0
2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
34: 66 0f 56 c2 orpd %xmm2, %xmm0
38: c3 retq
.rdata
.align 8
0: 00 00 00 00 .LC0: .quad 0x1.0p52
00 00 30 43
00 00 00 00
00 00 00 00
.align 16
10: ff ff ff ff .LC1: .quad ~(-0.0)
ff ff ff 7f
18: 00 00 00 00 .quad 0.0
00 00 00 00
.end
JFTR: in the best case, the memory accesses cost several cycles,
while in the worst case they yield a page fault!
Properly optimized, shorter and faster code, using but only 9 instructions
in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
and saving at least 16 + 32 bytes, follows:
.intel_syntax
.text
0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
5: 48 f7 d8 neg rax
# jz .L0 # argument zero?
8: 70 16 jo .L0 # argument indefinite?
# argument overflows 64-bit integer?
a: 48 f7 d8 neg rax
d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
12: 66 0f 73 d0 3f psrlq xmm0, 63
17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
20: c3 .L0: ret
.end
regards
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-05 7:25 Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1 Stefan Kanthak
@ 2021-08-05 9:42 ` Gabriel Paubert
2021-08-05 10:19 ` Richard Biener
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Gabriel Paubert @ 2021-08-05 9:42 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: gcc
On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> Hi,
>
> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> following code (13 instructions using 57 bytes, plus 4 quadwords
> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>
> .text
> 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
> 4: R_X86_64_PC32 .rdata
> 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
> c: R_X86_64_PC32 .rdata
> 10: 66 0f 28 d8 movapd %xmm0, %xmm3
> 14: 66 0f 28 c8 movapd %xmm0, %xmm1
> 18: 66 0f 54 da andpd %xmm2, %xmm3
> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
> 20: 76 16 jbe 38 <_trunc+0x38>
> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
> 27: 66 0f ef c0 pxor %xmm0, %xmm0
> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
> 34: 66 0f 56 c2 orpd %xmm2, %xmm0
> 38: c3 retq
>
> .rdata
> .align 8
> 0: 00 00 00 00 .LC0: .quad 0x1.0p52
> 00 00 30 43
> 00 00 00 00
> 00 00 00 00
> .align 16
> 10: ff ff ff ff .LC1: .quad ~(-0.0)
> ff ff ff 7f
> 18: 00 00 00 00 .quad 0.0
> 00 00 00 00
> .end
>
> JFTR: in the best case, the memory accesses cost several cycles,
> while in the worst case they yield a page fault!
>
>
> Properly optimized, shorter and faster code, using but only 9 instructions
> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> and saving at least 16 + 32 bytes, follows:
>
> .intel_syntax
> .text
> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> 5: 48 f7 d8 neg rax
> # jz .L0 # argument zero?
> 8: 70 16 jo .L0 # argument indefinite?
> # argument overflows 64-bit integer?
> a: 48 f7 d8 neg rax
> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> 12: 66 0f 73 d0 3f psrlq xmm0, 63
> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> 20: c3 .L0: ret
> .end
There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer. So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.
The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).
Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.
Apart from this, in your code, after cvttsd2si I'd rather use:
mov rcx,rax # make a second copy to a scratch register
neg rcx
jo .L0
cvtsi2sd xmm1,rax
The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.
With your patch:
cvttsd2si-->neg-?->neg-->cvtsi2sd
where the ? means that the following instructions are speculated.
With an auxiliary register there are two dependency chains:
cvttsd2si-?->cvtsi2sd
|->mov->neg->jump
Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.
Regards,
Gabriel
>
> regards
> Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-05 9:42 ` Gabriel Paubert
@ 2021-08-05 10:19 ` Richard Biener
2021-08-05 11:58 ` Stefan Kanthak
2021-08-05 13:18 ` Gabriel Ravier
2 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2021-08-05 10:19 UTC (permalink / raw)
To: Gabriel Paubert; +Cc: Stefan Kanthak, GCC Development
On Thu, Aug 5, 2021 at 11:44 AM Gabriel Paubert <paubert@iram.es> wrote:
>
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> > Hi,
> >
> > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> > following code (13 instructions using 57 bytes, plus 4 quadwords
> > using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >
> > .text
> > 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
> > 4: R_X86_64_PC32 .rdata
> > 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
> > c: R_X86_64_PC32 .rdata
> > 10: 66 0f 28 d8 movapd %xmm0, %xmm3
> > 14: 66 0f 28 c8 movapd %xmm0, %xmm1
> > 18: 66 0f 54 da andpd %xmm2, %xmm3
> > 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
> > 20: 76 16 jbe 38 <_trunc+0x38>
> > 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
> > 27: 66 0f ef c0 pxor %xmm0, %xmm0
> > 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
> > 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
> > 34: 66 0f 56 c2 orpd %xmm2, %xmm0
> > 38: c3 retq
> >
> > .rdata
> > .align 8
> > 0: 00 00 00 00 .LC0: .quad 0x1.0p52
> > 00 00 30 43
> > 00 00 00 00
> > 00 00 00 00
> > .align 16
> > 10: ff ff ff ff .LC1: .quad ~(-0.0)
> > ff ff ff 7f
> > 18: 00 00 00 00 .quad 0.0
> > 00 00 00 00
> > .end
> >
> > JFTR: in the best case, the memory accesses cost several cycles,
> > while in the worst case they yield a page fault!
> >
> >
> > Properly optimized, shorter and faster code, using but only 9 instructions
> > in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> > and saving at least 16 + 32 bytes, follows:
> >
> > .intel_syntax
> > .text
> > 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> > 5: 48 f7 d8 neg rax
> > # jz .L0 # argument zero?
> > 8: 70 16 jo .L0 # argument indefinite?
> > # argument overflows 64-bit integer?
> > a: 48 f7 d8 neg rax
> > d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> > 12: 66 0f 73 d0 3f psrlq xmm0, 63
> > 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> > 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> > 20: c3 .L0: ret
> > .end
>
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer. So
> using your code may require some option (-fast-math comes to mind), or
> you need at least a check on the exponent before cvttsd2si.
>
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
>
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax
>
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
> |->mov->neg->jump
>
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.
Btw, the code to emit these sequences is in
gcc/config/i386/i386-expand.c:ix86_expand_trunc
and friends.
Richard.
> Regards,
> Gabriel
>
>
>
> >
> > regards
> > Stefan
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-05 9:42 ` Gabriel Paubert
2021-08-05 10:19 ` Richard Biener
@ 2021-08-05 11:58 ` Stefan Kanthak
2021-08-05 13:59 ` Gabriel Paubert
2021-08-05 13:18 ` Gabriel Ravier
2 siblings, 1 reply; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-05 11:58 UTC (permalink / raw)
To: Gabriel Paubert; +Cc: gcc
Gabriel Paubert <paubert@iram.es> wrote:
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> Hi,
>>
>> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
>> following code (13 instructions using 57 bytes, plus 4 quadwords
>> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>>
>> .text
>> 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
>> 4: R_X86_64_PC32 .rdata
>> 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
>> c: R_X86_64_PC32 .rdata
>> 10: 66 0f 28 d8 movapd %xmm0, %xmm3
>> 14: 66 0f 28 c8 movapd %xmm0, %xmm1
>> 18: 66 0f 54 da andpd %xmm2, %xmm3
>> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
>> 20: 76 16 jbe 38 <_trunc+0x38>
>> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
>> 27: 66 0f ef c0 pxor %xmm0, %xmm0
>> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
>> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
>> 34: 66 0f 56 c2 orpd %xmm2, %xmm0
>> 38: c3 retq
>>
>> .rdata
>> .align 8
>> 0: 00 00 00 00 .LC0: .quad 0x1.0p52
>> 00 00 30 43
>> 00 00 00 00
>> 00 00 00 00
>> .align 16
>> 10: ff ff ff ff .LC1: .quad ~(-0.0)
>> ff ff ff 7f
>> 18: 00 00 00 00 .quad 0.0
>> 00 00 00 00
>> .end
>>
>> JFTR: in the best case, the memory accesses cost several cycles,
>> while in the worst case they yield a page fault!
>>
>>
>> Properly optimized, shorter and faster code, using but only 9 instructions
>> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
>> and saving at least 16 + 32 bytes, follows:
>>
>> .intel_syntax
>> .text
>> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
>> 5: 48 f7 d8 neg rax
>> # jz .L0 # argument zero?
>> 8: 70 16 jo .L0 # argument indefinite?
>> # argument overflows 64-bit integer?
>> a: 48 f7 d8 neg rax
>> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
>> 12: 66 0f 73 d0 3f psrlq xmm0, 63
>> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
>> 20: c3 .L0: ret
>> .end
>
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.
Right, I overlooked this fault. Thanks for pointing out.
> So using your code may require some option (-fast-math comes to mind),
> or you need at least a check on the exponent before cvttsd2si.
The whole idea behind these implementations is to get rid of loading
floating-point constants to perform comparisions.
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
Preserving the sign of -0.0 is explicitly specified in the standard,
and is cheap, as shown in my code.
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax
I don't know how GCC generates the code for builtins, and what kind of
templates it uses: the second goal was to minimize register usage.
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
> |->mov->neg->jump
Correct; see above: I expect the template(s) for builtins to give
the register allocator some freedom to split code paths and resolve
dependency chains.
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.
Right, the conversions dominate both the original and the code I posted.
It's easy to get rid of them, with still slightly shorter and faster
branchless code (17 instructions, 84 bytes, instead of 13 instructions,
57 + 32 = 89 bytes):
.code64
.intel_syntax noprefix
.text
0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = 0x1.0p52 = 4503599627370496.0
f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = argument
1d: 66 0f 73 f0 01 psllq xmm0, 1
22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = |argument|
27: 66 0f 73 d1 3f psrlq xmm1, 63
2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = (argument & -0.0) ? -0.0 : +0.0
31: f2 0f 10 d8 movsd xmm3, xmm0
35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = |argument| + 0x1.0p52
39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = |argument| - 0x1.0p52
# = rint(|argument|)
3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = -0x1.0p0 = -1.0
42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = rint(|argument|)
# - (|argument| < rint(|argument|)) ? 1.0 : 0.0
# = trunc(|argument|)
4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
53: c3 ret
.end
regards
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-05 9:42 ` Gabriel Paubert
2021-08-05 10:19 ` Richard Biener
2021-08-05 11:58 ` Stefan Kanthak
@ 2021-08-05 13:18 ` Gabriel Ravier
2 siblings, 0 replies; 19+ messages in thread
From: Gabriel Ravier @ 2021-08-05 13:18 UTC (permalink / raw)
To: Gabriel Paubert, Stefan Kanthak; +Cc: gcc
On 8/5/21 11:42 AM, Gabriel Paubert wrote:
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> Hi,
>>
>> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
>> following code (13 instructions using 57 bytes, plus 4 quadwords
>> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>>
>> .text
>> 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
>> 4: R_X86_64_PC32 .rdata
>> 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
>> c: R_X86_64_PC32 .rdata
>> 10: 66 0f 28 d8 movapd %xmm0, %xmm3
>> 14: 66 0f 28 c8 movapd %xmm0, %xmm1
>> 18: 66 0f 54 da andpd %xmm2, %xmm3
>> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
>> 20: 76 16 jbe 38 <_trunc+0x38>
>> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
>> 27: 66 0f ef c0 pxor %xmm0, %xmm0
>> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
>> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
>> 34: 66 0f 56 c2 orpd %xmm2, %xmm0
>> 38: c3 retq
>>
>> .rdata
>> .align 8
>> 0: 00 00 00 00 .LC0: .quad 0x1.0p52
>> 00 00 30 43
>> 00 00 00 00
>> 00 00 00 00
>> .align 16
>> 10: ff ff ff ff .LC1: .quad ~(-0.0)
>> ff ff ff 7f
>> 18: 00 00 00 00 .quad 0.0
>> 00 00 00 00
>> .end
>>
>> JFTR: in the best case, the memory accesses cost several cycles,
>> while in the worst case they yield a page fault!
>>
>>
>> Properly optimized, shorter and faster code, using but only 9 instructions
>> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
>> and saving at least 16 + 32 bytes, follows:
>>
>> .intel_syntax
>> .text
>> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
>> 5: 48 f7 d8 neg rax
>> # jz .L0 # argument zero?
>> 8: 70 16 jo .L0 # argument indefinite?
>> # argument overflows 64-bit integer?
>> a: 48 f7 d8 neg rax
>> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
>> 12: 66 0f 73 d0 3f psrlq xmm0, 63
>> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
>> 20: c3 .L0: ret
>> .end
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer. So
> using your code may require some option (-fast-math comes to mind), or
> you need at least a check on the exponent before cvttsd2si.
>
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
`-fno-signed-zeros` does that, if you need it
>
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax
>
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
> |->mov->neg->jump
>
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.
>
> Regards,
> Gabriel
>
>
>
>> regards
>> Stefan
>
>
--
_________________________
Gabriel RAVIER
First year student at Epitech
+33 6 36 46 16 43
gabriel.ravier@epitech.eu
11 Quai Finkwiller
67000 STRASBOURG
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-05 11:58 ` Stefan Kanthak
@ 2021-08-05 13:59 ` Gabriel Paubert
2021-08-06 12:43 ` Stefan Kanthak
0 siblings, 1 reply; 19+ messages in thread
From: Gabriel Paubert @ 2021-08-05 13:59 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: gcc
Hi,
On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert <paubert@iram.es> wrote:
>
>
> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> >> Hi,
> >>
> >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> >> following code (13 instructions using 57 bytes, plus 4 quadwords
> >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >>
> >> .text
> >> 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
> >> 4: R_X86_64_PC32 .rdata
> >> 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
> >> c: R_X86_64_PC32 .rdata
> >> 10: 66 0f 28 d8 movapd %xmm0, %xmm3
> >> 14: 66 0f 28 c8 movapd %xmm0, %xmm1
> >> 18: 66 0f 54 da andpd %xmm2, %xmm3
> >> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
> >> 20: 76 16 jbe 38 <_trunc+0x38>
> >> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
> >> 27: 66 0f ef c0 pxor %xmm0, %xmm0
> >> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
> >> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
> >> 34: 66 0f 56 c2 orpd %xmm2, %xmm0
> >> 38: c3 retq
> >>
> >> .rdata
> >> .align 8
> >> 0: 00 00 00 00 .LC0: .quad 0x1.0p52
> >> 00 00 30 43
> >> 00 00 00 00
> >> 00 00 00 00
> >> .align 16
> >> 10: ff ff ff ff .LC1: .quad ~(-0.0)
> >> ff ff ff 7f
> >> 18: 00 00 00 00 .quad 0.0
> >> 00 00 00 00
> >> .end
> >>
> >> JFTR: in the best case, the memory accesses cost several cycles,
> >> while in the worst case they yield a page fault!
> >>
> >>
> >> Properly optimized, shorter and faster code, using but only 9 instructions
> >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> >> and saving at least 16 + 32 bytes, follows:
> >>
> >> .intel_syntax
> >> .text
> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> >> 5: 48 f7 d8 neg rax
> >> # jz .L0 # argument zero?
> >> 8: 70 16 jo .L0 # argument indefinite?
> >> # argument overflows 64-bit integer?
> >> a: 48 f7 d8 neg rax
> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63
> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> 20: c3 .L0: ret
> >> .end
> >
> > There is one important difference, namely setting the invalid exception
> > flag when the parameter can't be represented in a signed integer.
>
> Right, I overlooked this fault. Thanks for pointing out.
>
> > So using your code may require some option (-fast-math comes to mind),
> > or you need at least a check on the exponent before cvttsd2si.
>
> The whole idea behind these implementations is to get rid of loading
> floating-point constants to perform comparisions.
Indeed, but what I had in mind was something along the following lines:
movq rax,xmm0 # and copy rax to say rcx, if needed later
shrq rax,52 # move sign and exponent to 12 LSBs
andl eax,0x7ff # mask the sign
cmpl eax,0x434 # value to be checked
ja return # exponent too large, we're done (what about NaNs?)
cvttsd2si rax,xmm0 # safe after exponent check
cvtsi2sd xmm0,rax # conversion done
and a bit more to handle the corner cases (essentially preserve the
sign to be correct between -1 and -0.0). But the CPU can (speculatively)
start the conversions early, so the dependency chain is rather short.
I don't know if it's faster than your new code, I'm almost sure that
it's shorter. Your new code also has a fairly long dependency chain.
>
> > The last part of your code then goes to take into account the special
> > case of -0.0, which I most often don't care about (I'd like to have a
> > -fdont-split-hairs-about-the-sign-of-zero option).
>
> Preserving the sign of -0.0 is explicitly specified in the standard,
> and is cheap, as shown in my code.
>
> > Potentially generating spurious invalid operation and then carefully
> > taking into account the sign of zero does not seem very consistent.
> >
> > Apart from this, in your code, after cvttsd2si I'd rather use:
> > mov rcx,rax # make a second copy to a scratch register
> > neg rcx
> > jo .L0
> > cvtsi2sd xmm1,rax
>
> I don't know how GCC generates the code for builtins, and what kind of
> templates it uses: the second goal was to minimize register usage.
>
Ok, but on 64 bit using two GPRs would still be reasonable.
> > The reason is latency, in an OoO engine, splitting the two paths is
> > almost always a win.
> >
> > With your patch:
> >
> > cvttsd2si-->neg-?->neg-->cvtsi2sd
> >
> > where the ? means that the following instructions are speculated.
> >
> > With an auxiliary register there are two dependency chains:
> >
> > cvttsd2si-?->cvtsi2sd
> > |->mov->neg->jump
>
> Correct; see above: I expect the template(s) for builtins to give
> the register allocator some freedom to split code paths and resolve
> dependency chains.
>
> > Actually some OoO cores just eliminate register copies using register
> > renaming mechanism. But even this is probably completely irrelevant in
> > this case where the latency is dominated by the two conversion
> > instructions.
>
> Right, the conversions dominate both the original and the code I posted.
> It's easy to get rid of them, with still slightly shorter and faster
> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> 57 + 32 = 89 bytes):
>
> .code64
> .intel_syntax noprefix
> .text
> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = 0x1.0p52 = 4503599627370496.0
> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = argument
> 1d: 66 0f 73 f0 01 psllq xmm0, 1
> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = |argument|
> 27: 66 0f 73 d1 3f psrlq xmm1, 63
> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = (argument & -0.0) ? -0.0 : +0.0
> 31: f2 0f 10 d8 movsd xmm3, xmm0
> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = |argument| + 0x1.0p52
> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = |argument| - 0x1.0p52
> # = rint(|argument|)
> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = -0x1.0p0 = -1.0
Huh? I see +1.0, -1 would be 0xBFF0000000000000.
> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = rint(|argument|)
> # - (|argument| < rint(|argument|)) ? 1.0 : 0.0
> # = trunc(|argument|)
> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> 53: c3 ret
> .end
>
> regards
> Stefan
Regards,
Gabriel
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-05 13:59 ` Gabriel Paubert
@ 2021-08-06 12:43 ` Stefan Kanthak
2021-08-06 12:59 ` Richard Biener
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 12:43 UTC (permalink / raw)
To: Gabriel Paubert; +Cc: gcc
Gabriel Paubert <paubert@iram.es> wrote:
> Hi,
>
> On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert <paubert@iram.es> wrote:
>>
>>
>> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> >> .intel_syntax
>> >> .text
>> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
>> >> 5: 48 f7 d8 neg rax
>> >> # jz .L0 # argument zero?
>> >> 8: 70 16 jo .L0 # argument indefinite?
>> >> # argument overflows 64-bit integer?
>> >> a: 48 f7 d8 neg rax
>> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
>> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63
>> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
>> >> 20: c3 .L0: ret
>> >> .end
>> >
>> > There is one important difference, namely setting the invalid exception
>> > flag when the parameter can't be represented in a signed integer.
>>
>> Right, I overlooked this fault. Thanks for pointing out.
>>
>> > So using your code may require some option (-fast-math comes to mind),
>> > or you need at least a check on the exponent before cvttsd2si.
>>
>> The whole idea behind these implementations is to get rid of loading
>> floating-point constants to perform comparisions.
>
> Indeed, but what I had in mind was something along the following lines:
>
> movq rax,xmm0 # and copy rax to say rcx, if needed later
> shrq rax,52 # move sign and exponent to 12 LSBs
> andl eax,0x7ff # mask the sign
> cmpl eax,0x434 # value to be checked
> ja return # exponent too large, we're done (what about NaNs?)
> cvttsd2si rax,xmm0 # safe after exponent check
> cvtsi2sd xmm0,rax # conversion done
>
> and a bit more to handle the corner cases (essentially preserve the
> sign to be correct between -1 and -0.0).
The sign of -0.0 is the only corner case and already handled in my code.
Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
preserved, as in the code GCC generates as well as my code.
> But the CPU can (speculatively) start the conversions early, so the
> dependency chain is rather short.
Correct.
> I don't know if it's faster than your new code,
It should be faster.
> I'm almost sure that it's shorter.
"neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
5+4+5+5+2=21 bytes.
JFTR: better use "add rax,rax; shr rax,53" instead of
"shr rax,52; and eax,0x7ff" and save 2 bytes.
Complete properly optimized code for __builtin_trunc is then as follows
(11 instructions, 44 bytes):
.code64
.intel_syntax
.equ BIAS, 1023
.text
movq rax, xmm0 # rax = argument
add rax, rax
shr rax, 53 # rax = exponent of |argument|
cmp eax, BIAS + 53
jae .Lexit # argument indefinite?
# |argument| >= 0x1.0p53?
cvttsd2si rax, xmm0 # rax = trunc(argument)
cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
psrlq xmm0, 63
psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
orpd xmm0, xmm1 # xmm0 = trunc(argument)
.L0: ret
.end
@Richard Biener (et. al.):
1. Is a primitive for "floating-point > 2**x", which generates such
an "integer" code sequence, already available, at least for
float/binary32 and double/binary64?
2. the procedural code generator for __builtin_trunc() etc. uses
__builtin_fabs() and __builtin_copysign() as building blocks.
These would need to (and of course should) be modified to generate
psllq/psrlq pairs instead of andpd/andnpd referencing a memory
location with either -0.0 oder ~(-0.0).
For -ffast-math, where the sign of -0.0 is not handled and the spurios
invalid floating-point exception for |argument| >= 2**63 is acceptable,
it boils down to:
.code64
.intel_syntax
.equ BIAS, 1023
.text
cvttsd2si rax, xmm0 # rax = trunc(argument)
jo .Lexit # argument indefinite?
# |argument| > 0x1.0p63?
cvtsi2sd xmm0, rax # xmm1 = trunc(argument)
.L0: ret
.end
[...]
>> Right, the conversions dominate both the original and the code I posted.
>> It's easy to get rid of them, with still slightly shorter and faster
>> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
>> 57 + 32 = 89 bytes):
>>
>> .code64
>> .intel_syntax noprefix
>> .text
>> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
>> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = 0x1.0p52 = 4503599627370496.0
>> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
>> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = argument
>> 1d: 66 0f 73 f0 01 psllq xmm0, 1
>> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = |argument|
>> 27: 66 0f 73 d1 3f psrlq xmm1, 63
>> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = (argument & -0.0) ? -0.0 : +0.0
>> 31: f2 0f 10 d8 movsd xmm3, xmm0
>> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = |argument| + 0x1.0p52
>> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = |argument| - 0x1.0p52
>> # = rint(|argument|)
>> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = -0x1.0p0 = -1.0
>
> Huh? I see +1.0, -1 would be 0xBFF0000000000000.
Spurious error in the comment.
I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
>> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
>> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
>> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = rint(|argument|)
>> # - (|argument| < rint(|argument|)) ? 1.0 : 0.0
>> # = trunc(|argument|)
>> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
>> 53: c3 ret
regards
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 12:43 ` Stefan Kanthak
@ 2021-08-06 12:59 ` Richard Biener
2021-08-06 13:20 ` Gabriel Paubert
2021-08-06 13:31 ` Michael Matz
2 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2021-08-06 12:59 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Gabriel Paubert, GCC Development
On Fri, Aug 6, 2021 at 2:47 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> Gabriel Paubert <paubert@iram.es> wrote:
>
> > Hi,
> >
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert <paubert@iram.es> wrote:
> >>
> >>
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>
> >> >> .intel_syntax
> >> >> .text
> >> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> >> >> 5: 48 f7 d8 neg rax
> >> >> # jz .L0 # argument zero?
> >> >> 8: 70 16 jo .L0 # argument indefinite?
> >> >> # argument overflows 64-bit integer?
> >> >> a: 48 f7 d8 neg rax
> >> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> >> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63
> >> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> >> 20: c3 .L0: ret
> >> >> .end
> >> >
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >>
> >> Right, I overlooked this fault. Thanks for pointing out.
> >>
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >>
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> >
> > Indeed, but what I had in mind was something along the following lines:
> >
> > movq rax,xmm0 # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs
> > andl eax,0x7ff # mask the sign
> > cmpl eax,0x434 # value to be checked
> > ja return # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax # conversion done
> >
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
>
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.
>
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
>
> Correct.
>
> > I don't know if it's faster than your new code,
>
> It should be faster.
>
> > I'm almost sure that it's shorter.
>
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
>
> JFTR: better use "add rax,rax; shr rax,53" instead of
> "shr rax,52; and eax,0x7ff" and save 2 bytes.
>
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
>
> .code64
> .intel_syntax
> .equ BIAS, 1023
> .text
> movq rax, xmm0 # rax = argument
> add rax, rax
> shr rax, 53 # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit # argument indefinite?
> # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0 # rax = trunc(argument)
> cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> psrlq xmm0, 63
> psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpd xmm0, xmm1 # xmm0 = trunc(argument)
> .L0: ret
> .end
>
> @Richard Biener (et. al.):
>
> 1. Is a primitive for "floating-point > 2**x", which generates such
> an "integer" code sequence, already available, at least for
> float/binary32 and double/binary64?
Not that I know, but it should be possible to craft that.
> 2. the procedural code generator for __builtin_trunc() etc. uses
> __builtin_fabs() and __builtin_copysign() as building blocks.
> These would need to (and of course should) be modified to generate
> psllq/psrlq pairs instead of andpd/andnpd referencing a memory
> location with either -0.0 oder ~(-0.0).
>
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
>
> .code64
> .intel_syntax
> .equ BIAS, 1023
> .text
> cvttsd2si rax, xmm0 # rax = trunc(argument)
> jo .Lexit # argument indefinite?
> # |argument| > 0x1.0p63?
> cvtsi2sd xmm0, rax # xmm1 = trunc(argument)
> .L0: ret
> .end
>
> [...]
>
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> >> 57 + 32 = 89 bytes):
> >>
> >> .code64
> >> .intel_syntax noprefix
> >> .text
> >> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
> >> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = 0x1.0p52 = 4503599627370496.0
> >> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
> >> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = argument
> >> 1d: 66 0f 73 f0 01 psllq xmm0, 1
> >> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = |argument|
> >> 27: 66 0f 73 d1 3f psrlq xmm1, 63
> >> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = (argument & -0.0) ? -0.0 : +0.0
> >> 31: f2 0f 10 d8 movsd xmm3, xmm0
> >> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = |argument| + 0x1.0p52
> >> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = |argument| - 0x1.0p52
> >> # = rint(|argument|)
> >> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = -0x1.0p0 = -1.0
> >
> > Huh? I see +1.0, -1 would be 0xBFF0000000000000.
>
> Spurious error in the comment.
> I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
> instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
>
> >> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
> >> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = rint(|argument|)
> >> # - (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >> # = trunc(|argument|)
> >> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> 53: c3 ret
>
> regards
> Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 12:43 ` Stefan Kanthak
2021-08-06 12:59 ` Richard Biener
@ 2021-08-06 13:20 ` Gabriel Paubert
2021-08-06 14:37 ` Stefan Kanthak
2021-08-06 13:31 ` Michael Matz
2 siblings, 1 reply; 19+ messages in thread
From: Gabriel Paubert @ 2021-08-06 13:20 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: gcc
On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert <paubert@iram.es> wrote:
>
> > Hi,
> >
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert <paubert@iram.es> wrote:
> >>
> >>
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>
> >> >> .intel_syntax
> >> >> .text
> >> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> >> >> 5: 48 f7 d8 neg rax
> >> >> # jz .L0 # argument zero?
> >> >> 8: 70 16 jo .L0 # argument indefinite?
> >> >> # argument overflows 64-bit integer?
> >> >> a: 48 f7 d8 neg rax
> >> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> >> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63
> >> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> >> 20: c3 .L0: ret
> >> >> .end
> >> >
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >>
> >> Right, I overlooked this fault. Thanks for pointing out.
> >>
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >>
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> >
> > Indeed, but what I had in mind was something along the following lines:
> >
> > movq rax,xmm0 # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs
> > andl eax,0x7ff # mask the sign
> > cmpl eax,0x434 # value to be checked
> > ja return # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax # conversion done
> >
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
>
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.
I don't know what the standard says about NaNs in this case, I seem to
remember that arithmetic instructions typically produce QNaN when one of
the inputs is a NaN, whether signaling or not.
>
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
>
> Correct.
>
> > I don't know if it's faster than your new code,
>
> It should be faster.
>
> > I'm almost sure that it's shorter.
>
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
>
> JFTR: better use "add rax,rax; shr rax,53" instead of
> "shr rax,52; and eax,0x7ff" and save 2 bytes.
Indeed, I don't have the exact size of instructions in my head,
especially since I've not written x86 assembly since the mid 90s.
In any case, with your last improvement, the code is now down to a
single 32 bit immediate constant. And I don't see how to eliminate it...
>
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
>
> .code64
> .intel_syntax
> .equ BIAS, 1023
> .text
> movq rax, xmm0 # rax = argument
> add rax, rax
> shr rax, 53 # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit # argument indefinite?
Maybe s/.Lexit/.L0/
> # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0 # rax = trunc(argument)
> cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> psrlq xmm0, 63
> psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpd xmm0, xmm1 # xmm0 = trunc(argument)
> .L0: ret
> .end
>
This looks nice.
> @Richard Biener (et. al.):
>
> 1. Is a primitive for "floating-point > 2**x", which generates such
> an "integer" code sequence, already available, at least for
> float/binary32 and double/binary64?
>
> 2. the procedural code generator for __builtin_trunc() etc. uses
> __builtin_fabs() and __builtin_copysign() as building blocks.
> These would need to (and of course should) be modified to generate
> psllq/psrlq pairs instead of andpd/andnpd referencing a memory
> location with either -0.0 oder ~(-0.0).
>
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
>
> .code64
> .intel_syntax
> .equ BIAS, 1023
> .text
> cvttsd2si rax, xmm0 # rax = trunc(argument)
> jo .Lexit # argument indefinite?
> # |argument| > 0x1.0p63?
> cvtsi2sd xmm0, rax # xmm1 = trunc(argument)
> .L0: ret
> .end
>
> [...]
>
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> >> 57 + 32 = 89 bytes):
> >>
> >> .code64
> >> .intel_syntax noprefix
> >> .text
> >> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
> >> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = 0x1.0p52 = 4503599627370496.0
> >> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
> >> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = argument
> >> 1d: 66 0f 73 f0 01 psllq xmm0, 1
> >> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = |argument|
> >> 27: 66 0f 73 d1 3f psrlq xmm1, 63
> >> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = (argument & -0.0) ? -0.0 : +0.0
> >> 31: f2 0f 10 d8 movsd xmm3, xmm0
> >> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = |argument| + 0x1.0p52
> >> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = |argument| - 0x1.0p52
> >> # = rint(|argument|)
> >> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = -0x1.0p0 = -1.0
> >
> > Huh? I see +1.0, -1 would be 0xBFF0000000000000.
>
> Spurious error in the comment.
> I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
> instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
>
> >> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
> >> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = rint(|argument|)
> >> # - (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >> # = trunc(|argument|)
> >> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> 53: c3 ret
>
> regards
> Stefan
Regards,
Gabriel
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 12:43 ` Stefan Kanthak
2021-08-06 12:59 ` Richard Biener
2021-08-06 13:20 ` Gabriel Paubert
@ 2021-08-06 13:31 ` Michael Matz
2021-08-06 14:32 ` Stefan Kanthak
2 siblings, 1 reply; 19+ messages in thread
From: Michael Matz @ 2021-08-06 13:31 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc
Hello,
On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
This claim would need to be proven in the wild. |argument| > 2**52 are
already integer, and shouldn't generate a spurious exception from the
various to-int conversions, not even in fast-math mode for some relevant
set of applications (at least SPECcpu).
Btw, have you made speed measurements with your improvements? The size
improvements are obvious, but speed changes can be fairly unintuitive,
e.g. there were old K8 CPUs where the memory loads for constants are
actually faster than the equivalent sequence of shifting and masking for
the >= compares. That's an irrelevant CPU now, but it shows that
intuition about speed consequences can be wrong.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 13:31 ` Michael Matz
@ 2021-08-06 14:32 ` Stefan Kanthak
2021-08-06 15:04 ` Michael Matz
2021-08-06 15:16 ` Richard Biener
0 siblings, 2 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 14:32 UTC (permalink / raw)
To: Michael Matz; +Cc: Gabriel Paubert, gcc
Michael Matz <matz@suse.de> wrote:
> Hello,
>
> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>
>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
>
> This claim would need to be proven in the wild.
I should have left the "when" after the "and" which I originally had
written...
> |argument| > 2**52 are already integer, and shouldn't generate a spurious
> exception from the various to-int conversions, not even in fast-math mode
> for some relevant set of applications (at least SPECcpu).
>
> Btw, have you made speed measurements with your improvements?
No.
> The size improvements are obvious, but speed changes can be fairly
> unintuitive, e.g. there were old K8 CPUs where the memory loads for
> constants are actually faster than the equivalent sequence of shifting
> and masking for the >= compares. That's an irrelevant CPU now, but it
> shows that intuition about speed consequences can be wrong.
I know. I also know of CPUs that can't load a 16-byte wide XMM register
in one go, but had to split the load into 2 8-byte loads.
If the constant happens to be present in L1 cache, it MAY load as fast
as an immediate.
BUT: on current CPUs, the code GCC generates
movsd .LC1(%rip), %xmm2
movsd .LC0(%rip), %xmm4
movapd %xmm0, %xmm3
movapd %xmm0, %xmm1
andpd %xmm2, %xmm3
ucomisd %xmm3, %xmm4
jbe 38 <_trunc+0x38>
needs
- 4 cycles if the movsd are executed in parallel and the movapd are
handled by the register renamer,
- 5 cycles if the movsd and the movapd are executed in parallel,
- 7 cycles else,
plus an unknown number of cycles if the constants are not in L1.
The proposed
movq rax, xmm0
add rax, rax
shr rax, 53
cmp eax, 53+1023
jae return
needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
register renamer).
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 13:20 ` Gabriel Paubert
@ 2021-08-06 14:37 ` Stefan Kanthak
2021-08-06 17:44 ` Joseph Myers
0 siblings, 1 reply; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 14:37 UTC (permalink / raw)
To: Gabriel Paubert; +Cc: gcc
Gabriel Paubert <paubert@iram.es> wrote:
> On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert <paubert@iram.es> wrote:
>>
>> > Hi,
>> >
>> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
[...]
>> >> The whole idea behind these implementations is to get rid of loading
>> >> floating-point constants to perform comparisions.
>> >
>> > Indeed, but what I had in mind was something along the following lines:
>> >
>> > movq rax,xmm0 # and copy rax to say rcx, if needed later
>> > shrq rax,52 # move sign and exponent to 12 LSBs
>> > andl eax,0x7ff # mask the sign
>> > cmpl eax,0x434 # value to be checked
>> > ja return # exponent too large, we're done (what about NaNs?)
>> > cvttsd2si rax,xmm0 # safe after exponent check
>> > cvtsi2sd xmm0,rax # conversion done
>> >
>> > and a bit more to handle the corner cases (essentially preserve the
>> > sign to be correct between -1 and -0.0).
>>
>> The sign of -0.0 is the only corner case and already handled in my code.
>> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
>> preserved, as in the code GCC generates as well as my code.
>
> I don't know what the standard says about NaNs in this case, I seem to
> remember that arithmetic instructions typically produce QNaN when one of
> the inputs is a NaN, whether signaling or not.
<https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
and its cousins as well as the C standard say
| If x is NaN, a NaN shall be returned.
That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.
>> > But the CPU can (speculatively) start the conversions early, so the
>> > dependency chain is rather short.
>>
>> Correct.
>>
>> > I don't know if it's faster than your new code,
>>
>> It should be faster.
>>
>> > I'm almost sure that it's shorter.
>>
>> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
>> 5+4+5+5+2=21 bytes.
>>
>> JFTR: better use "add rax,rax; shr rax,53" instead of
>> "shr rax,52; and eax,0x7ff" and save 2 bytes.
>
> Indeed, I don't have the exact size of instructions in my head,
> especially since I've not written x86 assembly since the mid 90s.
>
> In any case, with your last improvement, the code is now down to a
> single 32 bit immediate constant. And I don't see how to eliminate it...
>
>>
>> Complete properly optimized code for __builtin_trunc is then as follows
>> (11 instructions, 44 bytes):
>>
>> .code64
>> .intel_syntax
>> .equ BIAS, 1023
>> .text
>> movq rax, xmm0 # rax = argument
>> add rax, rax
>> shr rax, 53 # rax = exponent of |argument|
>> cmp eax, BIAS + 53
>> jae .Lexit # argument indefinite?
>
> Maybe s/.Lexit/.L0/
Surely!
>> # |argument| >= 0x1.0p53?
>> cvttsd2si rax, xmm0 # rax = trunc(argument)
>> cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
>> psrlq xmm0, 63
>> psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>> orpd xmm0, xmm1 # xmm0 = trunc(argument)
>> .L0: ret
>> .end
>>
>
> This looks nice.
Let's see how to convince GCC to generate such code sequences...
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 14:32 ` Stefan Kanthak
@ 2021-08-06 15:04 ` Michael Matz
2021-08-06 15:16 ` Richard Biener
1 sibling, 0 replies; 19+ messages in thread
From: Michael Matz @ 2021-08-06 15:04 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc
Hello,
On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> >> For -ffast-math, where the sign of -0.0 is not handled and the
> >> spurios invalid floating-point exception for |argument| >= 2**63 is
> >> acceptable,
> >
> > This claim would need to be proven in the wild.
>
> I should have left the "when" after the "and" which I originally had
> written...
>
> > |argument| > 2**52 are already integer, and shouldn't generate a
> > spurious exception from the various to-int conversions, not even in
> > fast-math mode for some relevant set of applications (at least
> > SPECcpu).
> >
> > Btw, have you made speed measurements with your improvements?
>
> No.
>
> > The size improvements are obvious, but speed changes can be fairly
> > unintuitive, e.g. there were old K8 CPUs where the memory loads for
> > constants are actually faster than the equivalent sequence of shifting
> > and masking for the >= compares. That's an irrelevant CPU now, but it
> > shows that intuition about speed consequences can be wrong.
>
> I know. I also know of CPUs that can't load a 16-byte wide XMM register
> in one go, but had to split the load into 2 8-byte loads.
>
> If the constant happens to be present in L1 cache, it MAY load as fast
> as an immediate.
> BUT: on current CPUs, the code GCC generates
>
> movsd .LC1(%rip), %xmm2
> movsd .LC0(%rip), %xmm4
> movapd %xmm0, %xmm3
> movapd %xmm0, %xmm1
> andpd %xmm2, %xmm3
> ucomisd %xmm3, %xmm4
> jbe 38 <_trunc+0x38>
>
> needs
> - 4 cycles if the movsd are executed in parallel and the movapd are
> handled by the register renamer,
> - 5 cycles if the movsd and the movapd are executed in parallel,
> - 7 cycles else,
> plus an unknown number of cycles if the constants are not in L1.
You also need to consider the case that the to-int converters are called
in a loop (which ultimately are the only interesting cases for
performance), where it's possible to load the constants before the loop
and keep them in registers (at the expense of two register pressure of
course) effectively removing the loads from cost considerations. It's all
tough choices, which is why stuff needs to be measured in some contexts
:-)
(I do like your sequences btw, it's just not 100% clearcut that they are
always a speed improvement).
Ciao,
Michael.
> The proposed
>
> movq rax, xmm0
> add rax, rax
> shr rax, 53
> cmp eax, 53+1023
> jae return
>
> needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
> register renamer).
>
> Stefan
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 14:32 ` Stefan Kanthak
2021-08-06 15:04 ` Michael Matz
@ 2021-08-06 15:16 ` Richard Biener
2021-08-06 16:57 ` Stefan Kanthak
1 sibling, 1 reply; 19+ messages in thread
From: Richard Biener @ 2021-08-06 15:16 UTC (permalink / raw)
To: gcc, Stefan Kanthak, Michael Matz
On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>Michael Matz <matz@suse.de> wrote:
>
>
>> Hello,
>>
>> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>>
>>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
>>
>> This claim would need to be proven in the wild.
>
>I should have left the "when" after the "and" which I originally had
>written...
>
>> |argument| > 2**52 are already integer, and shouldn't generate a spurious
>> exception from the various to-int conversions, not even in fast-math mode
>> for some relevant set of applications (at least SPECcpu).
>>
>> Btw, have you made speed measurements with your improvements?
>
>No.
>
>> The size improvements are obvious, but speed changes can be fairly
>> unintuitive, e.g. there were old K8 CPUs where the memory loads for
>> constants are actually faster than the equivalent sequence of shifting
>> and masking for the >= compares. That's an irrelevant CPU now, but it
>> shows that intuition about speed consequences can be wrong.
>
>I know. I also know of CPUs that can't load a 16-byte wide XMM register
>in one go, but had to split the load into 2 8-byte loads.
>
>If the constant happens to be present in L1 cache, it MAY load as fast
>as an immediate.
>BUT: on current CPUs, the code GCC generates
>
> movsd .LC1(%rip), %xmm2
> movsd .LC0(%rip), %xmm4
> movapd %xmm0, %xmm3
> movapd %xmm0, %xmm1
> andpd %xmm2, %xmm3
> ucomisd %xmm3, %xmm4
> jbe 38 <_trunc+0x38>
>
>needs
>- 4 cycles if the movsd are executed in parallel and the movapd are
> handled by the register renamer,
>- 5 cycles if the movsd and the movapd are executed in parallel,
>- 7 cycles else,
>plus an unknown number of cycles if the constants are not in L1.
>The proposed
>
> movq rax, xmm0
The xmm to GPR move costs you an extra cycle in latency. Shifts also tend to be port constrained. The original sequences are also somewhat straight forward to vectorize.
> add rax, rax
> shr rax, 53
> cmp eax, 53+1023
> jae return
>
>needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
>register renamer).
>
>Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 15:16 ` Richard Biener
@ 2021-08-06 16:57 ` Stefan Kanthak
0 siblings, 0 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 16:57 UTC (permalink / raw)
To: Richard Biener, gcc, Michael Matz
Richard Biener <richard.guenther@gmail.com> wrote:
> On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>Michael Matz <matz@suse.de> wrote:
>>> Btw, have you made speed measurements with your improvements?
>>
>>No.
[...]
>>If the constant happens to be present in L1 cache, it MAY load as fast
>>as an immediate.
>>BUT: on current CPUs, the code GCC generates
>>
>> movsd .LC1(%rip), %xmm2
>> movsd .LC0(%rip), %xmm4
>> movapd %xmm0, %xmm3
>> movapd %xmm0, %xmm1
>> andpd %xmm2, %xmm3
>> ucomisd %xmm3, %xmm4
>> jbe 38 <_trunc+0x38>
>>
>>needs
>>- 4 cycles if the movsd are executed in parallel and the movapd are
>> handled by the register renamer,
>>- 5 cycles if the movsd and the movapd are executed in parallel,
>>- 7 cycles else,
>>plus an unknown number of cycles if the constants are not in L1.
>>The proposed
>>
>> movq rax, xmm0
>
> The xmm to GPR move costs you an extra cycle in latency. Shifts also
> tend to be port constrained. The original sequences are also somewhat
> straight forward to vectorize.
Please show how GCC vectorizes CVT[T]SD2SI and CVTSI2SD!
These are the bottlenecks in the current code.
If you want the code for trunc() and cousins to be vectorizable you
should stay with the alternative code I presented some posts before,
which GCC should be (able to) generate from its other procedural
variant.
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 14:37 ` Stefan Kanthak
@ 2021-08-06 17:44 ` Joseph Myers
2021-08-07 12:32 ` Stefan Kanthak
0 siblings, 1 reply; 19+ messages in thread
From: Joseph Myers @ 2021-08-06 17:44 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc
On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> > I don't know what the standard says about NaNs in this case, I seem to
> > remember that arithmetic instructions typically produce QNaN when one of
> > the inputs is a NaN, whether signaling or not.
>
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
> and its cousins as well as the C standard say
>
> | If x is NaN, a NaN shall be returned.
>
> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.
You should be looking at TS 18661-3 / C2x Annex F for sNaN handling; the
POSIX attempts to deal with signaling NaNs aren't well thought out. (And
possibly adding flag_signaling_nans conditions as appropriate to disable
for -fsignaling-nans anything for these to-integer operations that doesn't
produce a qNaN with INVALID raised from sNaN input.) Though in C2x mode,
these SSE2 code sequences won't be used by default anyway, except for rint
(C2x implies -fno-fp-int-builtin-inexact).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-06 17:44 ` Joseph Myers
@ 2021-08-07 12:32 ` Stefan Kanthak
2021-08-08 22:58 ` Vincent Lefevre
2021-08-09 17:19 ` Joseph Myers
0 siblings, 2 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-07 12:32 UTC (permalink / raw)
To: Joseph Myers; +Cc: Gabriel Paubert, gcc
Joseph Myers <joseph@codesourcery.com> wrote:
> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
PLEASE DON'T STRIP ATTRIBUTION LINES: I did not write the following paragraph!
>> > I don't know what the standard says about NaNs in this case, I seem to
>> > remember that arithmetic instructions typically produce QNaN when one of
>> > the inputs is a NaN, whether signaling or not.
>>
>> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
>> and its cousins as well as the C standard say
>>
>> | If x is NaN, a NaN shall be returned.
>>
>> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.
>
> You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
I'll do so as soon as GCC drops support for all C dialects before C2x!
Unless you use a time machine and fix the POSIX and ISO C standards
written in the past you CAN'T neglect all software written before C2x
modified sNaN handling that relies on the documented behaviour at the
time it was written.
> the POSIX attempts to deal with signaling NaNs aren't well thought out.
[...]
> Though in C2x mode, these SSE2 code sequences won't be used by default
> anyway, except for rint (C2x implies -fno-fp-int-builtin-inexact).
regards
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-07 12:32 ` Stefan Kanthak
@ 2021-08-08 22:58 ` Vincent Lefevre
2021-08-09 17:19 ` Joseph Myers
1 sibling, 0 replies; 19+ messages in thread
From: Vincent Lefevre @ 2021-08-08 22:58 UTC (permalink / raw)
To: gcc
On 2021-08-07 14:32:32 +0200, Stefan Kanthak wrote:
> Joseph Myers <joseph@codesourcery.com> wrote:
> > On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>
> PLEASE DON'T STRIP ATTRIBUTION LINES: I did not write the following paragraph!
>
> >> > I don't know what the standard says about NaNs in this case, I seem to
> >> > remember that arithmetic instructions typically produce QNaN when one of
> >> > the inputs is a NaN, whether signaling or not.
> >>
> >> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
> >> and its cousins as well as the C standard say
> >>
> >> | If x is NaN, a NaN shall be returned.
> >>
> >> That's why I mentioned that the code GCC generates also doesn't
> >> quiet SNaNs.
> >
> > You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
>
> I'll do so as soon as GCC drops support for all C dialects before C2x!
>
> Unless you use a time machine and fix the POSIX and ISO C standards
> written in the past you CAN'T neglect all software written before C2x
> modified sNaN handling that relies on the documented behaviour at the
> time it was written.
Before C2x:
This specification does not define the behavior of signaling NaNs.365)
It generally uses the term NaN to denote quiet NaNs.
365) Since NaNs created by IEC 60559 operations are always quiet,
quiet NaNs (along with infinities) are sufficient for closure of
the arithmetic.
(in Annex F).
You can't create signaling NaNs with C operations, but you may get
them when reading data from memory. So, IMHO, they should really be
supported in practice, at least in some sense. I would expect that
when a sNaN occurs as an input, it is handled either like a sNaN
(see IEC 60559 / IEEE 754) or like a qNaN. Propagating the
signaling status (forbidden in IEEE 754 for almost all operations)
could be acceptable (this means that an implementation may ignore
whether a NaN is quiet or signaling), but should be avoided.
--
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
2021-08-07 12:32 ` Stefan Kanthak
2021-08-08 22:58 ` Vincent Lefevre
@ 2021-08-09 17:19 ` Joseph Myers
1 sibling, 0 replies; 19+ messages in thread
From: Joseph Myers @ 2021-08-09 17:19 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: gcc
On Sat, 7 Aug 2021, Stefan Kanthak wrote:
> Joseph Myers <joseph@codesourcery.com> wrote:
> > You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
>
> I'll do so as soon as GCC drops support for all C dialects before C2x!
>
> Unless you use a time machine and fix the POSIX and ISO C standards
> written in the past you CAN'T neglect all software written before C2x
> modified sNaN handling that relies on the documented behaviour at the
> time it was written.
Pre-C2x versions of C don't cover signaling NaNs at all; they use "NaN" to
mean "quiet NaN" (so signaling NaNs are trap representations). Software
written before C2x thus can't rely on any particular sNaN handling.
The POSIX description of signaling NaNs ("On implementations that support
the IEC 60559:1989 standard floating point, functions with signaling NaN
argument(s) shall be treated as if the function were called with an
argument that is a required domain error and shall return a quiet NaN
result, except where stated otherwise.") is consistent with C2x as regards
trunc (sNaN) needing to return a quiet NaN with INVALID raised. The
problems are (a) POSIX fails to "state otherwise" for the cases (e.g.
fabs, copysign) where a signaling NaN argument should not result in a
quiet NaN with INVALID raised (as per IEEE semantics for those operations)
and (b) the POSIX rule about setting errno to EDOM when (math_errhandling
& MATH_ERRNO) is nonzero is inappropriate for sNaN arguments (incompatible
with the normal approach of generating INVALID and a quiet NaN by passing
NaN arguments through arithmetic) and the C2x approach of being
implementation-defined whether an sNaN input is a domain error is more
appropriate.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2021-08-09 17:19 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05 7:25 Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1 Stefan Kanthak
2021-08-05 9:42 ` Gabriel Paubert
2021-08-05 10:19 ` Richard Biener
2021-08-05 11:58 ` Stefan Kanthak
2021-08-05 13:59 ` Gabriel Paubert
2021-08-06 12:43 ` Stefan Kanthak
2021-08-06 12:59 ` Richard Biener
2021-08-06 13:20 ` Gabriel Paubert
2021-08-06 14:37 ` Stefan Kanthak
2021-08-06 17:44 ` Joseph Myers
2021-08-07 12:32 ` Stefan Kanthak
2021-08-08 22:58 ` Vincent Lefevre
2021-08-09 17:19 ` Joseph Myers
2021-08-06 13:31 ` Michael Matz
2021-08-06 14:32 ` Stefan Kanthak
2021-08-06 15:04 ` Michael Matz
2021-08-06 15:16 ` Richard Biener
2021-08-06 16:57 ` Stefan Kanthak
2021-08-05 13:18 ` Gabriel Ravier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).