Suboptimal code generated for __buitlin_trunc on AMD64 without SS4

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
@ 2021-08-05  7:25 Stefan Kanthak
  2021-08-05  9:42 ` Gabriel Paubert
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-05  7:25 UTC (permalink / raw)
  To: gcc

Hi,

targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (13 instructions using 57 bytes, plus 4 quadwords
using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:

                                .text
   0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
                        4: R_X86_64_PC32        .rdata
   8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
                        c: R_X86_64_PC32        .rdata
  10:   66 0f 28 d8             movapd %xmm0, %xmm3
  14:   66 0f 28 c8             movapd %xmm0, %xmm1
  18:   66 0f 54 da             andpd  %xmm2, %xmm3
  1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
  20:   76 16                   jbe    38 <_trunc+0x38>
  22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
  27:   66 0f ef c0             pxor   %xmm0, %xmm0
  2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
  2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
  34:   66 0f 56 c2             orpd   %xmm2, %xmm0
  38:   c3                      retq

                                .rdata
                                .align 8
   0:   00 00 00 00     .LC0:   .quad  0x1.0p52
        00 00 30 43
        00 00 00 00
        00 00 00 00
                                .align 16
  10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
        ff ff ff 7f
  18:   00 00 00 00             .quad  0.0
        00 00 00 00
                                .end

JFTR: in the best case, the memory accesses cost several cycles,
      while in the worst case they yield a page fault!


Properly optimized, shorter and faster code, using but only 9 instructions
in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
and saving at least 16 + 32 bytes, follows:

                              .intel_syntax
                              .text
   0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
   5:   48 f7 d8              neg     rax
                        #     jz      .L0          # argument zero?
   8:   70 16                 jo      .L0          # argument indefinite?
                                                   # argument overflows 64-bit integer?
   a:   48 f7 d8              neg     rax
   d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
  12:   66 0f 73 d0 3f        psrlq   xmm0, 63
  17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
  1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
  20:   c3              .L0:  ret
                              .end

regards
Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-05  7:25 Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1 Stefan Kanthak
@ 2021-08-05  9:42 ` Gabriel Paubert
  2021-08-05 10:19   ` Richard Biener
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Gabriel Paubert @ 2021-08-05  9:42 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: gcc

On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> Hi,
> 
> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> following code (13 instructions using 57 bytes, plus 4 quadwords
> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> 
>                                 .text
>    0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
>                         4: R_X86_64_PC32        .rdata
>    8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
>                         c: R_X86_64_PC32        .rdata
>   10:   66 0f 28 d8             movapd %xmm0, %xmm3
>   14:   66 0f 28 c8             movapd %xmm0, %xmm1
>   18:   66 0f 54 da             andpd  %xmm2, %xmm3
>   1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
>   20:   76 16                   jbe    38 <_trunc+0x38>
>   22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
>   27:   66 0f ef c0             pxor   %xmm0, %xmm0
>   2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
>   2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
>   34:   66 0f 56 c2             orpd   %xmm2, %xmm0
>   38:   c3                      retq
> 
>                                 .rdata
>                                 .align 8
>    0:   00 00 00 00     .LC0:   .quad  0x1.0p52
>         00 00 30 43
>         00 00 00 00
>         00 00 00 00
>                                 .align 16
>   10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
>         ff ff ff 7f
>   18:   00 00 00 00             .quad  0.0
>         00 00 00 00
>                                 .end
> 
> JFTR: in the best case, the memory accesses cost several cycles,
>       while in the worst case they yield a page fault!
> 
> 
> Properly optimized, shorter and faster code, using but only 9 instructions
> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> and saving at least 16 + 32 bytes, follows:
> 
>                               .intel_syntax
>                               .text
>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
>    5:   48 f7 d8              neg     rax
>                         #     jz      .L0          # argument zero?
>    8:   70 16                 jo      .L0          # argument indefinite?
>                                                    # argument overflows 64-bit integer?
>    a:   48 f7 d8              neg     rax
>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
>   20:   c3              .L0:  ret
>                               .end

There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer.  So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.

The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).

Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.

Apart from this, in your code, after cvttsd2si I'd rather use:
	mov rcx,rax # make a second copy to a scratch register
	neg rcx
	jo .L0
	cvtsi2sd xmm1,rax

The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.

With your patch:

cvttsd2si-->neg-?->neg-->cvtsi2sd
              
where the ? means that the following instructions are speculated.  

With an auxiliary register there are two dependency chains:

cvttsd2si-?->cvtsi2sd
         |->mov->neg->jump

Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.

	Regards,
	Gabriel



> 
> regards
> Stefan
 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-05  9:42 ` Gabriel Paubert
@ 2021-08-05 10:19   ` Richard Biener
  2021-08-05 11:58   ` Stefan Kanthak
  2021-08-05 13:18   ` Gabriel Ravier
  2 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2021-08-05 10:19 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Stefan Kanthak, GCC Development

On Thu, Aug 5, 2021 at 11:44 AM Gabriel Paubert <paubert@iram.es> wrote:
>
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> > Hi,
> >
> > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> > following code (13 instructions using 57 bytes, plus 4 quadwords
> > using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >
> >                                 .text
> >    0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> >                         4: R_X86_64_PC32        .rdata
> >    8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> >                         c: R_X86_64_PC32        .rdata
> >   10:   66 0f 28 d8             movapd %xmm0, %xmm3
> >   14:   66 0f 28 c8             movapd %xmm0, %xmm1
> >   18:   66 0f 54 da             andpd  %xmm2, %xmm3
> >   1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
> >   20:   76 16                   jbe    38 <_trunc+0x38>
> >   22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
> >   27:   66 0f ef c0             pxor   %xmm0, %xmm0
> >   2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
> >   2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
> >   34:   66 0f 56 c2             orpd   %xmm2, %xmm0
> >   38:   c3                      retq
> >
> >                                 .rdata
> >                                 .align 8
> >    0:   00 00 00 00     .LC0:   .quad  0x1.0p52
> >         00 00 30 43
> >         00 00 00 00
> >         00 00 00 00
> >                                 .align 16
> >   10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
> >         ff ff ff 7f
> >   18:   00 00 00 00             .quad  0.0
> >         00 00 00 00
> >                                 .end
> >
> > JFTR: in the best case, the memory accesses cost several cycles,
> >       while in the worst case they yield a page fault!
> >
> >
> > Properly optimized, shorter and faster code, using but only 9 instructions
> > in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> > and saving at least 16 + 32 bytes, follows:
> >
> >                               .intel_syntax
> >                               .text
> >    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
> >    5:   48 f7 d8              neg     rax
> >                         #     jz      .L0          # argument zero?
> >    8:   70 16                 jo      .L0          # argument indefinite?
> >                                                    # argument overflows 64-bit integer?
> >    a:   48 f7 d8              neg     rax
> >    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
> >   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> >   20:   c3              .L0:  ret
> >                               .end
>
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.  So
> using your code may require some option (-fast-math comes to mind), or
> you need at least a check on the exponent before cvttsd2si.
>
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
>
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
>         mov rcx,rax # make a second copy to a scratch register
>         neg rcx
>         jo .L0
>         cvtsi2sd xmm1,rax
>
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
>          |->mov->neg->jump
>
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.

Btw, the code to emit these sequences is in
gcc/config/i386/i386-expand.c:ix86_expand_trunc
and friends.

Richard.

>         Regards,
>         Gabriel
>
>
>
> >
> > regards
> > Stefan
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-05  9:42 ` Gabriel Paubert
  2021-08-05 10:19   ` Richard Biener
@ 2021-08-05 11:58   ` Stefan Kanthak
  2021-08-05 13:59     ` Gabriel Paubert
  2021-08-05 13:18   ` Gabriel Ravier
  2 siblings, 1 reply; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-05 11:58 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: gcc

Gabriel Paubert <paubert@iram.es> wrote:


> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> Hi,
>> 
>> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
>> following code (13 instructions using 57 bytes, plus 4 quadwords
>> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>> 
>>                                 .text
>>    0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
>>                         4: R_X86_64_PC32        .rdata
>>    8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
>>                         c: R_X86_64_PC32        .rdata
>>   10:   66 0f 28 d8             movapd %xmm0, %xmm3
>>   14:   66 0f 28 c8             movapd %xmm0, %xmm1
>>   18:   66 0f 54 da             andpd  %xmm2, %xmm3
>>   1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
>>   20:   76 16                   jbe    38 <_trunc+0x38>
>>   22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
>>   27:   66 0f ef c0             pxor   %xmm0, %xmm0
>>   2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
>>   2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
>>   34:   66 0f 56 c2             orpd   %xmm2, %xmm0
>>   38:   c3                      retq
>> 
>>                                 .rdata
>>                                 .align 8
>>    0:   00 00 00 00     .LC0:   .quad  0x1.0p52
>>         00 00 30 43
>>         00 00 00 00
>>         00 00 00 00
>>                                 .align 16
>>   10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
>>         ff ff ff 7f
>>   18:   00 00 00 00             .quad  0.0
>>         00 00 00 00
>>                                 .end
>> 
>> JFTR: in the best case, the memory accesses cost several cycles,
>>       while in the worst case they yield a page fault!
>> 
>> 
>> Properly optimized, shorter and faster code, using but only 9 instructions
>> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
>> and saving at least 16 + 32 bytes, follows:
>> 
>>                               .intel_syntax
>>                               .text
>>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
>>    5:   48 f7 d8              neg     rax
>>                         #     jz      .L0          # argument zero?
>>    8:   70 16                 jo      .L0          # argument indefinite?
>>                                                    # argument overflows 64-bit integer?
>>    a:   48 f7 d8              neg     rax
>>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
>>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
>>   20:   c3              .L0:  ret
>>                               .end
> 
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.

Right, I overlooked this fault. Thanks for pointing out.

> So using your code may require some option (-fast-math comes to mind),
> or you need at least a check on the exponent before cvttsd2si.

The whole idea behind these implementations is to get rid of loading
floating-point constants to perform comparisions.

> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).

Preserving the sign of -0.0 is explicitly specified in the standard,
and is cheap, as shown in my code.

> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
> 
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax

I don't know how GCC generates the code for builtins, and what kind of
templates it uses: the second goal was to minimize register usage.

> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
> 
> With your patch:
> 
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>              
> where the ? means that the following instructions are speculated.  
> 
> With an auxiliary register there are two dependency chains:
> 
> cvttsd2si-?->cvtsi2sd
>         |->mov->neg->jump

Correct; see above: I expect the template(s) for builtins to give
the register allocator some freedom to split code paths and resolve
dependency chains.

> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.

Right, the conversions dominate both the original and the code I posted.
It's easy to get rid of them, with still slightly shorter and faster
branchless code (17 instructions, 84 bytes, instead of 13 instructions,
57 + 32 = 89 bytes):

                                        .code64
                                        .intel_syntax noprefix
                                        .text
   0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
   a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 0x1.0p52 = 4503599627370496.0
   f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
  19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = argument
  1d:   66 0f 73 f0 01                  psllq   xmm0, 1
  22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = |argument|
  27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
  2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = (argument & -0.0) ? -0.0 : +0.0
  31:   f2 0f 10 d8                     movsd   xmm3, xmm0
  35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = |argument| + 0x1.0p52
  39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = |argument| - 0x1.0p52
                                                                 #      = rint(|argument|)
  3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = -0x1.0p0 = -1.0
  42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
  47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
  4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = rint(|argument|)
                                                                 #      - (|argument| < rint(|argument|)) ? 1.0 : 0.0
                                                                 #      = trunc(|argument|)
  4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = trunc(argument)
  53:   c3                              ret
                                        .end

regards
Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-05 11:58   ` Stefan Kanthak
@ 2021-08-05 13:59     ` Gabriel Paubert
  2021-08-06 12:43       ` Stefan Kanthak
  0 siblings, 1 reply; 19+ messages in thread
From: Gabriel Paubert @ 2021-08-05 13:59 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: gcc

	Hi,

On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert <paubert@iram.es> wrote:
> 
> 
> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> >> Hi,
> >> 
> >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> >> following code (13 instructions using 57 bytes, plus 4 quadwords
> >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >> 
> >>                                 .text
> >>    0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> >>                         4: R_X86_64_PC32        .rdata
> >>    8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> >>                         c: R_X86_64_PC32        .rdata
> >>   10:   66 0f 28 d8             movapd %xmm0, %xmm3
> >>   14:   66 0f 28 c8             movapd %xmm0, %xmm1
> >>   18:   66 0f 54 da             andpd  %xmm2, %xmm3
> >>   1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
> >>   20:   76 16                   jbe    38 <_trunc+0x38>
> >>   22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
> >>   27:   66 0f ef c0             pxor   %xmm0, %xmm0
> >>   2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
> >>   2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
> >>   34:   66 0f 56 c2             orpd   %xmm2, %xmm0
> >>   38:   c3                      retq
> >> 
> >>                                 .rdata
> >>                                 .align 8
> >>    0:   00 00 00 00     .LC0:   .quad  0x1.0p52
> >>         00 00 30 43
> >>         00 00 00 00
> >>         00 00 00 00
> >>                                 .align 16
> >>   10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
> >>         ff ff ff 7f
> >>   18:   00 00 00 00             .quad  0.0
> >>         00 00 00 00
> >>                                 .end
> >> 
> >> JFTR: in the best case, the memory accesses cost several cycles,
> >>       while in the worst case they yield a page fault!
> >> 
> >> 
> >> Properly optimized, shorter and faster code, using but only 9 instructions
> >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> >> and saving at least 16 + 32 bytes, follows:
> >> 
> >>                               .intel_syntax
> >>                               .text
> >>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
> >>    5:   48 f7 d8              neg     rax
> >>                         #     jz      .L0          # argument zero?
> >>    8:   70 16                 jo      .L0          # argument indefinite?
> >>                                                    # argument overflows 64-bit integer?
> >>    a:   48 f7 d8              neg     rax
> >>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
> >>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> >>   20:   c3              .L0:  ret
> >>                               .end
> > 
> > There is one important difference, namely setting the invalid exception
> > flag when the parameter can't be represented in a signed integer.
> 
> Right, I overlooked this fault. Thanks for pointing out.
> 
> > So using your code may require some option (-fast-math comes to mind),
> > or you need at least a check on the exponent before cvttsd2si.
> 
> The whole idea behind these implementations is to get rid of loading
> floating-point constants to perform comparisions.

Indeed, but what I had in mind was something along the following lines:

	movq rax,xmm0   # and copy rax to say rcx, if needed later
	shrq rax,52     # move sign and exponent to 12 LSBs 
	andl eax,0x7ff  # mask the sign
	cmpl eax,0x434  # value to be checked
	ja return       # exponent too large, we're done (what about NaNs?)
	cvttsd2si rax,xmm0 # safe after exponent check
	cvtsi2sd xmm0,rax  # conversion done

and a bit more to handle the corner cases (essentially preserve the
sign to be correct between -1 and -0.0). But the CPU can (speculatively) 
start the conversions early, so the dependency chain is rather short.

I don't know if it's faster than your new code, I'm almost sure that
it's shorter. Your new code also has a fairly long dependency chain.

> 
> > The last part of your code then goes to take into account the special
> > case of -0.0, which I most often don't care about (I'd like to have a
> > -fdont-split-hairs-about-the-sign-of-zero option).
> 
> Preserving the sign of -0.0 is explicitly specified in the standard,
> and is cheap, as shown in my code.
> 
> > Potentially generating spurious invalid operation and then carefully
> > taking into account the sign of zero does not seem very consistent.
> > 
> > Apart from this, in your code, after cvttsd2si I'd rather use:
> > mov rcx,rax # make a second copy to a scratch register
> > neg rcx
> > jo .L0
> > cvtsi2sd xmm1,rax
> 
> I don't know how GCC generates the code for builtins, and what kind of
> templates it uses: the second goal was to minimize register usage.
> 

Ok, but on 64 bit using two GPRs would still be reasonable.

> > The reason is latency, in an OoO engine, splitting the two paths is
> > almost always a win.
> > 
> > With your patch:
> > 
> > cvttsd2si-->neg-?->neg-->cvtsi2sd
> >              
> > where the ? means that the following instructions are speculated.  
> > 
> > With an auxiliary register there are two dependency chains:
> > 
> > cvttsd2si-?->cvtsi2sd
> >         |->mov->neg->jump
> 
> Correct; see above: I expect the template(s) for builtins to give
> the register allocator some freedom to split code paths and resolve
> dependency chains.
> 
> > Actually some OoO cores just eliminate register copies using register
> > renaming mechanism. But even this is probably completely irrelevant in
> > this case where the latency is dominated by the two conversion
> > instructions.
> 
> Right, the conversions dominate both the original and the code I posted.
> It's easy to get rid of them, with still slightly shorter and faster
> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> 57 + 32 = 89 bytes):
> 
>                                         .code64
>                                         .intel_syntax noprefix
>                                         .text
>    0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
>    a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 0x1.0p52 = 4503599627370496.0
>    f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
>   19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = argument
>   1d:   66 0f 73 f0 01                  psllq   xmm0, 1
>   22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = |argument|
>   27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
>   2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = (argument & -0.0) ? -0.0 : +0.0
>   31:   f2 0f 10 d8                     movsd   xmm3, xmm0
>   35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = |argument| + 0x1.0p52
>   39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = |argument| - 0x1.0p52
>                                                                  #      = rint(|argument|)
>   3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = -0x1.0p0 = -1.0

Huh? I see +1.0, -1 would be 0xBFF0000000000000.

>   42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
>   47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
>   4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = rint(|argument|)
>                                                                  #      - (|argument| < rint(|argument|)) ? 1.0 : 0.0
>                                                                  #      = trunc(|argument|)
>   4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = trunc(argument)
>   53:   c3                              ret
>                                         .end
> 
> regards
> Stefan

	Regards,
	Gabriel
 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-05 13:59     ` Gabriel Paubert
@ 2021-08-06 12:43       ` Stefan Kanthak
  2021-08-06 12:59         ` Richard Biener
                           ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 12:43 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: gcc

Gabriel Paubert <paubert@iram.es> wrote:

> Hi,
> 
> On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert <paubert@iram.es> wrote:
>> 
>> 
>> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:

>> >>                               .intel_syntax
>> >>                               .text
>> >>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
>> >>    5:   48 f7 d8              neg     rax
>> >>                         #     jz      .L0          # argument zero?
>> >>    8:   70 16                 jo      .L0          # argument indefinite?
>> >>                                                    # argument overflows 64-bit integer?
>> >>    a:   48 f7 d8              neg     rax
>> >>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>> >>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
>> >>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>> >>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
>> >>   20:   c3              .L0:  ret
>> >>                               .end
>> > 
>> > There is one important difference, namely setting the invalid exception
>> > flag when the parameter can't be represented in a signed integer.
>> 
>> Right, I overlooked this fault. Thanks for pointing out.
>> 
>> > So using your code may require some option (-fast-math comes to mind),
>> > or you need at least a check on the exponent before cvttsd2si.
>> 
>> The whole idea behind these implementations is to get rid of loading
>> floating-point constants to perform comparisions.
> 
> Indeed, but what I had in mind was something along the following lines:
> 
> movq rax,xmm0   # and copy rax to say rcx, if needed later
> shrq rax,52     # move sign and exponent to 12 LSBs 
> andl eax,0x7ff  # mask the sign
> cmpl eax,0x434  # value to be checked
> ja return       # exponent too large, we're done (what about NaNs?)
> cvttsd2si rax,xmm0 # safe after exponent check
> cvtsi2sd xmm0,rax  # conversion done
> 
> and a bit more to handle the corner cases (essentially preserve the
> sign to be correct between -1 and -0.0).

The sign of -0.0 is the only corner case and already handled in my code.
Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
preserved, as in the code GCC generates as well as my code.

> But the CPU can (speculatively) start the conversions early, so the
> dependency chain is rather short.

Correct.
 
> I don't know if it's faster than your new code,

It should be faster.

> I'm almost sure that it's shorter.

"neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
5+4+5+5+2=21 bytes.

JFTR: better use "add rax,rax; shr rax,53" instead of
      "shr rax,52; and eax,0x7ff" and save 2 bytes.

Complete properly optimized code for __builtin_trunc is then as follows
(11 instructions, 44 bytes):

.code64
.intel_syntax
.equ    BIAS, 1023
.text
        movq    rax, xmm0    # rax = argument
        add     rax, rax
        shr     rax, 53      # rax = exponent of |argument|
        cmp     eax, BIAS + 53
        jae     .Lexit       # argument indefinite?
                             # |argument| >= 0x1.0p53?
        cvttsd2si rax, xmm0  # rax = trunc(argument)
        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
        psrlq   xmm0, 63
        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
        orpd    xmm0, xmm1   # xmm0 = trunc(argument)
.L0:    ret
.end

@Richard Biener (et. al.):

1. Is a primitive for "floating-point > 2**x", which generates such
   an "integer" code sequence, already available, at least for
   float/binary32 and double/binary64?

2. the procedural code generator for __builtin_trunc() etc.  uses
   __builtin_fabs() and __builtin_copysign() as building blocks.
   These would need to (and of course should) be modified to generate
   psllq/psrlq pairs instead of andpd/andnpd referencing a memory
   location with either -0.0 oder ~(-0.0).

For -ffast-math, where the sign of -0.0 is not handled and the spurios
invalid floating-point exception for |argument| >= 2**63 is acceptable,
it boils down to:

.code64
.intel_syntax
.equ    BIAS, 1023
.text
        cvttsd2si rax, xmm0  # rax = trunc(argument)
        jo      .Lexit       # argument indefinite?
                             # |argument| > 0x1.0p63?
        cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
.L0:    ret
.end

[...]

>> Right, the conversions dominate both the original and the code I posted.
>> It's easy to get rid of them, with still slightly shorter and faster
>> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
>> 57 + 32 = 89 bytes):
>> 
>>                                         .code64
>>                                         .intel_syntax noprefix
>>                                         .text
>>    0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
>>    a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 0x1.0p52 = 4503599627370496.0
>>    f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
>>   19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = argument
>>   1d:   66 0f 73 f0 01                  psllq   xmm0, 1
>>   22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = |argument|
>>   27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
>>   2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = (argument & -0.0) ? -0.0 : +0.0
>>   31:   f2 0f 10 d8                     movsd   xmm3, xmm0
>>   35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = |argument| + 0x1.0p52
>>   39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = |argument| - 0x1.0p52
>>                                                                  #      = rint(|argument|)
>>   3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = -0x1.0p0 = -1.0
> 
> Huh? I see +1.0, -1 would be 0xBFF0000000000000.

Spurious error in the comment.
I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
instead of "subsd xmm0, xmm2" to save a "movsd" instruction.

>>   42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
>>   47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
>>   4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = rint(|argument|)
>>                                                                  #      - (|argument| < rint(|argument|)) ? 1.0 : 0.0
>>                                                                  #      = trunc(|argument|)
>>   4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = trunc(argument)
>>   53:   c3                              ret

regards
Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 12:43       ` Stefan Kanthak
@ 2021-08-06 12:59         ` Richard Biener
  2021-08-06 13:20         ` Gabriel Paubert
  2021-08-06 13:31         ` Michael Matz
  2 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2021-08-06 12:59 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Gabriel Paubert, GCC Development

On Fri, Aug 6, 2021 at 2:47 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> Gabriel Paubert <paubert@iram.es> wrote:
>
> > Hi,
> >
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert <paubert@iram.es> wrote:
> >>
> >>
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>
> >> >>                               .intel_syntax
> >> >>                               .text
> >> >>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
> >> >>    5:   48 f7 d8              neg     rax
> >> >>                         #     jz      .L0          # argument zero?
> >> >>    8:   70 16                 jo      .L0          # argument indefinite?
> >> >>                                                    # argument overflows 64-bit integer?
> >> >>    a:   48 f7 d8              neg     rax
> >> >>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >> >>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> >> >>   20:   c3              .L0:  ret
> >> >>                               .end
> >> >
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >>
> >> Right, I overlooked this fault. Thanks for pointing out.
> >>
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >>
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> >
> > Indeed, but what I had in mind was something along the following lines:
> >
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52     # move sign and exponent to 12 LSBs
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return       # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> >
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
>
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.
>
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
>
> Correct.
>
> > I don't know if it's faster than your new code,
>
> It should be faster.
>
> > I'm almost sure that it's shorter.
>
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
>
> JFTR: better use "add rax,rax; shr rax,53" instead of
>       "shr rax,52; and eax,0x7ff" and save 2 bytes.
>
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
>
> .code64
> .intel_syntax
> .equ    BIAS, 1023
> .text
>         movq    rax, xmm0    # rax = argument
>         add     rax, rax
>         shr     rax, 53      # rax = exponent of |argument|
>         cmp     eax, BIAS + 53
>         jae     .Lexit       # argument indefinite?
>                              # |argument| >= 0x1.0p53?
>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>         cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>         psrlq   xmm0, 63
>         psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>         orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> .L0:    ret
> .end
>
> @Richard Biener (et. al.):
>
> 1. Is a primitive for "floating-point > 2**x", which generates such
>    an "integer" code sequence, already available, at least for
>    float/binary32 and double/binary64?

Not that I know, but it should be possible to craft that.

> 2. the procedural code generator for __builtin_trunc() etc.  uses
>    __builtin_fabs() and __builtin_copysign() as building blocks.
>    These would need to (and of course should) be modified to generate
>    psllq/psrlq pairs instead of andpd/andnpd referencing a memory
>    location with either -0.0 oder ~(-0.0).
>
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
>
> .code64
> .intel_syntax
> .equ    BIAS, 1023
> .text
>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>         jo      .Lexit       # argument indefinite?
>                              # |argument| > 0x1.0p63?
>         cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
> .L0:    ret
> .end
>
> [...]
>
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> >> 57 + 32 = 89 bytes):
> >>
> >>                                         .code64
> >>                                         .intel_syntax noprefix
> >>                                         .text
> >>    0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
> >>    a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 0x1.0p52 = 4503599627370496.0
> >>    f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
> >>   19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = argument
> >>   1d:   66 0f 73 f0 01                  psllq   xmm0, 1
> >>   22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = |argument|
> >>   27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
> >>   2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = (argument & -0.0) ? -0.0 : +0.0
> >>   31:   f2 0f 10 d8                     movsd   xmm3, xmm0
> >>   35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = |argument| + 0x1.0p52
> >>   39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = |argument| - 0x1.0p52
> >>                                                                  #      = rint(|argument|)
> >>   3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = -0x1.0p0 = -1.0
> >
> > Huh? I see +1.0, -1 would be 0xBFF0000000000000.
>
> Spurious error in the comment.
> I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
> instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
>
> >>   42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
> >>   47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >>   4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = rint(|argument|)
> >>                                                                  #      - (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >>                                                                  #      = trunc(|argument|)
> >>   4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = trunc(argument)
> >>   53:   c3                              ret
>
> regards
> Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 12:43       ` Stefan Kanthak
  2021-08-06 12:59         ` Richard Biener
@ 2021-08-06 13:20         ` Gabriel Paubert
  2021-08-06 14:37           ` Stefan Kanthak
  2021-08-06 13:31         ` Michael Matz
  2 siblings, 1 reply; 19+ messages in thread
From: Gabriel Paubert @ 2021-08-06 13:20 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: gcc

On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert <paubert@iram.es> wrote:
> 
> > Hi,
> > 
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert <paubert@iram.es> wrote:
> >> 
> >> 
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> 
> >> >>                               .intel_syntax
> >> >>                               .text
> >> >>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
> >> >>    5:   48 f7 d8              neg     rax
> >> >>                         #     jz      .L0          # argument zero?
> >> >>    8:   70 16                 jo      .L0          # argument indefinite?
> >> >>                                                    # argument overflows 64-bit integer?
> >> >>    a:   48 f7 d8              neg     rax
> >> >>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >> >>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> >> >>   20:   c3              .L0:  ret
> >> >>                               .end
> >> > 
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >> 
> >> Right, I overlooked this fault. Thanks for pointing out.
> >> 
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >> 
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> > 
> > Indeed, but what I had in mind was something along the following lines:
> > 
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52     # move sign and exponent to 12 LSBs 
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return       # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> > 
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
> 
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.

I don't know what the standard says about NaNs in this case, I seem to
remember that arithmetic instructions typically produce QNaN when one of
the inputs is a NaN, whether signaling or not. 

> 
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
> 
> Correct.
>  
> > I don't know if it's faster than your new code,
> 
> It should be faster.
> 
> > I'm almost sure that it's shorter.
> 
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
> 
> JFTR: better use "add rax,rax; shr rax,53" instead of
>       "shr rax,52; and eax,0x7ff" and save 2 bytes.

Indeed, I don't have the exact size of instructions in my head,
especially since I've not written x86 assembly since the mid 90s.

In any case, with your last improvement, the code is now down to a
single 32 bit immediate constant. And I don't see how to eliminate it...

> 
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
> 
> .code64
> .intel_syntax
> .equ    BIAS, 1023
> .text
>         movq    rax, xmm0    # rax = argument
>         add     rax, rax
>         shr     rax, 53      # rax = exponent of |argument|
>         cmp     eax, BIAS + 53
>         jae     .Lexit       # argument indefinite?

Maybe s/.Lexit/.L0/

>                              # |argument| >= 0x1.0p53?
>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>         cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>         psrlq   xmm0, 63
>         psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>         orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> .L0:    ret
> .end
> 

This looks nice.

> @Richard Biener (et. al.):
> 
> 1. Is a primitive for "floating-point > 2**x", which generates such
>    an "integer" code sequence, already available, at least for
>    float/binary32 and double/binary64?
> 
> 2. the procedural code generator for __builtin_trunc() etc.  uses
>    __builtin_fabs() and __builtin_copysign() as building blocks.
>    These would need to (and of course should) be modified to generate
>    psllq/psrlq pairs instead of andpd/andnpd referencing a memory
>    location with either -0.0 oder ~(-0.0).
> 
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
> 
> .code64
> .intel_syntax
> .equ    BIAS, 1023
> .text
>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>         jo      .Lexit       # argument indefinite?
>                              # |argument| > 0x1.0p63?
>         cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
> .L0:    ret
> .end
> 
> [...]
> 
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> >> 57 + 32 = 89 bytes):
> >> 
> >>                                         .code64
> >>                                         .intel_syntax noprefix
> >>                                         .text
> >>    0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
> >>    a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 0x1.0p52 = 4503599627370496.0
> >>    f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
> >>   19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = argument
> >>   1d:   66 0f 73 f0 01                  psllq   xmm0, 1
> >>   22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = |argument|
> >>   27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
> >>   2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = (argument & -0.0) ? -0.0 : +0.0
> >>   31:   f2 0f 10 d8                     movsd   xmm3, xmm0
> >>   35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = |argument| + 0x1.0p52
> >>   39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = |argument| - 0x1.0p52
> >>                                                                  #      = rint(|argument|)
> >>   3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = -0x1.0p0 = -1.0
> > 
> > Huh? I see +1.0, -1 would be 0xBFF0000000000000.
> 
> Spurious error in the comment.
> I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
> instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
> 
> >>   42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
> >>   47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >>   4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = rint(|argument|)
> >>                                                                  #      - (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >>                                                                  #      = trunc(|argument|)
> >>   4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = trunc(argument)
> >>   53:   c3                              ret
> 
> regards
> Stefan

	Regards,
	Gabriel
 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 13:20         ` Gabriel Paubert
@ 2021-08-06 14:37           ` Stefan Kanthak
  2021-08-06 17:44             ` Joseph Myers
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 14:37 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: gcc

Gabriel Paubert <paubert@iram.es> wrote:


> On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert <paubert@iram.es> wrote:
>> 
>> > Hi,
>> > 
>> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:

[...]

>> >> The whole idea behind these implementations is to get rid of loading
>> >> floating-point constants to perform comparisions.
>> > 
>> > Indeed, but what I had in mind was something along the following lines:
>> > 
>> > movq rax,xmm0   # and copy rax to say rcx, if needed later
>> > shrq rax,52     # move sign and exponent to 12 LSBs 
>> > andl eax,0x7ff  # mask the sign
>> > cmpl eax,0x434  # value to be checked
>> > ja return       # exponent too large, we're done (what about NaNs?)
>> > cvttsd2si rax,xmm0 # safe after exponent check
>> > cvtsi2sd xmm0,rax  # conversion done
>> > 
>> > and a bit more to handle the corner cases (essentially preserve the
>> > sign to be correct between -1 and -0.0).
>> 
>> The sign of -0.0 is the only corner case and already handled in my code.
>> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
>> preserved, as in the code GCC generates as well as my code.
> 
> I don't know what the standard says about NaNs in this case, I seem to
> remember that arithmetic instructions typically produce QNaN when one of
> the inputs is a NaN, whether signaling or not. 

<https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
and its cousins as well as the C standard say

| If x is NaN, a NaN shall be returned.

That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.

>> > But the CPU can (speculatively) start the conversions early, so the
>> > dependency chain is rather short.
>> 
>> Correct.
>>  
>> > I don't know if it's faster than your new code,
>> 
>> It should be faster.
>> 
>> > I'm almost sure that it's shorter.
>> 
>> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
>> 5+4+5+5+2=21 bytes.
>> 
>> JFTR: better use "add rax,rax; shr rax,53" instead of
>>       "shr rax,52; and eax,0x7ff" and save 2 bytes.
> 
> Indeed, I don't have the exact size of instructions in my head,
> especially since I've not written x86 assembly since the mid 90s.
> 
> In any case, with your last improvement, the code is now down to a
> single 32 bit immediate constant. And I don't see how to eliminate it...
> 
>> 
>> Complete properly optimized code for __builtin_trunc is then as follows
>> (11 instructions, 44 bytes):
>> 
>> .code64
>> .intel_syntax
>> .equ    BIAS, 1023
>> .text
>>         movq    rax, xmm0    # rax = argument
>>         add     rax, rax
>>         shr     rax, 53      # rax = exponent of |argument|
>>         cmp     eax, BIAS + 53
>>         jae     .Lexit       # argument indefinite?
> 
> Maybe s/.Lexit/.L0/

Surely!

>>                              # |argument| >= 0x1.0p53?
>>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>>         cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>>         psrlq   xmm0, 63
>>         psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>>         orpd    xmm0, xmm1   # xmm0 = trunc(argument)
>> .L0:    ret
>> .end
>> 
> 
> This looks nice.

Let's see how to convince GCC to generate such code sequences...

Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 14:37           ` Stefan Kanthak
@ 2021-08-06 17:44             ` Joseph Myers
  2021-08-07 12:32               ` Stefan Kanthak
  0 siblings, 1 reply; 19+ messages in thread
From: Joseph Myers @ 2021-08-06 17:44 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc

On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> > I don't know what the standard says about NaNs in this case, I seem to
> > remember that arithmetic instructions typically produce QNaN when one of
> > the inputs is a NaN, whether signaling or not. 
> 
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
> and its cousins as well as the C standard say
> 
> | If x is NaN, a NaN shall be returned.
> 
> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.

You should be looking at TS 18661-3 / C2x Annex F for sNaN handling; the 
POSIX attempts to deal with signaling NaNs aren't well thought out.  (And 
possibly adding flag_signaling_nans conditions as appropriate to disable 
for -fsignaling-nans anything for these to-integer operations that doesn't 
produce a qNaN with INVALID raised from sNaN input.)  Though in C2x mode, 
these SSE2 code sequences won't be used by default anyway, except for rint 
(C2x implies -fno-fp-int-builtin-inexact).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 17:44             ` Joseph Myers
@ 2021-08-07 12:32               ` Stefan Kanthak
  2021-08-08 22:58                 ` Vincent Lefevre
  2021-08-09 17:19                 ` Joseph Myers
  0 siblings, 2 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-07 12:32 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Gabriel Paubert, gcc

Joseph Myers <joseph@codesourcery.com> wrote:


> On Fri, 6 Aug 2021, Stefan Kanthak wrote:

PLEASE DON'T STRIP ATTRIBUTION LINES: I did not write the following paragraph!

>> > I don't know what the standard says about NaNs in this case, I seem to
>> > remember that arithmetic instructions typically produce QNaN when one of
>> > the inputs is a NaN, whether signaling or not. 
>> 
>> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
>> and its cousins as well as the C standard say
>> 
>> | If x is NaN, a NaN shall be returned.
>> 
>> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.
> 
> You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;

I'll do so as soon as GCC drops support for all C dialects before C2x!

Unless you use a time machine and fix the POSIX and ISO C standards
written in the past you CAN'T neglect all software written before C2x
modified sNaN handling that relies on the documented behaviour at the
time it was written.

> the POSIX attempts to deal with signaling NaNs aren't well thought out.

[...]

> Though in C2x mode, these SSE2 code sequences won't be used by default
> anyway, except for rint (C2x implies -fno-fp-int-builtin-inexact).

regards
Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-07 12:32               ` Stefan Kanthak
@ 2021-08-08 22:58                 ` Vincent Lefevre
  2021-08-09 17:19                 ` Joseph Myers
  1 sibling, 0 replies; 19+ messages in thread
From: Vincent Lefevre @ 2021-08-08 22:58 UTC (permalink / raw)
  To: gcc

On 2021-08-07 14:32:32 +0200, Stefan Kanthak wrote:
> Joseph Myers <joseph@codesourcery.com> wrote:
> > On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> 
> PLEASE DON'T STRIP ATTRIBUTION LINES: I did not write the following paragraph!
> 
> >> > I don't know what the standard says about NaNs in this case, I seem to
> >> > remember that arithmetic instructions typically produce QNaN when one of
> >> > the inputs is a NaN, whether signaling or not. 
> >> 
> >> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/trunc.html>
> >> and its cousins as well as the C standard say
> >> 
> >> | If x is NaN, a NaN shall be returned.
> >> 
> >> That's why I mentioned that the code GCC generates also doesn't
> >> quiet SNaNs.
> > 
> > You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
> 
> I'll do so as soon as GCC drops support for all C dialects before C2x!
> 
> Unless you use a time machine and fix the POSIX and ISO C standards
> written in the past you CAN'T neglect all software written before C2x
> modified sNaN handling that relies on the documented behaviour at the
> time it was written.

Before C2x:

  This specification does not define the behavior of signaling NaNs.365)
  It generally uses the term NaN to denote quiet NaNs.

  365) Since NaNs created by IEC 60559 operations are always quiet,
  quiet NaNs (along with infinities) are sufficient for closure of
  the arithmetic.

(in Annex F).

You can't create signaling NaNs with C operations, but you may get
them when reading data from memory. So, IMHO, they should really be
supported in practice, at least in some sense. I would expect that
when a sNaN occurs as an input, it is handled either like a sNaN
(see IEC 60559 / IEEE 754) or like a qNaN. Propagating the
signaling status (forbidden in IEEE 754 for almost all operations)
could be acceptable (this means that an implementation may ignore
whether a NaN is quiet or signaling), but should be avoided.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-07 12:32               ` Stefan Kanthak
  2021-08-08 22:58                 ` Vincent Lefevre
@ 2021-08-09 17:19                 ` Joseph Myers
  1 sibling, 0 replies; 19+ messages in thread
From: Joseph Myers @ 2021-08-09 17:19 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: gcc

On Sat, 7 Aug 2021, Stefan Kanthak wrote:

> Joseph Myers <joseph@codesourcery.com> wrote:
> > You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
> 
> I'll do so as soon as GCC drops support for all C dialects before C2x!
> 
> Unless you use a time machine and fix the POSIX and ISO C standards
> written in the past you CAN'T neglect all software written before C2x
> modified sNaN handling that relies on the documented behaviour at the
> time it was written.

Pre-C2x versions of C don't cover signaling NaNs at all; they use "NaN" to 
mean "quiet NaN" (so signaling NaNs are trap representations).  Software 
written before C2x thus can't rely on any particular sNaN handling.

The POSIX description of signaling NaNs ("On implementations that support 
the IEC 60559:1989 standard floating point, functions with signaling NaN 
argument(s) shall be treated as if the function were called with an 
argument that is a required domain error and shall return a quiet NaN 
result, except where stated otherwise.") is consistent with C2x as regards 
trunc (sNaN) needing to return a quiet NaN with INVALID raised.  The 
problems are (a) POSIX fails to "state otherwise" for the cases (e.g. 
fabs, copysign) where a signaling NaN argument should not result in a 
quiet NaN with INVALID raised (as per IEEE semantics for those operations) 
and (b) the POSIX rule about setting errno to EDOM when (math_errhandling 
& MATH_ERRNO) is nonzero is inappropriate for sNaN arguments (incompatible 
with the normal approach of generating INVALID and a quiet NaN by passing 
NaN arguments through arithmetic) and the C2x approach of being 
implementation-defined whether an sNaN input is a domain error is more 
appropriate.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 12:43       ` Stefan Kanthak
  2021-08-06 12:59         ` Richard Biener
  2021-08-06 13:20         ` Gabriel Paubert
@ 2021-08-06 13:31         ` Michael Matz
  2021-08-06 14:32           ` Stefan Kanthak
  2 siblings, 1 reply; 19+ messages in thread
From: Michael Matz @ 2021-08-06 13:31 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc

Hello,

On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,

This claim would need to be proven in the wild.  |argument| > 2**52 are 
already integer, and shouldn't generate a spurious exception from the 
various to-int conversions, not even in fast-math mode for some relevant 
set of applications (at least SPECcpu).

Btw, have you made speed measurements with your improvements?  The size 
improvements are obvious, but speed changes can be fairly unintuitive, 
e.g. there were old K8 CPUs where the memory loads for constants are 
actually faster than the equivalent sequence of shifting and masking for 
the >= compares.  That's an irrelevant CPU now, but it shows that 
intuition about speed consequences can be wrong.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 13:31         ` Michael Matz
@ 2021-08-06 14:32           ` Stefan Kanthak
  2021-08-06 15:04             ` Michael Matz
  2021-08-06 15:16             ` Richard Biener
  0 siblings, 2 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 14:32 UTC (permalink / raw)
  To: Michael Matz; +Cc: Gabriel Paubert, gcc

Michael Matz <matz@suse.de> wrote:


> Hello,
> 
> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> 
>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> 
> This claim would need to be proven in the wild.

I should have left the "when" after the "and" which I originally had
written...

> |argument| > 2**52 are already integer, and shouldn't generate a spurious
> exception from the various to-int conversions, not even in fast-math mode
> for some relevant set of applications (at least SPECcpu).
> 
> Btw, have you made speed measurements with your improvements?

No.

> The size improvements are obvious, but speed changes can be fairly
> unintuitive, e.g. there were old K8 CPUs where the memory loads for
> constants are actually faster than the equivalent sequence of shifting
> and masking for the >= compares.  That's an irrelevant CPU now, but it
> shows that intuition about speed consequences can be wrong.

I know. I also know of CPUs that can't load a 16-byte wide XMM register
in one go, but had to split the load into 2 8-byte loads.

If the constant happens to be present in L1 cache, it MAY load as fast
as an immediate.
BUT: on current CPUs, the code GCC generates

        movsd  .LC1(%rip), %xmm2
        movsd  .LC0(%rip), %xmm4
        movapd %xmm0, %xmm3
        movapd %xmm0, %xmm1
        andpd  %xmm2, %xmm3
        ucomisd %xmm3, %xmm4
        jbe    38 <_trunc+0x38>
 
needs
- 4 cycles if the movsd are executed in parallel and the movapd are
  handled by the register renamer,
- 5 cycles if the movsd and the movapd are executed in parallel,
- 7 cycles else,
plus an unknown number of cycles if the constants are not in L1.
The proposed

        movq   rax, xmm0
        add    rax, rax
        shr    rax, 53
        cmp    eax, 53+1023
        jae    return

needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
register renamer).

Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 14:32           ` Stefan Kanthak
@ 2021-08-06 15:04             ` Michael Matz
  2021-08-06 15:16             ` Richard Biener
  1 sibling, 0 replies; 19+ messages in thread
From: Michael Matz @ 2021-08-06 15:04 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Gabriel Paubert, gcc

Hello,

On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> >> For -ffast-math, where the sign of -0.0 is not handled and the 
> >> spurios invalid floating-point exception for |argument| >= 2**63 is 
> >> acceptable,
> > 
> > This claim would need to be proven in the wild.
> 
> I should have left the "when" after the "and" which I originally had 
> written...
> 
> > |argument| > 2**52 are already integer, and shouldn't generate a 
> > spurious exception from the various to-int conversions, not even in 
> > fast-math mode for some relevant set of applications (at least 
> > SPECcpu).
> > 
> > Btw, have you made speed measurements with your improvements?
> 
> No.
> 
> > The size improvements are obvious, but speed changes can be fairly 
> > unintuitive, e.g. there were old K8 CPUs where the memory loads for 
> > constants are actually faster than the equivalent sequence of shifting 
> > and masking for the >= compares.  That's an irrelevant CPU now, but it 
> > shows that intuition about speed consequences can be wrong.
> 
> I know. I also know of CPUs that can't load a 16-byte wide XMM register 
> in one go, but had to split the load into 2 8-byte loads.
> 
> If the constant happens to be present in L1 cache, it MAY load as fast
> as an immediate.
> BUT: on current CPUs, the code GCC generates
> 
>         movsd  .LC1(%rip), %xmm2
>         movsd  .LC0(%rip), %xmm4
>         movapd %xmm0, %xmm3
>         movapd %xmm0, %xmm1
>         andpd  %xmm2, %xmm3
>         ucomisd %xmm3, %xmm4
>         jbe    38 <_trunc+0x38>
>  
> needs
> - 4 cycles if the movsd are executed in parallel and the movapd are
>   handled by the register renamer,
> - 5 cycles if the movsd and the movapd are executed in parallel,
> - 7 cycles else,
> plus an unknown number of cycles if the constants are not in L1.

You also need to consider the case that the to-int converters are called 
in a loop (which ultimately are the only interesting cases for 
performance), where it's possible to load the constants before the loop 
and keep them in registers (at the expense of two register pressure of 
course) effectively removing the loads from cost considerations.  It's all 
tough choices, which is why stuff needs to be measured in some contexts 
:-)

(I do like your sequences btw, it's just not 100% clearcut that they are 
always a speed improvement).


Ciao,
Michael.

> The proposed
> 
>         movq   rax, xmm0
>         add    rax, rax
>         shr    rax, 53
>         cmp    eax, 53+1023
>         jae    return
> 
> needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
> register renamer).
> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 14:32           ` Stefan Kanthak
  2021-08-06 15:04             ` Michael Matz
@ 2021-08-06 15:16             ` Richard Biener
  2021-08-06 16:57               ` Stefan Kanthak
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Biener @ 2021-08-06 15:16 UTC (permalink / raw)
  To: gcc, Stefan Kanthak, Michael Matz

On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>Michael Matz <matz@suse.de> wrote:
>
>
>> Hello,
>> 
>> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>> 
>>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
>> 
>> This claim would need to be proven in the wild.
>
>I should have left the "when" after the "and" which I originally had
>written...
>
>> |argument| > 2**52 are already integer, and shouldn't generate a spurious
>> exception from the various to-int conversions, not even in fast-math mode
>> for some relevant set of applications (at least SPECcpu).
>> 
>> Btw, have you made speed measurements with your improvements?
>
>No.
>
>> The size improvements are obvious, but speed changes can be fairly
>> unintuitive, e.g. there were old K8 CPUs where the memory loads for
>> constants are actually faster than the equivalent sequence of shifting
>> and masking for the >= compares.  That's an irrelevant CPU now, but it
>> shows that intuition about speed consequences can be wrong.
>
>I know. I also know of CPUs that can't load a 16-byte wide XMM register
>in one go, but had to split the load into 2 8-byte loads.
>
>If the constant happens to be present in L1 cache, it MAY load as fast
>as an immediate.
>BUT: on current CPUs, the code GCC generates
>
>        movsd  .LC1(%rip), %xmm2
>        movsd  .LC0(%rip), %xmm4
>        movapd %xmm0, %xmm3
>        movapd %xmm0, %xmm1
>        andpd  %xmm2, %xmm3
>        ucomisd %xmm3, %xmm4
>        jbe    38 <_trunc+0x38>
> 
>needs
>- 4 cycles if the movsd are executed in parallel and the movapd are
>  handled by the register renamer,
>- 5 cycles if the movsd and the movapd are executed in parallel,
>- 7 cycles else,
>plus an unknown number of cycles if the constants are not in L1.
>The proposed
>
>        movq   rax, xmm0

The xmm to GPR move costs you an extra cycle in latency. Shifts also tend to be port constrained. The original sequences are also somewhat straight forward to vectorize. 

>        add    rax, rax
>        shr    rax, 53
>        cmp    eax, 53+1023
>        jae    return
>
>needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
>register renamer).
>
>Stefan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-06 15:16             ` Richard Biener
@ 2021-08-06 16:57               ` Stefan Kanthak
  0 siblings, 0 replies; 19+ messages in thread
From: Stefan Kanthak @ 2021-08-06 16:57 UTC (permalink / raw)
  To: Richard Biener, gcc, Michael Matz

Richard Biener <richard.guenther@gmail.com> wrote:

> On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>Michael Matz <matz@suse.de> wrote:

>>> Btw, have you made speed measurements with your improvements?
>>
>>No.
[...]
>>If the constant happens to be present in L1 cache, it MAY load as fast
>>as an immediate.
>>BUT: on current CPUs, the code GCC generates
>>
>>        movsd  .LC1(%rip), %xmm2
>>        movsd  .LC0(%rip), %xmm4
>>        movapd %xmm0, %xmm3
>>        movapd %xmm0, %xmm1
>>        andpd  %xmm2, %xmm3
>>        ucomisd %xmm3, %xmm4
>>        jbe    38 <_trunc+0x38>
>> 
>>needs
>>- 4 cycles if the movsd are executed in parallel and the movapd are
>>  handled by the register renamer,
>>- 5 cycles if the movsd and the movapd are executed in parallel,
>>- 7 cycles else,
>>plus an unknown number of cycles if the constants are not in L1.
>>The proposed
>>
>>        movq   rax, xmm0
>
> The xmm to GPR move costs you an extra cycle in latency. Shifts also
> tend to be port constrained. The original sequences are also somewhat
> straight forward to vectorize. 

Please show how GCC vectorizes CVT[T]SD2SI and CVTSI2SD!
These are the bottlenecks in the current code.

If you want the code for trunc() and cousins to be vectorizable you
should stay with the alternative code I presented some posts before,
which GCC should be (able to) generate from its other procedural
variant.

Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
  2021-08-05  9:42 ` Gabriel Paubert
  2021-08-05 10:19   ` Richard Biener
  2021-08-05 11:58   ` Stefan Kanthak
@ 2021-08-05 13:18   ` Gabriel Ravier
  2 siblings, 0 replies; 19+ messages in thread
From: Gabriel Ravier @ 2021-08-05 13:18 UTC (permalink / raw)
  To: Gabriel Paubert, Stefan Kanthak; +Cc: gcc


On 8/5/21 11:42 AM, Gabriel Paubert wrote:
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> Hi,
>>
>> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
>> following code (13 instructions using 57 bytes, plus 4 quadwords
>> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>>
>>                                  .text
>>     0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
>>                          4: R_X86_64_PC32        .rdata
>>     8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
>>                          c: R_X86_64_PC32        .rdata
>>    10:   66 0f 28 d8             movapd %xmm0, %xmm3
>>    14:   66 0f 28 c8             movapd %xmm0, %xmm1
>>    18:   66 0f 54 da             andpd  %xmm2, %xmm3
>>    1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
>>    20:   76 16                   jbe    38 <_trunc+0x38>
>>    22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
>>    27:   66 0f ef c0             pxor   %xmm0, %xmm0
>>    2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
>>    2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
>>    34:   66 0f 56 c2             orpd   %xmm2, %xmm0
>>    38:   c3                      retq
>>
>>                                  .rdata
>>                                  .align 8
>>     0:   00 00 00 00     .LC0:   .quad  0x1.0p52
>>          00 00 30 43
>>          00 00 00 00
>>          00 00 00 00
>>                                  .align 16
>>    10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
>>          ff ff ff 7f
>>    18:   00 00 00 00             .quad  0.0
>>          00 00 00 00
>>                                  .end
>>
>> JFTR: in the best case, the memory accesses cost several cycles,
>>        while in the worst case they yield a page fault!
>>
>>
>> Properly optimized, shorter and faster code, using but only 9 instructions
>> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
>> and saving at least 16 + 32 bytes, follows:
>>
>>                                .intel_syntax
>>                                .text
>>     0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
>>     5:   48 f7 d8              neg     rax
>>                          #     jz      .L0          # argument zero?
>>     8:   70 16                 jo      .L0          # argument indefinite?
>>                                                     # argument overflows 64-bit integer?
>>     a:   48 f7 d8              neg     rax
>>     d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>>    12:   66 0f 73 d0 3f        psrlq   xmm0, 63
>>    17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>>    1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
>>    20:   c3              .L0:  ret
>>                                .end
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.  So
> using your code may require some option (-fast-math comes to mind), or
> you need at least a check on the exponent before cvttsd2si.
>
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
`-fno-signed-zeros` does that, if you need it
>
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
> 	mov rcx,rax # make a second copy to a scratch register
> 	neg rcx
> 	jo .L0
> 	cvtsi2sd xmm1,rax
>
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>                
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
>           |->mov->neg->jump
>
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.
>
> 	Regards,
> 	Gabriel
>
>
>
>> regards
>> Stefan
>   
>
-- 
_________________________
Gabriel RAVIER
First year student at Epitech
+33 6 36 46 16 43
gabriel.ravier@epitech.eu
11 Quai Finkwiller
67000 STRASBOURG


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-08-09 17:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05  7:25 Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1 Stefan Kanthak
2021-08-05  9:42 ` Gabriel Paubert
2021-08-05 10:19   ` Richard Biener
2021-08-05 11:58   ` Stefan Kanthak
2021-08-05 13:59     ` Gabriel Paubert
2021-08-06 12:43       ` Stefan Kanthak
2021-08-06 12:59         ` Richard Biener
2021-08-06 13:20         ` Gabriel Paubert
2021-08-06 14:37           ` Stefan Kanthak
2021-08-06 17:44             ` Joseph Myers
2021-08-07 12:32               ` Stefan Kanthak
2021-08-08 22:58                 ` Vincent Lefevre
2021-08-09 17:19                 ` Joseph Myers
2021-08-06 13:31         ` Michael Matz
2021-08-06 14:32           ` Stefan Kanthak
2021-08-06 15:04             ` Michael Matz
2021-08-06 15:16             ` Richard Biener
2021-08-06 16:57               ` Stefan Kanthak
2021-08-05 13:18   ` Gabriel Ravier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).