From: Gabriel Paubert <paubert@iram.es>
To: Stefan Kanthak <stefan.kanthak@nexgo.de>
Cc: gcc@gcc.gnu.org
Subject: Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
Date: Thu, 5 Aug 2021 15:59:08 +0200 [thread overview]
Message-ID: <20210805135908.GA10666@lt-gp.iram.es> (raw)
In-Reply-To: <4C9973413BE546A0A847447291A2977F@H270>
Hi,
On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert <paubert@iram.es> wrote:
>
>
> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> >> Hi,
> >>
> >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> >> following code (13 instructions using 57 bytes, plus 4 quadwords
> >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >>
> >> .text
> >> 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
> >> 4: R_X86_64_PC32 .rdata
> >> 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
> >> c: R_X86_64_PC32 .rdata
> >> 10: 66 0f 28 d8 movapd %xmm0, %xmm3
> >> 14: 66 0f 28 c8 movapd %xmm0, %xmm1
> >> 18: 66 0f 54 da andpd %xmm2, %xmm3
> >> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
> >> 20: 76 16 jbe 38 <_trunc+0x38>
> >> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
> >> 27: 66 0f ef c0 pxor %xmm0, %xmm0
> >> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
> >> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
> >> 34: 66 0f 56 c2 orpd %xmm2, %xmm0
> >> 38: c3 retq
> >>
> >> .rdata
> >> .align 8
> >> 0: 00 00 00 00 .LC0: .quad 0x1.0p52
> >> 00 00 30 43
> >> 00 00 00 00
> >> 00 00 00 00
> >> .align 16
> >> 10: ff ff ff ff .LC1: .quad ~(-0.0)
> >> ff ff ff 7f
> >> 18: 00 00 00 00 .quad 0.0
> >> 00 00 00 00
> >> .end
> >>
> >> JFTR: in the best case, the memory accesses cost several cycles,
> >> while in the worst case they yield a page fault!
> >>
> >>
> >> Properly optimized, shorter and faster code, using but only 9 instructions
> >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> >> and saving at least 16 + 32 bytes, follows:
> >>
> >> .intel_syntax
> >> .text
> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> >> 5: 48 f7 d8 neg rax
> >> # jz .L0 # argument zero?
> >> 8: 70 16 jo .L0 # argument indefinite?
> >> # argument overflows 64-bit integer?
> >> a: 48 f7 d8 neg rax
> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63
> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> >> 20: c3 .L0: ret
> >> .end
> >
> > There is one important difference, namely setting the invalid exception
> > flag when the parameter can't be represented in a signed integer.
>
> Right, I overlooked this fault. Thanks for pointing out.
>
> > So using your code may require some option (-fast-math comes to mind),
> > or you need at least a check on the exponent before cvttsd2si.
>
> The whole idea behind these implementations is to get rid of loading
> floating-point constants to perform comparisions.
Indeed, but what I had in mind was something along the following lines:
movq rax,xmm0 # and copy rax to say rcx, if needed later
shrq rax,52 # move sign and exponent to 12 LSBs
andl eax,0x7ff # mask the sign
cmpl eax,0x434 # value to be checked
ja return # exponent too large, we're done (what about NaNs?)
cvttsd2si rax,xmm0 # safe after exponent check
cvtsi2sd xmm0,rax # conversion done
and a bit more to handle the corner cases (essentially preserve the
sign to be correct between -1 and -0.0). But the CPU can (speculatively)
start the conversions early, so the dependency chain is rather short.
I don't know if it's faster than your new code, I'm almost sure that
it's shorter. Your new code also has a fairly long dependency chain.
>
> > The last part of your code then goes to take into account the special
> > case of -0.0, which I most often don't care about (I'd like to have a
> > -fdont-split-hairs-about-the-sign-of-zero option).
>
> Preserving the sign of -0.0 is explicitly specified in the standard,
> and is cheap, as shown in my code.
>
> > Potentially generating spurious invalid operation and then carefully
> > taking into account the sign of zero does not seem very consistent.
> >
> > Apart from this, in your code, after cvttsd2si I'd rather use:
> > mov rcx,rax # make a second copy to a scratch register
> > neg rcx
> > jo .L0
> > cvtsi2sd xmm1,rax
>
> I don't know how GCC generates the code for builtins, and what kind of
> templates it uses: the second goal was to minimize register usage.
>
Ok, but on 64 bit using two GPRs would still be reasonable.
> > The reason is latency, in an OoO engine, splitting the two paths is
> > almost always a win.
> >
> > With your patch:
> >
> > cvttsd2si-->neg-?->neg-->cvtsi2sd
> >
> > where the ? means that the following instructions are speculated.
> >
> > With an auxiliary register there are two dependency chains:
> >
> > cvttsd2si-?->cvtsi2sd
> > |->mov->neg->jump
>
> Correct; see above: I expect the template(s) for builtins to give
> the register allocator some freedom to split code paths and resolve
> dependency chains.
>
> > Actually some OoO cores just eliminate register copies using register
> > renaming mechanism. But even this is probably completely irrelevant in
> > this case where the latency is dominated by the two conversion
> > instructions.
>
> Right, the conversions dominate both the original and the code I posted.
> It's easy to get rid of them, with still slightly shorter and faster
> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> 57 + 32 = 89 bytes):
>
> .code64
> .intel_syntax noprefix
> .text
> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = 0x1.0p52 = 4503599627370496.0
> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = argument
> 1d: 66 0f 73 f0 01 psllq xmm0, 1
> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = |argument|
> 27: 66 0f 73 d1 3f psrlq xmm1, 63
> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = (argument & -0.0) ? -0.0 : +0.0
> 31: f2 0f 10 d8 movsd xmm3, xmm0
> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = |argument| + 0x1.0p52
> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = |argument| - 0x1.0p52
> # = rint(|argument|)
> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = -0x1.0p0 = -1.0
Huh? I see +1.0, -1 would be 0xBFF0000000000000.
> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = (|argument| < rint(|argument|)) ? ~0L : 0L
> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = (|argument| < rint(|argument|)) ? 1.0 : 0.0
> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = rint(|argument|)
> # - (|argument| < rint(|argument|)) ? 1.0 : 0.0
> # = trunc(|argument|)
> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> 53: c3 ret
> .end
>
> regards
> Stefan
Regards,
Gabriel
next prev parent reply other threads:[~2021-08-05 13:59 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-05 7:25 Stefan Kanthak
2021-08-05 9:42 ` Gabriel Paubert
2021-08-05 10:19 ` Richard Biener
2021-08-05 11:58 ` Stefan Kanthak
2021-08-05 13:59 ` Gabriel Paubert [this message]
2021-08-06 12:43 ` Stefan Kanthak
2021-08-06 12:59 ` Richard Biener
2021-08-06 13:20 ` Gabriel Paubert
2021-08-06 14:37 ` Stefan Kanthak
2021-08-06 17:44 ` Joseph Myers
2021-08-07 12:32 ` Stefan Kanthak
2021-08-08 22:58 ` Vincent Lefevre
2021-08-09 17:19 ` Joseph Myers
2021-08-06 13:31 ` Michael Matz
2021-08-06 14:32 ` Stefan Kanthak
2021-08-06 15:04 ` Michael Matz
2021-08-06 15:16 ` Richard Biener
2021-08-06 16:57 ` Stefan Kanthak
2021-08-05 13:18 ` Gabriel Ravier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210805135908.GA10666@lt-gp.iram.es \
--to=paubert@iram.es \
--cc=gcc@gcc.gnu.org \
--cc=stefan.kanthak@nexgo.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).