From: Richard Biener <richard.guenther@gmail.com>
To: Gabriel Paubert <paubert@iram.es>
Cc: Stefan Kanthak <stefan.kanthak@nexgo.de>,
GCC Development <gcc@gcc.gnu.org>
Subject: Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
Date: Thu, 5 Aug 2021 12:19:41 +0200 [thread overview]
Message-ID: <CAFiYyc3t58ohnRw+tEn8epkttj7vG-yH8cxnUzXKdbD8a2CrdQ@mail.gmail.com> (raw)
In-Reply-To: <20210805094243.GA340@lt-gp.iram.es>
On Thu, Aug 5, 2021 at 11:44 AM Gabriel Paubert <paubert@iram.es> wrote:
>
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> > Hi,
> >
> > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> > following code (13 instructions using 57 bytes, plus 4 quadwords
> > using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >
> > .text
> > 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
> > 4: R_X86_64_PC32 .rdata
> > 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
> > c: R_X86_64_PC32 .rdata
> > 10: 66 0f 28 d8 movapd %xmm0, %xmm3
> > 14: 66 0f 28 c8 movapd %xmm0, %xmm1
> > 18: 66 0f 54 da andpd %xmm2, %xmm3
> > 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
> > 20: 76 16 jbe 38 <_trunc+0x38>
> > 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
> > 27: 66 0f ef c0 pxor %xmm0, %xmm0
> > 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
> > 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
> > 34: 66 0f 56 c2 orpd %xmm2, %xmm0
> > 38: c3 retq
> >
> > .rdata
> > .align 8
> > 0: 00 00 00 00 .LC0: .quad 0x1.0p52
> > 00 00 30 43
> > 00 00 00 00
> > 00 00 00 00
> > .align 16
> > 10: ff ff ff ff .LC1: .quad ~(-0.0)
> > ff ff ff 7f
> > 18: 00 00 00 00 .quad 0.0
> > 00 00 00 00
> > .end
> >
> > JFTR: in the best case, the memory accesses cost several cycles,
> > while in the worst case they yield a page fault!
> >
> >
> > Properly optimized, shorter and faster code, using but only 9 instructions
> > in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> > and saving at least 16 + 32 bytes, follows:
> >
> > .intel_syntax
> > .text
> > 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
> > 5: 48 f7 d8 neg rax
> > # jz .L0 # argument zero?
> > 8: 70 16 jo .L0 # argument indefinite?
> > # argument overflows 64-bit integer?
> > a: 48 f7 d8 neg rax
> > d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
> > 12: 66 0f 73 d0 3f psrlq xmm0, 63
> > 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> > 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
> > 20: c3 .L0: ret
> > .end
>
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer. So
> using your code may require some option (-fast-math comes to mind), or
> you need at least a check on the exponent before cvttsd2si.
>
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
>
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax
>
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
> |->mov->neg->jump
>
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.
Btw, the code to emit these sequences is in
gcc/config/i386/i386-expand.c:ix86_expand_trunc
and friends.
Richard.
> Regards,
> Gabriel
>
>
>
> >
> > regards
> > Stefan
>
>
next prev parent reply other threads:[~2021-08-05 10:19 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-05 7:25 Stefan Kanthak
2021-08-05 9:42 ` Gabriel Paubert
2021-08-05 10:19 ` Richard Biener [this message]
2021-08-05 11:58 ` Stefan Kanthak
2021-08-05 13:59 ` Gabriel Paubert
2021-08-06 12:43 ` Stefan Kanthak
2021-08-06 12:59 ` Richard Biener
2021-08-06 13:20 ` Gabriel Paubert
2021-08-06 14:37 ` Stefan Kanthak
2021-08-06 17:44 ` Joseph Myers
2021-08-07 12:32 ` Stefan Kanthak
2021-08-08 22:58 ` Vincent Lefevre
2021-08-09 17:19 ` Joseph Myers
2021-08-06 13:31 ` Michael Matz
2021-08-06 14:32 ` Stefan Kanthak
2021-08-06 15:04 ` Michael Matz
2021-08-06 15:16 ` Richard Biener
2021-08-06 16:57 ` Stefan Kanthak
2021-08-05 13:18 ` Gabriel Ravier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFiYyc3t58ohnRw+tEn8epkttj7vG-yH8cxnUzXKdbD8a2CrdQ@mail.gmail.com \
--to=richard.guenther@gmail.com \
--cc=gcc@gcc.gnu.org \
--cc=paubert@iram.es \
--cc=stefan.kanthak@nexgo.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).