From: "Stéphane Glondu" <stephane.glondu@inria.fr>
To: Alexander Monakov <amonakov@ispras.ru>, gcc-help@gcc.gnu.org
Cc: sibid@uvic.ca, Paul Zimmermann <Paul.Zimmermann@inria.fr>
Subject: Re: slowdown with -std=gnu18 with respect to -std=c99
Date: Thu, 5 May 2022 10:57:07 +0200 [thread overview]
Message-ID: <c8517377-e695-06fe-0be4-b7e409d471b9@inria.fr> (raw)
In-Reply-To: <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru>
Le 03/05/2022 à 11:09, Alexander Monakov a écrit :
>> Does anyone have a clue?
>
> I can reproduce a difference, but in my case it's simply because in -std=gnuXX
> mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few
> steps in the benchmarked function to use fma instead of separate mul/add
> instructions.
>
> (regarding __builtin_expect, it also makes a small difference in my case,
> it seems GCC generates some redundant code without it, but the difference is
> 10x smaller than what presence/absence of FMA gives)
>
> I think you might be able to figure it out on your end if you run both variants
> under 'perf stat', note how cycle count and instruction counts change, and then
> look at disassembly to see what changed. You can use 'perf record' and 'perf
> report' to easily see the hot code path; if you do that, I'd recommend to run
> it with the same sampling period in both cases, e.g. like this:
>
> perf record -e instructions:P -c 500000 ./perf ...
I did that. The hot code path corresponds to (from exp10f.c):
double a = iln2h*z, ia = __builtin_floor(a), h = (a - ia) + iln2l*z;
long i = ia, j = i&0xf, e = i - j;
e >>= 4;
double s = tb[j];
b64u64_u su = {.u = (e + 0x3fful)<<52};
s *= su.f;
double h2 = h*h;
double c0 = c[0] + h*c[1];
double c2 = c[2] + h*c[3];
double c4 = c[4] + h*c[5];
c0 += h2*(c2 + h2*c4);
double w = s*h;
return s + w*c0;
With -std=c99, where the overall performance is 22 cycles, I get:
4,03 │ 3a: vcvtss2sd -0x4(%rsp),%xmm0,%xmm0
0,01 │ vmulsd ir.4+0x38,%xmm0,%xmm1
│ vmulsd ir.4+0x40,%xmm0,%xmm0
0,01 │ lea tb.1,%rdx
3,06 │ vroundsd $0x9,%xmm1,%xmm1,%xmm2
0,03 │ vsubsd %xmm2,%xmm1,%xmm1
10,42 │ vcvttsd2si %xmm2,%rax
0,01 │ vaddsd %xmm0,%xmm1,%xmm1
│ mov %rax,%rcx
0,02 │ vmulsd ir.4+0x58,%xmm1,%xmm0
0,38 │ vmulsd %xmm1,%xmm1,%xmm5
0,00 │ vmulsd ir.4+0x68,%xmm1,%xmm4
│ sar $0x4,%rax
0,00 │ add $0x3ff,%rax
1,17 │ vaddsd ir.4+0x60,%xmm0,%xmm0
│ shl $0x34,%rax
0,02 │ vaddsd ir.4+0x70,%xmm4,%xmm4
0,10 │ vmulsd %xmm5,%xmm0,%xmm0
0,85 │ vmulsd ir.4+0x48,%xmm1,%xmm3
0,00 │ and $0xf,%ecx
│ vmovq %rax,%xmm6
1,20 │ vmulsd (%rdx,%rcx,8),%xmm6,%xmm2
0,65 │ vaddsd %xmm4,%xmm0,%xmm0
0,00 │ vaddsd ir.4+0x50,%xmm3,%xmm3
3,49 │ vmulsd %xmm5,%xmm0,%xmm0
15,59 │ vmulsd %xmm2,%xmm1,%xmm1
4,61 │ vaddsd %xmm3,%xmm0,%xmm0
10,24 │ vmulsd %xmm1,%xmm0,%xmm0
11,31 │ vaddsd %xmm2,%xmm0,%xmm0
23,21 │ vcvtsd2ss %xmm0,%xmm0,%xmm0
0,00 │ ← ret
With -std=gnu18, where the overall performance is 36 cycles, I get:
0,02 │ 3a: vcvtss2sd -0x4(%rsp),%xmm1,%xmm1
0,01 │ vmulsd ir.4+0x40,%xmm1,%xmm0
│ vmovsd ir.4+0x60,%xmm5
│ vmovsd ir.4+0x50,%xmm4
│ lea tb.1,%rdx
0,13 │ vroundsd $0x9,%xmm0,%xmm0,%xmm2
0,83 │ vsubsd %xmm2,%xmm0,%xmm0
28,99 │ vcvttsd2si %xmm2,%rax
63,49 │ vfmadd132sd 0x961(%rip),%xmm0,%xmm1
│ vmovsd ir.4+0x70,%xmm0
│ mov %rax,%rcx
│ sar $0x4,%rax
2,73 │ add $0x3ff,%rax
1,99 │ vmulsd %xmm1,%xmm1,%xmm3
0,00 │ vfmadd213sd 0x95f(%rip),%xmm1,%xmm5
0,00 │ vfmadd213sd 0x966(%rip),%xmm1,%xmm0
│ shl $0x34,%rax
│ and $0xf,%ecx
│ vmovq %rax,%xmm6
0,17 │ vmulsd (%rdx,%rcx,8),%xmm6,%xmm2
│ vfmadd213sd 0x92c(%rip),%xmm1,%xmm4
0,04 │ vfmadd132sd %xmm3,%xmm5,%xmm0
0,64 │ vmulsd %xmm2,%xmm1,%xmm1
0,01 │ vfmadd132sd %xmm3,%xmm4,%xmm0
0,46 │ vfmadd132sd %xmm1,%xmm2,%xmm0
0,27 │ vcvtsd2ss %xmm0,%xmm0,%xmm0
│ ← ret
The distribution of time is very different in both cases: in the first
case, most of the time is spent at the end (computing w and return value
I suppose) whereas in the second case, most of the time is spent in the
first multiply-and-add (computing h). I do not understand this change of
behaviour.
Cheers,
--
Stéphane
next prev parent reply other threads:[~2022-05-05 8:57 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-03 8:28 Paul Zimmermann
2022-05-03 9:09 ` Alexander Monakov
2022-05-03 11:45 ` Paul Zimmermann
2022-05-03 12:12 ` Alexander Monakov
2022-05-05 8:57 ` Stéphane Glondu [this message]
2022-05-05 14:31 ` Stéphane Glondu
2022-05-05 14:41 ` Marc Glisse
2022-05-05 14:56 ` Alexander Monakov
2022-05-06 7:46 ` Paul Zimmermann
2022-05-06 9:27 ` Alexander Monakov
2022-05-07 6:11 ` Paul Zimmermann
2022-05-11 13:26 ` Alexander Monakov
2022-05-05 17:50 ` Paul Zimmermann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c8517377-e695-06fe-0be4-b7e409d471b9@inria.fr \
--to=stephane.glondu@inria.fr \
--cc=Paul.Zimmermann@inria.fr \
--cc=amonakov@ispras.ru \
--cc=gcc-help@gcc.gnu.org \
--cc=sibid@uvic.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).