public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* slowdown with -std=gnu18 with respect to -std=c99
@ 2022-05-03  8:28 Paul Zimmermann
  2022-05-03  9:09 ` Alexander Monakov
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Zimmermann @ 2022-05-03  8:28 UTC (permalink / raw)
  To: gcc-help; +Cc: sibid, stephane.glondu

       Hi,

I observe a slowdown of some code compiled with gcc when I use -std=gnu18
instead of -std=c99.

My computer is a i5-4590, and I use gcc version 11.3.0 (Debian 11.3.0-1).

To reproduce:

$ git clone https://gitlab.inria.fr/core-math/core-math.git
$ cd core-math
$ CORE_MATH_PERF_MODE=rdtsc CFLAGS="-O3 -march=native -ffinite-math-only -std=gnu18" ./perf.sh exp10f
GNU libc version: 2.33
GNU libc release: release
31.746
11.780
$ CORE_MATH_PERF_MODE=rdtsc CFLAGS="-O3 -march=native -ffinite-math-only -std=c99" ./perf.sh exp10f
GNU libc version: 2.33
GNU libc release: release
21.514
11.751

The difference is seen between the first figures in each run (31.746 and
21.514), which indicate the average number of cycles of the exp10f function
from the core-math library.

The code is very simple (a few dozen lines):

https://gitlab.inria.fr/core-math/core-math/-/blob/master/src/binary32/exp10/exp10f.c

Some more remarks:

* this slowdown does not happen on all machines, for example it does not
  appear on an AMD EPYC 7282 with gcc gcc version 10.2.1 (Debian 10.2.1-6).

* this slowdown disappears when I replace __builtin_expect(ex>(127+6), 0)
  by ex>(127+6) at line 45 of the code, however that branch is never taken
  in the above experiment.

Does anyone have a clue?

Best regards,
Paul Zimmermann

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-03  8:28 slowdown with -std=gnu18 with respect to -std=c99 Paul Zimmermann
@ 2022-05-03  9:09 ` Alexander Monakov
  2022-05-03 11:45   ` Paul Zimmermann
  2022-05-05  8:57   ` Stéphane Glondu
  0 siblings, 2 replies; 13+ messages in thread
From: Alexander Monakov @ 2022-05-03  9:09 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: gcc-help, stephane.glondu, sibid

On Tue, 3 May 2022, Paul Zimmermann via Gcc-help wrote:

> Does anyone have a clue?

I can reproduce a difference, but in my case it's simply because in -std=gnuXX
mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few
steps in the benchmarked function to use fma instead of separate mul/add
instructions.

(regarding __builtin_expect, it also makes a small difference in my case,
it seems GCC generates some redundant code without it, but the difference is
10x smaller than what presence/absence of FMA gives)

I think you might be able to figure it out on your end if you run both variants
under 'perf stat', note how cycle count and instruction counts change, and then
look at disassembly to see what changed. You can use 'perf record' and 'perf
report' to easily see the hot code path; if you do that, I'd recommend to run
it with the same sampling period in both cases, e.g. like this:

    perf record -e instructions:P -c 500000 ./perf ...

Alexander

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-03  9:09 ` Alexander Monakov
@ 2022-05-03 11:45   ` Paul Zimmermann
  2022-05-03 12:12     ` Alexander Monakov
  2022-05-05  8:57   ` Stéphane Glondu
  1 sibling, 1 reply; 13+ messages in thread
From: Paul Zimmermann @ 2022-05-03 11:45 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-help, stephane.glondu, sibid

thank you very much Alexander.

> Date: Tue, 3 May 2022 12:09:32 +0300 (MSK)
> From: Alexander Monakov <amonakov@ispras.ru>
> cc: gcc-help@gcc.gnu.org, stephane.glondu@inria.fr, sibid@uvic.ca
> 
> On Tue, 3 May 2022, Paul Zimmermann via Gcc-help wrote:
> 
> > Does anyone have a clue?
> 
> I can reproduce a difference, but in my case it's simply because in -std=gnuXX
> mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few
> steps in the benchmarked function to use fma instead of separate mul/add
> instructions.

but then you should get better (i.e. smaller) timings with -std=gnuXX than
with -std=cXX, instead of worse timings as we get?

> (regarding __builtin_expect, it also makes a small difference in my case,
> it seems GCC generates some redundant code without it, but the difference is
> 10x smaller than what presence/absence of FMA gives)
> 
> I think you might be able to figure it out on your end if you run both variants
> under 'perf stat', note how cycle count and instruction counts change, and then
> look at disassembly to see what changed. You can use 'perf record' and 'perf
> report' to easily see the hot code path; if you do that, I'd recommend to run
> it with the same sampling period in both cases, e.g. like this:
> 
>     perf record -e instructions:P -c 500000 ./perf ...

thank you, we'll investigate that.

Best regards,
Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-03 11:45   ` Paul Zimmermann
@ 2022-05-03 12:12     ` Alexander Monakov
  0 siblings, 0 replies; 13+ messages in thread
From: Alexander Monakov @ 2022-05-03 12:12 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: gcc-help, stephane.glondu, sibid

On Tue, 3 May 2022, Paul Zimmermann via Gcc-help wrote:
> > I can reproduce a difference, but in my case it's simply because in -std=gnuXX
> > mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few
> > steps in the benchmarked function to use fma instead of separate mul/add
> > instructions.
> 
> but then you should get better (i.e. smaller) timings with -std=gnuXX than
> with -std=cXX, instead of worse timings as we get?

Right, for me -std=gnuXX is faster. But for you it's slower by almost 1.5x,
that's quite a lot and should be easy to spot on 'perf report' profile.

> >     perf record -e instructions:P -c 500000 ./perf ...
> 
> thank you, we'll investigate that.

Good luck! I'm curious what you'll find, please let me know.

Alexander

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-03  9:09 ` Alexander Monakov
  2022-05-03 11:45   ` Paul Zimmermann
@ 2022-05-05  8:57   ` Stéphane Glondu
  2022-05-05 14:31     ` Stéphane Glondu
  1 sibling, 1 reply; 13+ messages in thread
From: Stéphane Glondu @ 2022-05-05  8:57 UTC (permalink / raw)
  To: Alexander Monakov, gcc-help; +Cc: sibid, Paul Zimmermann

Le 03/05/2022 à 11:09, Alexander Monakov a écrit :
>> Does anyone have a clue?
> 
> I can reproduce a difference, but in my case it's simply because in -std=gnuXX
> mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few
> steps in the benchmarked function to use fma instead of separate mul/add
> instructions.
> 
> (regarding __builtin_expect, it also makes a small difference in my case,
> it seems GCC generates some redundant code without it, but the difference is
> 10x smaller than what presence/absence of FMA gives)
> 
> I think you might be able to figure it out on your end if you run both variants
> under 'perf stat', note how cycle count and instruction counts change, and then
> look at disassembly to see what changed. You can use 'perf record' and 'perf
> report' to easily see the hot code path; if you do that, I'd recommend to run
> it with the same sampling period in both cases, e.g. like this:
> 
>     perf record -e instructions:P -c 500000 ./perf ...
I did that. The hot code path corresponds to (from exp10f.c):

    double a = iln2h*z, ia = __builtin_floor(a), h = (a - ia) + iln2l*z;
    long i = ia, j = i&0xf, e = i - j;
    e >>= 4;
    double s = tb[j];
    b64u64_u su = {.u = (e + 0x3fful)<<52};
    s *= su.f;
    double h2 = h*h;
    double c0 = c[0] + h*c[1];
    double c2 = c[2] + h*c[3];
    double c4 = c[4] + h*c[5];
    c0 += h2*(c2 + h2*c4);
    double w = s*h;
    return s + w*c0;

With -std=c99, where the overall performance is 22 cycles, I get:

  4,03 │ 3a:   vcvtss2sd  -0x4(%rsp),%xmm0,%xmm0
  0,01 │       vmulsd     ir.4+0x38,%xmm0,%xmm1
       │       vmulsd     ir.4+0x40,%xmm0,%xmm0
  0,01 │       lea        tb.1,%rdx
  3,06 │       vroundsd   $0x9,%xmm1,%xmm1,%xmm2
  0,03 │       vsubsd     %xmm2,%xmm1,%xmm1
 10,42 │       vcvttsd2si %xmm2,%rax
  0,01 │       vaddsd     %xmm0,%xmm1,%xmm1
       │       mov        %rax,%rcx
  0,02 │       vmulsd     ir.4+0x58,%xmm1,%xmm0
  0,38 │       vmulsd     %xmm1,%xmm1,%xmm5
  0,00 │       vmulsd     ir.4+0x68,%xmm1,%xmm4
       │       sar        $0x4,%rax
  0,00 │       add        $0x3ff,%rax
  1,17 │       vaddsd     ir.4+0x60,%xmm0,%xmm0
       │       shl        $0x34,%rax
  0,02 │       vaddsd     ir.4+0x70,%xmm4,%xmm4
  0,10 │       vmulsd     %xmm5,%xmm0,%xmm0
  0,85 │       vmulsd     ir.4+0x48,%xmm1,%xmm3
  0,00 │       and        $0xf,%ecx
       │       vmovq      %rax,%xmm6
  1,20 │       vmulsd     (%rdx,%rcx,8),%xmm6,%xmm2
  0,65 │       vaddsd     %xmm4,%xmm0,%xmm0
  0,00 │       vaddsd     ir.4+0x50,%xmm3,%xmm3
  3,49 │       vmulsd     %xmm5,%xmm0,%xmm0
 15,59 │       vmulsd     %xmm2,%xmm1,%xmm1
  4,61 │       vaddsd     %xmm3,%xmm0,%xmm0
 10,24 │       vmulsd     %xmm1,%xmm0,%xmm0
 11,31 │       vaddsd     %xmm2,%xmm0,%xmm0
 23,21 │       vcvtsd2ss  %xmm0,%xmm0,%xmm0
  0,00 │     ← ret

With -std=gnu18, where the overall performance is 36 cycles, I get:

  0,02 │ 3a:   vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1
  0,01 │       vmulsd      ir.4+0x40,%xmm1,%xmm0
       │       vmovsd      ir.4+0x60,%xmm5
       │       vmovsd      ir.4+0x50,%xmm4
       │       lea         tb.1,%rdx
  0,13 │       vroundsd    $0x9,%xmm0,%xmm0,%xmm2
  0,83 │       vsubsd      %xmm2,%xmm0,%xmm0
 28,99 │       vcvttsd2si  %xmm2,%rax
 63,49 │       vfmadd132sd 0x961(%rip),%xmm0,%xmm1
       │       vmovsd      ir.4+0x70,%xmm0
       │       mov         %rax,%rcx
       │       sar         $0x4,%rax
  2,73 │       add         $0x3ff,%rax
  1,99 │       vmulsd      %xmm1,%xmm1,%xmm3
  0,00 │       vfmadd213sd 0x95f(%rip),%xmm1,%xmm5
  0,00 │       vfmadd213sd 0x966(%rip),%xmm1,%xmm0
       │       shl         $0x34,%rax
       │       and         $0xf,%ecx
       │       vmovq       %rax,%xmm6
  0,17 │       vmulsd      (%rdx,%rcx,8),%xmm6,%xmm2
       │       vfmadd213sd 0x92c(%rip),%xmm1,%xmm4
  0,04 │       vfmadd132sd %xmm3,%xmm5,%xmm0
  0,64 │       vmulsd      %xmm2,%xmm1,%xmm1
  0,01 │       vfmadd132sd %xmm3,%xmm4,%xmm0
  0,46 │       vfmadd132sd %xmm1,%xmm2,%xmm0
  0,27 │       vcvtsd2ss   %xmm0,%xmm0,%xmm0
       │     ← ret

The distribution of time is very different in both cases: in the first
case, most of the time is spent at the end (computing w and return value
I suppose) whereas in the second case, most of the time is spent in the
first multiply-and-add (computing h). I do not understand this change of
behaviour.


Cheers,

-- 
Stéphane

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-05  8:57   ` Stéphane Glondu
@ 2022-05-05 14:31     ` Stéphane Glondu
  2022-05-05 14:41       ` Marc Glisse
  0 siblings, 1 reply; 13+ messages in thread
From: Stéphane Glondu @ 2022-05-05 14:31 UTC (permalink / raw)
  To: Alexander Monakov, gcc-help; +Cc: sibid, Paul Zimmermann

As additional data points, the performance with several versions of gcc
(as packaged in Debian testing/unstable):

             | gcc-9 | gcc-10 | gcc-11 | gcc-12 |
 ------------|-------|--------|--------|--------|
  -std=c99   | 24    | 23.5   | 23     | 23     |
  -std=gnu18 | 43    | 16.8   | 38     | 38     |

One can see that the performance stays relatively constant with
-std=c99, but varies significantly with -std=gnu18.

-- 
Stéphane

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-05 14:31     ` Stéphane Glondu
@ 2022-05-05 14:41       ` Marc Glisse
  2022-05-05 14:56         ` Alexander Monakov
  2022-05-05 17:50         ` Paul Zimmermann
  0 siblings, 2 replies; 13+ messages in thread
From: Marc Glisse @ 2022-05-05 14:41 UTC (permalink / raw)
  To: Stéphane Glondu; +Cc: Alexander Monakov, gcc-help, sibid, Paul Zimmermann

On Thu, 5 May 2022, Stéphane Glondu via Gcc-help wrote:

> As additional data points, the performance with several versions of gcc
> (as packaged in Debian testing/unstable):
>
>             | gcc-9 | gcc-10 | gcc-11 | gcc-12 |
> ------------|-------|--------|--------|--------|
>  -std=c99   | 24    | 23.5   | 23     | 23     |
>  -std=gnu18 | 43    | 16.8   | 38     | 38     |
>
> One can see that the performance stays relatively constant with
> -std=c99, but varies significantly with -std=gnu18.

Could you compare with c18 or gnu99, to determine if the issue is with c 
vs gnu (most likely since fma seems important) or 99 vs 18?

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-05 14:41       ` Marc Glisse
@ 2022-05-05 14:56         ` Alexander Monakov
  2022-05-06  7:46           ` Paul Zimmermann
  2022-05-05 17:50         ` Paul Zimmermann
  1 sibling, 1 reply; 13+ messages in thread
From: Alexander Monakov @ 2022-05-05 14:56 UTC (permalink / raw)
  To: Marc Glisse via Gcc-help
  Cc: Stéphane Glondu, Marc Glisse, sibid, Paul Zimmermann

On Thu, 5 May 2022, Marc Glisse via Gcc-help wrote:

> On Thu, 5 May 2022, Stéphane Glondu via Gcc-help wrote:
> 
> > As additional data points, the performance with several versions of gcc
> > (as packaged in Debian testing/unstable):
> >
> >             | gcc-9 | gcc-10 | gcc-11 | gcc-12 |
> > ------------|-------|--------|--------|--------|
> >  -std=c99   | 24    | 23.5   | 23     | 23     |
> >  -std=gnu18 | 43    | 16.8   | 38     | 38     |
> >
> > One can see that the performance stays relatively constant with
> > -std=c99, but varies significantly with -std=gnu18.
> 
> Could you compare with c18 or gnu99, to determine if the issue is with c vs
> gnu (most likely since fma seems important) or 99 vs 18?

Good point. Also could you please add latency metrics, I see that your testing
framework already exposes the '--latency' flag.

I could reproduce a similar though less dramatic slowdown and am investigating.

Alexander

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-05 14:41       ` Marc Glisse
  2022-05-05 14:56         ` Alexander Monakov
@ 2022-05-05 17:50         ` Paul Zimmermann
  1 sibling, 0 replies; 13+ messages in thread
From: Paul Zimmermann @ 2022-05-05 17:50 UTC (permalink / raw)
  To: gcc-help; +Cc: stephane.glondu, amonakov, gcc-help, sibid

> Date: Thu, 5 May 2022 16:41:28 +0200 (CEST)
> From: Marc Glisse <marc.glisse@inria.fr>
> 
> On Thu, 5 May 2022, Stéphane Glondu via Gcc-help wrote:
> 
> > As additional data points, the performance with several versions of gcc
> > (as packaged in Debian testing/unstable):
> >
> >             | gcc-9 | gcc-10 | gcc-11 | gcc-12 |
> > ------------|-------|--------|--------|--------|
> >  -std=c99   | 24    | 23.5   | 23     | 23     |
> >  -std=gnu18 | 43    | 16.8   | 38     | 38     |
> >
> > One can see that the performance stays relatively constant with
> > -std=c99, but varies significantly with -std=gnu18.
> 
> Could you compare with c18 or gnu99, to determine if the issue is with c 
> vs gnu (most likely since fma seems important) or 99 vs 18?

yes it is easy. On another i5:

            | gcc-9 | gcc-10 | gcc-11 |
------------|-------|--------|--------|
 -std=c99   | 24.3  | 23.8   | 23.8   |
 -std=c18   | 24.4  | 23.8   | 23.9   |
 -std=gnu99 | 42.9  | 19.2   | 35.0   |
 -std=gnu18 | 42.9  | 19.2   | 35.0   |

Thus the issue is definitely c vs gnu.

Paul



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-05 14:56         ` Alexander Monakov
@ 2022-05-06  7:46           ` Paul Zimmermann
  2022-05-06  9:27             ` Alexander Monakov
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Zimmermann @ 2022-05-06  7:46 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-help, stephane.glondu, marc.glisse, sibid

       Dear Alexander,

> Good point. Also could you please add latency metrics, I see that your testing
> framework already exposes the '--latency' flag.

here are latency metrics (still on i5-4590):

            | gcc-9 | gcc-10 | gcc-11 |
------------|-------|--------|--------|
 -std=c99   | 70.8  | 70.3   | 70.2   |
 -std=gnu18 | 59.5  | 59.5   | 59.5   |

It thus seems the issue only appears for the reciprocal throughput.

Best regards,
Paul



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-06  7:46           ` Paul Zimmermann
@ 2022-05-06  9:27             ` Alexander Monakov
  2022-05-07  6:11               ` Paul Zimmermann
  2022-05-11 13:26               ` Alexander Monakov
  0 siblings, 2 replies; 13+ messages in thread
From: Alexander Monakov @ 2022-05-06  9:27 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: gcc-help, stephane.glondu, marc.glisse, sibid

On Fri, 6 May 2022, Paul Zimmermann via Gcc-help wrote:

> here are latency metrics (still on i5-4590):
> 
>             | gcc-9 | gcc-10 | gcc-11 |
> ------------|-------|--------|--------|
>  -std=c99   | 70.8  | 70.3   | 70.2   |
>  -std=gnu18 | 59.5  | 59.5   | 59.5   |
> 
> It thus seems the issue only appears for the reciprocal throughput.

Thanks.

The primary issue here is false dependency on vcvtss2sd instruction. In the
snippet shown in Stéphane's email, the slower variant begins with

    vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1

The cvtss2sd instruction is specified to take the upper bits of SSE register
unmodified, so here it merges high bits of xmm1 with results of float->double
conversion (in low bits) into new xmm1. Unless the CPU can track dependencies
separately for vector register components, it has to delay this instruction
until the previous computation that modified xmm1 has completed (AMD Zen2 is
an example of a microarchitecture that apparently can).

This limits the degree to which separate cr_log10f can overlap, affecting
throughput. In latency measurements, the calls are already serialized by
dependency over xmm0, so the additional false dependency does not matter.

(so fma is a "red herring", it's just that depending on compiler version and
flags, register allocation will place last assignment into xmm1 differently)

If you want to experiment, you can hand-edit assembly to replace the problematic
instruction with variants that avoid the false dependency, such as

    vcvtss2sd %xmm0, %xmm0, %xmm1

or

    vpxor %xmm1, %xmm1, %xmm1
    vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1

GCC has code to do this automatically, but for some reason it doesn't work for
your function. I have reported in to the Bugzilla:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504

Alexander

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-06  9:27             ` Alexander Monakov
@ 2022-05-07  6:11               ` Paul Zimmermann
  2022-05-11 13:26               ` Alexander Monakov
  1 sibling, 0 replies; 13+ messages in thread
From: Paul Zimmermann @ 2022-05-07  6:11 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-help, stephane.glondu, marc.glisse, sibid

thank you very much Alexander for your analysis and the bugzilla report!

Paul

> Date: Fri, 6 May 2022 12:27:39 +0300 (MSK)
> From: Alexander Monakov <amonakov@ispras.ru>
> 
> On Fri, 6 May 2022, Paul Zimmermann via Gcc-help wrote:
> 
> > here are latency metrics (still on i5-4590):
> > 
> >             | gcc-9 | gcc-10 | gcc-11 |
> > ------------|-------|--------|--------|
> >  -std=c99   | 70.8  | 70.3   | 70.2   |
> >  -std=gnu18 | 59.5  | 59.5   | 59.5   |
> > 
> > It thus seems the issue only appears for the reciprocal throughput.
> 
> Thanks.
> 
> The primary issue here is false dependency on vcvtss2sd instruction. In the
> snippet shown in Stéphane's email, the slower variant begins with
> 
>     vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1
> 
> The cvtss2sd instruction is specified to take the upper bits of SSE register
> unmodified, so here it merges high bits of xmm1 with results of float->double
> conversion (in low bits) into new xmm1. Unless the CPU can track dependencies
> separately for vector register components, it has to delay this instruction
> until the previous computation that modified xmm1 has completed (AMD Zen2 is
> an example of a microarchitecture that apparently can).
> 
> This limits the degree to which separate cr_log10f can overlap, affecting
> throughput. In latency measurements, the calls are already serialized by
> dependency over xmm0, so the additional false dependency does not matter.
> 
> (so fma is a "red herring", it's just that depending on compiler version and
> flags, register allocation will place last assignment into xmm1 differently)
> 
> If you want to experiment, you can hand-edit assembly to replace the problematic
> instruction with variants that avoid the false dependency, such as
> 
>     vcvtss2sd %xmm0, %xmm0, %xmm1
> 
> or
> 
>     vpxor %xmm1, %xmm1, %xmm1
>     vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1
> 
> GCC has code to do this automatically, but for some reason it doesn't work for
> your function. I have reported in to the Bugzilla:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504
> 
> Alexander

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slowdown with -std=gnu18 with respect to -std=c99
  2022-05-06  9:27             ` Alexander Monakov
  2022-05-07  6:11               ` Paul Zimmermann
@ 2022-05-11 13:26               ` Alexander Monakov
  1 sibling, 0 replies; 13+ messages in thread
From: Alexander Monakov @ 2022-05-11 13:26 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: gcc-help, stephane.glondu, marc.glisse, sibid

On Fri, 6 May 2022, Alexander Monakov wrote:

> The primary issue here is false dependency on vcvtss2sd instruction. In the
> snippet shown in Stéphane's email, the slower variant begins with
> 
>     vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1
> 
> The cvtss2sd instruction is specified to take the upper bits of SSE register
> unmodified, so here it merges high bits of xmm1 with results of float->double
> conversion (in low bits) into new xmm1. Unless the CPU can track dependencies
> separately for vector register components, it has to delay this instruction
> until the previous computation that modified xmm1 has completed (AMD Zen2 is
> an example of a microarchitecture that apparently can).

For future reference, my statement in parenthesis was a bit inaccurate: Zen 2
avoids the false dependency provided that xmm1 carries all-zeroes in high bits
after being idiomatically zeroed (i.e. via pxor). Thanks to Andreas Abel for
pointing out there's a limitation.

(nevertheless, the "blessed" state seemingly survives context switches, so
it's quite useful, including this testcase)

Alexander

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-05-11 13:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-03  8:28 slowdown with -std=gnu18 with respect to -std=c99 Paul Zimmermann
2022-05-03  9:09 ` Alexander Monakov
2022-05-03 11:45   ` Paul Zimmermann
2022-05-03 12:12     ` Alexander Monakov
2022-05-05  8:57   ` Stéphane Glondu
2022-05-05 14:31     ` Stéphane Glondu
2022-05-05 14:41       ` Marc Glisse
2022-05-05 14:56         ` Alexander Monakov
2022-05-06  7:46           ` Paul Zimmermann
2022-05-06  9:27             ` Alexander Monakov
2022-05-07  6:11               ` Paul Zimmermann
2022-05-11 13:26               ` Alexander Monakov
2022-05-05 17:50         ` Paul Zimmermann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).