* [Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
@ 2024-05-08 14:45 ` colin.king at intel dot com
2024-05-08 14:45 ` colin.king at intel dot com
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-08 14:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
--- Comment #1 from Colin Ian King <colin.king at intel dot com> ---
Created attachment 58127
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58127&action=edit
gcc-13 disassembly
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
2024-05-08 14:45 ` [Bug c/114987] " colin.king at intel dot com
@ 2024-05-08 14:45 ` colin.king at intel dot com
2024-05-08 15:00 ` colin.king at intel dot com
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-08 14:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
--- Comment #2 from Colin Ian King <colin.king at intel dot com> ---
Created attachment 58128
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58128&action=edit
gcc-14 disassembly
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
2024-05-08 14:45 ` [Bug c/114987] " colin.king at intel dot com
2024-05-08 14:45 ` colin.king at intel dot com
@ 2024-05-08 15:00 ` colin.king at intel dot com
2024-05-10 7:52 ` [Bug target/114987] [14/15 Regression] " rguenth at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-08 15:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
--- Comment #3 from Colin Ian King <colin.king at intel dot com> ---
perf report from gcc-13 of stress_vecfp_float_add_16.avx of compute loop:
57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5
11.11 │ vaddps 0xe0(%rsp),%ymm2,%ymm6
0.02 │ vmovaps %ymm5,0x60(%rsp)
2.92 │ mov 0x60(%rsp),%rax
│ mov 0x68(%rsp),%rdx
0.37 │ vmovaps %ymm6,0x40(%rsp)
│ vmovaps %ymm5,0x80(%rsp)
6.30 │ vmovq %rax,%xmm1
4.11 │ mov 0x40(%rsp),%rax
│ vmovdqa 0x90(%rsp),%xmm5
│ vmovaps %ymm6,0xa0(%rsp)
3.27 │ vpinsrq $0x1,%rdx,%xmm1,%xmm1
│ mov 0x48(%rsp),%rdx
│ vmovdqa 0xb0(%rsp),%xmm6
3.22 │ vmovdqa %xmm1,0xc0(%rsp)
0.42 │ vmovq %rax,%xmm0
│ vmovdqa %xmm5,0xd0(%rsp)
6.80 │ vpinsrq $0x1,%rdx,%xmm0,%xmm0
3.52 │ vmovdqa %xmm0,0xe0(%rsp)
│ vmovdqa %xmm6,0xf0(%rsp)
│ sub $0x1,%ecx
│ ↑ jne 200
perf report from gcc-14 of stress_vecfp_float_add_16.avx of compute loop:
65.79 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5
3.26 │ vaddps 0xe0(%rsp),%ymm2,%ymm6
0.00 │ vmovaps %ymm5,0x60(%rsp)
9.25 │ mov 0x60(%rsp),%rax
0.00 │ mov 0x68(%rsp),%rdx
│ vmovaps %ymm6,0x40(%rsp)
│ vmovaps %ymm5,0x80(%rsp)
6.49 │ vmovq %rax,%xmm1
0.00 │ mov 0x40(%rsp),%rax
0.00 │ vmovaps %ymm6,0xa0(%rsp)
3.02 │ vpinsrq $0x1,%rdx,%xmm1,%xmm1
│ mov 0x48(%rsp),%rdx
0.35 │ vmovdqa %xmm1,0xc0(%rsp)
0.68 │ vmovq %rax,%xmm0
0.00 │ vmovdqa 0x90(%rsp),%xmm1
5.18 │ vpinsrq $0x1,%rdx,%xmm0,%xmm0
3.00 │ vmovdqa %xmm0,0xe0(%rsp)
│ vmovdqa 0xb0(%rsp),%xmm0
│ vmovdqa %xmm1,0xd0(%rsp)
│ vmovdqa %xmm0,0xf0(%rsp)
│ sub $0x1,%ecx
2.94 │ ↑ jne 200
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
` (2 preceding siblings ...)
2024-05-08 15:00 ` colin.king at intel dot com
@ 2024-05-10 7:52 ` rguenth at gcc dot gnu.org
2024-05-10 8:00 ` haochen.jiang at intel dot com
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-10 7:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Summary|[14/15 regression] floating |[14/15 Regression] floating
|point vector regression, |point vector regression,
|x86, between gcc 14 and |x86, between gcc 14 and
|gcc-13 using -O3 and target |gcc-13 using -O3 and target
|clones on skylake platforms |clones on skylake platforms
Ever confirmed|0 |1
Last reconfirmed| |2024-05-10
Target|x86_64 |x86_64-*-*
Target Milestone|--- |14.2
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can't reproduce a slowdown on a Zen2 CPU. The difference seems to be merely
instruction scheduling. I do note we're not doing a good job in handling
for (i = 0; i < LOOPS_PER_CALL; i++) {
r.v = r.v + add.v;
}
where r.v and add.v are AVX512 sized vectors when emulating them with AVX
vectors. We end up with
r_v_lsm.48_48 = r.v;
_11 = add.v;
<bb 3> [local count: 1063004408]:
# r_v_lsm.48_50 = PHI <_12(3), r_v_lsm.48_48(2)>
# ivtmp_56 = PHI <ivtmp_55(3), 65536(2)>
_16 = BIT_FIELD_REF <_11, 256, 0>;
_37 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 0>;
_29 = _16 + _37;
_387 = BIT_FIELD_REF <_11, 256, 256>;
_375 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 256>;
_363 = _387 + _375;
_12 = {_29, _363};
ivtmp_55 = ivtmp_56 - 1;
if (ivtmp_55 != 0)
goto <bb 3>; [98.99%]
else
goto <bb 4>; [1.01%]
<bb 4> [local count: 10737416]:
after lowering from 512bit to 256bit vectors and there's no pass that
would demote the 512bit reduction value to two 256bit ones.
There's also weird things going on in the target/on RTL. A smaller testcase
illustrating the code generation issue is
typedef float v16sf __attribute__((vector_size(sizeof(float)*16)));
void foo (v16sf * __restrict r, v16sf *a, int n)
{
for (int i = 0; i < n; ++i)
*r = *r + *a;
}
So confirmed for non-optimal code but I don't see how it's a regression.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
` (3 preceding siblings ...)
2024-05-10 7:52 ` [Bug target/114987] [14/15 Regression] " rguenth at gcc dot gnu.org
@ 2024-05-10 8:00 ` haochen.jiang at intel dot com
2024-05-10 8:05 ` liuhongt at gcc dot gnu.org
2024-05-10 8:42 ` haochen.jiang at intel dot com
6 siblings, 0 replies; 8+ messages in thread
From: haochen.jiang at intel dot com @ 2024-05-10 8:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
--- Comment #5 from Haochen Jiang <haochen.jiang at intel dot com> ---
What I have found is that the binary built with GCC13 and GCC14 will regress on
Cascadelake and Skylake.
But when I copied the binary to Icelake, it won't. Seems Icelake might fix this
with micro-tuning.
I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)"
and rebuilt the binary and it will save half the regression.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
` (4 preceding siblings ...)
2024-05-10 8:00 ` haochen.jiang at intel dot com
@ 2024-05-10 8:05 ` liuhongt at gcc dot gnu.org
2024-05-10 8:42 ` haochen.jiang at intel dot com
6 siblings, 0 replies; 8+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-05-10 8:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
--- Comment #6 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)"
> and rebuilt the binary and it will save half the regression.
57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5
11.11 │ vaddps 0xe0(%rsp),%ymm2,%ymm6
...
3.22 │ vmovdqa %xmm1,0xc0(%rsp)
│ vmovdqa %xmm5,0xd0(%rsp)
3.52 │ vmovdqa %xmm0,0xe0(%rsp)
│ vmovdqa %xmm6,0xf0(%rsp)
I guess there're specific patterns in SKX microarhitecture for STLF, the main
difference is instruction order of those xmm stores.
From compiler side, the worth thing to do is PR107916.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
` (5 preceding siblings ...)
2024-05-10 8:05 ` liuhongt at gcc dot gnu.org
@ 2024-05-10 8:42 ` haochen.jiang at intel dot com
6 siblings, 0 replies; 8+ messages in thread
From: haochen.jiang at intel dot com @ 2024-05-10 8:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
--- Comment #7 from Haochen Jiang <haochen.jiang at intel dot com> ---
Furthermore, when I build with GCC11, the codegen is much better:
vaddps 0xc0(%rsp),%ymm5,%ymm2
vaddps 0xe0(%rsp),%ymm4,%ymm1
vmovaps %ymm2,0x80(%rsp)
vmovdqa 0x90(%rsp),%xmm6
vmovaps %ymm1,0xa0(%rsp)
vmovdqa 0xb0(%rsp),%xmm7
vmovdqa %xmm2,0xc0(%rsp)
vmovdqa %xmm6,0xd0(%rsp)
vmovdqa %xmm1,0xe0(%rsp)
vmovdqa %xmm7,0xf0(%rsp)
sub $0x1,%eax
jne 401e00 <stress_vecfp_float_add_16.avx.1+0x1e0>
Seems we might get two separate issues for this regression.
^ permalink raw reply [flat|nested] 8+ messages in thread