[Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
@ 2024-05-08 14:44 colin.king at intel dot com
  2024-05-08 14:45 ` [Bug c/114987] " colin.king at intel dot com
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-08 14:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

            Bug ID: 114987
           Summary: floating point vector regression, x86, between gcc 14
                    and gcc-13 using -O3 and target clones on skylake
                    platforms
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58126
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58126&action=edit
reproducer.c source code

I'm seeing a ~10% performance regression in gcc-14 compared to gcc-13, using
gcc on Ubuntu 24.04:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1) 

king@skylake:~$ CFLAGS="" gcc-13 reproducer.c; ./a.out  
4.92 secs duration, 2130.379 Mfp-ops/sec
cking@skylake:~$ CFLAGS="" gcc-14 reproducer.c; ./a.out  
5.46 secs duration, 1921.799 Mfp-ops/sec

The original issue appeared when regression testing stress-ng vecfp stressor
[1] using the floating point vector 16 add stressor method. I've managed to
extract the attached reproducer (reproducer.c) from the original code.

Salient points to focus on:

1. The issue is dependant on the OPTIMIZE3 macro in the reproducer being
__attribute__((optimize("-O3")))
2. The issue is also dependant on the TARGET_CLONES macro being defined as
__attribute__((target_clones("mmx,avx,default")))  - the avx target clones
seems to be an issue in reproducing this problem.

Attached are the reproducer.c C source and disassembled object code. The
stress_vecfp_float_add_16.avx from gcc-13 is significantly different from the
gcc-14 code.

References: [1]
https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecfp.c

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
  2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
@ 2024-05-08 14:45 ` colin.king at intel dot com
  2024-05-08 14:45 ` colin.king at intel dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-08 14:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #1 from Colin Ian King <colin.king at intel dot com> ---
Created attachment 58127
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58127&action=edit
gcc-13 disassembly

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
  2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
  2024-05-08 14:45 ` [Bug c/114987] " colin.king at intel dot com
@ 2024-05-08 14:45 ` colin.king at intel dot com
  2024-05-08 15:00 ` colin.king at intel dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-08 14:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #2 from Colin Ian King <colin.king at intel dot com> ---
Created attachment 58128
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58128&action=edit
gcc-14 disassembly

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug c/114987] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
  2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
  2024-05-08 14:45 ` [Bug c/114987] " colin.king at intel dot com
  2024-05-08 14:45 ` colin.king at intel dot com
@ 2024-05-08 15:00 ` colin.king at intel dot com
  2024-05-10  7:52 ` [Bug target/114987] [14/15 Regression] " rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-08 15:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #3 from Colin Ian King <colin.king at intel dot com> ---
perf report from gcc-13 of stress_vecfp_float_add_16.avx of compute loop:

 57.93 │200:   vaddps       0xc0(%rsp),%ymm3,%ymm5                        
 11.11 │       vaddps       0xe0(%rsp),%ymm2,%ymm6                        
  0.02 │       vmovaps      %ymm5,0x60(%rsp)                              
  2.92 │       mov          0x60(%rsp),%rax                               
       │       mov          0x68(%rsp),%rdx                               
  0.37 │       vmovaps      %ymm6,0x40(%rsp)                              
       │       vmovaps      %ymm5,0x80(%rsp)                              
  6.30 │       vmovq        %rax,%xmm1                                    
  4.11 │       mov          0x40(%rsp),%rax                               
       │       vmovdqa      0x90(%rsp),%xmm5                              
       │       vmovaps      %ymm6,0xa0(%rsp)                              
  3.27 │       vpinsrq      $0x1,%rdx,%xmm1,%xmm1                         
       │       mov          0x48(%rsp),%rdx                               
       │       vmovdqa      0xb0(%rsp),%xmm6                              
  3.22 │       vmovdqa      %xmm1,0xc0(%rsp)                              
  0.42 │       vmovq        %rax,%xmm0                                    
       │       vmovdqa      %xmm5,0xd0(%rsp)                              
  6.80 │       vpinsrq      $0x1,%rdx,%xmm0,%xmm0                         
  3.52 │       vmovdqa      %xmm0,0xe0(%rsp)                              
       │       vmovdqa      %xmm6,0xf0(%rsp)                              
       │       sub          $0x1,%ecx                                     
       │     ↑ jne          200    

perf report from gcc-14 of stress_vecfp_float_add_16.avx of compute loop:

 65.79 │200:   vaddps       0xc0(%rsp),%ymm3,%ymm5                        
  3.26 │       vaddps       0xe0(%rsp),%ymm2,%ymm6                        
  0.00 │       vmovaps      %ymm5,0x60(%rsp)                              
  9.25 │       mov          0x60(%rsp),%rax                               
  0.00 │       mov          0x68(%rsp),%rdx                               
       │       vmovaps      %ymm6,0x40(%rsp)                              
       │       vmovaps      %ymm5,0x80(%rsp)                              
  6.49 │       vmovq        %rax,%xmm1                                    
  0.00 │       mov          0x40(%rsp),%rax                               
  0.00 │       vmovaps      %ymm6,0xa0(%rsp)                              
  3.02 │       vpinsrq      $0x1,%rdx,%xmm1,%xmm1                         
       │       mov          0x48(%rsp),%rdx                               
  0.35 │       vmovdqa      %xmm1,0xc0(%rsp)                              
  0.68 │       vmovq        %rax,%xmm0                                    
  0.00 │       vmovdqa      0x90(%rsp),%xmm1                              
  5.18 │       vpinsrq      $0x1,%rdx,%xmm0,%xmm0                         
  3.00 │       vmovdqa      %xmm0,0xe0(%rsp)                              
       │       vmovdqa      0xb0(%rsp),%xmm0                              
       │       vmovdqa      %xmm1,0xd0(%rsp)                              
       │       vmovdqa      %xmm0,0xf0(%rsp)                              
       │       sub          $0x1,%ecx                                     
  2.94 │     ↑ jne          200

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
  2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
                   ` (2 preceding siblings ...)
  2024-05-08 15:00 ` colin.king at intel dot com
@ 2024-05-10  7:52 ` rguenth at gcc dot gnu.org
  2024-05-10  8:00 ` haochen.jiang at intel dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-10  7:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
            Summary|[14/15 regression] floating |[14/15 Regression] floating
                   |point vector regression,    |point vector regression,
                   |x86, between gcc 14 and     |x86, between gcc 14 and
                   |gcc-13 using -O3 and target |gcc-13 using -O3 and target
                   |clones on skylake platforms |clones on skylake platforms
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-05-10
             Target|x86_64                      |x86_64-*-*
   Target Milestone|---                         |14.2

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can't reproduce a slowdown on a Zen2 CPU.  The difference seems to be merely
instruction scheduling.  I do note we're not doing a good job in handling

        for (i = 0; i < LOOPS_PER_CALL; i++) {
                r.v = r.v + add.v;
        }

where r.v and add.v are AVX512 sized vectors when emulating them with AVX
vectors.  We end up with

  r_v_lsm.48_48 = r.v;
  _11 = add.v;

  <bb 3> [local count: 1063004408]:
  # r_v_lsm.48_50 = PHI <_12(3), r_v_lsm.48_48(2)>
  # ivtmp_56 = PHI <ivtmp_55(3), 65536(2)>
  _16 = BIT_FIELD_REF <_11, 256, 0>;
  _37 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 0>;
  _29 = _16 + _37;
  _387 = BIT_FIELD_REF <_11, 256, 256>;
  _375 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 256>;
  _363 = _387 + _375;
  _12 = {_29, _363};
  ivtmp_55 = ivtmp_56 - 1;
  if (ivtmp_55 != 0)
    goto <bb 3>; [98.99%]
  else
    goto <bb 4>; [1.01%]

  <bb 4> [local count: 10737416]:

after lowering from 512bit to 256bit vectors and there's no pass that
would demote the 512bit reduction value to two 256bit ones.

There's also weird things going on in the target/on RTL.  A smaller testcase
illustrating the code generation issue is

typedef float v16sf __attribute__((vector_size(sizeof(float)*16)));

void foo (v16sf * __restrict r, v16sf *a, int n)
{
  for (int i = 0; i < n; ++i)
    *r = *r + *a;
}

So confirmed for non-optimal code but I don't see how it's a regression.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
  2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
                   ` (3 preceding siblings ...)
  2024-05-10  7:52 ` [Bug target/114987] [14/15 Regression] " rguenth at gcc dot gnu.org
@ 2024-05-10  8:00 ` haochen.jiang at intel dot com
  2024-05-10  8:05 ` liuhongt at gcc dot gnu.org
  2024-05-10  8:42 ` haochen.jiang at intel dot com
  6 siblings, 0 replies; 8+ messages in thread
From: haochen.jiang at intel dot com @ 2024-05-10  8:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #5 from Haochen Jiang <haochen.jiang at intel dot com> ---
What I have found is that the binary built with GCC13 and GCC14 will regress on
Cascadelake and Skylake.

But when I copied the binary to Icelake, it won't. Seems Icelake might fix this
with micro-tuning.

I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)"
and rebuilt the binary and it will save half the regression.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
  2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
                   ` (4 preceding siblings ...)
  2024-05-10  8:00 ` haochen.jiang at intel dot com
@ 2024-05-10  8:05 ` liuhongt at gcc dot gnu.org
  2024-05-10  8:42 ` haochen.jiang at intel dot com
  6 siblings, 0 replies; 8+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-05-10  8:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #6 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)"
> and rebuilt the binary and it will save half the regression.

 57.93 │200:   vaddps       0xc0(%rsp),%ymm3,%ymm5                        
 11.11 │       vaddps       0xe0(%rsp),%ymm2,%ymm6
        ...
  3.22 │       vmovdqa      %xmm1,0xc0(%rsp)                                    
       │       vmovdqa      %xmm5,0xd0(%rsp)                                    
  3.52 │       vmovdqa      %xmm0,0xe0(%rsp)                              
       │       vmovdqa      %xmm6,0xf0(%rsp)   

I guess there're specific patterns in SKX microarhitecture for STLF, the main
difference is instruction order of those xmm stores.

From compiler side, the worth thing to do is PR107916.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms
  2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
                   ` (5 preceding siblings ...)
  2024-05-10  8:05 ` liuhongt at gcc dot gnu.org
@ 2024-05-10  8:42 ` haochen.jiang at intel dot com
  6 siblings, 0 replies; 8+ messages in thread
From: haochen.jiang at intel dot com @ 2024-05-10  8:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

--- Comment #7 from Haochen Jiang <haochen.jiang at intel dot com> ---
Furthermore, when I build with GCC11, the codegen is much better:

        vaddps       0xc0(%rsp),%ymm5,%ymm2
        vaddps       0xe0(%rsp),%ymm4,%ymm1
        vmovaps      %ymm2,0x80(%rsp)
        vmovdqa      0x90(%rsp),%xmm6
        vmovaps      %ymm1,0xa0(%rsp)
        vmovdqa      0xb0(%rsp),%xmm7
        vmovdqa      %xmm2,0xc0(%rsp)
        vmovdqa      %xmm6,0xd0(%rsp)
        vmovdqa      %xmm1,0xe0(%rsp)
        vmovdqa      %xmm7,0xf0(%rsp)
        sub          $0x1,%eax
        jne          401e00 <stress_vecfp_float_add_16.avx.1+0x1e0>

Seems we might get two separate issues for this regression.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-05-10  8:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-08 14:44 [Bug c/114987] New: floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms colin.king at intel dot com
2024-05-08 14:45 ` [Bug c/114987] " colin.king at intel dot com
2024-05-08 14:45 ` colin.king at intel dot com
2024-05-08 15:00 ` colin.king at intel dot com
2024-05-10  7:52 ` [Bug target/114987] [14/15 Regression] " rguenth at gcc dot gnu.org
2024-05-10  8:00 ` haochen.jiang at intel dot com
2024-05-10  8:05 ` liuhongt at gcc dot gnu.org
2024-05-10  8:42 ` haochen.jiang at intel dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).