[Bug middle-end/110586] [14 Regression] 10% fatigue2 regression on zen since r14-2369-g3a61ca1b925653 (bad LRA&scheduling)

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "hubicka at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/110586] [14 Regression] 10% fatigue2 regression on zen since r14-2369-g3a61ca1b925653 (bad LRA&scheduling)
Date: Tue, 18 Jul 2023 10:27:01 +0000	[thread overview]
Message-ID: <bug-110586-4-gHTL5AsXEm@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-110586-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110586

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[14 Regression] 10%         |[14 Regression] 10%
                   |fatigue2 regression on zen  |fatigue2 regression on zen
                   |since                       |since
                   |r14-2369-g3a61ca1b925653    |r14-2369-g3a61ca1b925653
                   |                            |(bad LRA&scheduling)

--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Aha, sphinx3 is indeed same patch.
The patch corrects profile here. It is LRA/scheduler interaction that causes
the difference

With older trunk I get:
 Performance counter stats for './b.out':

         28,536.75 msec task-clock:u                     #    1.000 CPUs
utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               138      page-faults:u                    #    4.836 /sec
   134,747,380,473      cycles:u                         #    4.722 GHz        
                (83.33%)
       714,193,718      stalled-cycles-frontend:u        #    0.53% frontend
cycles idle        (83.33%)
         3,510,378      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.33%)
   243,176,910,654      instructions:u                   #    1.80  insn per
cycle
                                                  #    0.00  stalled cycles per
insn     (83.33%)
    13,541,807,472      branches:u                       #  474.539 M/sec      
                (83.33%)
        13,829,858      branch-misses:u                  #    0.10% of all
branches             (83.33%)

      28.537620889 seconds time elapsed

      28.536941000 seconds user
       0.000000000 seconds sys

and with current trunk:
 Performance counter stats for './a.out':

          31933.51 msec task-clock:u                     #    1.000 CPUs
utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               138      page-faults:u                    #    4.321 /sec
      150448312691      cycles:u                         #    4.711 GHz        
                (83.33%)
         760763745      stalled-cycles-frontend:u        #    0.51% frontend
cycles idle        (83.33%)
           1918238      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.33%)
      242823668283      instructions:u                   #    1.61  insn per
cycle
                                                  #    0.00  stalled cycles per
insn     (83.34%)
       13541981288      branches:u                       #  424.068 M/sec      
                (83.34%)
          14583703      branch-misses:u                  #    0.11% of all
branches             (83.33%)

      31.933986770 seconds time elapsed

      31.933701000 seconds user
       0.000000000 seconds sys

So same instruction and branch count, but they execute slower. IPC goes down
from 1.8 to 1.6. Perf thinks the difference is
__perdida_m_MOD_generalized_hookes_law.constprop.0.

  27.45%  b.out    b.out                 [.] MAIN__                             
  27.07%  a.out    a.out                 [.] MAIN__                             
  21.72%  a.out    a.out                 [.]
__perdida_m_MOD_generalized_hookes_law.constprop.0.
  16.60%  b.out    b.out                 [.]
__perdida_m_MOD_generalized_hookes_law.constprop.0.
   2.22%  a.out    a.out                 [.]
__perdida_m_MOD_generalized_hookes_law.constprop.1.
   1.64%  b.out    b.out                 [.]
__perdida_m_MOD_generalized_hookes_law.constprop.1.
   1.55%  b.out    libc.so.6             [.] __memset_avx2_unaligned_erms       
   1.54%  a.out    libc.so.6             [.] __memset_avx2_unaligned_erms       
   0.06%  a.out    libm.so.6             [.] __sincos_fma                       
   0.04%  b.out    libm.so.6             [.] __sincos_fma                       

b.out is before patch and a.out is after. The difference seems to be relocated
load.  Before patch:

Percent│     0000000000401860 <__perdida_m_MOD_generalized_hookes_▒
       │     __perdida_m_MOD_generalized_hookes_law.constprop.0.is▒
  0.10 │       push         %rbp                                  ▒
  0.02 │       mov          %r8,%rax                              ▒
       │       vmovddup     %xmm0,%xmm5                           ▒
       │       mov          %rsp,%rbp                             ▒
  1.22 │       push         %r15                                  ▒
  0.04 │       push         %r14                                  ▒
  0.03 │       push         %r13                                  ▒
  0.09 │       push         %r12                                  ▒
  0.05 │       push         %rbx                                  ▒
  0.03 │       not          %rax                                  ▒
  0.00 │       mov          %rdi,%rbx                             ▒
       │       and          $0xffffffffffffffe0,%rsp              ▒
  1.11 │       mov          %rdx,%r12                             ▒
       │       sub          $0x180,%rsp                           ▒
  0.04 │       vmovapd      %xmm5,0x20(%rsp)                      ◆
               ^^^^ this load
  0.04 │       mov          %rax,0x30(%rsp)                       ▒
  0.02 │       test         %rsi,%rsi                             ▒
       │     ↓ je           210                                   ▒
       │       mov          %rsi,%rax                             ▒
       │       mov          %rsi,%r13                             ▒
  1.16 │       lea          (%rsi,%rsi,1),%r10                    ▒
  0.01 │       mov          %rsi,%r15                             ▒
       │       shl          $0x4,%rax                             ▒
  0.06 │       neg          %r13                                  ▒
       │       lea          (%r10,%rsi,1),%r14                    ▒
  0.03 │       mov          %rax,0x18(%rsp)                       ▒
  0.02 │       lea          0x0(,%rsi,8),%rax                     ▒
       │       mov          %rax,0x10(%rsp)                       ▒
  1.23 │ 66:   mov          $0x120,%edx                           ▒
  0.01 │       xor          %esi,%esi                             ▒
       │       lea          0x60(%rsp),%rdi                       ▒
  0.07 │       vmovsd       %xmm1,0x38(%rsp)                      ▒
  0.03 │       vmovsd       %xmm0,0x40(%rsp)                      ▒
  0.12 │       mov          %r8,0x48(%rsp)                        ▒
  0.05 │       mov          %rcx,0x50(%rsp)                       ▒
  0.06 │       sub          %r12,%r13                             ▒
  1.16 │       mov          %r10,0x58(%rsp)                       ▒
  0.04 │     → call         memset@plt                            ▒

after

       │     0000000000401870 <__perdida_m_MOD_generalized_hookes_▒
       │     __perdida_m_MOD_generalized_hookes_law.constprop.0.is▒
  0.07 │       push         %rbp                                  ▒
  0.01 │       mov          %r8,%rax                              ▒
       │       vmovddup     %xmm0,%xmm3                           ▒
       │       mov          %rsp,%rbp                             ▒
  0.87 │       push         %r15                                  ▒
  0.04 │       push         %r14                                  ▒
  0.02 │       push         %r13                                  ▒
  0.07 │       push         %r12                                  ▒
  0.02 │       push         %rbx                                  ▒
  0.02 │       not          %rax                                  ▒
  0.00 │       mov          %rdi,%rbx                             ▒
       │       and          $0xffffffffffffffe0,%rsp              ▒
  0.87 │       mov          %rdx,%r12                             ▒
       │       sub          $0x180,%rsp                           ◆
  0.04 │       mov          %rax,0x58(%rsp)                       ▒
  0.03 │       test         %rsi,%rsi                             ▒
       │       je           210                                   ▒
       │       mov          %rsi,%rax                             ▒
  0.00 │       mov          %rsi,%r13                             ▒
       │       lea          (%rsi,%rsi,1),%r10                    ▒
  0.95 │       mov          %rsi,%r15                             ▒
  0.01 │       shl          $0x4,%rax                             ▒
       │       neg          %r13                                  ▒
  0.04 │       lea          (%r10,%rsi,1),%r14                    ▒
  0.04 │       mov          %rax,0x18(%rsp)                       ▒
  0.01 │       lea          0x0(,%rsi,8),%rax                     ▒
       │       mov          %rax,0x10(%rsp)                       ▒
  0.02 │ 60:   mov          $0x120,%edx                           ▒
  0.89 │       xor          %esi,%esi                             ▒
  0.01 │       lea          0x60(%rsp),%rdi                       ▒
       │       vmovsd       %xmm1,0x20(%rsp)                      ▒
               ^^^^ is now here
  0.08 │       vmovsd       %xmm0,0x28(%rsp)                      ▒
  0.04 │       mov          %r8,0x30(%rsp)                        ▒
  0.01 │       mov          %rcx,0x38(%rsp)                       ▒
  0.05 │       sub          %r12,%r13                             ▒
       │       mov          %r10,0x50(%rsp)                       ▒
  1.04 │       vmovapd      %xmm3,0x40(%rsp)                      ▒

And later bit different scheduling:
  0.12 │       vmovsd       %xmm1,0x108(%rsp)                     ▒
  1.22 │       vmovsd       %xmm0,0x70(%rsp)                      ◆
  0.38 │       vmovapd      %xmm4,0xc0(%rsp)                      ▒
  1.27 │       vmovsd       %xmm0,0xa0(%rsp)                      ▒
  0.20 │       vmovsd       %xmm1,0x140(%rsp)                     ▒
  2.41 │       vmovsd       %xmm1,0x178(%rsp)                     ▒
  2.05 │       vbroadcastsd 0x10(%rcx,%rax,8),%ymm1               ▒
  0.10 │       vunpcklpd    %xmm0,%xmm2,%xmm3                     ▒
       │       vmovsd       %xmm2,0xd0(%rsp)                      ▒
  0.34 │       vmovapd      %xmm3,0x60(%rsp)                      ▒
  2.25 │       vunpcklpd    %xmm2,%xmm0,%xmm3                     ▒
       │       vbroadcastsd -0x8(%rcx,%rdx,8),%ymm2               ▒
  0.01 │       vmovapd      %xmm3,0x90(%rsp)                      ▒
  0.28 │       vbroadcastsd (%rcx,%rdx,8),%ymm3                   ▒
  0.01 │       vmulpd       0xc0(%rsp),%ymm3,%ymm3                ▒
 52.87 │       vmulpd       0xf0(%rsp),%ymm2,%ymm2                ▒
  0.06 │       vbroadcastsd (%rcx),%ymm0                          ▒
       │       vfmadd132pd  0x90(%rsp),%ymm3,%ymm1                ▒
  1.77 │       vfmadd132pd  0x60(%rsp),%ymm2,%ymm0                ▒
  0.10 │       vmovddup     0x8(%rcx,%rax,8),%xmm2                ▒
       │       lea          0x0(%r13,%r12,2),%rax                 ▒

After:

  0.28 │       vmovsd       %xmm1,0x108(%rsp)                     ▒
  0.98 │       vmovsd       %xmm0,0x70(%rsp)                      ◆
  0.04 │       vmovapd      %xmm3,0xc0(%rsp)                      ▒
  0.99 │       vmovsd       %xmm0,0xa0(%rsp)                      ▒
  0.26 │       vmovsd       %xmm1,0x140(%rsp)                     ▒
  1.80 │       vmovsd       %xmm1,0x178(%rsp)                     ▒
  0.91 │       vbroadcastsd (%rcx,%rdx,8),%ymm3                   ▒
  0.08 │       vbroadcastsd 0x10(%rcx,%rax,8),%ymm1               ▒
  0.07 │       vunpcklpd    %xmm0,%xmm2,%xmm4                     ▒
  0.02 │       vmovsd       %xmm2,0xd0(%rsp)                      ▒
  0.93 │       vmulpd       0xc0(%rsp),%ymm3,%ymm3                ▒
 42.18 │       vmovapd      %xmm4,0x60(%rsp)                      ▒
       │       vunpcklpd    %xmm2,%xmm0,%xmm4                     ▒
       │       vbroadcastsd -0x8(%rcx,%rdx,8),%ymm2               ▒
       │       vmulpd       0xf0(%rsp),%ymm2,%ymm2                ▒
  0.09 │       vmovapd      %xmm4,0x90(%rsp)                      ▒
       │       vbroadcastsd (%rcx),%ymm0                          ▒
       │       vfmadd132pd  0x90(%rsp),%ymm3,%ymm1                ▒
 23.48 │       vfmadd132pd  0x60(%rsp),%ymm2,%ymm0                ▒
  0.77 │       vmovddup     0x8(%rcx,%rax,8),%xmm2                ▒
       │       lea          0x0(%r13,%r12,2),%rax                 ▒

Perdida is loopless with only 3 BBS in optimize dump.  With old build we get:

  <bb 2> [local count: 25581901]:
  _60 = {ISRA.929_118(D), ISRA.929_118(D)};
  offset.162_6 = ~ISRA.928_112(D);
  if (ISRA.925_113(D) != 0)
    goto <bb 3>; [50.00%]
  else
    goto <bb 4>; [50.00%]

  <bb 3> [local count: 12790951]:
  _226 = -ISRA.925_113(D);
  _228 = ISRA.925_113(D) * 2;
  _230 = ISRA.925_113(D) * 3;
  _232 = ISRA.925_113(D) * 16;
  _234 = (sizetype) _232;
  _236 = ISRA.925_113(D) * 8;
  _238 = (sizetype) _236;

  <bb 4> [local count: 51163802]:
  # iftmp.499_11 = PHI <ISRA.925_113(D)(3), 1(2)>
  # prephitmp_227 = PHI <_226(3), -1(2)>
  # prephitmp_229 = PHI <_228(3), 2(2)>
  # prephitmp_231 = PHI <_230(3), 3(2)>
  # prephitmp_235 = PHI <_234(3), 16(2)>
  # prephitmp_239 = PHI <_238(3), 8(2)>
  offset.166_13 = prephitmp_227 - ISRA.926_115(D);
  generalized_constitutive_tensor = {};
  _17 = .FMA (ISRA.930_119(D), 2.0e+0, ISRA.929_118(D));
  _157 = {ISRA.929_118(D), _17};
  _177 = {_17, ISRA.929_118(D)};

Count of BB4 should be the same as the count of BB2 but it is twice as much. 
This is originally comming from vectorizer doing the vectorized epilogue that
never iterates but giving it 50% chance of iteration.
After patch this is corrected:

  <bb 2> [local count: 25581901]:
  _60 = {ISRA.929_118(D), ISRA.929_118(D)};
  offset.162_6 = ~ISRA.928_112(D);
  if (ISRA.925_113(D) != 0)
    goto <bb 3>; [50.00%]
  else
    goto <bb 4>; [50.00%]

  <bb 3> [local count: 12790951]:
  _226 = -ISRA.925_113(D);
  _228 = ISRA.925_113(D) * 2;
  _230 = ISRA.925_113(D) * 3;
  _232 = ISRA.925_113(D) * 16;
  _234 = (sizetype) _232;
  _236 = ISRA.925_113(D) * 8;
  _238 = (sizetype) _236; 

  <bb 4> [local count: 25581901]:
  # iftmp.499_11 = PHI <ISRA.925_113(D)(3), 1(2)>
  # prephitmp_227 = PHI <_226(3), -1(2)>
  # prephitmp_229 = PHI <_228(3), 2(2)>
  # prephitmp_231 = PHI <_230(3), 3(2)>
  # prephitmp_235 = PHI <_234(3), 16(2)>
  # prephitmp_239 = PHI <_238(3), 8(2)>
  offset.166_13 = prephitmp_227 - ISRA.926_115(D);
  generalized_constitutive_tensor = {};
  _17 = .FMA (ISRA.930_119(D), 2.0e+0, ISRA.929_118(D));
  _157 = {ISRA.929_118(D), _17};
  _177 = {_17, ISRA.929_118(D)};
  MEM <vector(2) real(kind=8)> [(real(kind=8)
*)&generalized_constitutive_tensor] = _177;
  MEM <vector(2) real(kind=8)> [(real(kind=8)
*)&generalized_constitutive_tensor + 48B] = _157;
  MEM <vector(2) real(kind=8)> [(real(kind=8)
*)&generalized_constitutive_tensor + 96B] = _60;

So it seems like RTL backend getting worse schedule due to different memory
allocations. 

Memset is bit unfortunate here since it requires a lot of spiling. With
-minline-all-stringops I get before patch:

 Performance counter stats for './b.out':

         27,928.16 msec task-clock:u                     #    1.000 CPUs
utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               138      page-faults:u                    #    4.941 /sec
   133,992,554,723      cycles:u                         #    4.798 GHz        
                (83.33%)
        17,113,198      stalled-cycles-frontend:u        #    0.01% frontend
cycles idle        (83.33%)
        10,144,634      stalled-cycles-backend:u         #    0.01% backend
cycles idle         (83.33%)
   205,237,551,965      instructions:u                   #    1.53  insn per
cycle
                                                  #    0.00  stalled cycles per
insn     (83.33%)
     7,665,052,125      branches:u                       #  274.456 M/sec      
                (83.34%)
        13,596,346      branch-misses:u                  #    0.18% of all
branches             (83.34%)

      27.933007797 seconds time elapsed

      27.928356000 seconds user
       0.000000000 seconds sys


and after patch:

          30791.26 msec task-clock:u                     #    1.000 CPUs
utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               138      page-faults:u                    #    4.482 /sec
      148093969122      cycles:u                         #    4.810 GHz        
                (83.33%)
          13660157      stalled-cycles-frontend:u        #    0.01% frontend
cycles idle        (83.33%)
            411233      stalled-cycles-backend:u         #    0.00% backend
cycles idle         (83.33%)
      204951193376      instructions:u                   #    1.38  insn per
cycle
                                                  #    0.00  stalled cycles per
insn     (83.33%)
        7664856101      branches:u                       #  248.930 M/sec      
                (83.33%)
          12960525      branch-misses:u                  #    0.17% of all
branches             (83.34%)

      30.791579163 seconds time elapsed

      30.791441000 seconds user
       0.000000000 seconds sys

So this may be

next prev parent reply	other threads:[~2023-07-18 10:27 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07  9:46 [Bug middle-end/110586] New: 10% fatigue2 regression on zen between g:8377cf1bf41a0a9d9d49de807b2341f0bf5d30cf and g:3a61ca1b9256535e1bfb19b2d46cde21f3908a5d hubicka at gcc dot gnu.org
2023-07-07 10:26 ` [Bug middle-end/110586] [13/14 Regression] " rguenth at gcc dot gnu.org
2023-07-15 19:26 ` [Bug middle-end/110586] [13/14 Regression] 10% fatigue2 regression on zen since r14-2369-g3a61ca1b925653 jamborm at gcc dot gnu.org
2023-07-15 19:37 ` [Bug middle-end/110586] [14 " pinskia at gcc dot gnu.org
2023-07-17  9:02 ` hubicka at gcc dot gnu.org
2023-07-17 10:02 ` jamborm at gcc dot gnu.org
2023-07-18 10:27 ` hubicka at gcc dot gnu.org [this message]
2024-03-07 23:29 ` [Bug middle-end/110586] [14 Regression] 10% fatigue2 regression on zen since r14-2369-g3a61ca1b925653 (bad LRA&scheduling) law at gcc dot gnu.org
2024-05-07  7:41 ` [Bug middle-end/110586] [14/15 " rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-110586-4-gHTL5AsXEm@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).