public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well
@ 2021-12-06 16:59 hubicka at gcc dot gnu.org
  2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-06 16:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

            Bug ID: 103585
           Summary: fatigue2 requires inlining of peridida to work well
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

fatigue2 regresses by 35% comparing -Ofast -march=native (zen) to -Ofast
-march=native -fno-inline-functions-called once

In generall inlining functions called once is not intended as important
performance heuristics.  Here we seem to miss some propagation that supports
better vectorization.

The important inline is:
-Inlining perdida.constprop/105 size 703.
- Called once from iztaccihuatl/17 2570 insns.
-  Parm map:  -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4
-Updated mod-ref summary for iztaccihuatl/17
-  loads:
-    Every base
-  stores:
-      Base 0: alias set 6
-        Ref 0: alias set 6
-          Every access
-  Side effects
-  Global memory read
- Inlined into iztaccihuatl/17 which now has 3148 size
-Overall time estimate: 270429.808105 weighted by profile: 0.000000
+Overall time estimate: 93562.854248 weighted by profile: 0.000000

So we estimate it to 703 instructions that is not small and the function
inlined to is already quite large (3148 instructions).

Before vectorizer the perdida is still 800 lines of gimple dump (w/o
declarations) but then it simplifies to 630 at the last rerun of dse.

iztaccihuatl is 3323 lines of gimple dump in .optimized of build with perdida
inlined 2832 in build without perdida inlined, so 491 lines.  So it looks like
indeed something propagates with we do not get from ipa-cp/modref propagation.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
@ 2021-12-06 17:15 ` hubicka at gcc dot gnu.org
  2021-12-07  9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-06 17:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mjambor at suse dot cz

--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
It seems to me that we miss some ipa-cp propagation here.  In particular there
are array descriptors:

  parm.326.span = 8;                                                            
  MEM <c_char[8]> [(struct dtype_type *)&parm.326 + 24B] = {};                  
  parm.326.dtype.elem_len = 8;                                                  
  parm.326.dtype.rank = 2;                                                      
  parm.326.dtype.type = 3;                                                      
  parm.326.dim[0].lbound = 1;                                                   
  parm.326.dim[0].ubound = 3;                                                   
  parm.326.dim[0].stride = 1;                                                   
  parm.326.dim[1].lbound = 1;                                                   
  parm.326.dim[1].ubound = 3;                                                   
  parm.326.dim[1].stride = 3;                                                   
  _245 = _243 * 9;                                                              
  _246 = &MEM <real(kind=8)[0:]> [(real(kind=8)[0:] *)_44][_245];               
  parm.326.data = _246;                                                         
  parm.326.offset = -4;                                                         

and the resulting jump function

        ref offset 64: -4 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
        ref offset 128: 8 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
        ref offset 224: 2 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
        ref offset 232: 3 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
        ref offset 256: 8 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
        ref offset 320: 1 [loc_time: 248, loc_size: 697, prop_time: 248,
prop_size: 697]
        ref offset 384: 1 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size:
702]
        ref offset 448: 3 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size:
702]
        ref offset 512: 3 [loc_time: 245, loc_size: 700, prop_time: 245,
prop_size: 700]
        ref offset 576: 1 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size:
702]
        ref offset 640: 3 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size:
702]

here are 11 constants while there are 12 constant stores above.  
I guess the MEM one is missing.

Call is:
                 call perdida (dt, lambda, mu, yield_stress, R_infinity, b,
X_infinity,     &
                               gamma, eta, plastic_strain_threshold,
stress_tensor(:,:,n),  &
                               strain_tensor(:,:,n),
plastic_strain_tensor(:,:,n),          &
                               strain_rate_tensor(:,:,n),
accumulated_plastic_strain(n),    &
                               back_stress_tensor(:,:,n),
isotropic_hardening_stress(n),    &
                               damage(n), failure_threshold,
crack_closure_parameter)

Similar situation repats iself multiple times.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
  2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org
@ 2021-12-07  9:31 ` hubicka at gcc dot gnu.org
  2021-12-07  9:49 ` hubicka at gcc dot gnu.org
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-07  9:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Doing some basic perfing shows that perdida seems to account for about 36% of
runtime of the benchmark and has interesting property that most of time it will
return without looping (and that probbly explains PR103592).  It is not clear
to me why one runs faster than other - we do a lot of spilling in both paths.

This is comparing with -fno-tree-vectorize -fno-tree-slp-vectorize since I find
it easier to parse

  36.91%  a.out-slow  a.out-slow            [.]
__perdida_m_MOD_perdida.constprop.0
  34.15%  a.out-fast  a.out-fast            [.] MAIN__                          
  17.96%  a.out-slow  a.out-slow            [.] MAIN__                          
   7.37%  a.out-slow  a.out-slow            [.]
__perdida_m_MOD_generalized_hookes_law.constprop.0.isra.0
   2.43%  a.out-fast  a.out-fast            [.]
__perdida_m_MOD_generalized_hookes_law.constprop.0.isra.0
   0.24%  a.out-slow  a.out-slow            [.]
__perdida_m_MOD_generalized_hookes_law.constprop.1.isra.0
   0.21%  a.out-fast  a.out-fast            [.]
__perdida_m_MOD_generalized_hookes_law.constprop.1.isra.0

perdida:
       │                                                             ◆
       │     0000000000402500 <__perdida_m_MOD_perdida.constprop.0>: ▒
       │     __perdida_m_MOD_perdida.constprop.0():                  ▒
  1.37 │       push         %rbp                                     ▒
  0.00 │       vxorpd       %xmm13,%xmm13,%xmm13                     ▒
       │       mov          %rsp,%rbp                                ▒
  0.00 │       push         %r15                                     ▒
  2.07 │       push         %r14                                     ▒
  0.01 │       mov          %rdx,%r15                                ▒
  0.00 │       push         %r13                                     ▒
  0.00 │       push         %r12                                     ▒
  0.69 │       push         %rbx                                     ▒
  0.01 │       sub          $0x3c8,%rsp                              ▒
  0.01 │       mov          0x30(%rbp),%rax                          ▒
  0.02 │       vmovsd       (%rsi),%xmm6                             ▒
  0.01 │       mov          %rcx,-0x300(%rbp)                        ▒
  2.01 │       vmovsd       (%rdx),%xmm1                             ▒
  0.00 │       mov          %rdi,-0x380(%rbp)                        ▒
  0.01 │       mov          %r8,-0x388(%rbp)                         ▒
  0.04 │       lea          -0x240(%rbp),%rdi                        ▒
  0.00 │       mov          %r9,-0x390(%rbp)                         ▒
  0.66 │       mov          %rsi,-0x308(%rbp)                        ▒
  0.03 │       lea          -0x2a0(%rbp),%rsi                        ▒
  0.00 │       movq         $0x1,-0x218(%rbp)                        ▒
  0.01 │       movq         $0x3,-0x200(%rbp)                        ▒
  0.66 │       movq         $0x3,-0x260(%rbp)                        ▒
  0.02 │       mov          (%rax),%rbx                              ▒
  0.06 │       mov          0x38(%rbp),%rax                          ▒
  0.00 │       vmovsd       %xmm6,-0x3d8(%rbp)                       ▒
  0.08 │       vmovsd       %xmm1,-0x3e0(%rbp)                       ▒
  0.65 │       mov          %rbx,-0x240(%rbp)                        ▒
  0.74 │       mov          (%rax),%r14                              ▒
  0.67 │       mov          0x40(%rbp),%rax                          ▒
  0.01 │       mov          (%rax),%r13                              ▒
  0.74 │       mov          0x58(%rbp),%rax                          ▒
  0.00 │       mov          (%rax),%r12                              ▒
  0.70 │       mov          0x48(%rbp),%rax                          ▒
  0.00 │       mov          (%rax),%rax                              ▒
  0.69 │       mov          %rax,-0x310(%rbp)                        ▒
  0.00 │       mov          %rsp,%rax                                ▒
  0.00 │       sub          $0x50,%rsp                               ▒
       │       mov          %rsp,%rcx                                ▒
  0.00 │       mov          %rax,-0x2f8(%rbp)                        ▒
  0.01 │       mov          %rcx,-0x2a0(%rbp)                        ▒
  0.01 │       vmovsd       (%r14),%xmm0                             ▒
  0.68 │       vsubsd       0x0(%r13),%xmm0,%xmm0                    ▒
  2.85 │       vmovsd       %xmm0,(%rsp)                             ▒
  2.09 │       vmovsd       0x8(%r14),%xmm0                          ▒
  0.00 │       vsubsd       0x8(%r13),%xmm0,%xmm0                    ▒
  0.01 │       vmovsd       %xmm0,0x8(%rcx)                          ▒
  0.70 │       vmovsd       0x10(%r14),%xmm0                         ▒
  0.01 │       vsubsd       0x10(%r13),%xmm0,%xmm0                   ▒
  0.02 │       vmovsd       %xmm0,0x10(%rcx)                         ▒
  0.69 │       vmovsd       0x18(%r14),%xmm0                         ▒
  0.00 │       vsubsd       0x18(%r13),%xmm0,%xmm0                   ▒
  0.59 │       vmovsd       %xmm0,0x18(%rcx)                         ▒
  1.83 │       vmovsd       0x20(%r14),%xmm0                         ▒
  0.00 │       vsubsd       0x20(%r13),%xmm0,%xmm0                   ▒
  0.02 │       vmovsd       %xmm0,0x20(%rcx)                         ▒
  0.72 │       vmovsd       0x28(%r14),%xmm0                         ◆
  0.00 │       vsubsd       0x28(%r13),%xmm0,%xmm0                   ▒
  0.01 │       vmovsd       %xmm0,0x28(%rcx)                         ▒
  0.13 │       vmovsd       0x30(%r14),%xmm0                         ▒
  0.01 │       vsubsd       0x30(%r13),%xmm0,%xmm0                   ▒
  0.01 │       vmovsd       %xmm0,0x30(%rcx)                         ▒
  0.70 │       vmovsd       0x38(%r14),%xmm0                         ▒
  0.00 │       vsubsd       0x38(%r13),%xmm0,%xmm0                   ▒
  0.02 │       vmovsd       %xmm0,0x38(%rcx)                         ▒
  0.69 │       vmovsd       0x40(%r14),%xmm0                         ▒
  0.00 │       vsubsd       0x40(%r13),%xmm0,%xmm0                   ▒
  0.01 │       vmovsd       %xmm0,0x40(%rcx)                         ▒
  0.70 │       vmovsd       %xmm6,%xmm6,%xmm0                        ▒
  0.00 │     → call         __perdida_m_MOD_generalized_hookes_law.co▒
  0.00 │       mov          0x68(%rbp),%rax                          ▒
  0.00 │       vmovsd       delim.4+0xc,%xmm1                        ▒
  0.00 │       mov          -0x2f8(%rbp),%rsp                        ▒
  0.01 │       mov          -0x308(%rbp),%r10                        ▒
  0.01 │       vmovsd       (%rax),%xmm4                             ▒
  0.01 │       vsubsd       %xmm4,%xmm1,%xmm8                        ▒
  0.04 │       vmulsd       (%rbx),%xmm8,%xmm2                       ▒
  1.46 │       vmulsd       0x20(%rbx),%xmm8,%xmm1                   ▒
  3.40 │       vcomisd      %xmm13,%xmm4                             ▒
  0.00 │       vmulsd       0x40(%rbx),%xmm8,%xmm0                   ▒
  0.75 │       vmulsd       0x8(%rbx),%xmm8,%xmm9                    ▒
  0.11 │       vmulsd       0x10(%rbx),%xmm8,%xmm6                   ▒
  0.13 │       vmulsd       0x18(%rbx),%xmm8,%xmm11                  ▒
  0.00 │       vaddsd       %xmm1,%xmm2,%xmm3                        ▒
  0.01 │       vmovsd       %xmm2,(%rbx)                             ▒
  0.01 │       vmulsd       0x28(%rbx),%xmm8,%xmm5                   ▒
  0.02 │       vmovsd       %xmm1,0x20(%rbx)                         ▒
  0.00 │       vmulsd       0x30(%rbx),%xmm8,%xmm10                  ▒
  0.60 │       vaddsd       %xmm0,%xmm3,%xmm3                        ▒
  1.50 │       vmovsd       %xmm9,0x8(%rbx)                          ▒
  0.01 │       vmulsd       0x38(%rbx),%xmm8,%xmm7                   ▒
  0.00 │       vmovsd       %xmm0,0x40(%rbx)                         ▒
       │       vmovsd       %xmm6,0x10(%rbx)                         ▒
  0.00 │       vmovsd       %xmm11,0x18(%rbx)                        ▒
  0.01 │       vmovsd       %xmm5,0x28(%rbx)                         ▒
  0.02 │       vmovsd       %xmm10,0x30(%rbx)                        ▒
  0.02 │       vmovsd       %xmm7,0x38(%rbx)                         ▒
  0.69 │     ↓ jbe          207                                      ▒
  0.00 │       vcomisd      0xd6e2f(%rip),%xmm3        # 4d9530 <deli▒
  0.86 │     ↓ jb           cf3                                      ▒
  0.24 │207:   vmovsd       delim.4+0x1c,%xmm7                       ▒
  0.01 │       mov          delim.4+0xc,%rax                         ▒
  0.00 │       mov          %rax,-0x330(%rbp)                        ▒
  0.86 │     ↓ jb           cf3                                      ▒
  0.24 │207:   vmovsd       delim.4+0x1c,%xmm7                       ▒
  0.01 │       mov          delim.4+0xc,%rax                         ▒
  0.00 │       mov          %rax,-0x330(%rbp)                        ▒
  0.00 │       vmulsd       %xmm7,%xmm3,%xmm6                        ▒
  1.78 │       vmovsd       %xmm6,-0x2f8(%rbp)                       ▒
  0.45 │229:   vmovsd       %xmm7,-0x3a8(%rbp)                       ▒
  0.00 │       vmovsd       -0x2f8(%rbp),%xmm7                       ▒
  5.12 │       vmovsd       0x8(%rbx),%xmm5                          ▒
  0.02 │       vmovsd       0x10(%rbx),%xmm4                         ▒
  0.04 │       vmovsd       0x18(%rbx),%xmm3                         ▒
  0.01 │       vmovsd       0x28(%rbx),%xmm15                        ▒
  0.03 │       vmovsd       0x38(%rbx),%xmm9                         ◆
  0.04 │       vmovsd       0x30(%rbx),%xmm14                        ▒
  0.03 │       vsubsd       %xmm7,%xmm2,%xmm6                        ▒
  2.00 │       vmovsd       delim.4+0xc,%xmm2                        ▒
  0.00 │       vsubsd       %xmm7,%xmm1,%xmm1                        ▒
  0.69 │       vsubsd       %xmm7,%xmm0,%xmm7                        ▒
  0.69 │       vmovsd       %xmm5,-0x350(%rbp)                       ▒
  0.00 │       vmovsd       %xmm4,-0x358(%rbp)                       ▒
  0.01 │       vmovsd       %xmm3,-0x360(%rbp)                       ▒
  0.01 │       vmovsd       %xmm6,-0x338(%rbp)                       ▒
  0.68 │       vmovsd       %xmm1,-0x340(%rbp)                       ▒
  0.69 │       vmovsd       %xmm9,-0x378(%rbp)                       ▒
  0.01 │       vmovsd       %xmm15,-0x368(%rbp)                      ▒
  0.02 │       vmovsd       %xmm14,-0x370(%rbp)                      ▒
  0.01 │       vmovsd       %xmm7,-0x348(%rbp)                       ▒
  0.68 │       vdivsd       %xmm8,%xmm2,%xmm8                        ▒
  0.03 │       vmulsd       %xmm5,%xmm8,%xmm11                       ▒
  0.03 │       vsubsd       0x8(%r12),%xmm11,%xmm0                   ▒
  0.01 │       vmulsd       %xmm4,%xmm8,%xmm5                        ▒
  0.00 │       vmulsd       %xmm6,%xmm8,%xmm6                        ▒
  0.04 │       vsubsd       (%r12),%xmm6,%xmm4                       ▒
  0.72 │       vmulsd       %xmm3,%xmm8,%xmm12                       ▒
  0.00 │       vmovsd       %xmm5,-0x318(%rbp)                       ▒
  0.02 │       vmulsd       %xmm1,%xmm8,%xmm10                       ▒
  0.00 │       vsubsd       0x18(%r12),%xmm12,%xmm1                  ▒
  0.01 │       vmulsd       %xmm15,%xmm8,%xmm3                       ▒
  0.00 │       vmovsd       %xmm6,-0x320(%rbp)                       ▒
  0.02 │       vmulsd       %xmm0,%xmm0,%xmm0                        ▒
  0.10 │       vmulsd       %xmm9,%xmm8,%xmm2                        ▒
  0.64 │       vsubsd       0x28(%r12),%xmm3,%xmm9                   ▒
  0.01 │       vmovsd       %xmm3,-0x308(%rbp)                       ▒
  0.01 │       vfmadd132sd  %xmm4,%xmm0,%xmm4                        ▒
  2.70 │       vsubsd       0x10(%r12),%xmm5,%xmm0                   ▒
  0.00 │       vmulsd       %xmm14,%xmm8,%xmm15                      ▒
  0.00 │       vmovsd       %xmm2,-0x3a0(%rbp)                       ▒
  0.00 │       vmulsd       %xmm7,%xmm8,%xmm8                        ▒
  0.00 │       vmulsd       %xmm1,%xmm1,%xmm1                        ▒
  0.01 │       vmulsd       %xmm9,%xmm9,%xmm9                        ▒
  0.01 │       vfmadd132sd  %xmm0,%xmm1,%xmm0                        ▒
  0.06 │       vsubsd       0x20(%r12),%xmm10,%xmm1                  ▒
  0.66 │       vfmadd132sd  %xmm1,%xmm9,%xmm1                        ▒
  0.06 │       vsubsd       0x30(%r12),%xmm15,%xmm9                  ▒
  0.00 │       vaddsd       %xmm0,%xmm1,%xmm1                        ▒
  2.07 │       vsubsd       0x38(%r12),%xmm2,%xmm0                   ▒
  0.00 │       vmulsd       %xmm0,%xmm0,%xmm0                        ▒
  0.01 │       vfmadd231sd  %xmm9,%xmm9,%xmm0                        ▒
  0.08 │       vsubsd       0x40(%r12),%xmm8,%xmm9                   ▒
  0.00 │       vfmadd132sd  %xmm9,%xmm4,%xmm9                        ▒
  0.65 │       vaddsd       %xmm9,%xmm0,%xmm0                        ▒
  2.04 │       vaddsd       %xmm1,%xmm0,%xmm0                        ▒
  2.09 │       vmulsd       delim.4+0x24,%xmm0,%xmm14                ▒
  3.47 │       mov          -0x300(%rbp),%rax                        ▒
  0.00 │       vmovsd       (%rax),%xmm9                             ▒
  0.00 │       mov          0x60(%rbp),%rax                          ▒
  0.00 │       vaddsd       (%rax),%xmm9,%xmm4                       ▒
  0.01 │       vsqrtsd      %xmm14,%xmm14,%xmm14                     ▒
 13.94 │       vsubsd       %xmm4,%xmm14,%xmm4                       ▒
  0.71 │       vcomisd      %xmm4,%xmm13                             ▒
  2.88 │       vmovsd       %xmm4,-0x398(%rbp)                       ▒
  0.00 │     ↓ ja           3b0                                      ▒
  0.16 │       vcomisd      0xd6c9a(%rip),%xmm0        # 4d9548 <deli◆
  0.04 │     ↓ jae          3bf                                      ▒
  1.30 │3b0:   lea          -0x28(%rbp),%rsp                         ▒
  0.00 │       pop          %rbx                                     ▒
  0.00 │       pop          %r12                                     ▒
  0.00 │       pop          %r13                                     ▒
  0.00 │       pop          %r14                                     ▒
  0.00 │       pop          %r15                                     ▒
  0.02 │       pop          %rbp                                     ▒
  0.01 │     ← ret                                                   ▒

inlined variant:
Percent│        vmovsd       (%r14),%xmm0                            ▒
  0.01 │        mov          %rsp,%rax                               ▒
  0.01 │        sub          $0x50,%rsp                              ▒
  0.00 │        vmovsd       -0x448(%rbp),%xmm5                      ▒
  0.06 │        vsubsd       0x0(%r13),%xmm0,%xmm0                   ▒
  0.71 │        mov          %rsp,%rdx                               ▒
  0.02 │        vmovsd       -0x440(%rbp),%xmm1                      ▒
  0.00 │        mov          %rcx,-0x508(%rbp)                       ▒
  0.00 │        mov          %r10,-0x4e8(%rbp)                       ▒
  0.01 │        mov          %rax,-0x4e0(%rbp)                       ◆
  0.00 │        lea          -0x250(%rbp),%rsi                       ▒
       │        lea          -0x2b0(%rbp),%rdi                       ▒
  0.05 │        mov          %rdx,-0x250(%rbp)                       ▒
  0.67 │        movq         $0x1,-0x288(%rbp)                       ▒
  0.02 │        vxorpd       %xmm13,%xmm13,%xmm13                    ▒
  0.00 │        movq         $0x3,-0x270(%rbp)                       ▒
  0.00 │        mov          %r12,-0x2b0(%rbp)                       ▒
  0.01 │        movq         $0x3,-0x210(%rbp)                       ▒
  0.00 │        vmovsd       %xmm5,-0x618(%rbp)                      ▒
  0.00 │        vmovsd       %xmm1,-0x620(%rbp)                      ▒
  0.06 │        vmovsd       %xmm0,(%rsp)                            ▒
  0.69 │        vmovsd       0x8(%r14),%xmm0                         ▒
  0.03 │        vsubsd       0x8(%r13),%xmm0,%xmm0                   ▒
  0.01 │        vmovsd       %xmm0,0x8(%rdx)                         ▒
  0.01 │        vmovsd       0x10(%r14),%xmm0                        ▒
  0.02 │        vsubsd       0x10(%r13),%xmm0,%xmm0                  ▒
  0.00 │        vmovsd       %xmm0,0x10(%rdx)                        ▒
  0.01 │        vmovsd       0x18(%r14),%xmm0                        ▒
  0.06 │        vsubsd       0x18(%r13),%xmm0,%xmm0                  ▒
  0.69 │        vmovsd       %xmm0,0x18(%rdx)                        ▒
  0.03 │        vmovsd       0x20(%r14),%xmm0                        ▒
  0.01 │        vsubsd       0x20(%r13),%xmm0,%xmm0                  ▒
  0.01 │        vmovsd       %xmm0,0x20(%rdx)                        ▒
  0.02 │        vmovsd       0x28(%r14),%xmm0                        ▒
  0.00 │        vsubsd       0x28(%r13),%xmm0,%xmm0                  ▒
  0.00 │        vmovsd       %xmm0,0x28(%rdx)                        ▒
  0.06 │        vmovsd       0x30(%r14),%xmm0                        ▒
  0.67 │        vsubsd       0x30(%r13),%xmm0,%xmm0                  ▒
  0.03 │        vmovsd       %xmm0,0x30(%rdx)                        ▒
  0.01 │        vmovsd       0x38(%r14),%xmm0                        ▒
  0.00 │        vsubsd       0x38(%r13),%xmm0,%xmm0                  ▒
  0.02 │        vmovsd       %xmm0,0x38(%rdx)                        ▒
  0.01 │        vmovsd       0x40(%r14),%xmm0                        ▒
  0.00 │        vsubsd       0x40(%r13),%xmm0,%xmm0                  ▒
  0.06 │        vmovsd       %xmm0,0x40(%rdx)                        ▒
  0.68 │        vmovsd       %xmm5,%xmm5,%xmm0                       ▒
  0.01 │      → call         __perdida_m_MOD_generalized_hookes_law.c▒
  0.00 │        mov          -0x510(%rbp),%rax                       ▒
  0.00 │        vmovsd       delim.4+0x1c,%xmm6                      ▒
  0.00 │        mov          -0x4e0(%rbp),%rsp                       ▒
  0.05 │        mov          -0x4e8(%rbp),%r10                       ▒
  0.01 │        mov          -0x508(%rbp),%rcx                       ▒
  0.07 │        vmovsd       -0x10(%rax,%r15,8),%xmm4                ▒
  0.64 │        vsubsd       %xmm4,%xmm6,%xmm8                       ▒
  0.06 │        vmulsd       0x8(%r12),%xmm8,%xmm6                   ▒
  0.40 │        vmulsd       0x10(%r12),%xmm8,%xmm7                  ▒
  0.25 │        vcomisd      %xmm13,%xmm4                            ▒
  0.02 │        vmovsd       %xmm4,-0x570(%rbp)                      ▒
  0.00 │        vmulsd       0x18(%r12),%xmm8,%xmm5                  ▒
  0.10 │        vmulsd       (%r12),%xmm8,%xmm2                      ▒
  0.54 │        vmulsd       0x20(%r12),%xmm8,%xmm1                  ▒
  0.85 │        vmulsd       0x40(%r12),%xmm8,%xmm0                  ▒
  0.44 │        vmovsd       %xmm6,-0x4f0(%rbp)                      ▒
  0.04 │        vmovsd       %xmm6,0x8(%r12)                         ▒
  0.10 │        vmovsd       %xmm7,-0x4f8(%rbp)                      ▒
  0.01 │        vmovsd       %xmm7,0x10(%r12)                        ◆
  0.14 │        vmulsd       0x28(%r12),%xmm8,%xmm6                  ▒
  0.19 │        vmulsd       0x38(%r12),%xmm8,%xmm7                  ▒
  0.25 │        vmovsd       %xmm5,-0x500(%rbp)                      ▒
  0.03 │        vmovsd       %xmm5,0x18(%r12)                        ▒
  0.37 │        vmovsd       %xmm2,(%r12)                            ▒
  0.03 │        vmulsd       0x30(%r12),%xmm8,%xmm5                  ▒
  0.09 │        vaddsd       %xmm1,%xmm2,%xmm3                       ▒
  0.00 │        vmovsd       %xmm1,0x20(%r12)                        ▒
  0.19 │        vmovsd       %xmm0,0x40(%r12)                        ▒
  0.16 │        vaddsd       %xmm0,%xmm3,%xmm3                       ▒
  0.76 │        vmovsd       %xmm6,0x28(%r12)                        ▒
  0.09 │        vmovsd       %xmm7,0x38(%r12)                        ▒
  0.10 │        vmovsd       %xmm5,0x30(%r12)                        ▒
  0.10 │      ↓ jbe          b51                                     ▒
  0.01 │        vcomisd      0xd17ed(%rip),%xmm3        # 4d8578 <del▒
  0.55 │      ↓ jb           1a11                                    ▒
  0.23 │ b51:   vmovsd       delim.4+0x64,%xmm4                      ▒
  0.01 │        vmulsd       %xmm4,%xmm3,%xmm3                       ▒
  1.38 │        vmovsd       %xmm4,-0x5d8(%rbp)                      ▒
  0.01 │        vmovsd       %xmm3,-0x508(%rbp)                      ▒
  0.48 │ b6d:   vsubsd       %xmm3,%xmm2,%xmm4                       ▒
  0.02 │        vsubsd       %xmm3,%xmm1,%xmm2                       ▒
  0.57 │        vmovsd       %xmm5,-0x600(%rbp)                      ▒
  0.15 │        vmovsd       %xmm6,-0x608(%rbp)                      ▒
  0.18 │        vsubsd       %xmm3,%xmm0,%xmm1                       ▒
  0.52 │        vmovsd       %xmm7,-0x5f8(%rbp)                      ▒
  0.30 │        vmovsd       delim.4+0x1c,%xmm3                      ▒
  0.09 │        vmovsd       %xmm4,-0x578(%rbp)                      ▒
  0.95 │        vmovsd       %xmm2,-0x580(%rbp)                      ▒
  0.73 │        vmovsd       %xmm1,-0x588(%rbp)                      ▒
  0.55 │        vdivsd       %xmm8,%xmm3,%xmm0                       ▒
  5.95 │        vmulsd       -0x4f0(%rbp),%xmm0,%xmm13               ▒
  2.14 │        vmulsd       -0x4f8(%rbp),%xmm0,%xmm12               ▒
  0.65 │        vmulsd       %xmm0,%xmm5,%xmm10                      ▒
  0.09 │        vmulsd       %xmm6,%xmm0,%xmm3                       ▒
  0.01 │        vmovsd       0x8(%rbx),%xmm5                         ▒
  0.01 │        vmulsd       %xmm7,%xmm0,%xmm6                       ▒
  0.68 │        vmulsd       %xmm1,%xmm0,%xmm1                       ▒
  0.15 │        vmovsd       (%rbx),%xmm7                            ▒
  0.00 │        vmovsd       %xmm3,-0x5e0(%rbp)                      ▒
  0.01 │        vmulsd       %xmm4,%xmm0,%xmm9                       ▒
  0.69 │        vmulsd       %xmm2,%xmm0,%xmm14                      ▒
  0.10 │        vmovsd       %xmm1,-0x4e8(%rbp)                      ▒
  0.05 │        vmovsd       %xmm6,-0x4e0(%rbp)                      ▒
  0.62 │        vmovsd       %xmm5,-0x520(%rbp)                      ▒
  0.05 │        vmovsd       0x10(%rbx),%xmm6                        ▒
  0.00 │        vmulsd       -0x500(%rbp),%xmm0,%xmm4                ▒
  0.11 │        vsubsd       %xmm5,%xmm13,%xmm0                      ▒
  0.01 │        vsubsd       %xmm7,%xmm9,%xmm2                       ▒
  0.09 │        vmovsd       %xmm9,-0x5c0(%rbp)                      ▒
  0.70 │        vmovsd       %xmm7,-0x518(%rbp)                      ▒
  0.06 │        vmovsd       %xmm14,-0x5c8(%rbp)                     ▒
  0.64 │        vmulsd       %xmm0,%xmm0,%xmm1                       ▒
  0.06 │        vmovsd       %xmm0,-0x5a8(%rbp)                      ▒
  0.71 │        vmovsd       %xmm2,-0x5a0(%rbp)                      ▒
  0.72 │        vfmadd231sd  %xmm2,%xmm2,%xmm1                       ◆
  0.81 │        vmovsd       0x18(%rbx),%xmm2                        ▒
  0.01 │        vsubsd       %xmm6,%xmm12,%xmm15                     ▒
  0.00 │        vmovsd       %xmm6,-0x528(%rbp)                      ▒
  0.05 │        vmovsd       %xmm15,-0x5b0(%rbp)                     ▒
  0.07 │        vmovsd       %xmm4,-0x5f0(%rbp)                      ▒
  0.63 │        vmovsd       %xmm2,-0x530(%rbp)                      ▒
  0.07 │        vsubsd       %xmm2,%xmm4,%xmm2                       ▒
  0.01 │        vmovsd       0x20(%rbx),%xmm4                        ▒
  0.10 │        vmulsd       %xmm2,%xmm2,%xmm0                       ▒
  0.06 │        vmovsd       %xmm2,-0x5e8(%rbp)                      ▒
  0.62 │        vfmadd231sd  %xmm15,%xmm15,%xmm0                     ▒
  1.14 │        vmovsd       %xmm4,-0x538(%rbp)                      ▒
  0.07 │        vsubsd       %xmm4,%xmm14,%xmm14                     ▒
  0.00 │        vaddsd       %xmm0,%xmm1,%xmm1                       ▒
  2.07 │        vmovsd       0x28(%rbx),%xmm0                        ▒
  0.00 │        vsubsd       %xmm0,%xmm3,%xmm15                      ▒
  0.00 │        vmovsd       %xmm0,-0x540(%rbp)                      ▒
  0.04 │        vmovsd       0x30(%rbx),%xmm3                        ▒
  0.00 │        vmovsd       -0x4e8(%rbp),%xmm4                      ▒
  0.01 │        vmovsd       %xmm10,-0x5d0(%rbp)                     ▒
  0.05 │        vmovsd       0x38(%rbx),%xmm9                        ▒
  0.00 │        vmovsd       -0x4e0(%rbp),%xmm7                      ▒
  0.69 │        vmulsd       %xmm15,%xmm15,%xmm0                     ▒
  0.00 │        mov          -0x568(%rbp),%rax                       ▒
  0.00 │        vfmadd231sd  %xmm14,%xmm14,%xmm0                     ▒
  0.02 │        vsubsd       %xmm3,%xmm10,%xmm11                     ▒
  0.00 │        vmovsd       0x40(%rbx),%xmm10                       ▒
  0.00 │        vmovsd       %xmm3,-0x548(%rbp)                      ▒
  0.05 │        vmovsd       %xmm9,-0x550(%rbp)                      ▒
  0.06 │        vsubsd       %xmm9,%xmm7,%xmm3                       ▒
  0.68 │        vfmadd231sd  %xmm11,%xmm11,%xmm0                     ▒
  0.19 │        vmovsd       %xmm3,-0x5b8(%rbp)                      ▒
  0.06 │        vmovsd       %xmm10,-0x558(%rbp)                     ▒
  0.06 │        vsubsd       %xmm10,%xmm4,%xmm10                     ▒
  0.00 │        vxorpd       %xmm4,%xmm4,%xmm4                       ▒
  0.00 │        vaddsd       %xmm0,%xmm1,%xmm1                       ▒
  0.91 │        vmulsd       %xmm10,%xmm10,%xmm0                     ▒
  0.02 │        vfmadd231sd  %xmm3,%xmm3,%xmm0                       ▒
  0.78 │        vaddsd       %xmm1,%xmm0,%xmm0                       ▒
  2.21 │        vmovsd       -0x10(%rax,%r15,8),%xmm1                ▒
  0.01 │        vmulsd       delim.4+0x6c,%xmm0,%xmm9                ▒
  3.77 │        vmovsd       %xmm1,-0x590(%rbp)                      ▒
  0.00 │        vaddsd       -0x3f8(%rbp),%xmm1,%xmm1                ▒
  0.00 │        vsqrtsd      %xmm9,%xmm9,%xmm9                       ▒
 15.05 │        vsubsd       %xmm1,%xmm9,%xmm5                       ▒
  0.81 │        vmovsd       %xmm1,-0x560(%rbp)                      ▒
  0.00 │        vcomisd      %xmm5,%xmm4                             ▒
  3.48 │        vmovsd       %xmm5,-0x598(%rbp)                      ▒
  0.00 │      ↑ ja           910                                     ▒
  0.17 │        vcomisd      0xd15d4(%rip),%xmm0        # 4d8590 <del▒
  0.07 │      ↑ jb           910                                     ▒

Which does look very similar to me except for the bookkeping moves.  
With perf stat I get:

 Performance counter stats for './a.out-slow':

         37,676.01 msec task-clock                #    1.000 CPUs utilized      
               113      context-switches          #    2.999 /sec               
                24      cpu-migrations            #    0.637 /sec               
                70      page-faults               #    1.858 /sec               
   142,922,291,843      cycles                    #    3.793 GHz                
       128,124,341      stalled-cycles-frontend   #    0.09% frontend cycles
idle   
   110,415,979,659      stalled-cycles-backend    #   77.26% backend cycles
idle    
   260,215,601,013      instructions              #    1.82  insn per cycle     
                                                  #    0.42  stalled cycles per
insn
     9,809,351,864      branches                  #  260.361 M/sec              
        23,585,751      branch-misses             #    0.24% of all branches    

      37.682154164 seconds time elapsed

      37.677998000 seconds user
       0.000000000 seconds sys


 Performance counter stats for './a.out-fast':



         23,892.07 msec task-clock                #    1.000 CPUs utilized      
                36      context-switches          #    1.507 /sec               
                13      cpu-migrations            #    0.544 /sec               
                70      page-faults               #    2.930 /sec               
    84,630,483,861      cycles                    #    3.542 GHz                
       105,428,600      stalled-cycles-frontend   #    0.12% frontend cycles
idle   
    59,095,901,330      stalled-cycles-backend    #   69.83% backend cycles
idle    
   191,156,427,437      instructions              #    2.26  insn per cycle     
                                                  #    0.31  stalled cycles per
insn
     8,612,504,734      branches                  #  360.475 M/sec              
        23,436,092      branch-misses             #    0.27% of all branches    

      23.902460280 seconds time elapsed

      23.885982000 seconds user
       0.008000000 seconds sys

So there is 13% difference in instruction count that I could imagine to be
mostly the moves.
Perhaps perdida is simply very expensive to call given the job it does

  0.00 │16d0:   mov          0x38(%rsp),%rax                         ▒
  0.01 │        mov          %r10,0x18(%rsp)                         ▒
  0.00 │        mov          %r11,0x2f8(%rsp)                        ▒
  0.06 │        mov          %r11,0x358(%rsp)                        ▒
  0.03 │        mov          %r11,0x3b8(%rsp)                        ▒
  1.37 │        movq         $0xfffffffffffffffc,0x2e8(%rsp)         ▒
  0.01 │        movq         $0x8,0x2f0(%rsp)                        ▒
  0.01 │        movq         $0x8,0x300(%rsp)                        ◆
  0.01 │        movq         $0x1,0x308(%rsp)                        ▒
  0.02 │        add          %rbp,%rax                               ▒
  0.00 │        movq         $0x1,0x310(%rsp)                        ▒
  0.05 │        movq         $0x3,0x318(%rsp)                        ▒
  0.03 │        mov          %rax,0x2e0(%rsp)                        ▒
  1.34 │        movq         $0x3,0x320(%rsp)                        ▒
  0.03 │        movq         $0x1,0x328(%rsp)                        ▒
  0.01 │        movq         $0x3,0x330(%rsp)                        ▒
  0.03 │        movq         $0xfffffffffffffffc,0x348(%rsp)         ▒
  0.01 │        movq         $0x8,0x350(%rsp)                        ▒
  0.02 │        movq         $0x8,0x360(%rsp)                        ▒
  0.05 │        movq         $0x1,0x368(%rsp)                        ▒
  0.04 │        movq         $0x1,0x370(%rsp)                        ▒
  1.35 │        movq         $0x3,0x378(%rsp)                        ▒
  0.03 │        movq         $0x3,0x380(%rsp)                        ▒
  0.01 │        movq         $0x1,0x388(%rsp)                        ▒
  0.26 │        movq         $0x3,0x390(%rsp)                        ▒
  0.02 │        movq         $0xfffffffffffffffc,0x3a8(%rsp)         ▒
  0.26 │        movq         $0x8,0x3b0(%rsp)                        ▒
  0.05 │        movq         $0x8,0x3c0(%rsp)                        ▒
  0.09 │        movq         $0x1,0x3c8(%rsp)                        ▒
  1.34 │        mov          0x28(%rsp),%rax                         ▒
  0.02 │        add          %rbp,%rax                               ▒
  0.01 │        mov          %rax,0x340(%rsp)                        ▒
  0.02 │        mov          0x40(%rsp),%rax                         ▒
  0.01 │        add          %rbp,%rax                               ▒
  0.00 │        mov          %rax,0x3a0(%rsp)                        ▒
  0.09 │        movq         $0x1,0x3d0(%rsp)                        ▒
  0.04 │        mov          0x30(%rsp),%rax                         ▒
  1.35 │        mov          %r11,0x418(%rsp)                        ▒
  0.03 │        mov          %r11,0x478(%rsp)                        ▒
  0.02 │        movq         $0x3,0x3d8(%rsp)                        ▒
  0.04 │        movq         $0x3,0x3e0(%rsp)                        ▒
  0.01 │        movq         $0x1,0x3e8(%rsp)                        ▒
  0.07 │        movq         $0x3,0x3f0(%rsp)                        ▒
  0.06 │        add          %rbp,%rax                               ▒
  0.02 │        movq         $0xfffffffffffffffc,0x408(%rsp)         ▒
  1.36 │        movq         $0x8,0x410(%rsp)                        ▒
  0.48 │        mov          %rax,0x400(%rsp)                        ▒
  0.01 │        movq         $0x8,0x420(%rsp)                        ▒
  0.24 │        movq         $0x1,0x428(%rsp)                        ▒
  0.01 │        movq         $0x1,0x430(%rsp)                        ▒
  0.24 │        movq         $0x3,0x438(%rsp)                        ▒
  0.01 │        movq         $0x3,0x440(%rsp)                        ▒
  0.28 │        movq         $0x1,0x448(%rsp)                        ▒
  1.32 │        movq         $0x3,0x450(%rsp)                        ▒
 13.32 │        movq         $0xfffffffffffffffc,0x468(%rsp)         ▒
  0.01 │        movq         $0x8,0x470(%rsp)                        ▒
  5.68 │        movq         $0x8,0x480(%rsp)                        ▒
  0.01 │        movq         $0x1,0x488(%rsp)                        ▒
  2.87 │        movq         $0x1,0x490(%rsp)                        ▒
  0.06 │        movq         $0x3,0x498(%rsp)                        ▒
  2.82 │        movq         $0x3,0x4a0(%rsp)                        ▒
  0.02 │        movq         $0x1,0x4a8(%rsp)                        ▒
  2.87 │        movq         $0x3,0x4b0(%rsp)                        ▒
  0.02 │        mov          0x48(%rsp),%rax                         ◆
  1.41 │        add          %rbp,%rax                               ▒
  0.02 │        add          $0x48,%rbp                              ▒
  0.00 │        mov          %rax,0x460(%rsp)                        ▒
  1.43 │        lea          0x170(%rsp),%rax                        ▒
  0.00 │        push         %rax                                    ▒
  1.42 │        lea          0x190(%rsp),%rax                        ▒
  0.02 │        push         %rax                                    ▒
  4.28 │        push         %r13                                    ▒
  0.01 │        add          $0x8,%r13                               ▒
  0.00 │        push         %r14                                    ▒
  4.27 │        add          $0x8,%r14                               ▒
       │        lea          0x480(%rsp),%rax                        ▒
  0.01 │        push         %rax                                    ▒
  0.03 │        push         %r12                                    ▒
  0.00 │        add          $0x8,%r12                               ▒
  0.00 │        lea          0x430(%rsp),%rax                        ▒
  1.42 │        push         %rax                                    ▒
  2.77 │        lea          0x3d8(%rsp),%rax                        ▒
  0.01 │        push         %rax                                    ▒
  1.43 │        lea          0x380(%rsp),%rax                        ▒
  0.01 │        push         %rax                                    ▒
  1.47 │        lea          0x328(%rsp),%rax                        ▒
  0.00 │        push         %rax                                    ▒
  4.16 │        lea          0x1f8(%rsp),%rax                        ▒
  0.01 │        push         %rax                                    ▒
  2.86 │        lea          0x1d8(%rsp),%rax                        ▒
  0.00 │        push         %rax                                    ▒
  1.56 │        lea          0x1f0(%rsp),%rax                        ▒
  0.01 │        push         %rax                                    ▒
  5.42 │        lea          0x248(%rsp),%rax                        ▒
  0.01 │        push         %rax                                    ▒
  1.43 │        lea          0x1c8(%rsp),%r9                         ▒
  0.00 │        lea          0x220(%rsp),%r8                         ▒
  0.0  0.00 │        lea          0x210(%rsp),%rdx                        ▒
  0.00 │        lea          0x208(%rsp),%rsi                        ▒
  0.01 │        lea          0x1e8(%rsp),%rdi                        ▒
  0.03 │      → call         __perdida_m_MOD_perdida.constprop.0     ▒
 0 │        lea          0x258(%rsp),%rcx                        ▒

We do constant propagate the array descriptor (up to the one variable entry)
but we do not optimize out the code initializing them.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
  2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org
  2021-12-07  9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org
@ 2021-12-07  9:49 ` hubicka at gcc dot gnu.org
  2021-12-07 10:10 ` hubicka at gcc dot gnu.org
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-07  9:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-12-07
             Status|UNCONFIRMED                 |NEW

--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
perdida modref summary is:
  loads:                                                                        
    Every base                                                                  
  stores:                                                                       
    Every base                                                                  
  Side effects                                                                  
  Nondeterministic                                                              
  Global memory read                                                            
  Global memory written                                                         
  parm 0 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 1 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 2 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 3 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 4 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 5 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 6 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 7 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 8 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 9 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape no_indirect_read
  parm 10 flags: no_direct_clobber no_direct_escape no_indirect_escape          
  parm 11 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 12 flags: no_direct_clobber no_direct_escape no_indirect_escape          
  parm 13 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
  parm 14 flags: no_direct_escape no_indirect_escape                            
  parm 15 flags: no_direct_clobber no_direct_escape no_indirect_escape          
  parm 16 flags: no_direct_escape no_indirect_escape                            
  parm 17 flags: no_direct_escape no_indirect_escape                            
  parm 18 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape no_indirect_read
  parm 19 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
__attribute__((fn spec (". r r r r r r r r r r w r w r w w w w r r ")))         

So we do can not use it to optimize out the dead parameters. The reason is that
it calls to 
  dt_parm.237.common.filename = &"fatigue2.f90"[1]{lb: 1 sz: 1};                
  dt_parm.237.common.line = 1010;                                               
  MEM <vector(2) integer(kind=4)> [(integer(kind=4) *)&dt_parm.237] = { 128, 6
};
  _gfortran_st_write (&dt_parm.237);                                            
  _gfortran_transfer_character_write (&dt_parm.237, &"discriminant is negative
in perdida, abort."[1]{lb: 1 sz: 1}, 43);
  _gfortran_st_write_done (&dt_parm.237);                                       
  dt_parm.237 ={v} {CLOBBER};                                                   
  _gfortran_stop_string (0B, 0, 0);                                             

and fortran i/o has no useful fnspec on it so we think it access everything.

To optimize out the initializer we could (I guess)
 1) teach ipa-cp to turn the one remaining field to scalar and then we will
detect the descriptor as non-escaping from MAIN and optimize it out
 2) teach PTA summary of ipa-modref to be expressive enough to make PTA to work
this out.

-fipa-pta does not help (but I think it could if it worked well). We get:

  <bb 69> [local count: 109362591]:                                             
  _1895 = (unsigned long) _44;                                                  
  _1894 = _1895 + ivtmp.1050_1911;                                              
  _246 = (real(kind=8) *) _1894;                                                
  parm.326.data = _246;                                                         
  parm.326.offset = -4;                                                         
  parm.326.dtype.elem_len = 8;                                                  
  MEM <unsigned long> [(void *)&parm.326 + 24B] = 3307124817920;                
  parm.326.span = 8;                                                            
  parm.326.dim[0].stride = 1;                                                   
  parm.326.dim[0].lbound = 1;                                                   
  parm.326.dim[0].ubound = 3;                                                   
  parm.326.dim[1].stride = 3;                                                   
  parm.326.dim[1].lbound = 1;                                                   
  parm.326.dim[1].ubound = 3;                                                   
  _1883 = (unsigned long) _33;                                                  
  _1882 = _1883 + ivtmp.1050_1911;                                              
  _247 = (real(kind=8) *) _1882;                                                
  parm.327.data = _247;                                                         
  parm.327.offset = -4;                                                         
  parm.327.dtype.elem_len = 8;                                                  
  MEM <unsigned long> [(void *)&parm.327 + 24B] = 3307124817920;                
  parm.327.span = 8;                                                            
  parm.327.dim[0].stride = 1;                                                   
  parm.327.dim[0].lbound = 1;                                                   
  parm.327.dim[0].ubound = 3;                                                   
  parm.327.dim[1].stride = 3;                                                   
  parm.327.dim[1].lbound = 1;                                                   
  parm.327.dim[1].ubound = 3;                                                   
  _1881 = (unsigned long) _56;                                                  
  _1880 = _1881 + ivtmp.1050_1911;                                              
  _248 = (real(kind=8) *) _1880;                                                
  parm.328.data = _248;                                                         
  parm.328.offset = -4;                                                         
  parm.328.dtype.elem_len = 8;                                                  
  MEM <unsigned long> [(void *)&parm.328 + 24B] = 3307124817920;                
  parm.328.span = 8;                                                            
  parm.328.dim[0].stride = 1;                                                   
  parm.328.dim[0].lbound = 1;                                                   
  parm.328.dim[0].ubound = 3;                                                   
  parm.328.dim[1].stride = 3;                                                   
  parm.328.dim[1].lbound = 1;                                                   
  parm.328.dim[1].ubound = 3;                                                   
  _1879 = (unsigned long) _20;                                                  
  _1878 = _1879 + ivtmp.1050_1911;                                              
  _249 = (real(kind=8) *) _1878;                                                
  parm.329.data = _249;                                                         
  parm.329.offset = -4;                                                         
  parm.329.dtype.elem_len = 8;                                                  
  MEM <unsigned long> [(void *)&parm.329 + 24B] = 3307124817920;                
  parm.329.span = 8;                                                            
  parm.329.dim[0].stride = 1;                                                   
  parm.329.dim[0].lbound = 1;                                                   
  parm.329.dim[0].ubound = 3;                                                   
  parm.329.dim[1].stride = 3;                                                   
  parm.329.dim[1].lbound = 1;                                                   
  parm.329.dim[1].ubound = 3;                                                   
  _1867 = (unsigned long) _67;                                                  
  _1866 = _1867 + ivtmp.1050_1911;                                              
  _250 = (real(kind=8) *) _1866;                                                
  parm.330.data = _250;                                                         
  parm.330.offset = -4;                                                         
  parm.330.dtype.elem_len = 8;                                                  
  MEM <unsigned long> [(void *)&parm.330 + 24B] = 3307124817920;                
  parm.330.span = 8;                                                            
  parm.330.dim[0].stride = 1;                                                   
  parm.330.dim[0].lbound = 1;                                                   
  parm.330.dim[0].ubound = 3;                                                   
  parm.330.dim[1].stride = 3;                                                   
  parm.330.dim[1].lbound = 1;                                                   
  parm.330.dim[1].ubound = 3;                                                   
  _251 = (real(kind=8) *) ivtmp.1049_1915;                                      
  _252 = (real(kind=8) *) ivtmp.1048_1925;                                      
  _253 = (real(kind=8) *) ivtmp.1047_1936;                                      
  perdida.constprop (&dt, &lambda, &mu, &yield_stress, &r_infinity, &b,
&x_infinity, &gamma, &eta, &plastic_strain_threshold, &parm.326, &parm.327,
&parm.328, &parm.329, _253, &parm.330, _252, _251, &failure_threshold,
&crack_closure_parameter);
  parm.326 ={v} {CLOBBER};                                                      
  parm.327 ={v} {CLOBBER};                                                      
  parm.328 ={v} {CLOBBER};                                                      
  parm.329 ={v} {CLOBBER};                                                      
  parm.330 ={v} {CLOBBER};

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-12-07  9:49 ` hubicka at gcc dot gnu.org
@ 2021-12-07 10:10 ` hubicka at gcc dot gnu.org
  2021-12-07 10:17 ` hubicka at gcc dot gnu.org
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-07 10:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 51940
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51940&action=edit
gcov report.  perdida does have some loops in it but it seems we optimize them
out/make htem relatively cheap.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-12-07 10:10 ` hubicka at gcc dot gnu.org
@ 2021-12-07 10:17 ` hubicka at gcc dot gnu.org
  2021-12-07 10:24 ` marxin at gcc dot gnu.org
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-07 10:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Quick counting from gcov data.  It is 585898984 calls and from the disassmebly
the call is 101 instructions, so it account 58589898400
The perf reports shows 69059173576 difference so at least instruction count
wise most of it is call overhead of perdida.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-12-07 10:17 ` hubicka at gcc dot gnu.org
@ 2021-12-07 10:24 ` marxin at gcc dot gnu.org
  2021-12-08 18:24 ` hubicka at gcc dot gnu.org
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-07 10:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #4)
> Created attachment 51940 [details]
> gcov report.  perdida does have some loops in it but it seems we optimize
> them out/make htem relatively cheap.

HTML colored version of the report:
https://splichal.eu/tmp/fatique.gcov.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-12-07 10:24 ` marxin at gcc dot gnu.org
@ 2021-12-08 18:24 ` hubicka at gcc dot gnu.org
  2021-12-08 18:28 ` hubicka at gcc dot gnu.org
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-08 18:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 51952
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51952&action=edit
Patch to teach modref about global memory

This patch extends modref so we eliminate the dead part of array descriptors.
This is done by making difference between accesses to global memory and to
unknown memory accesses.  The first is less evil since it can not access things
that does not escape and it is possible to arrange unanalyzed functions to
access only global memory (since everything reaching them escapes).

With the patch I now get
 Performance counter stats for './a.out-badmod':

         27,448.82 msec task-clock                #    0.999 CPUs utilized      
               334      context-switches          #   12.168 /sec               
                10      cpu-migrations            #    0.364 /sec               
                70      page-faults               #    2.550 /sec               
   109,079,862,444      cycles                    #    3.974 GHz                
       130,234,675      stalled-cycles-frontend   #    0.12% frontend cycles
idle   
    80,559,978,185      stalled-cycles-backend    #   73.85% backend cycles
idle    
   230,337,342,608      instructions              #    2.11  insn per cycle     
                                                  #    0.35  stalled cycles per
insn
     9,808,852,327      branches                  #  357.351 M/sec              
        24,742,926      branch-misses             #    0.25% of all branches    

      27.468971377 seconds time elapsed

      27.445934000 seconds user
       0.003999000 seconds sys


so 56% slowdown is now only 22% which is quite nice improvement.
The call of perdida now looks as follows:
  <bb 70> [local count: 109362591]:
  _818 = (unsigned long) _44;
  _817 = _818 + ivtmp.1050_1737;
  _246 = (real(kind=8) *) _817;
  parm.326.data = _246;
  _811 = (unsigned long) _33;
  _810 = _811 + ivtmp.1050_1737;
  _247 = (real(kind=8) *) _810;
  parm.327.data = _247;
  _712 = (unsigned long) _56;
  _711 = _712 + ivtmp.1050_1737;
  _248 = (real(kind=8) *) _711;
  parm.328.data = _248;
  _612 = (unsigned long) _20;
  _574 = _612 + ivtmp.1050_1737;
  _249 = (real(kind=8) *) _574;
  parm.329.data = _249;
  _573 = (unsigned long) _67;
  _572 = _573 + ivtmp.1050_1737;
  _250 = (real(kind=8) *) _572;
  parm.330.data = _250;
  _251 = (real(kind=8) *) ivtmp.1049_1740;
  _252 = (real(kind=8) *) ivtmp.1048_1753;
  _253 = (real(kind=8) *) ivtmp.1047_1762;
  perdida.constprop (&dt, &lambda, &mu, &yield_stress, &r_infinity, &b,
&x_infinity, &gamma, &eta, &plastic_strain_threshold, &parm.326, &parm.327,
&parm.328, &parm.329, _253, &parm.330, _252, _251, &failure_threshold,
&crack_closure_parameter);
  parm.326 ={v} {CLOBBER};
  parm.327 ={v} {CLOBBER};
  parm.328 ={v} {CLOBBER};
  parm.329 ={v} {CLOBBER};
  parm.330 ={v} {CLOBBER};

and compiles to:
        movq    48(%rsp), %rax
        movq    %r10, 24(%rsp)
        addq    %rbp, %rax
        movq    %rax, 720(%rsp)
        movq    32(%rsp), %rax
        addq    %rbp, %rax
        movq    %rax, 816(%rsp)
        movq    64(%rsp), %rax
        addq    %rbp, %rax
        movq    %rax, 912(%rsp)
        movq    40(%rsp), %rax
        addq    %rbp, %rax
        movq    %rax, 1008(%rsp)
        movq    72(%rsp), %rax
        addq    %rbp, %rax
        addq    $72, %rbp
        movq    %rax, 1104(%rsp)
        leaq    352(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1720
        leaq    384(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1728
        pushq   %r13
        .cfi_def_cfa_offset 1736
        addq    $8, %r13
        pushq   %r12
        .cfi_def_cfa_offset 1744
        addq    $8, %r12
        leaq    1136(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1752
        pushq   %r14
        .cfi_def_cfa_offset 1760
        addq    $8, %r14
        leaq    1056(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1768
        leaq    968(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1776
        leaq    880(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1784
        leaq    792(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1792
        leaq    488(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1800
        leaq    456(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1808
        leaq    480(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1816
        leaq    568(%rsp), %rax
        pushq   %rax
        .cfi_def_cfa_offset 1824
        leaq    440(%rsp), %r9
        leaq    528(%rsp), %r8
        leaq    584(%rsp), %rcx
        leaq    512(%rsp), %rdx
        leaq    504(%rsp), %rsi
        leaq    472(%rsp), %rdi
        call    __perdida_m_MOD_perdida.constprop.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-12-08 18:24 ` hubicka at gcc dot gnu.org
@ 2021-12-08 18:28 ` hubicka at gcc dot gnu.org
  2021-12-08 18:43 ` hubicka at gcc dot gnu.org
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-08 18:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #8 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Also IRA fails with
! Disqualifying parameter number 0 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 3 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 4 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 5 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 6 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 7 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 8 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 9 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 10 - Would result into a too big set of
replacements.
! Disqualifying parameter number 11 - Would result into a too big set of
replacements.
! Disqualifying parameter number 12 - Would result into a too big set of
replacements.
! Disqualifying parameter number 13 - Would result into a too big set of
replacements.
! Disqualifying parameter number 15 - Would result into a too big set of
replacements.
! Disqualifying parameter number 18 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 19 - Would create a possibly illegal
dereference in a caller.

I think it is overparanoid about values passed by reference.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-12-08 18:28 ` hubicka at gcc dot gnu.org
@ 2021-12-08 18:43 ` hubicka at gcc dot gnu.org
  2021-12-12  8:43 ` hubicka at gcc dot gnu.org
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-08 18:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #9 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Hacking around the logic in ipa-sra that disables the transform and adding
-fwhole-program I get down to:


 Performance counter stats for './a.out-bad3':

         24,946.66 msec task-clock                #    0.999 CPUs utilized      
             1,078      context-switches          #   43.212 /sec               
                42      cpu-migrations            #    1.684 /sec               
                71      page-faults               #    2.846 /sec               
    96,144,941,575      cycles                    #    3.854 GHz                
       151,439,200      stalled-cycles-frontend   #    0.16% frontend cycles
idle   
    68,072,941,085      stalled-cycles-backend    #   70.80% backend cycles
idle    
   210,675,636,303      instructions              #    2.19  insn per cycle     
                                                  #    0.32  stalled cycles per
insn
     9,128,994,716      branches                  #  365.941 M/sec              
        24,781,891      branch-misses             #    0.27% of all branches    

      24.982117481 seconds time elapsed

      24.909903000 seconds user
       0.036031000 seconds sys

which is not bad.  I think we have two ipa-sra issues
 1) ipa-sra is overparanoid about not adding derefernece. First I think it is
safe for parameters that are REFERENCE_TYPE rather than POINTER_TYPE second it
should do propagation from callers to callees: it is quite easy to figure out
that a given param contains data packed to a structure only to make callee
happy.
 2) since ipa-sra is run before ipa-cp it won't simplify ipa-cp (or other)
clones even if they are static symbols. 

Also I think ipa-sra may consider packing multiple structures together. If
array descriptors are passed by references and built in caller it seems
pointless to pass each as separate struct. Not sure what kind of benefits to
expect here though.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-12-08 18:43 ` hubicka at gcc dot gnu.org
@ 2021-12-12  8:43 ` hubicka at gcc dot gnu.org
  2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-12  8:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586644.html
improves modref so it is almost able to remove the unnecesary stores to array
descriptors.  Unforutnately to elinate them one needs to handle the case that
calling unknown function is safe WRT all non-escaping values (since passing
them to the function makes them escape) that is sadly broken by Fortran FE
incorrectly annotating functions calling GOMP_parallel.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-12-12  8:43 ` hubicka at gcc dot gnu.org
@ 2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org
  2022-01-31 10:33 ` hubicka at gcc dot gnu.org
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-12-14 15:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:3305135c29e1c3e988bd9bad40aefc01d138aaca

commit r12-5960-g3305135c29e1c3e988bd9bad40aefc01d138aaca
Author: Jan Hubicka <jh@suse.cz>
Date:   Tue Dec 14 16:50:27 2021 +0100

    Determine global memory accesses in ipa-modref

    As discussed in PR103585, fatigue2 is now only benchmark from my usual
testing
    set (SPEC2k6, SPEC2k17, CPP benchmarks, polyhedron, Firefox, clang) which
sees
    important regression when inlining functions called once is limited.  This
    prevents us from solving runtime issues in roms benchmarks and elsewhere.

    The problem is that there is perdida function that takes many arguments and
    some of them are array descriptors.  We constant propagate most of their
fields
    but still keep their initialization. Because perdida is quite fast, the
call
    overhead dominates, since we need over 100 memory stores consuing about 35%
    of the overall benchmark runtime.

    The memory stores would be eliminated if perdida did not call fortran I/O
which
    makes modref to thin that the array descriptors could be accessed. We are
    quite close discovering that they can't becuase they are non-escaping from
    function.  This patch makes modref to distingush between global memory
access
    (only things that escapes) and unkonwn accesss (that may access also
    nonescaping things reaching the function).  This makes disambiguation for
    functions containing error handling better.

    Unfortunately the patch hits two semi-latent issues in Fortran frontned.
    First is wrong code in gfortran.dg/unlimited_polymorphic_3.f03. This can be
    turned into wrong code testcase on both mainline and gcc11 if the runtime
    call is removed, so I filled PR 103662 for it. There is TBAA mismatch for
    structure produced in FE.

    Second is issue with GOMP where Fortran marks certain parameters as
non-escaping
    and then makes them escape via GOMP_parallel.  For this I disabled the use
of
    escape info in verify_arg which also disables the useful transform on
perdida
    but still does useful work for e.g. GCC error handling.  I will work on
this
    incrementally.

    Bootstrapped/regtested x86_64-linux, lto-bootstrapped and also tested with
    clang build.  I plan to commit this tomorrow if there are no complains
    (the patch is not completely short but conceptualy simple and handles a lot
    of common cases).

    gcc/ChangeLog:

    2021-12-12  Jan Hubicka  <hubicka@ucw.cz>

            PR ipa/103585
            * ipa-modref-tree.c (modref_access_node::range_info_useful_p):
Handle
            MODREF_GLOBAL_MEMORY_PARM.
            (modref_access_node::dump): Likewise.
            (modref_access_node::get_call_arg): Likewise.
            * ipa-modref-tree.h (enum modref_special_parms): Add
            MODREF_GLOBAL_MEMORY_PARM.
            (modref_access_node::useful_for_kill): Handle
            MODREF_GLOBAL_MEMORY_PARM.
            (modref:tree::merge): Add promote_unknown_to_global.
            * ipa-modref.c (verify_arg):New function.
            (may_access_nonescaping_parm_p): New function.
            (modref_access_analysis::record_global_memory_load): New member
            function.
            (modref_access_analysis::record_global_memory_store): Likewise.
            (modref_access_analysis::process_fnspec): Distingush global and
local
            memory.
            (modref_access_analysis::analyze_call): Likewise.
            * tree-ssa-alias.c (ref_may_access_global_memory_p): New function.
            (modref_may_conflict): Use it.

    gcc/testsuite/ChangeLog:

    2021-12-12  Jan Hubicka  <hubicka@ucw.cz>

            * gcc.dg/analyzer/data-model-1.c: Disable ipa-modref.
            * gcc.dg/uninit-38.c: Likewise.
            * gcc.dg/uninit-pr98578.c: Liewise.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org
@ 2022-01-31 10:33 ` hubicka at gcc dot gnu.org
  2022-12-14  0:05 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2022-01-31 10:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 52317
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52317&action=edit
WIP patch

This is patch I made while working on this problem. It improves the escape
analysis of the first patch so it cooperates with PTA. However because fortran
FE sort of lies about escapes in the presence of OpenMP I had to disable it for
cases that are interesting for perdida. So I will look on it next stage 1.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2022-01-31 10:33 ` hubicka at gcc dot gnu.org
@ 2022-12-14  0:05 ` cvs-commit at gcc dot gnu.org
  2023-01-19 13:21 ` jamborm at gcc dot gnu.org
  2023-01-29  2:23 ` hubicka at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-12-14  0:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #13 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>:

https://gcc.gnu.org/g:10478270fe0c39c59eb0f35d19356a63bdf3a2ad

commit r13-4687-g10478270fe0c39c59eb0f35d19356a63bdf3a2ad
Author: Martin Jambor <mjambor@suse.cz>
Date:   Wed Dec 14 00:33:06 2022 +0100

    ipa-sra: Treat REFERENCE_TYPES as always dereferencable

    C++ and especially Fortran pass data by references which are not
    pointers potentially pointing anywhere and so can be assumed to be
    safely dereferencable.  This patch teaches IPA-SRA to treat them as
    such and avoid the dance we do to prove that we can move loads from
    them to the caller.

    When we do not know that a dereference will happen all the time, we
    need a heuristics so that we do not force memory accesses that normally
    happen only rarely.  The patch simply uses the (possibly guessed)
    profile and checks whether the (expected) number of loads is at least
    half of function invocations invocations - the half is now
    configurable with a param as requested by Honza.

    gcc/ChangeLog:

    2022-12-13  Martin Jambor  <mjambor@suse.cz>

            PR ipa/103585
            * params.opt (ipa-sra-deref-prob-threshold): New parameter.
            * doc/invoke.texi (ipa-sra-deref-prob-threshold): Document it.
            * ipa-sra.cc (struct gensum_param_access): New field load_count.
            (struct gensum_param_desc): New field safe_ref, adjusted comments.
            (by_ref_count): Renamed to unsafe_by_ref_count, adjusted all uses.
            (dump_gensum_access): Dump the new field.
            (dump_gensum_param_descriptor): Likewise.
            (create_parameter_descriptors): Set safe_ref field, move setting
            by_ref forward.  Only increment unsafe_by_ref_count for unsafe
            by_ref parameters.
            (allocate_access): Initialize new field.
            (mark_param_dereference): Adjust indentation.  Only add data to
            bb_dereferences for unsafe by_ref parameters.
            (scan_expr_access): For loads, accumulate BB counts.
            (dereference_probable_p): New function.
            (check_gensum_access): Fix leading comment, add parameter FUN.
            Check cumulative counts of loads for safe by_ref accesses instead
            of dereferences.
            (process_scan_results): Do not propagate dereference distances for
            safe by_ref parameters.  Pass fun to check_gensum_access.  Safe
            by_ref params do not need the postdominance check.

    gcc/testsuite/ChangeLog:

    2022-11-11  Martin Jambor  <mjambor@suse.cz>

            * g++.dg/ipa/ipa-sra-5.C: New test

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2022-12-14  0:05 ` cvs-commit at gcc dot gnu.org
@ 2023-01-19 13:21 ` jamborm at gcc dot gnu.org
  2023-01-29  2:23 ` hubicka at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-01-19 13:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #14 from Martin Jambor <jamborm at gcc dot gnu.org> ---
Honza, what remains to be done here (if anything)?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
  2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2023-01-19 13:21 ` jamborm at gcc dot gnu.org
@ 2023-01-29  2:23 ` hubicka at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-01-29  2:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585

--- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
We get 47s runtime with -O2 -flto and 53s with -O2
-fno-inline-functions-called-once.

The call sequence is:
  <bb 81> [local count: 109362591]:
  _1656 = (unsigned long) _45;
  _1655 = _1656 + ivtmp.1182_2540;
  _229 = (double *) _1655;
  _1646 = (unsigned long) _35;
  _1645 = _1646 + ivtmp.1182_2540;
  _230 = (double *) _1645;
  _1636 = (unsigned long) _55;
  _1635 = _1636 + ivtmp.1182_2540;
  _231 = (double *) _1635;
  _1628 = (unsigned long) _17;
  _1627 = _1628 + ivtmp.1182_2540;
  _232 = (double *) _1627;
  _1618 = (unsigned long) _64;
  _1617 = _1618 + ivtmp.1182_2540;
  _233 = (double *) _1617;
  _234 = (double *) ivtmp.1181_2551;
  _235 = (double *) ivtmp.1180_2575;
  _236 = (double *) ivtmp.1178_2586;
  _2607 = yield_stress;
  perdida.constprop.isra (&dt, &lambda, &mu, _2607, &r_infinity, &b,
&x_infinity, &gamma, &eta, &plastic_strain_threshold, _229, _230, _231, _232,
_236, _233, _235, _234, &failure_threshold, &crack_closure_parameter);

It is not clear to me why lambda is not replaced.  Howevever for dt it seems to
be:
! Disqualifying parameter number 0 - Dereferences in callers would happen much
more frequently.

I think this is too early since if we SRA all the way down to the original
caller we will avoid all dereferences completely.

Other place for improvement is non-LTO. Here IPA-sra disables itself since it
does not have cost model for cloning (that could be also improved).

Situation could be improved by ipa-modref that may optimize away unused parts
of the array descriptors. ipa-modref however gives up on the fact that perdida
has Fortran i/o and it then gives up on tracking on the descriptors even if the
descritors are never escaping to the i/o.

For this I need to finish the non-escaping analysis. I.e. make difference
between arguments that does not escape in a sense that once function return
they are not saved in global memory and in a sense that they are never passed
down to callee function.

We also may annotate fortran i/o and understand what it does.

So there is still a lot to do.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-01-29  2:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org
2021-12-07  9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org
2021-12-07  9:49 ` hubicka at gcc dot gnu.org
2021-12-07 10:10 ` hubicka at gcc dot gnu.org
2021-12-07 10:17 ` hubicka at gcc dot gnu.org
2021-12-07 10:24 ` marxin at gcc dot gnu.org
2021-12-08 18:24 ` hubicka at gcc dot gnu.org
2021-12-08 18:28 ` hubicka at gcc dot gnu.org
2021-12-08 18:43 ` hubicka at gcc dot gnu.org
2021-12-12  8:43 ` hubicka at gcc dot gnu.org
2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org
2022-01-31 10:33 ` hubicka at gcc dot gnu.org
2022-12-14  0:05 ` cvs-commit at gcc dot gnu.org
2023-01-19 13:21 ` jamborm at gcc dot gnu.org
2023-01-29  2:23 ` hubicka at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).