public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well @ 2021-12-06 16:59 hubicka at gcc dot gnu.org 2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org ` (14 more replies) 0 siblings, 15 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-06 16:59 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 Bug ID: 103585 Summary: fatigue2 requires inlining of peridida to work well Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- fatigue2 regresses by 35% comparing -Ofast -march=native (zen) to -Ofast -march=native -fno-inline-functions-called once In generall inlining functions called once is not intended as important performance heuristics. Here we seem to miss some propagation that supports better vectorization. The important inline is: -Inlining perdida.constprop/105 size 703. - Called once from iztaccihuatl/17 2570 insns. - Parm map: -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -Updated mod-ref summary for iztaccihuatl/17 - loads: - Every base - stores: - Base 0: alias set 6 - Ref 0: alias set 6 - Every access - Side effects - Global memory read - Inlined into iztaccihuatl/17 which now has 3148 size -Overall time estimate: 270429.808105 weighted by profile: 0.000000 +Overall time estimate: 93562.854248 weighted by profile: 0.000000 So we estimate it to 703 instructions that is not small and the function inlined to is already quite large (3148 instructions). Before vectorizer the perdida is still 800 lines of gimple dump (w/o declarations) but then it simplifies to 630 at the last rerun of dse. iztaccihuatl is 3323 lines of gimple dump in .optimized of build with perdida inlined 2832 in build without perdida inlined, so 491 lines. So it looks like indeed something propagates with we do not get from ipa-cp/modref propagation. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org @ 2021-12-06 17:15 ` hubicka at gcc dot gnu.org 2021-12-07 9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org ` (13 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-06 17:15 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mjambor at suse dot cz --- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> --- It seems to me that we miss some ipa-cp propagation here. In particular there are array descriptors: parm.326.span = 8; MEM <c_char[8]> [(struct dtype_type *)&parm.326 + 24B] = {}; parm.326.dtype.elem_len = 8; parm.326.dtype.rank = 2; parm.326.dtype.type = 3; parm.326.dim[0].lbound = 1; parm.326.dim[0].ubound = 3; parm.326.dim[0].stride = 1; parm.326.dim[1].lbound = 1; parm.326.dim[1].ubound = 3; parm.326.dim[1].stride = 3; _245 = _243 * 9; _246 = &MEM <real(kind=8)[0:]> [(real(kind=8)[0:] *)_44][_245]; parm.326.data = _246; parm.326.offset = -4; and the resulting jump function ref offset 64: -4 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size: 703] ref offset 128: 8 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size: 703] ref offset 224: 2 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size: 703] ref offset 232: 3 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size: 703] ref offset 256: 8 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size: 703] ref offset 320: 1 [loc_time: 248, loc_size: 697, prop_time: 248, prop_size: 697] ref offset 384: 1 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size: 702] ref offset 448: 3 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size: 702] ref offset 512: 3 [loc_time: 245, loc_size: 700, prop_time: 245, prop_size: 700] ref offset 576: 1 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size: 702] ref offset 640: 3 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size: 702] here are 11 constants while there are 12 constant stores above. I guess the MEM one is missing. Call is: call perdida (dt, lambda, mu, yield_stress, R_infinity, b, X_infinity, & gamma, eta, plastic_strain_threshold, stress_tensor(:,:,n), & strain_tensor(:,:,n), plastic_strain_tensor(:,:,n), & strain_rate_tensor(:,:,n), accumulated_plastic_strain(n), & back_stress_tensor(:,:,n), isotropic_hardening_stress(n), & damage(n), failure_threshold, crack_closure_parameter) Similar situation repats iself multiple times. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org 2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org @ 2021-12-07 9:31 ` hubicka at gcc dot gnu.org 2021-12-07 9:49 ` hubicka at gcc dot gnu.org ` (12 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-07 9:31 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Doing some basic perfing shows that perdida seems to account for about 36% of runtime of the benchmark and has interesting property that most of time it will return without looping (and that probbly explains PR103592). It is not clear to me why one runs faster than other - we do a lot of spilling in both paths. This is comparing with -fno-tree-vectorize -fno-tree-slp-vectorize since I find it easier to parse 36.91% a.out-slow a.out-slow [.] __perdida_m_MOD_perdida.constprop.0 34.15% a.out-fast a.out-fast [.] MAIN__ 17.96% a.out-slow a.out-slow [.] MAIN__ 7.37% a.out-slow a.out-slow [.] __perdida_m_MOD_generalized_hookes_law.constprop.0.isra.0 2.43% a.out-fast a.out-fast [.] __perdida_m_MOD_generalized_hookes_law.constprop.0.isra.0 0.24% a.out-slow a.out-slow [.] __perdida_m_MOD_generalized_hookes_law.constprop.1.isra.0 0.21% a.out-fast a.out-fast [.] __perdida_m_MOD_generalized_hookes_law.constprop.1.isra.0 perdida: │ ◆ │ 0000000000402500 <__perdida_m_MOD_perdida.constprop.0>: ▒ │ __perdida_m_MOD_perdida.constprop.0(): ▒ 1.37 │ push %rbp ▒ 0.00 │ vxorpd %xmm13,%xmm13,%xmm13 ▒ │ mov %rsp,%rbp ▒ 0.00 │ push %r15 ▒ 2.07 │ push %r14 ▒ 0.01 │ mov %rdx,%r15 ▒ 0.00 │ push %r13 ▒ 0.00 │ push %r12 ▒ 0.69 │ push %rbx ▒ 0.01 │ sub $0x3c8,%rsp ▒ 0.01 │ mov 0x30(%rbp),%rax ▒ 0.02 │ vmovsd (%rsi),%xmm6 ▒ 0.01 │ mov %rcx,-0x300(%rbp) ▒ 2.01 │ vmovsd (%rdx),%xmm1 ▒ 0.00 │ mov %rdi,-0x380(%rbp) ▒ 0.01 │ mov %r8,-0x388(%rbp) ▒ 0.04 │ lea -0x240(%rbp),%rdi ▒ 0.00 │ mov %r9,-0x390(%rbp) ▒ 0.66 │ mov %rsi,-0x308(%rbp) ▒ 0.03 │ lea -0x2a0(%rbp),%rsi ▒ 0.00 │ movq $0x1,-0x218(%rbp) ▒ 0.01 │ movq $0x3,-0x200(%rbp) ▒ 0.66 │ movq $0x3,-0x260(%rbp) ▒ 0.02 │ mov (%rax),%rbx ▒ 0.06 │ mov 0x38(%rbp),%rax ▒ 0.00 │ vmovsd %xmm6,-0x3d8(%rbp) ▒ 0.08 │ vmovsd %xmm1,-0x3e0(%rbp) ▒ 0.65 │ mov %rbx,-0x240(%rbp) ▒ 0.74 │ mov (%rax),%r14 ▒ 0.67 │ mov 0x40(%rbp),%rax ▒ 0.01 │ mov (%rax),%r13 ▒ 0.74 │ mov 0x58(%rbp),%rax ▒ 0.00 │ mov (%rax),%r12 ▒ 0.70 │ mov 0x48(%rbp),%rax ▒ 0.00 │ mov (%rax),%rax ▒ 0.69 │ mov %rax,-0x310(%rbp) ▒ 0.00 │ mov %rsp,%rax ▒ 0.00 │ sub $0x50,%rsp ▒ │ mov %rsp,%rcx ▒ 0.00 │ mov %rax,-0x2f8(%rbp) ▒ 0.01 │ mov %rcx,-0x2a0(%rbp) ▒ 0.01 │ vmovsd (%r14),%xmm0 ▒ 0.68 │ vsubsd 0x0(%r13),%xmm0,%xmm0 ▒ 2.85 │ vmovsd %xmm0,(%rsp) ▒ 2.09 │ vmovsd 0x8(%r14),%xmm0 ▒ 0.00 │ vsubsd 0x8(%r13),%xmm0,%xmm0 ▒ 0.01 │ vmovsd %xmm0,0x8(%rcx) ▒ 0.70 │ vmovsd 0x10(%r14),%xmm0 ▒ 0.01 │ vsubsd 0x10(%r13),%xmm0,%xmm0 ▒ 0.02 │ vmovsd %xmm0,0x10(%rcx) ▒ 0.69 │ vmovsd 0x18(%r14),%xmm0 ▒ 0.00 │ vsubsd 0x18(%r13),%xmm0,%xmm0 ▒ 0.59 │ vmovsd %xmm0,0x18(%rcx) ▒ 1.83 │ vmovsd 0x20(%r14),%xmm0 ▒ 0.00 │ vsubsd 0x20(%r13),%xmm0,%xmm0 ▒ 0.02 │ vmovsd %xmm0,0x20(%rcx) ▒ 0.72 │ vmovsd 0x28(%r14),%xmm0 ◆ 0.00 │ vsubsd 0x28(%r13),%xmm0,%xmm0 ▒ 0.01 │ vmovsd %xmm0,0x28(%rcx) ▒ 0.13 │ vmovsd 0x30(%r14),%xmm0 ▒ 0.01 │ vsubsd 0x30(%r13),%xmm0,%xmm0 ▒ 0.01 │ vmovsd %xmm0,0x30(%rcx) ▒ 0.70 │ vmovsd 0x38(%r14),%xmm0 ▒ 0.00 │ vsubsd 0x38(%r13),%xmm0,%xmm0 ▒ 0.02 │ vmovsd %xmm0,0x38(%rcx) ▒ 0.69 │ vmovsd 0x40(%r14),%xmm0 ▒ 0.00 │ vsubsd 0x40(%r13),%xmm0,%xmm0 ▒ 0.01 │ vmovsd %xmm0,0x40(%rcx) ▒ 0.70 │ vmovsd %xmm6,%xmm6,%xmm0 ▒ 0.00 │ → call __perdida_m_MOD_generalized_hookes_law.co▒ 0.00 │ mov 0x68(%rbp),%rax ▒ 0.00 │ vmovsd delim.4+0xc,%xmm1 ▒ 0.00 │ mov -0x2f8(%rbp),%rsp ▒ 0.01 │ mov -0x308(%rbp),%r10 ▒ 0.01 │ vmovsd (%rax),%xmm4 ▒ 0.01 │ vsubsd %xmm4,%xmm1,%xmm8 ▒ 0.04 │ vmulsd (%rbx),%xmm8,%xmm2 ▒ 1.46 │ vmulsd 0x20(%rbx),%xmm8,%xmm1 ▒ 3.40 │ vcomisd %xmm13,%xmm4 ▒ 0.00 │ vmulsd 0x40(%rbx),%xmm8,%xmm0 ▒ 0.75 │ vmulsd 0x8(%rbx),%xmm8,%xmm9 ▒ 0.11 │ vmulsd 0x10(%rbx),%xmm8,%xmm6 ▒ 0.13 │ vmulsd 0x18(%rbx),%xmm8,%xmm11 ▒ 0.00 │ vaddsd %xmm1,%xmm2,%xmm3 ▒ 0.01 │ vmovsd %xmm2,(%rbx) ▒ 0.01 │ vmulsd 0x28(%rbx),%xmm8,%xmm5 ▒ 0.02 │ vmovsd %xmm1,0x20(%rbx) ▒ 0.00 │ vmulsd 0x30(%rbx),%xmm8,%xmm10 ▒ 0.60 │ vaddsd %xmm0,%xmm3,%xmm3 ▒ 1.50 │ vmovsd %xmm9,0x8(%rbx) ▒ 0.01 │ vmulsd 0x38(%rbx),%xmm8,%xmm7 ▒ 0.00 │ vmovsd %xmm0,0x40(%rbx) ▒ │ vmovsd %xmm6,0x10(%rbx) ▒ 0.00 │ vmovsd %xmm11,0x18(%rbx) ▒ 0.01 │ vmovsd %xmm5,0x28(%rbx) ▒ 0.02 │ vmovsd %xmm10,0x30(%rbx) ▒ 0.02 │ vmovsd %xmm7,0x38(%rbx) ▒ 0.69 │ ↓ jbe 207 ▒ 0.00 │ vcomisd 0xd6e2f(%rip),%xmm3 # 4d9530 <deli▒ 0.86 │ ↓ jb cf3 ▒ 0.24 │207: vmovsd delim.4+0x1c,%xmm7 ▒ 0.01 │ mov delim.4+0xc,%rax ▒ 0.00 │ mov %rax,-0x330(%rbp) ▒ 0.86 │ ↓ jb cf3 ▒ 0.24 │207: vmovsd delim.4+0x1c,%xmm7 ▒ 0.01 │ mov delim.4+0xc,%rax ▒ 0.00 │ mov %rax,-0x330(%rbp) ▒ 0.00 │ vmulsd %xmm7,%xmm3,%xmm6 ▒ 1.78 │ vmovsd %xmm6,-0x2f8(%rbp) ▒ 0.45 │229: vmovsd %xmm7,-0x3a8(%rbp) ▒ 0.00 │ vmovsd -0x2f8(%rbp),%xmm7 ▒ 5.12 │ vmovsd 0x8(%rbx),%xmm5 ▒ 0.02 │ vmovsd 0x10(%rbx),%xmm4 ▒ 0.04 │ vmovsd 0x18(%rbx),%xmm3 ▒ 0.01 │ vmovsd 0x28(%rbx),%xmm15 ▒ 0.03 │ vmovsd 0x38(%rbx),%xmm9 ◆ 0.04 │ vmovsd 0x30(%rbx),%xmm14 ▒ 0.03 │ vsubsd %xmm7,%xmm2,%xmm6 ▒ 2.00 │ vmovsd delim.4+0xc,%xmm2 ▒ 0.00 │ vsubsd %xmm7,%xmm1,%xmm1 ▒ 0.69 │ vsubsd %xmm7,%xmm0,%xmm7 ▒ 0.69 │ vmovsd %xmm5,-0x350(%rbp) ▒ 0.00 │ vmovsd %xmm4,-0x358(%rbp) ▒ 0.01 │ vmovsd %xmm3,-0x360(%rbp) ▒ 0.01 │ vmovsd %xmm6,-0x338(%rbp) ▒ 0.68 │ vmovsd %xmm1,-0x340(%rbp) ▒ 0.69 │ vmovsd %xmm9,-0x378(%rbp) ▒ 0.01 │ vmovsd %xmm15,-0x368(%rbp) ▒ 0.02 │ vmovsd %xmm14,-0x370(%rbp) ▒ 0.01 │ vmovsd %xmm7,-0x348(%rbp) ▒ 0.68 │ vdivsd %xmm8,%xmm2,%xmm8 ▒ 0.03 │ vmulsd %xmm5,%xmm8,%xmm11 ▒ 0.03 │ vsubsd 0x8(%r12),%xmm11,%xmm0 ▒ 0.01 │ vmulsd %xmm4,%xmm8,%xmm5 ▒ 0.00 │ vmulsd %xmm6,%xmm8,%xmm6 ▒ 0.04 │ vsubsd (%r12),%xmm6,%xmm4 ▒ 0.72 │ vmulsd %xmm3,%xmm8,%xmm12 ▒ 0.00 │ vmovsd %xmm5,-0x318(%rbp) ▒ 0.02 │ vmulsd %xmm1,%xmm8,%xmm10 ▒ 0.00 │ vsubsd 0x18(%r12),%xmm12,%xmm1 ▒ 0.01 │ vmulsd %xmm15,%xmm8,%xmm3 ▒ 0.00 │ vmovsd %xmm6,-0x320(%rbp) ▒ 0.02 │ vmulsd %xmm0,%xmm0,%xmm0 ▒ 0.10 │ vmulsd %xmm9,%xmm8,%xmm2 ▒ 0.64 │ vsubsd 0x28(%r12),%xmm3,%xmm9 ▒ 0.01 │ vmovsd %xmm3,-0x308(%rbp) ▒ 0.01 │ vfmadd132sd %xmm4,%xmm0,%xmm4 ▒ 2.70 │ vsubsd 0x10(%r12),%xmm5,%xmm0 ▒ 0.00 │ vmulsd %xmm14,%xmm8,%xmm15 ▒ 0.00 │ vmovsd %xmm2,-0x3a0(%rbp) ▒ 0.00 │ vmulsd %xmm7,%xmm8,%xmm8 ▒ 0.00 │ vmulsd %xmm1,%xmm1,%xmm1 ▒ 0.01 │ vmulsd %xmm9,%xmm9,%xmm9 ▒ 0.01 │ vfmadd132sd %xmm0,%xmm1,%xmm0 ▒ 0.06 │ vsubsd 0x20(%r12),%xmm10,%xmm1 ▒ 0.66 │ vfmadd132sd %xmm1,%xmm9,%xmm1 ▒ 0.06 │ vsubsd 0x30(%r12),%xmm15,%xmm9 ▒ 0.00 │ vaddsd %xmm0,%xmm1,%xmm1 ▒ 2.07 │ vsubsd 0x38(%r12),%xmm2,%xmm0 ▒ 0.00 │ vmulsd %xmm0,%xmm0,%xmm0 ▒ 0.01 │ vfmadd231sd %xmm9,%xmm9,%xmm0 ▒ 0.08 │ vsubsd 0x40(%r12),%xmm8,%xmm9 ▒ 0.00 │ vfmadd132sd %xmm9,%xmm4,%xmm9 ▒ 0.65 │ vaddsd %xmm9,%xmm0,%xmm0 ▒ 2.04 │ vaddsd %xmm1,%xmm0,%xmm0 ▒ 2.09 │ vmulsd delim.4+0x24,%xmm0,%xmm14 ▒ 3.47 │ mov -0x300(%rbp),%rax ▒ 0.00 │ vmovsd (%rax),%xmm9 ▒ 0.00 │ mov 0x60(%rbp),%rax ▒ 0.00 │ vaddsd (%rax),%xmm9,%xmm4 ▒ 0.01 │ vsqrtsd %xmm14,%xmm14,%xmm14 ▒ 13.94 │ vsubsd %xmm4,%xmm14,%xmm4 ▒ 0.71 │ vcomisd %xmm4,%xmm13 ▒ 2.88 │ vmovsd %xmm4,-0x398(%rbp) ▒ 0.00 │ ↓ ja 3b0 ▒ 0.16 │ vcomisd 0xd6c9a(%rip),%xmm0 # 4d9548 <deli◆ 0.04 │ ↓ jae 3bf ▒ 1.30 │3b0: lea -0x28(%rbp),%rsp ▒ 0.00 │ pop %rbx ▒ 0.00 │ pop %r12 ▒ 0.00 │ pop %r13 ▒ 0.00 │ pop %r14 ▒ 0.00 │ pop %r15 ▒ 0.02 │ pop %rbp ▒ 0.01 │ ← ret ▒ inlined variant: Percent│ vmovsd (%r14),%xmm0 ▒ 0.01 │ mov %rsp,%rax ▒ 0.01 │ sub $0x50,%rsp ▒ 0.00 │ vmovsd -0x448(%rbp),%xmm5 ▒ 0.06 │ vsubsd 0x0(%r13),%xmm0,%xmm0 ▒ 0.71 │ mov %rsp,%rdx ▒ 0.02 │ vmovsd -0x440(%rbp),%xmm1 ▒ 0.00 │ mov %rcx,-0x508(%rbp) ▒ 0.00 │ mov %r10,-0x4e8(%rbp) ▒ 0.01 │ mov %rax,-0x4e0(%rbp) ◆ 0.00 │ lea -0x250(%rbp),%rsi ▒ │ lea -0x2b0(%rbp),%rdi ▒ 0.05 │ mov %rdx,-0x250(%rbp) ▒ 0.67 │ movq $0x1,-0x288(%rbp) ▒ 0.02 │ vxorpd %xmm13,%xmm13,%xmm13 ▒ 0.00 │ movq $0x3,-0x270(%rbp) ▒ 0.00 │ mov %r12,-0x2b0(%rbp) ▒ 0.01 │ movq $0x3,-0x210(%rbp) ▒ 0.00 │ vmovsd %xmm5,-0x618(%rbp) ▒ 0.00 │ vmovsd %xmm1,-0x620(%rbp) ▒ 0.06 │ vmovsd %xmm0,(%rsp) ▒ 0.69 │ vmovsd 0x8(%r14),%xmm0 ▒ 0.03 │ vsubsd 0x8(%r13),%xmm0,%xmm0 ▒ 0.01 │ vmovsd %xmm0,0x8(%rdx) ▒ 0.01 │ vmovsd 0x10(%r14),%xmm0 ▒ 0.02 │ vsubsd 0x10(%r13),%xmm0,%xmm0 ▒ 0.00 │ vmovsd %xmm0,0x10(%rdx) ▒ 0.01 │ vmovsd 0x18(%r14),%xmm0 ▒ 0.06 │ vsubsd 0x18(%r13),%xmm0,%xmm0 ▒ 0.69 │ vmovsd %xmm0,0x18(%rdx) ▒ 0.03 │ vmovsd 0x20(%r14),%xmm0 ▒ 0.01 │ vsubsd 0x20(%r13),%xmm0,%xmm0 ▒ 0.01 │ vmovsd %xmm0,0x20(%rdx) ▒ 0.02 │ vmovsd 0x28(%r14),%xmm0 ▒ 0.00 │ vsubsd 0x28(%r13),%xmm0,%xmm0 ▒ 0.00 │ vmovsd %xmm0,0x28(%rdx) ▒ 0.06 │ vmovsd 0x30(%r14),%xmm0 ▒ 0.67 │ vsubsd 0x30(%r13),%xmm0,%xmm0 ▒ 0.03 │ vmovsd %xmm0,0x30(%rdx) ▒ 0.01 │ vmovsd 0x38(%r14),%xmm0 ▒ 0.00 │ vsubsd 0x38(%r13),%xmm0,%xmm0 ▒ 0.02 │ vmovsd %xmm0,0x38(%rdx) ▒ 0.01 │ vmovsd 0x40(%r14),%xmm0 ▒ 0.00 │ vsubsd 0x40(%r13),%xmm0,%xmm0 ▒ 0.06 │ vmovsd %xmm0,0x40(%rdx) ▒ 0.68 │ vmovsd %xmm5,%xmm5,%xmm0 ▒ 0.01 │ → call __perdida_m_MOD_generalized_hookes_law.c▒ 0.00 │ mov -0x510(%rbp),%rax ▒ 0.00 │ vmovsd delim.4+0x1c,%xmm6 ▒ 0.00 │ mov -0x4e0(%rbp),%rsp ▒ 0.05 │ mov -0x4e8(%rbp),%r10 ▒ 0.01 │ mov -0x508(%rbp),%rcx ▒ 0.07 │ vmovsd -0x10(%rax,%r15,8),%xmm4 ▒ 0.64 │ vsubsd %xmm4,%xmm6,%xmm8 ▒ 0.06 │ vmulsd 0x8(%r12),%xmm8,%xmm6 ▒ 0.40 │ vmulsd 0x10(%r12),%xmm8,%xmm7 ▒ 0.25 │ vcomisd %xmm13,%xmm4 ▒ 0.02 │ vmovsd %xmm4,-0x570(%rbp) ▒ 0.00 │ vmulsd 0x18(%r12),%xmm8,%xmm5 ▒ 0.10 │ vmulsd (%r12),%xmm8,%xmm2 ▒ 0.54 │ vmulsd 0x20(%r12),%xmm8,%xmm1 ▒ 0.85 │ vmulsd 0x40(%r12),%xmm8,%xmm0 ▒ 0.44 │ vmovsd %xmm6,-0x4f0(%rbp) ▒ 0.04 │ vmovsd %xmm6,0x8(%r12) ▒ 0.10 │ vmovsd %xmm7,-0x4f8(%rbp) ▒ 0.01 │ vmovsd %xmm7,0x10(%r12) ◆ 0.14 │ vmulsd 0x28(%r12),%xmm8,%xmm6 ▒ 0.19 │ vmulsd 0x38(%r12),%xmm8,%xmm7 ▒ 0.25 │ vmovsd %xmm5,-0x500(%rbp) ▒ 0.03 │ vmovsd %xmm5,0x18(%r12) ▒ 0.37 │ vmovsd %xmm2,(%r12) ▒ 0.03 │ vmulsd 0x30(%r12),%xmm8,%xmm5 ▒ 0.09 │ vaddsd %xmm1,%xmm2,%xmm3 ▒ 0.00 │ vmovsd %xmm1,0x20(%r12) ▒ 0.19 │ vmovsd %xmm0,0x40(%r12) ▒ 0.16 │ vaddsd %xmm0,%xmm3,%xmm3 ▒ 0.76 │ vmovsd %xmm6,0x28(%r12) ▒ 0.09 │ vmovsd %xmm7,0x38(%r12) ▒ 0.10 │ vmovsd %xmm5,0x30(%r12) ▒ 0.10 │ ↓ jbe b51 ▒ 0.01 │ vcomisd 0xd17ed(%rip),%xmm3 # 4d8578 <del▒ 0.55 │ ↓ jb 1a11 ▒ 0.23 │ b51: vmovsd delim.4+0x64,%xmm4 ▒ 0.01 │ vmulsd %xmm4,%xmm3,%xmm3 ▒ 1.38 │ vmovsd %xmm4,-0x5d8(%rbp) ▒ 0.01 │ vmovsd %xmm3,-0x508(%rbp) ▒ 0.48 │ b6d: vsubsd %xmm3,%xmm2,%xmm4 ▒ 0.02 │ vsubsd %xmm3,%xmm1,%xmm2 ▒ 0.57 │ vmovsd %xmm5,-0x600(%rbp) ▒ 0.15 │ vmovsd %xmm6,-0x608(%rbp) ▒ 0.18 │ vsubsd %xmm3,%xmm0,%xmm1 ▒ 0.52 │ vmovsd %xmm7,-0x5f8(%rbp) ▒ 0.30 │ vmovsd delim.4+0x1c,%xmm3 ▒ 0.09 │ vmovsd %xmm4,-0x578(%rbp) ▒ 0.95 │ vmovsd %xmm2,-0x580(%rbp) ▒ 0.73 │ vmovsd %xmm1,-0x588(%rbp) ▒ 0.55 │ vdivsd %xmm8,%xmm3,%xmm0 ▒ 5.95 │ vmulsd -0x4f0(%rbp),%xmm0,%xmm13 ▒ 2.14 │ vmulsd -0x4f8(%rbp),%xmm0,%xmm12 ▒ 0.65 │ vmulsd %xmm0,%xmm5,%xmm10 ▒ 0.09 │ vmulsd %xmm6,%xmm0,%xmm3 ▒ 0.01 │ vmovsd 0x8(%rbx),%xmm5 ▒ 0.01 │ vmulsd %xmm7,%xmm0,%xmm6 ▒ 0.68 │ vmulsd %xmm1,%xmm0,%xmm1 ▒ 0.15 │ vmovsd (%rbx),%xmm7 ▒ 0.00 │ vmovsd %xmm3,-0x5e0(%rbp) ▒ 0.01 │ vmulsd %xmm4,%xmm0,%xmm9 ▒ 0.69 │ vmulsd %xmm2,%xmm0,%xmm14 ▒ 0.10 │ vmovsd %xmm1,-0x4e8(%rbp) ▒ 0.05 │ vmovsd %xmm6,-0x4e0(%rbp) ▒ 0.62 │ vmovsd %xmm5,-0x520(%rbp) ▒ 0.05 │ vmovsd 0x10(%rbx),%xmm6 ▒ 0.00 │ vmulsd -0x500(%rbp),%xmm0,%xmm4 ▒ 0.11 │ vsubsd %xmm5,%xmm13,%xmm0 ▒ 0.01 │ vsubsd %xmm7,%xmm9,%xmm2 ▒ 0.09 │ vmovsd %xmm9,-0x5c0(%rbp) ▒ 0.70 │ vmovsd %xmm7,-0x518(%rbp) ▒ 0.06 │ vmovsd %xmm14,-0x5c8(%rbp) ▒ 0.64 │ vmulsd %xmm0,%xmm0,%xmm1 ▒ 0.06 │ vmovsd %xmm0,-0x5a8(%rbp) ▒ 0.71 │ vmovsd %xmm2,-0x5a0(%rbp) ▒ 0.72 │ vfmadd231sd %xmm2,%xmm2,%xmm1 ◆ 0.81 │ vmovsd 0x18(%rbx),%xmm2 ▒ 0.01 │ vsubsd %xmm6,%xmm12,%xmm15 ▒ 0.00 │ vmovsd %xmm6,-0x528(%rbp) ▒ 0.05 │ vmovsd %xmm15,-0x5b0(%rbp) ▒ 0.07 │ vmovsd %xmm4,-0x5f0(%rbp) ▒ 0.63 │ vmovsd %xmm2,-0x530(%rbp) ▒ 0.07 │ vsubsd %xmm2,%xmm4,%xmm2 ▒ 0.01 │ vmovsd 0x20(%rbx),%xmm4 ▒ 0.10 │ vmulsd %xmm2,%xmm2,%xmm0 ▒ 0.06 │ vmovsd %xmm2,-0x5e8(%rbp) ▒ 0.62 │ vfmadd231sd %xmm15,%xmm15,%xmm0 ▒ 1.14 │ vmovsd %xmm4,-0x538(%rbp) ▒ 0.07 │ vsubsd %xmm4,%xmm14,%xmm14 ▒ 0.00 │ vaddsd %xmm0,%xmm1,%xmm1 ▒ 2.07 │ vmovsd 0x28(%rbx),%xmm0 ▒ 0.00 │ vsubsd %xmm0,%xmm3,%xmm15 ▒ 0.00 │ vmovsd %xmm0,-0x540(%rbp) ▒ 0.04 │ vmovsd 0x30(%rbx),%xmm3 ▒ 0.00 │ vmovsd -0x4e8(%rbp),%xmm4 ▒ 0.01 │ vmovsd %xmm10,-0x5d0(%rbp) ▒ 0.05 │ vmovsd 0x38(%rbx),%xmm9 ▒ 0.00 │ vmovsd -0x4e0(%rbp),%xmm7 ▒ 0.69 │ vmulsd %xmm15,%xmm15,%xmm0 ▒ 0.00 │ mov -0x568(%rbp),%rax ▒ 0.00 │ vfmadd231sd %xmm14,%xmm14,%xmm0 ▒ 0.02 │ vsubsd %xmm3,%xmm10,%xmm11 ▒ 0.00 │ vmovsd 0x40(%rbx),%xmm10 ▒ 0.00 │ vmovsd %xmm3,-0x548(%rbp) ▒ 0.05 │ vmovsd %xmm9,-0x550(%rbp) ▒ 0.06 │ vsubsd %xmm9,%xmm7,%xmm3 ▒ 0.68 │ vfmadd231sd %xmm11,%xmm11,%xmm0 ▒ 0.19 │ vmovsd %xmm3,-0x5b8(%rbp) ▒ 0.06 │ vmovsd %xmm10,-0x558(%rbp) ▒ 0.06 │ vsubsd %xmm10,%xmm4,%xmm10 ▒ 0.00 │ vxorpd %xmm4,%xmm4,%xmm4 ▒ 0.00 │ vaddsd %xmm0,%xmm1,%xmm1 ▒ 0.91 │ vmulsd %xmm10,%xmm10,%xmm0 ▒ 0.02 │ vfmadd231sd %xmm3,%xmm3,%xmm0 ▒ 0.78 │ vaddsd %xmm1,%xmm0,%xmm0 ▒ 2.21 │ vmovsd -0x10(%rax,%r15,8),%xmm1 ▒ 0.01 │ vmulsd delim.4+0x6c,%xmm0,%xmm9 ▒ 3.77 │ vmovsd %xmm1,-0x590(%rbp) ▒ 0.00 │ vaddsd -0x3f8(%rbp),%xmm1,%xmm1 ▒ 0.00 │ vsqrtsd %xmm9,%xmm9,%xmm9 ▒ 15.05 │ vsubsd %xmm1,%xmm9,%xmm5 ▒ 0.81 │ vmovsd %xmm1,-0x560(%rbp) ▒ 0.00 │ vcomisd %xmm5,%xmm4 ▒ 3.48 │ vmovsd %xmm5,-0x598(%rbp) ▒ 0.00 │ ↑ ja 910 ▒ 0.17 │ vcomisd 0xd15d4(%rip),%xmm0 # 4d8590 <del▒ 0.07 │ ↑ jb 910 ▒ Which does look very similar to me except for the bookkeping moves. With perf stat I get: Performance counter stats for './a.out-slow': 37,676.01 msec task-clock # 1.000 CPUs utilized 113 context-switches # 2.999 /sec 24 cpu-migrations # 0.637 /sec 70 page-faults # 1.858 /sec 142,922,291,843 cycles # 3.793 GHz 128,124,341 stalled-cycles-frontend # 0.09% frontend cycles idle 110,415,979,659 stalled-cycles-backend # 77.26% backend cycles idle 260,215,601,013 instructions # 1.82 insn per cycle # 0.42 stalled cycles per insn 9,809,351,864 branches # 260.361 M/sec 23,585,751 branch-misses # 0.24% of all branches 37.682154164 seconds time elapsed 37.677998000 seconds user 0.000000000 seconds sys Performance counter stats for './a.out-fast': 23,892.07 msec task-clock # 1.000 CPUs utilized 36 context-switches # 1.507 /sec 13 cpu-migrations # 0.544 /sec 70 page-faults # 2.930 /sec 84,630,483,861 cycles # 3.542 GHz 105,428,600 stalled-cycles-frontend # 0.12% frontend cycles idle 59,095,901,330 stalled-cycles-backend # 69.83% backend cycles idle 191,156,427,437 instructions # 2.26 insn per cycle # 0.31 stalled cycles per insn 8,612,504,734 branches # 360.475 M/sec 23,436,092 branch-misses # 0.27% of all branches 23.902460280 seconds time elapsed 23.885982000 seconds user 0.008000000 seconds sys So there is 13% difference in instruction count that I could imagine to be mostly the moves. Perhaps perdida is simply very expensive to call given the job it does 0.00 │16d0: mov 0x38(%rsp),%rax ▒ 0.01 │ mov %r10,0x18(%rsp) ▒ 0.00 │ mov %r11,0x2f8(%rsp) ▒ 0.06 │ mov %r11,0x358(%rsp) ▒ 0.03 │ mov %r11,0x3b8(%rsp) ▒ 1.37 │ movq $0xfffffffffffffffc,0x2e8(%rsp) ▒ 0.01 │ movq $0x8,0x2f0(%rsp) ▒ 0.01 │ movq $0x8,0x300(%rsp) ◆ 0.01 │ movq $0x1,0x308(%rsp) ▒ 0.02 │ add %rbp,%rax ▒ 0.00 │ movq $0x1,0x310(%rsp) ▒ 0.05 │ movq $0x3,0x318(%rsp) ▒ 0.03 │ mov %rax,0x2e0(%rsp) ▒ 1.34 │ movq $0x3,0x320(%rsp) ▒ 0.03 │ movq $0x1,0x328(%rsp) ▒ 0.01 │ movq $0x3,0x330(%rsp) ▒ 0.03 │ movq $0xfffffffffffffffc,0x348(%rsp) ▒ 0.01 │ movq $0x8,0x350(%rsp) ▒ 0.02 │ movq $0x8,0x360(%rsp) ▒ 0.05 │ movq $0x1,0x368(%rsp) ▒ 0.04 │ movq $0x1,0x370(%rsp) ▒ 1.35 │ movq $0x3,0x378(%rsp) ▒ 0.03 │ movq $0x3,0x380(%rsp) ▒ 0.01 │ movq $0x1,0x388(%rsp) ▒ 0.26 │ movq $0x3,0x390(%rsp) ▒ 0.02 │ movq $0xfffffffffffffffc,0x3a8(%rsp) ▒ 0.26 │ movq $0x8,0x3b0(%rsp) ▒ 0.05 │ movq $0x8,0x3c0(%rsp) ▒ 0.09 │ movq $0x1,0x3c8(%rsp) ▒ 1.34 │ mov 0x28(%rsp),%rax ▒ 0.02 │ add %rbp,%rax ▒ 0.01 │ mov %rax,0x340(%rsp) ▒ 0.02 │ mov 0x40(%rsp),%rax ▒ 0.01 │ add %rbp,%rax ▒ 0.00 │ mov %rax,0x3a0(%rsp) ▒ 0.09 │ movq $0x1,0x3d0(%rsp) ▒ 0.04 │ mov 0x30(%rsp),%rax ▒ 1.35 │ mov %r11,0x418(%rsp) ▒ 0.03 │ mov %r11,0x478(%rsp) ▒ 0.02 │ movq $0x3,0x3d8(%rsp) ▒ 0.04 │ movq $0x3,0x3e0(%rsp) ▒ 0.01 │ movq $0x1,0x3e8(%rsp) ▒ 0.07 │ movq $0x3,0x3f0(%rsp) ▒ 0.06 │ add %rbp,%rax ▒ 0.02 │ movq $0xfffffffffffffffc,0x408(%rsp) ▒ 1.36 │ movq $0x8,0x410(%rsp) ▒ 0.48 │ mov %rax,0x400(%rsp) ▒ 0.01 │ movq $0x8,0x420(%rsp) ▒ 0.24 │ movq $0x1,0x428(%rsp) ▒ 0.01 │ movq $0x1,0x430(%rsp) ▒ 0.24 │ movq $0x3,0x438(%rsp) ▒ 0.01 │ movq $0x3,0x440(%rsp) ▒ 0.28 │ movq $0x1,0x448(%rsp) ▒ 1.32 │ movq $0x3,0x450(%rsp) ▒ 13.32 │ movq $0xfffffffffffffffc,0x468(%rsp) ▒ 0.01 │ movq $0x8,0x470(%rsp) ▒ 5.68 │ movq $0x8,0x480(%rsp) ▒ 0.01 │ movq $0x1,0x488(%rsp) ▒ 2.87 │ movq $0x1,0x490(%rsp) ▒ 0.06 │ movq $0x3,0x498(%rsp) ▒ 2.82 │ movq $0x3,0x4a0(%rsp) ▒ 0.02 │ movq $0x1,0x4a8(%rsp) ▒ 2.87 │ movq $0x3,0x4b0(%rsp) ▒ 0.02 │ mov 0x48(%rsp),%rax ◆ 1.41 │ add %rbp,%rax ▒ 0.02 │ add $0x48,%rbp ▒ 0.00 │ mov %rax,0x460(%rsp) ▒ 1.43 │ lea 0x170(%rsp),%rax ▒ 0.00 │ push %rax ▒ 1.42 │ lea 0x190(%rsp),%rax ▒ 0.02 │ push %rax ▒ 4.28 │ push %r13 ▒ 0.01 │ add $0x8,%r13 ▒ 0.00 │ push %r14 ▒ 4.27 │ add $0x8,%r14 ▒ │ lea 0x480(%rsp),%rax ▒ 0.01 │ push %rax ▒ 0.03 │ push %r12 ▒ 0.00 │ add $0x8,%r12 ▒ 0.00 │ lea 0x430(%rsp),%rax ▒ 1.42 │ push %rax ▒ 2.77 │ lea 0x3d8(%rsp),%rax ▒ 0.01 │ push %rax ▒ 1.43 │ lea 0x380(%rsp),%rax ▒ 0.01 │ push %rax ▒ 1.47 │ lea 0x328(%rsp),%rax ▒ 0.00 │ push %rax ▒ 4.16 │ lea 0x1f8(%rsp),%rax ▒ 0.01 │ push %rax ▒ 2.86 │ lea 0x1d8(%rsp),%rax ▒ 0.00 │ push %rax ▒ 1.56 │ lea 0x1f0(%rsp),%rax ▒ 0.01 │ push %rax ▒ 5.42 │ lea 0x248(%rsp),%rax ▒ 0.01 │ push %rax ▒ 1.43 │ lea 0x1c8(%rsp),%r9 ▒ 0.00 │ lea 0x220(%rsp),%r8 ▒ 0.0 0.00 │ lea 0x210(%rsp),%rdx ▒ 0.00 │ lea 0x208(%rsp),%rsi ▒ 0.01 │ lea 0x1e8(%rsp),%rdi ▒ 0.03 │ → call __perdida_m_MOD_perdida.constprop.0 ▒ 0 │ lea 0x258(%rsp),%rcx ▒ We do constant propagate the array descriptor (up to the one variable entry) but we do not optimize out the code initializing them. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org 2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org 2021-12-07 9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org @ 2021-12-07 9:49 ` hubicka at gcc dot gnu.org 2021-12-07 10:10 ` hubicka at gcc dot gnu.org ` (11 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-07 9:49 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Last reconfirmed| |2021-12-07 Status|UNCONFIRMED |NEW --- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> --- perdida modref summary is: loads: Every base stores: Every base Side effects Nondeterministic Global memory read Global memory written parm 0 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 1 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 2 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 3 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 4 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 5 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 6 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 7 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 8 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 9 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read parm 10 flags: no_direct_clobber no_direct_escape no_indirect_escape parm 11 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 12 flags: no_direct_clobber no_direct_escape no_indirect_escape parm 13 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape parm 14 flags: no_direct_escape no_indirect_escape parm 15 flags: no_direct_clobber no_direct_escape no_indirect_escape parm 16 flags: no_direct_escape no_indirect_escape parm 17 flags: no_direct_escape no_indirect_escape parm 18 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read parm 19 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape __attribute__((fn spec (". r r r r r r r r r r w r w r w w w w r r "))) So we do can not use it to optimize out the dead parameters. The reason is that it calls to dt_parm.237.common.filename = &"fatigue2.f90"[1]{lb: 1 sz: 1}; dt_parm.237.common.line = 1010; MEM <vector(2) integer(kind=4)> [(integer(kind=4) *)&dt_parm.237] = { 128, 6 }; _gfortran_st_write (&dt_parm.237); _gfortran_transfer_character_write (&dt_parm.237, &"discriminant is negative in perdida, abort."[1]{lb: 1 sz: 1}, 43); _gfortran_st_write_done (&dt_parm.237); dt_parm.237 ={v} {CLOBBER}; _gfortran_stop_string (0B, 0, 0); and fortran i/o has no useful fnspec on it so we think it access everything. To optimize out the initializer we could (I guess) 1) teach ipa-cp to turn the one remaining field to scalar and then we will detect the descriptor as non-escaping from MAIN and optimize it out 2) teach PTA summary of ipa-modref to be expressive enough to make PTA to work this out. -fipa-pta does not help (but I think it could if it worked well). We get: <bb 69> [local count: 109362591]: _1895 = (unsigned long) _44; _1894 = _1895 + ivtmp.1050_1911; _246 = (real(kind=8) *) _1894; parm.326.data = _246; parm.326.offset = -4; parm.326.dtype.elem_len = 8; MEM <unsigned long> [(void *)&parm.326 + 24B] = 3307124817920; parm.326.span = 8; parm.326.dim[0].stride = 1; parm.326.dim[0].lbound = 1; parm.326.dim[0].ubound = 3; parm.326.dim[1].stride = 3; parm.326.dim[1].lbound = 1; parm.326.dim[1].ubound = 3; _1883 = (unsigned long) _33; _1882 = _1883 + ivtmp.1050_1911; _247 = (real(kind=8) *) _1882; parm.327.data = _247; parm.327.offset = -4; parm.327.dtype.elem_len = 8; MEM <unsigned long> [(void *)&parm.327 + 24B] = 3307124817920; parm.327.span = 8; parm.327.dim[0].stride = 1; parm.327.dim[0].lbound = 1; parm.327.dim[0].ubound = 3; parm.327.dim[1].stride = 3; parm.327.dim[1].lbound = 1; parm.327.dim[1].ubound = 3; _1881 = (unsigned long) _56; _1880 = _1881 + ivtmp.1050_1911; _248 = (real(kind=8) *) _1880; parm.328.data = _248; parm.328.offset = -4; parm.328.dtype.elem_len = 8; MEM <unsigned long> [(void *)&parm.328 + 24B] = 3307124817920; parm.328.span = 8; parm.328.dim[0].stride = 1; parm.328.dim[0].lbound = 1; parm.328.dim[0].ubound = 3; parm.328.dim[1].stride = 3; parm.328.dim[1].lbound = 1; parm.328.dim[1].ubound = 3; _1879 = (unsigned long) _20; _1878 = _1879 + ivtmp.1050_1911; _249 = (real(kind=8) *) _1878; parm.329.data = _249; parm.329.offset = -4; parm.329.dtype.elem_len = 8; MEM <unsigned long> [(void *)&parm.329 + 24B] = 3307124817920; parm.329.span = 8; parm.329.dim[0].stride = 1; parm.329.dim[0].lbound = 1; parm.329.dim[0].ubound = 3; parm.329.dim[1].stride = 3; parm.329.dim[1].lbound = 1; parm.329.dim[1].ubound = 3; _1867 = (unsigned long) _67; _1866 = _1867 + ivtmp.1050_1911; _250 = (real(kind=8) *) _1866; parm.330.data = _250; parm.330.offset = -4; parm.330.dtype.elem_len = 8; MEM <unsigned long> [(void *)&parm.330 + 24B] = 3307124817920; parm.330.span = 8; parm.330.dim[0].stride = 1; parm.330.dim[0].lbound = 1; parm.330.dim[0].ubound = 3; parm.330.dim[1].stride = 3; parm.330.dim[1].lbound = 1; parm.330.dim[1].ubound = 3; _251 = (real(kind=8) *) ivtmp.1049_1915; _252 = (real(kind=8) *) ivtmp.1048_1925; _253 = (real(kind=8) *) ivtmp.1047_1936; perdida.constprop (&dt, &lambda, &mu, &yield_stress, &r_infinity, &b, &x_infinity, &gamma, &eta, &plastic_strain_threshold, &parm.326, &parm.327, &parm.328, &parm.329, _253, &parm.330, _252, _251, &failure_threshold, &crack_closure_parameter); parm.326 ={v} {CLOBBER}; parm.327 ={v} {CLOBBER}; parm.328 ={v} {CLOBBER}; parm.329 ={v} {CLOBBER}; parm.330 ={v} {CLOBBER}; ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (2 preceding siblings ...) 2021-12-07 9:49 ` hubicka at gcc dot gnu.org @ 2021-12-07 10:10 ` hubicka at gcc dot gnu.org 2021-12-07 10:17 ` hubicka at gcc dot gnu.org ` (10 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-07 10:10 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Created attachment 51940 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51940&action=edit gcov report. perdida does have some loops in it but it seems we optimize them out/make htem relatively cheap. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (3 preceding siblings ...) 2021-12-07 10:10 ` hubicka at gcc dot gnu.org @ 2021-12-07 10:17 ` hubicka at gcc dot gnu.org 2021-12-07 10:24 ` marxin at gcc dot gnu.org ` (9 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-07 10:17 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Quick counting from gcov data. It is 585898984 calls and from the disassmebly the call is 101 instructions, so it account 58589898400 The perf reports shows 69059173576 difference so at least instruction count wise most of it is call overhead of perdida. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (4 preceding siblings ...) 2021-12-07 10:17 ` hubicka at gcc dot gnu.org @ 2021-12-07 10:24 ` marxin at gcc dot gnu.org 2021-12-08 18:24 ` hubicka at gcc dot gnu.org ` (8 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: marxin at gcc dot gnu.org @ 2021-12-07 10:24 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> --- (In reply to Jan Hubicka from comment #4) > Created attachment 51940 [details] > gcov report. perdida does have some loops in it but it seems we optimize > them out/make htem relatively cheap. HTML colored version of the report: https://splichal.eu/tmp/fatique.gcov.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (5 preceding siblings ...) 2021-12-07 10:24 ` marxin at gcc dot gnu.org @ 2021-12-08 18:24 ` hubicka at gcc dot gnu.org 2021-12-08 18:28 ` hubicka at gcc dot gnu.org ` (7 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-08 18:24 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Created attachment 51952 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51952&action=edit Patch to teach modref about global memory This patch extends modref so we eliminate the dead part of array descriptors. This is done by making difference between accesses to global memory and to unknown memory accesses. The first is less evil since it can not access things that does not escape and it is possible to arrange unanalyzed functions to access only global memory (since everything reaching them escapes). With the patch I now get Performance counter stats for './a.out-badmod': 27,448.82 msec task-clock # 0.999 CPUs utilized 334 context-switches # 12.168 /sec 10 cpu-migrations # 0.364 /sec 70 page-faults # 2.550 /sec 109,079,862,444 cycles # 3.974 GHz 130,234,675 stalled-cycles-frontend # 0.12% frontend cycles idle 80,559,978,185 stalled-cycles-backend # 73.85% backend cycles idle 230,337,342,608 instructions # 2.11 insn per cycle # 0.35 stalled cycles per insn 9,808,852,327 branches # 357.351 M/sec 24,742,926 branch-misses # 0.25% of all branches 27.468971377 seconds time elapsed 27.445934000 seconds user 0.003999000 seconds sys so 56% slowdown is now only 22% which is quite nice improvement. The call of perdida now looks as follows: <bb 70> [local count: 109362591]: _818 = (unsigned long) _44; _817 = _818 + ivtmp.1050_1737; _246 = (real(kind=8) *) _817; parm.326.data = _246; _811 = (unsigned long) _33; _810 = _811 + ivtmp.1050_1737; _247 = (real(kind=8) *) _810; parm.327.data = _247; _712 = (unsigned long) _56; _711 = _712 + ivtmp.1050_1737; _248 = (real(kind=8) *) _711; parm.328.data = _248; _612 = (unsigned long) _20; _574 = _612 + ivtmp.1050_1737; _249 = (real(kind=8) *) _574; parm.329.data = _249; _573 = (unsigned long) _67; _572 = _573 + ivtmp.1050_1737; _250 = (real(kind=8) *) _572; parm.330.data = _250; _251 = (real(kind=8) *) ivtmp.1049_1740; _252 = (real(kind=8) *) ivtmp.1048_1753; _253 = (real(kind=8) *) ivtmp.1047_1762; perdida.constprop (&dt, &lambda, &mu, &yield_stress, &r_infinity, &b, &x_infinity, &gamma, &eta, &plastic_strain_threshold, &parm.326, &parm.327, &parm.328, &parm.329, _253, &parm.330, _252, _251, &failure_threshold, &crack_closure_parameter); parm.326 ={v} {CLOBBER}; parm.327 ={v} {CLOBBER}; parm.328 ={v} {CLOBBER}; parm.329 ={v} {CLOBBER}; parm.330 ={v} {CLOBBER}; and compiles to: movq 48(%rsp), %rax movq %r10, 24(%rsp) addq %rbp, %rax movq %rax, 720(%rsp) movq 32(%rsp), %rax addq %rbp, %rax movq %rax, 816(%rsp) movq 64(%rsp), %rax addq %rbp, %rax movq %rax, 912(%rsp) movq 40(%rsp), %rax addq %rbp, %rax movq %rax, 1008(%rsp) movq 72(%rsp), %rax addq %rbp, %rax addq $72, %rbp movq %rax, 1104(%rsp) leaq 352(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1720 leaq 384(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1728 pushq %r13 .cfi_def_cfa_offset 1736 addq $8, %r13 pushq %r12 .cfi_def_cfa_offset 1744 addq $8, %r12 leaq 1136(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1752 pushq %r14 .cfi_def_cfa_offset 1760 addq $8, %r14 leaq 1056(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1768 leaq 968(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1776 leaq 880(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1784 leaq 792(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1792 leaq 488(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1800 leaq 456(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1808 leaq 480(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1816 leaq 568(%rsp), %rax pushq %rax .cfi_def_cfa_offset 1824 leaq 440(%rsp), %r9 leaq 528(%rsp), %r8 leaq 584(%rsp), %rcx leaq 512(%rsp), %rdx leaq 504(%rsp), %rsi leaq 472(%rsp), %rdi call __perdida_m_MOD_perdida.constprop.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (6 preceding siblings ...) 2021-12-08 18:24 ` hubicka at gcc dot gnu.org @ 2021-12-08 18:28 ` hubicka at gcc dot gnu.org 2021-12-08 18:43 ` hubicka at gcc dot gnu.org ` (6 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-08 18:28 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #8 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Also IRA fails with ! Disqualifying parameter number 0 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 3 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 4 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 5 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 6 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 7 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 8 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 9 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 10 - Would result into a too big set of replacements. ! Disqualifying parameter number 11 - Would result into a too big set of replacements. ! Disqualifying parameter number 12 - Would result into a too big set of replacements. ! Disqualifying parameter number 13 - Would result into a too big set of replacements. ! Disqualifying parameter number 15 - Would result into a too big set of replacements. ! Disqualifying parameter number 18 - Would create a possibly illegal dereference in a caller. ! Disqualifying parameter number 19 - Would create a possibly illegal dereference in a caller. I think it is overparanoid about values passed by reference. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (7 preceding siblings ...) 2021-12-08 18:28 ` hubicka at gcc dot gnu.org @ 2021-12-08 18:43 ` hubicka at gcc dot gnu.org 2021-12-12 8:43 ` hubicka at gcc dot gnu.org ` (5 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-08 18:43 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #9 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Hacking around the logic in ipa-sra that disables the transform and adding -fwhole-program I get down to: Performance counter stats for './a.out-bad3': 24,946.66 msec task-clock # 0.999 CPUs utilized 1,078 context-switches # 43.212 /sec 42 cpu-migrations # 1.684 /sec 71 page-faults # 2.846 /sec 96,144,941,575 cycles # 3.854 GHz 151,439,200 stalled-cycles-frontend # 0.16% frontend cycles idle 68,072,941,085 stalled-cycles-backend # 70.80% backend cycles idle 210,675,636,303 instructions # 2.19 insn per cycle # 0.32 stalled cycles per insn 9,128,994,716 branches # 365.941 M/sec 24,781,891 branch-misses # 0.27% of all branches 24.982117481 seconds time elapsed 24.909903000 seconds user 0.036031000 seconds sys which is not bad. I think we have two ipa-sra issues 1) ipa-sra is overparanoid about not adding derefernece. First I think it is safe for parameters that are REFERENCE_TYPE rather than POINTER_TYPE second it should do propagation from callers to callees: it is quite easy to figure out that a given param contains data packed to a structure only to make callee happy. 2) since ipa-sra is run before ipa-cp it won't simplify ipa-cp (or other) clones even if they are static symbols. Also I think ipa-sra may consider packing multiple structures together. If array descriptors are passed by references and built in caller it seems pointless to pass each as separate struct. Not sure what kind of benefits to expect here though. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (8 preceding siblings ...) 2021-12-08 18:43 ` hubicka at gcc dot gnu.org @ 2021-12-12 8:43 ` hubicka at gcc dot gnu.org 2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org ` (4 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2021-12-12 8:43 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> --- https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586644.html improves modref so it is almost able to remove the unnecesary stores to array descriptors. Unforutnately to elinate them one needs to handle the case that calling unknown function is safe WRT all non-escaping values (since passing them to the function makes them escape) that is sadly broken by Fortran FE incorrectly annotating functions calling GOMP_parallel. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (9 preceding siblings ...) 2021-12-12 8:43 ` hubicka at gcc dot gnu.org @ 2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org 2022-01-31 10:33 ` hubicka at gcc dot gnu.org ` (3 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: cvs-commit at gcc dot gnu.org @ 2021-12-14 15:51 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: https://gcc.gnu.org/g:3305135c29e1c3e988bd9bad40aefc01d138aaca commit r12-5960-g3305135c29e1c3e988bd9bad40aefc01d138aaca Author: Jan Hubicka <jh@suse.cz> Date: Tue Dec 14 16:50:27 2021 +0100 Determine global memory accesses in ipa-modref As discussed in PR103585, fatigue2 is now only benchmark from my usual testing set (SPEC2k6, SPEC2k17, CPP benchmarks, polyhedron, Firefox, clang) which sees important regression when inlining functions called once is limited. This prevents us from solving runtime issues in roms benchmarks and elsewhere. The problem is that there is perdida function that takes many arguments and some of them are array descriptors. We constant propagate most of their fields but still keep their initialization. Because perdida is quite fast, the call overhead dominates, since we need over 100 memory stores consuing about 35% of the overall benchmark runtime. The memory stores would be eliminated if perdida did not call fortran I/O which makes modref to thin that the array descriptors could be accessed. We are quite close discovering that they can't becuase they are non-escaping from function. This patch makes modref to distingush between global memory access (only things that escapes) and unkonwn accesss (that may access also nonescaping things reaching the function). This makes disambiguation for functions containing error handling better. Unfortunately the patch hits two semi-latent issues in Fortran frontned. First is wrong code in gfortran.dg/unlimited_polymorphic_3.f03. This can be turned into wrong code testcase on both mainline and gcc11 if the runtime call is removed, so I filled PR 103662 for it. There is TBAA mismatch for structure produced in FE. Second is issue with GOMP where Fortran marks certain parameters as non-escaping and then makes them escape via GOMP_parallel. For this I disabled the use of escape info in verify_arg which also disables the useful transform on perdida but still does useful work for e.g. GCC error handling. I will work on this incrementally. Bootstrapped/regtested x86_64-linux, lto-bootstrapped and also tested with clang build. I plan to commit this tomorrow if there are no complains (the patch is not completely short but conceptualy simple and handles a lot of common cases). gcc/ChangeLog: 2021-12-12 Jan Hubicka <hubicka@ucw.cz> PR ipa/103585 * ipa-modref-tree.c (modref_access_node::range_info_useful_p): Handle MODREF_GLOBAL_MEMORY_PARM. (modref_access_node::dump): Likewise. (modref_access_node::get_call_arg): Likewise. * ipa-modref-tree.h (enum modref_special_parms): Add MODREF_GLOBAL_MEMORY_PARM. (modref_access_node::useful_for_kill): Handle MODREF_GLOBAL_MEMORY_PARM. (modref:tree::merge): Add promote_unknown_to_global. * ipa-modref.c (verify_arg):New function. (may_access_nonescaping_parm_p): New function. (modref_access_analysis::record_global_memory_load): New member function. (modref_access_analysis::record_global_memory_store): Likewise. (modref_access_analysis::process_fnspec): Distingush global and local memory. (modref_access_analysis::analyze_call): Likewise. * tree-ssa-alias.c (ref_may_access_global_memory_p): New function. (modref_may_conflict): Use it. gcc/testsuite/ChangeLog: 2021-12-12 Jan Hubicka <hubicka@ucw.cz> * gcc.dg/analyzer/data-model-1.c: Disable ipa-modref. * gcc.dg/uninit-38.c: Likewise. * gcc.dg/uninit-pr98578.c: Liewise. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (10 preceding siblings ...) 2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org @ 2022-01-31 10:33 ` hubicka at gcc dot gnu.org 2022-12-14 0:05 ` cvs-commit at gcc dot gnu.org ` (2 subsequent siblings) 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2022-01-31 10:33 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Created attachment 52317 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52317&action=edit WIP patch This is patch I made while working on this problem. It improves the escape analysis of the first patch so it cooperates with PTA. However because fortran FE sort of lies about escapes in the presence of OpenMP I had to disable it for cases that are interesting for perdida. So I will look on it next stage 1. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (11 preceding siblings ...) 2022-01-31 10:33 ` hubicka at gcc dot gnu.org @ 2022-12-14 0:05 ` cvs-commit at gcc dot gnu.org 2023-01-19 13:21 ` jamborm at gcc dot gnu.org 2023-01-29 2:23 ` hubicka at gcc dot gnu.org 14 siblings, 0 replies; 16+ messages in thread From: cvs-commit at gcc dot gnu.org @ 2022-12-14 0:05 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #13 from CVS Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>: https://gcc.gnu.org/g:10478270fe0c39c59eb0f35d19356a63bdf3a2ad commit r13-4687-g10478270fe0c39c59eb0f35d19356a63bdf3a2ad Author: Martin Jambor <mjambor@suse.cz> Date: Wed Dec 14 00:33:06 2022 +0100 ipa-sra: Treat REFERENCE_TYPES as always dereferencable C++ and especially Fortran pass data by references which are not pointers potentially pointing anywhere and so can be assumed to be safely dereferencable. This patch teaches IPA-SRA to treat them as such and avoid the dance we do to prove that we can move loads from them to the caller. When we do not know that a dereference will happen all the time, we need a heuristics so that we do not force memory accesses that normally happen only rarely. The patch simply uses the (possibly guessed) profile and checks whether the (expected) number of loads is at least half of function invocations invocations - the half is now configurable with a param as requested by Honza. gcc/ChangeLog: 2022-12-13 Martin Jambor <mjambor@suse.cz> PR ipa/103585 * params.opt (ipa-sra-deref-prob-threshold): New parameter. * doc/invoke.texi (ipa-sra-deref-prob-threshold): Document it. * ipa-sra.cc (struct gensum_param_access): New field load_count. (struct gensum_param_desc): New field safe_ref, adjusted comments. (by_ref_count): Renamed to unsafe_by_ref_count, adjusted all uses. (dump_gensum_access): Dump the new field. (dump_gensum_param_descriptor): Likewise. (create_parameter_descriptors): Set safe_ref field, move setting by_ref forward. Only increment unsafe_by_ref_count for unsafe by_ref parameters. (allocate_access): Initialize new field. (mark_param_dereference): Adjust indentation. Only add data to bb_dereferences for unsafe by_ref parameters. (scan_expr_access): For loads, accumulate BB counts. (dereference_probable_p): New function. (check_gensum_access): Fix leading comment, add parameter FUN. Check cumulative counts of loads for safe by_ref accesses instead of dereferences. (process_scan_results): Do not propagate dereference distances for safe by_ref parameters. Pass fun to check_gensum_access. Safe by_ref params do not need the postdominance check. gcc/testsuite/ChangeLog: 2022-11-11 Martin Jambor <mjambor@suse.cz> * g++.dg/ipa/ipa-sra-5.C: New test ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (12 preceding siblings ...) 2022-12-14 0:05 ` cvs-commit at gcc dot gnu.org @ 2023-01-19 13:21 ` jamborm at gcc dot gnu.org 2023-01-29 2:23 ` hubicka at gcc dot gnu.org 14 siblings, 0 replies; 16+ messages in thread From: jamborm at gcc dot gnu.org @ 2023-01-19 13:21 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #14 from Martin Jambor <jamborm at gcc dot gnu.org> --- Honza, what remains to be done here (if anything)? ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org ` (13 preceding siblings ...) 2023-01-19 13:21 ` jamborm at gcc dot gnu.org @ 2023-01-29 2:23 ` hubicka at gcc dot gnu.org 14 siblings, 0 replies; 16+ messages in thread From: hubicka at gcc dot gnu.org @ 2023-01-29 2:23 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585 --- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> --- We get 47s runtime with -O2 -flto and 53s with -O2 -fno-inline-functions-called-once. The call sequence is: <bb 81> [local count: 109362591]: _1656 = (unsigned long) _45; _1655 = _1656 + ivtmp.1182_2540; _229 = (double *) _1655; _1646 = (unsigned long) _35; _1645 = _1646 + ivtmp.1182_2540; _230 = (double *) _1645; _1636 = (unsigned long) _55; _1635 = _1636 + ivtmp.1182_2540; _231 = (double *) _1635; _1628 = (unsigned long) _17; _1627 = _1628 + ivtmp.1182_2540; _232 = (double *) _1627; _1618 = (unsigned long) _64; _1617 = _1618 + ivtmp.1182_2540; _233 = (double *) _1617; _234 = (double *) ivtmp.1181_2551; _235 = (double *) ivtmp.1180_2575; _236 = (double *) ivtmp.1178_2586; _2607 = yield_stress; perdida.constprop.isra (&dt, &lambda, &mu, _2607, &r_infinity, &b, &x_infinity, &gamma, &eta, &plastic_strain_threshold, _229, _230, _231, _232, _236, _233, _235, _234, &failure_threshold, &crack_closure_parameter); It is not clear to me why lambda is not replaced. Howevever for dt it seems to be: ! Disqualifying parameter number 0 - Dereferences in callers would happen much more frequently. I think this is too early since if we SRA all the way down to the original caller we will avoid all dereferences completely. Other place for improvement is non-LTO. Here IPA-sra disables itself since it does not have cost model for cloning (that could be also improved). Situation could be improved by ipa-modref that may optimize away unused parts of the array descriptors. ipa-modref however gives up on the fact that perdida has Fortran i/o and it then gives up on tracking on the descriptors even if the descritors are never escaping to the i/o. For this I need to finish the non-escaping analysis. I.e. make difference between arguments that does not escape in a sense that once function return they are not saved in global memory and in a sense that they are never passed down to callee function. We also may annotate fortran i/o and understand what it does. So there is still a lot to do. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-01-29 2:23 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org 2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org 2021-12-07 9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org 2021-12-07 9:49 ` hubicka at gcc dot gnu.org 2021-12-07 10:10 ` hubicka at gcc dot gnu.org 2021-12-07 10:17 ` hubicka at gcc dot gnu.org 2021-12-07 10:24 ` marxin at gcc dot gnu.org 2021-12-08 18:24 ` hubicka at gcc dot gnu.org 2021-12-08 18:28 ` hubicka at gcc dot gnu.org 2021-12-08 18:43 ` hubicka at gcc dot gnu.org 2021-12-12 8:43 ` hubicka at gcc dot gnu.org 2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org 2022-01-31 10:33 ` hubicka at gcc dot gnu.org 2022-12-14 0:05 ` cvs-commit at gcc dot gnu.org 2023-01-19 13:21 ` jamborm at gcc dot gnu.org 2023-01-29 2:23 ` hubicka at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).