public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well
@ 2021-12-06 16:59 hubicka at gcc dot gnu.org
2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org
` (14 more replies)
0 siblings, 15 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-06 16:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
Bug ID: 103585
Summary: fatigue2 requires inlining of peridida to work well
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
fatigue2 regresses by 35% comparing -Ofast -march=native (zen) to -Ofast
-march=native -fno-inline-functions-called once
In generall inlining functions called once is not intended as important
performance heuristics. Here we seem to miss some propagation that supports
better vectorization.
The important inline is:
-Inlining perdida.constprop/105 size 703.
- Called once from iztaccihuatl/17 2570 insns.
- Parm map: -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4
-Updated mod-ref summary for iztaccihuatl/17
- loads:
- Every base
- stores:
- Base 0: alias set 6
- Ref 0: alias set 6
- Every access
- Side effects
- Global memory read
- Inlined into iztaccihuatl/17 which now has 3148 size
-Overall time estimate: 270429.808105 weighted by profile: 0.000000
+Overall time estimate: 93562.854248 weighted by profile: 0.000000
So we estimate it to 703 instructions that is not small and the function
inlined to is already quite large (3148 instructions).
Before vectorizer the perdida is still 800 lines of gimple dump (w/o
declarations) but then it simplifies to 630 at the last rerun of dse.
iztaccihuatl is 3323 lines of gimple dump in .optimized of build with perdida
inlined 2832 in build without perdida inlined, so 491 lines. So it looks like
indeed something propagates with we do not get from ipa-cp/modref propagation.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
@ 2021-12-06 17:15 ` hubicka at gcc dot gnu.org
2021-12-07 9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org
` (13 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-06 17:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |mjambor at suse dot cz
--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
It seems to me that we miss some ipa-cp propagation here. In particular there
are array descriptors:
parm.326.span = 8;
MEM <c_char[8]> [(struct dtype_type *)&parm.326 + 24B] = {};
parm.326.dtype.elem_len = 8;
parm.326.dtype.rank = 2;
parm.326.dtype.type = 3;
parm.326.dim[0].lbound = 1;
parm.326.dim[0].ubound = 3;
parm.326.dim[0].stride = 1;
parm.326.dim[1].lbound = 1;
parm.326.dim[1].ubound = 3;
parm.326.dim[1].stride = 3;
_245 = _243 * 9;
_246 = &MEM <real(kind=8)[0:]> [(real(kind=8)[0:] *)_44][_245];
parm.326.data = _246;
parm.326.offset = -4;
and the resulting jump function
ref offset 64: -4 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
ref offset 128: 8 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
ref offset 224: 2 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
ref offset 232: 3 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
ref offset 256: 8 [loc_time: 0, loc_size: 703, prop_time: 0, prop_size:
703]
ref offset 320: 1 [loc_time: 248, loc_size: 697, prop_time: 248,
prop_size: 697]
ref offset 384: 1 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size:
702]
ref offset 448: 3 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size:
702]
ref offset 512: 3 [loc_time: 245, loc_size: 700, prop_time: 245,
prop_size: 700]
ref offset 576: 1 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size:
702]
ref offset 640: 3 [loc_time: 1, loc_size: 702, prop_time: 1, prop_size:
702]
here are 11 constants while there are 12 constant stores above.
I guess the MEM one is missing.
Call is:
call perdida (dt, lambda, mu, yield_stress, R_infinity, b,
X_infinity, &
gamma, eta, plastic_strain_threshold,
stress_tensor(:,:,n), &
strain_tensor(:,:,n),
plastic_strain_tensor(:,:,n), &
strain_rate_tensor(:,:,n),
accumulated_plastic_strain(n), &
back_stress_tensor(:,:,n),
isotropic_hardening_stress(n), &
damage(n), failure_threshold,
crack_closure_parameter)
Similar situation repats iself multiple times.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org
@ 2021-12-07 9:31 ` hubicka at gcc dot gnu.org
2021-12-07 9:49 ` hubicka at gcc dot gnu.org
` (12 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-07 9:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Doing some basic perfing shows that perdida seems to account for about 36% of
runtime of the benchmark and has interesting property that most of time it will
return without looping (and that probbly explains PR103592). It is not clear
to me why one runs faster than other - we do a lot of spilling in both paths.
This is comparing with -fno-tree-vectorize -fno-tree-slp-vectorize since I find
it easier to parse
36.91% a.out-slow a.out-slow [.]
__perdida_m_MOD_perdida.constprop.0
34.15% a.out-fast a.out-fast [.] MAIN__
17.96% a.out-slow a.out-slow [.] MAIN__
7.37% a.out-slow a.out-slow [.]
__perdida_m_MOD_generalized_hookes_law.constprop.0.isra.0
2.43% a.out-fast a.out-fast [.]
__perdida_m_MOD_generalized_hookes_law.constprop.0.isra.0
0.24% a.out-slow a.out-slow [.]
__perdida_m_MOD_generalized_hookes_law.constprop.1.isra.0
0.21% a.out-fast a.out-fast [.]
__perdida_m_MOD_generalized_hookes_law.constprop.1.isra.0
perdida:
│ ◆
│ 0000000000402500 <__perdida_m_MOD_perdida.constprop.0>: ▒
│ __perdida_m_MOD_perdida.constprop.0(): ▒
1.37 │ push %rbp ▒
0.00 │ vxorpd %xmm13,%xmm13,%xmm13 ▒
│ mov %rsp,%rbp ▒
0.00 │ push %r15 ▒
2.07 │ push %r14 ▒
0.01 │ mov %rdx,%r15 ▒
0.00 │ push %r13 ▒
0.00 │ push %r12 ▒
0.69 │ push %rbx ▒
0.01 │ sub $0x3c8,%rsp ▒
0.01 │ mov 0x30(%rbp),%rax ▒
0.02 │ vmovsd (%rsi),%xmm6 ▒
0.01 │ mov %rcx,-0x300(%rbp) ▒
2.01 │ vmovsd (%rdx),%xmm1 ▒
0.00 │ mov %rdi,-0x380(%rbp) ▒
0.01 │ mov %r8,-0x388(%rbp) ▒
0.04 │ lea -0x240(%rbp),%rdi ▒
0.00 │ mov %r9,-0x390(%rbp) ▒
0.66 │ mov %rsi,-0x308(%rbp) ▒
0.03 │ lea -0x2a0(%rbp),%rsi ▒
0.00 │ movq $0x1,-0x218(%rbp) ▒
0.01 │ movq $0x3,-0x200(%rbp) ▒
0.66 │ movq $0x3,-0x260(%rbp) ▒
0.02 │ mov (%rax),%rbx ▒
0.06 │ mov 0x38(%rbp),%rax ▒
0.00 │ vmovsd %xmm6,-0x3d8(%rbp) ▒
0.08 │ vmovsd %xmm1,-0x3e0(%rbp) ▒
0.65 │ mov %rbx,-0x240(%rbp) ▒
0.74 │ mov (%rax),%r14 ▒
0.67 │ mov 0x40(%rbp),%rax ▒
0.01 │ mov (%rax),%r13 ▒
0.74 │ mov 0x58(%rbp),%rax ▒
0.00 │ mov (%rax),%r12 ▒
0.70 │ mov 0x48(%rbp),%rax ▒
0.00 │ mov (%rax),%rax ▒
0.69 │ mov %rax,-0x310(%rbp) ▒
0.00 │ mov %rsp,%rax ▒
0.00 │ sub $0x50,%rsp ▒
│ mov %rsp,%rcx ▒
0.00 │ mov %rax,-0x2f8(%rbp) ▒
0.01 │ mov %rcx,-0x2a0(%rbp) ▒
0.01 │ vmovsd (%r14),%xmm0 ▒
0.68 │ vsubsd 0x0(%r13),%xmm0,%xmm0 ▒
2.85 │ vmovsd %xmm0,(%rsp) ▒
2.09 │ vmovsd 0x8(%r14),%xmm0 ▒
0.00 │ vsubsd 0x8(%r13),%xmm0,%xmm0 ▒
0.01 │ vmovsd %xmm0,0x8(%rcx) ▒
0.70 │ vmovsd 0x10(%r14),%xmm0 ▒
0.01 │ vsubsd 0x10(%r13),%xmm0,%xmm0 ▒
0.02 │ vmovsd %xmm0,0x10(%rcx) ▒
0.69 │ vmovsd 0x18(%r14),%xmm0 ▒
0.00 │ vsubsd 0x18(%r13),%xmm0,%xmm0 ▒
0.59 │ vmovsd %xmm0,0x18(%rcx) ▒
1.83 │ vmovsd 0x20(%r14),%xmm0 ▒
0.00 │ vsubsd 0x20(%r13),%xmm0,%xmm0 ▒
0.02 │ vmovsd %xmm0,0x20(%rcx) ▒
0.72 │ vmovsd 0x28(%r14),%xmm0 ◆
0.00 │ vsubsd 0x28(%r13),%xmm0,%xmm0 ▒
0.01 │ vmovsd %xmm0,0x28(%rcx) ▒
0.13 │ vmovsd 0x30(%r14),%xmm0 ▒
0.01 │ vsubsd 0x30(%r13),%xmm0,%xmm0 ▒
0.01 │ vmovsd %xmm0,0x30(%rcx) ▒
0.70 │ vmovsd 0x38(%r14),%xmm0 ▒
0.00 │ vsubsd 0x38(%r13),%xmm0,%xmm0 ▒
0.02 │ vmovsd %xmm0,0x38(%rcx) ▒
0.69 │ vmovsd 0x40(%r14),%xmm0 ▒
0.00 │ vsubsd 0x40(%r13),%xmm0,%xmm0 ▒
0.01 │ vmovsd %xmm0,0x40(%rcx) ▒
0.70 │ vmovsd %xmm6,%xmm6,%xmm0 ▒
0.00 │ → call __perdida_m_MOD_generalized_hookes_law.co▒
0.00 │ mov 0x68(%rbp),%rax ▒
0.00 │ vmovsd delim.4+0xc,%xmm1 ▒
0.00 │ mov -0x2f8(%rbp),%rsp ▒
0.01 │ mov -0x308(%rbp),%r10 ▒
0.01 │ vmovsd (%rax),%xmm4 ▒
0.01 │ vsubsd %xmm4,%xmm1,%xmm8 ▒
0.04 │ vmulsd (%rbx),%xmm8,%xmm2 ▒
1.46 │ vmulsd 0x20(%rbx),%xmm8,%xmm1 ▒
3.40 │ vcomisd %xmm13,%xmm4 ▒
0.00 │ vmulsd 0x40(%rbx),%xmm8,%xmm0 ▒
0.75 │ vmulsd 0x8(%rbx),%xmm8,%xmm9 ▒
0.11 │ vmulsd 0x10(%rbx),%xmm8,%xmm6 ▒
0.13 │ vmulsd 0x18(%rbx),%xmm8,%xmm11 ▒
0.00 │ vaddsd %xmm1,%xmm2,%xmm3 ▒
0.01 │ vmovsd %xmm2,(%rbx) ▒
0.01 │ vmulsd 0x28(%rbx),%xmm8,%xmm5 ▒
0.02 │ vmovsd %xmm1,0x20(%rbx) ▒
0.00 │ vmulsd 0x30(%rbx),%xmm8,%xmm10 ▒
0.60 │ vaddsd %xmm0,%xmm3,%xmm3 ▒
1.50 │ vmovsd %xmm9,0x8(%rbx) ▒
0.01 │ vmulsd 0x38(%rbx),%xmm8,%xmm7 ▒
0.00 │ vmovsd %xmm0,0x40(%rbx) ▒
│ vmovsd %xmm6,0x10(%rbx) ▒
0.00 │ vmovsd %xmm11,0x18(%rbx) ▒
0.01 │ vmovsd %xmm5,0x28(%rbx) ▒
0.02 │ vmovsd %xmm10,0x30(%rbx) ▒
0.02 │ vmovsd %xmm7,0x38(%rbx) ▒
0.69 │ ↓ jbe 207 ▒
0.00 │ vcomisd 0xd6e2f(%rip),%xmm3 # 4d9530 <deli▒
0.86 │ ↓ jb cf3 ▒
0.24 │207: vmovsd delim.4+0x1c,%xmm7 ▒
0.01 │ mov delim.4+0xc,%rax ▒
0.00 │ mov %rax,-0x330(%rbp) ▒
0.86 │ ↓ jb cf3 ▒
0.24 │207: vmovsd delim.4+0x1c,%xmm7 ▒
0.01 │ mov delim.4+0xc,%rax ▒
0.00 │ mov %rax,-0x330(%rbp) ▒
0.00 │ vmulsd %xmm7,%xmm3,%xmm6 ▒
1.78 │ vmovsd %xmm6,-0x2f8(%rbp) ▒
0.45 │229: vmovsd %xmm7,-0x3a8(%rbp) ▒
0.00 │ vmovsd -0x2f8(%rbp),%xmm7 ▒
5.12 │ vmovsd 0x8(%rbx),%xmm5 ▒
0.02 │ vmovsd 0x10(%rbx),%xmm4 ▒
0.04 │ vmovsd 0x18(%rbx),%xmm3 ▒
0.01 │ vmovsd 0x28(%rbx),%xmm15 ▒
0.03 │ vmovsd 0x38(%rbx),%xmm9 ◆
0.04 │ vmovsd 0x30(%rbx),%xmm14 ▒
0.03 │ vsubsd %xmm7,%xmm2,%xmm6 ▒
2.00 │ vmovsd delim.4+0xc,%xmm2 ▒
0.00 │ vsubsd %xmm7,%xmm1,%xmm1 ▒
0.69 │ vsubsd %xmm7,%xmm0,%xmm7 ▒
0.69 │ vmovsd %xmm5,-0x350(%rbp) ▒
0.00 │ vmovsd %xmm4,-0x358(%rbp) ▒
0.01 │ vmovsd %xmm3,-0x360(%rbp) ▒
0.01 │ vmovsd %xmm6,-0x338(%rbp) ▒
0.68 │ vmovsd %xmm1,-0x340(%rbp) ▒
0.69 │ vmovsd %xmm9,-0x378(%rbp) ▒
0.01 │ vmovsd %xmm15,-0x368(%rbp) ▒
0.02 │ vmovsd %xmm14,-0x370(%rbp) ▒
0.01 │ vmovsd %xmm7,-0x348(%rbp) ▒
0.68 │ vdivsd %xmm8,%xmm2,%xmm8 ▒
0.03 │ vmulsd %xmm5,%xmm8,%xmm11 ▒
0.03 │ vsubsd 0x8(%r12),%xmm11,%xmm0 ▒
0.01 │ vmulsd %xmm4,%xmm8,%xmm5 ▒
0.00 │ vmulsd %xmm6,%xmm8,%xmm6 ▒
0.04 │ vsubsd (%r12),%xmm6,%xmm4 ▒
0.72 │ vmulsd %xmm3,%xmm8,%xmm12 ▒
0.00 │ vmovsd %xmm5,-0x318(%rbp) ▒
0.02 │ vmulsd %xmm1,%xmm8,%xmm10 ▒
0.00 │ vsubsd 0x18(%r12),%xmm12,%xmm1 ▒
0.01 │ vmulsd %xmm15,%xmm8,%xmm3 ▒
0.00 │ vmovsd %xmm6,-0x320(%rbp) ▒
0.02 │ vmulsd %xmm0,%xmm0,%xmm0 ▒
0.10 │ vmulsd %xmm9,%xmm8,%xmm2 ▒
0.64 │ vsubsd 0x28(%r12),%xmm3,%xmm9 ▒
0.01 │ vmovsd %xmm3,-0x308(%rbp) ▒
0.01 │ vfmadd132sd %xmm4,%xmm0,%xmm4 ▒
2.70 │ vsubsd 0x10(%r12),%xmm5,%xmm0 ▒
0.00 │ vmulsd %xmm14,%xmm8,%xmm15 ▒
0.00 │ vmovsd %xmm2,-0x3a0(%rbp) ▒
0.00 │ vmulsd %xmm7,%xmm8,%xmm8 ▒
0.00 │ vmulsd %xmm1,%xmm1,%xmm1 ▒
0.01 │ vmulsd %xmm9,%xmm9,%xmm9 ▒
0.01 │ vfmadd132sd %xmm0,%xmm1,%xmm0 ▒
0.06 │ vsubsd 0x20(%r12),%xmm10,%xmm1 ▒
0.66 │ vfmadd132sd %xmm1,%xmm9,%xmm1 ▒
0.06 │ vsubsd 0x30(%r12),%xmm15,%xmm9 ▒
0.00 │ vaddsd %xmm0,%xmm1,%xmm1 ▒
2.07 │ vsubsd 0x38(%r12),%xmm2,%xmm0 ▒
0.00 │ vmulsd %xmm0,%xmm0,%xmm0 ▒
0.01 │ vfmadd231sd %xmm9,%xmm9,%xmm0 ▒
0.08 │ vsubsd 0x40(%r12),%xmm8,%xmm9 ▒
0.00 │ vfmadd132sd %xmm9,%xmm4,%xmm9 ▒
0.65 │ vaddsd %xmm9,%xmm0,%xmm0 ▒
2.04 │ vaddsd %xmm1,%xmm0,%xmm0 ▒
2.09 │ vmulsd delim.4+0x24,%xmm0,%xmm14 ▒
3.47 │ mov -0x300(%rbp),%rax ▒
0.00 │ vmovsd (%rax),%xmm9 ▒
0.00 │ mov 0x60(%rbp),%rax ▒
0.00 │ vaddsd (%rax),%xmm9,%xmm4 ▒
0.01 │ vsqrtsd %xmm14,%xmm14,%xmm14 ▒
13.94 │ vsubsd %xmm4,%xmm14,%xmm4 ▒
0.71 │ vcomisd %xmm4,%xmm13 ▒
2.88 │ vmovsd %xmm4,-0x398(%rbp) ▒
0.00 │ ↓ ja 3b0 ▒
0.16 │ vcomisd 0xd6c9a(%rip),%xmm0 # 4d9548 <deli◆
0.04 │ ↓ jae 3bf ▒
1.30 │3b0: lea -0x28(%rbp),%rsp ▒
0.00 │ pop %rbx ▒
0.00 │ pop %r12 ▒
0.00 │ pop %r13 ▒
0.00 │ pop %r14 ▒
0.00 │ pop %r15 ▒
0.02 │ pop %rbp ▒
0.01 │ ← ret ▒
inlined variant:
Percent│ vmovsd (%r14),%xmm0 ▒
0.01 │ mov %rsp,%rax ▒
0.01 │ sub $0x50,%rsp ▒
0.00 │ vmovsd -0x448(%rbp),%xmm5 ▒
0.06 │ vsubsd 0x0(%r13),%xmm0,%xmm0 ▒
0.71 │ mov %rsp,%rdx ▒
0.02 │ vmovsd -0x440(%rbp),%xmm1 ▒
0.00 │ mov %rcx,-0x508(%rbp) ▒
0.00 │ mov %r10,-0x4e8(%rbp) ▒
0.01 │ mov %rax,-0x4e0(%rbp) ◆
0.00 │ lea -0x250(%rbp),%rsi ▒
│ lea -0x2b0(%rbp),%rdi ▒
0.05 │ mov %rdx,-0x250(%rbp) ▒
0.67 │ movq $0x1,-0x288(%rbp) ▒
0.02 │ vxorpd %xmm13,%xmm13,%xmm13 ▒
0.00 │ movq $0x3,-0x270(%rbp) ▒
0.00 │ mov %r12,-0x2b0(%rbp) ▒
0.01 │ movq $0x3,-0x210(%rbp) ▒
0.00 │ vmovsd %xmm5,-0x618(%rbp) ▒
0.00 │ vmovsd %xmm1,-0x620(%rbp) ▒
0.06 │ vmovsd %xmm0,(%rsp) ▒
0.69 │ vmovsd 0x8(%r14),%xmm0 ▒
0.03 │ vsubsd 0x8(%r13),%xmm0,%xmm0 ▒
0.01 │ vmovsd %xmm0,0x8(%rdx) ▒
0.01 │ vmovsd 0x10(%r14),%xmm0 ▒
0.02 │ vsubsd 0x10(%r13),%xmm0,%xmm0 ▒
0.00 │ vmovsd %xmm0,0x10(%rdx) ▒
0.01 │ vmovsd 0x18(%r14),%xmm0 ▒
0.06 │ vsubsd 0x18(%r13),%xmm0,%xmm0 ▒
0.69 │ vmovsd %xmm0,0x18(%rdx) ▒
0.03 │ vmovsd 0x20(%r14),%xmm0 ▒
0.01 │ vsubsd 0x20(%r13),%xmm0,%xmm0 ▒
0.01 │ vmovsd %xmm0,0x20(%rdx) ▒
0.02 │ vmovsd 0x28(%r14),%xmm0 ▒
0.00 │ vsubsd 0x28(%r13),%xmm0,%xmm0 ▒
0.00 │ vmovsd %xmm0,0x28(%rdx) ▒
0.06 │ vmovsd 0x30(%r14),%xmm0 ▒
0.67 │ vsubsd 0x30(%r13),%xmm0,%xmm0 ▒
0.03 │ vmovsd %xmm0,0x30(%rdx) ▒
0.01 │ vmovsd 0x38(%r14),%xmm0 ▒
0.00 │ vsubsd 0x38(%r13),%xmm0,%xmm0 ▒
0.02 │ vmovsd %xmm0,0x38(%rdx) ▒
0.01 │ vmovsd 0x40(%r14),%xmm0 ▒
0.00 │ vsubsd 0x40(%r13),%xmm0,%xmm0 ▒
0.06 │ vmovsd %xmm0,0x40(%rdx) ▒
0.68 │ vmovsd %xmm5,%xmm5,%xmm0 ▒
0.01 │ → call __perdida_m_MOD_generalized_hookes_law.c▒
0.00 │ mov -0x510(%rbp),%rax ▒
0.00 │ vmovsd delim.4+0x1c,%xmm6 ▒
0.00 │ mov -0x4e0(%rbp),%rsp ▒
0.05 │ mov -0x4e8(%rbp),%r10 ▒
0.01 │ mov -0x508(%rbp),%rcx ▒
0.07 │ vmovsd -0x10(%rax,%r15,8),%xmm4 ▒
0.64 │ vsubsd %xmm4,%xmm6,%xmm8 ▒
0.06 │ vmulsd 0x8(%r12),%xmm8,%xmm6 ▒
0.40 │ vmulsd 0x10(%r12),%xmm8,%xmm7 ▒
0.25 │ vcomisd %xmm13,%xmm4 ▒
0.02 │ vmovsd %xmm4,-0x570(%rbp) ▒
0.00 │ vmulsd 0x18(%r12),%xmm8,%xmm5 ▒
0.10 │ vmulsd (%r12),%xmm8,%xmm2 ▒
0.54 │ vmulsd 0x20(%r12),%xmm8,%xmm1 ▒
0.85 │ vmulsd 0x40(%r12),%xmm8,%xmm0 ▒
0.44 │ vmovsd %xmm6,-0x4f0(%rbp) ▒
0.04 │ vmovsd %xmm6,0x8(%r12) ▒
0.10 │ vmovsd %xmm7,-0x4f8(%rbp) ▒
0.01 │ vmovsd %xmm7,0x10(%r12) ◆
0.14 │ vmulsd 0x28(%r12),%xmm8,%xmm6 ▒
0.19 │ vmulsd 0x38(%r12),%xmm8,%xmm7 ▒
0.25 │ vmovsd %xmm5,-0x500(%rbp) ▒
0.03 │ vmovsd %xmm5,0x18(%r12) ▒
0.37 │ vmovsd %xmm2,(%r12) ▒
0.03 │ vmulsd 0x30(%r12),%xmm8,%xmm5 ▒
0.09 │ vaddsd %xmm1,%xmm2,%xmm3 ▒
0.00 │ vmovsd %xmm1,0x20(%r12) ▒
0.19 │ vmovsd %xmm0,0x40(%r12) ▒
0.16 │ vaddsd %xmm0,%xmm3,%xmm3 ▒
0.76 │ vmovsd %xmm6,0x28(%r12) ▒
0.09 │ vmovsd %xmm7,0x38(%r12) ▒
0.10 │ vmovsd %xmm5,0x30(%r12) ▒
0.10 │ ↓ jbe b51 ▒
0.01 │ vcomisd 0xd17ed(%rip),%xmm3 # 4d8578 <del▒
0.55 │ ↓ jb 1a11 ▒
0.23 │ b51: vmovsd delim.4+0x64,%xmm4 ▒
0.01 │ vmulsd %xmm4,%xmm3,%xmm3 ▒
1.38 │ vmovsd %xmm4,-0x5d8(%rbp) ▒
0.01 │ vmovsd %xmm3,-0x508(%rbp) ▒
0.48 │ b6d: vsubsd %xmm3,%xmm2,%xmm4 ▒
0.02 │ vsubsd %xmm3,%xmm1,%xmm2 ▒
0.57 │ vmovsd %xmm5,-0x600(%rbp) ▒
0.15 │ vmovsd %xmm6,-0x608(%rbp) ▒
0.18 │ vsubsd %xmm3,%xmm0,%xmm1 ▒
0.52 │ vmovsd %xmm7,-0x5f8(%rbp) ▒
0.30 │ vmovsd delim.4+0x1c,%xmm3 ▒
0.09 │ vmovsd %xmm4,-0x578(%rbp) ▒
0.95 │ vmovsd %xmm2,-0x580(%rbp) ▒
0.73 │ vmovsd %xmm1,-0x588(%rbp) ▒
0.55 │ vdivsd %xmm8,%xmm3,%xmm0 ▒
5.95 │ vmulsd -0x4f0(%rbp),%xmm0,%xmm13 ▒
2.14 │ vmulsd -0x4f8(%rbp),%xmm0,%xmm12 ▒
0.65 │ vmulsd %xmm0,%xmm5,%xmm10 ▒
0.09 │ vmulsd %xmm6,%xmm0,%xmm3 ▒
0.01 │ vmovsd 0x8(%rbx),%xmm5 ▒
0.01 │ vmulsd %xmm7,%xmm0,%xmm6 ▒
0.68 │ vmulsd %xmm1,%xmm0,%xmm1 ▒
0.15 │ vmovsd (%rbx),%xmm7 ▒
0.00 │ vmovsd %xmm3,-0x5e0(%rbp) ▒
0.01 │ vmulsd %xmm4,%xmm0,%xmm9 ▒
0.69 │ vmulsd %xmm2,%xmm0,%xmm14 ▒
0.10 │ vmovsd %xmm1,-0x4e8(%rbp) ▒
0.05 │ vmovsd %xmm6,-0x4e0(%rbp) ▒
0.62 │ vmovsd %xmm5,-0x520(%rbp) ▒
0.05 │ vmovsd 0x10(%rbx),%xmm6 ▒
0.00 │ vmulsd -0x500(%rbp),%xmm0,%xmm4 ▒
0.11 │ vsubsd %xmm5,%xmm13,%xmm0 ▒
0.01 │ vsubsd %xmm7,%xmm9,%xmm2 ▒
0.09 │ vmovsd %xmm9,-0x5c0(%rbp) ▒
0.70 │ vmovsd %xmm7,-0x518(%rbp) ▒
0.06 │ vmovsd %xmm14,-0x5c8(%rbp) ▒
0.64 │ vmulsd %xmm0,%xmm0,%xmm1 ▒
0.06 │ vmovsd %xmm0,-0x5a8(%rbp) ▒
0.71 │ vmovsd %xmm2,-0x5a0(%rbp) ▒
0.72 │ vfmadd231sd %xmm2,%xmm2,%xmm1 ◆
0.81 │ vmovsd 0x18(%rbx),%xmm2 ▒
0.01 │ vsubsd %xmm6,%xmm12,%xmm15 ▒
0.00 │ vmovsd %xmm6,-0x528(%rbp) ▒
0.05 │ vmovsd %xmm15,-0x5b0(%rbp) ▒
0.07 │ vmovsd %xmm4,-0x5f0(%rbp) ▒
0.63 │ vmovsd %xmm2,-0x530(%rbp) ▒
0.07 │ vsubsd %xmm2,%xmm4,%xmm2 ▒
0.01 │ vmovsd 0x20(%rbx),%xmm4 ▒
0.10 │ vmulsd %xmm2,%xmm2,%xmm0 ▒
0.06 │ vmovsd %xmm2,-0x5e8(%rbp) ▒
0.62 │ vfmadd231sd %xmm15,%xmm15,%xmm0 ▒
1.14 │ vmovsd %xmm4,-0x538(%rbp) ▒
0.07 │ vsubsd %xmm4,%xmm14,%xmm14 ▒
0.00 │ vaddsd %xmm0,%xmm1,%xmm1 ▒
2.07 │ vmovsd 0x28(%rbx),%xmm0 ▒
0.00 │ vsubsd %xmm0,%xmm3,%xmm15 ▒
0.00 │ vmovsd %xmm0,-0x540(%rbp) ▒
0.04 │ vmovsd 0x30(%rbx),%xmm3 ▒
0.00 │ vmovsd -0x4e8(%rbp),%xmm4 ▒
0.01 │ vmovsd %xmm10,-0x5d0(%rbp) ▒
0.05 │ vmovsd 0x38(%rbx),%xmm9 ▒
0.00 │ vmovsd -0x4e0(%rbp),%xmm7 ▒
0.69 │ vmulsd %xmm15,%xmm15,%xmm0 ▒
0.00 │ mov -0x568(%rbp),%rax ▒
0.00 │ vfmadd231sd %xmm14,%xmm14,%xmm0 ▒
0.02 │ vsubsd %xmm3,%xmm10,%xmm11 ▒
0.00 │ vmovsd 0x40(%rbx),%xmm10 ▒
0.00 │ vmovsd %xmm3,-0x548(%rbp) ▒
0.05 │ vmovsd %xmm9,-0x550(%rbp) ▒
0.06 │ vsubsd %xmm9,%xmm7,%xmm3 ▒
0.68 │ vfmadd231sd %xmm11,%xmm11,%xmm0 ▒
0.19 │ vmovsd %xmm3,-0x5b8(%rbp) ▒
0.06 │ vmovsd %xmm10,-0x558(%rbp) ▒
0.06 │ vsubsd %xmm10,%xmm4,%xmm10 ▒
0.00 │ vxorpd %xmm4,%xmm4,%xmm4 ▒
0.00 │ vaddsd %xmm0,%xmm1,%xmm1 ▒
0.91 │ vmulsd %xmm10,%xmm10,%xmm0 ▒
0.02 │ vfmadd231sd %xmm3,%xmm3,%xmm0 ▒
0.78 │ vaddsd %xmm1,%xmm0,%xmm0 ▒
2.21 │ vmovsd -0x10(%rax,%r15,8),%xmm1 ▒
0.01 │ vmulsd delim.4+0x6c,%xmm0,%xmm9 ▒
3.77 │ vmovsd %xmm1,-0x590(%rbp) ▒
0.00 │ vaddsd -0x3f8(%rbp),%xmm1,%xmm1 ▒
0.00 │ vsqrtsd %xmm9,%xmm9,%xmm9 ▒
15.05 │ vsubsd %xmm1,%xmm9,%xmm5 ▒
0.81 │ vmovsd %xmm1,-0x560(%rbp) ▒
0.00 │ vcomisd %xmm5,%xmm4 ▒
3.48 │ vmovsd %xmm5,-0x598(%rbp) ▒
0.00 │ ↑ ja 910 ▒
0.17 │ vcomisd 0xd15d4(%rip),%xmm0 # 4d8590 <del▒
0.07 │ ↑ jb 910 ▒
Which does look very similar to me except for the bookkeping moves.
With perf stat I get:
Performance counter stats for './a.out-slow':
37,676.01 msec task-clock # 1.000 CPUs utilized
113 context-switches # 2.999 /sec
24 cpu-migrations # 0.637 /sec
70 page-faults # 1.858 /sec
142,922,291,843 cycles # 3.793 GHz
128,124,341 stalled-cycles-frontend # 0.09% frontend cycles
idle
110,415,979,659 stalled-cycles-backend # 77.26% backend cycles
idle
260,215,601,013 instructions # 1.82 insn per cycle
# 0.42 stalled cycles per
insn
9,809,351,864 branches # 260.361 M/sec
23,585,751 branch-misses # 0.24% of all branches
37.682154164 seconds time elapsed
37.677998000 seconds user
0.000000000 seconds sys
Performance counter stats for './a.out-fast':
23,892.07 msec task-clock # 1.000 CPUs utilized
36 context-switches # 1.507 /sec
13 cpu-migrations # 0.544 /sec
70 page-faults # 2.930 /sec
84,630,483,861 cycles # 3.542 GHz
105,428,600 stalled-cycles-frontend # 0.12% frontend cycles
idle
59,095,901,330 stalled-cycles-backend # 69.83% backend cycles
idle
191,156,427,437 instructions # 2.26 insn per cycle
# 0.31 stalled cycles per
insn
8,612,504,734 branches # 360.475 M/sec
23,436,092 branch-misses # 0.27% of all branches
23.902460280 seconds time elapsed
23.885982000 seconds user
0.008000000 seconds sys
So there is 13% difference in instruction count that I could imagine to be
mostly the moves.
Perhaps perdida is simply very expensive to call given the job it does
0.00 │16d0: mov 0x38(%rsp),%rax ▒
0.01 │ mov %r10,0x18(%rsp) ▒
0.00 │ mov %r11,0x2f8(%rsp) ▒
0.06 │ mov %r11,0x358(%rsp) ▒
0.03 │ mov %r11,0x3b8(%rsp) ▒
1.37 │ movq $0xfffffffffffffffc,0x2e8(%rsp) ▒
0.01 │ movq $0x8,0x2f0(%rsp) ▒
0.01 │ movq $0x8,0x300(%rsp) ◆
0.01 │ movq $0x1,0x308(%rsp) ▒
0.02 │ add %rbp,%rax ▒
0.00 │ movq $0x1,0x310(%rsp) ▒
0.05 │ movq $0x3,0x318(%rsp) ▒
0.03 │ mov %rax,0x2e0(%rsp) ▒
1.34 │ movq $0x3,0x320(%rsp) ▒
0.03 │ movq $0x1,0x328(%rsp) ▒
0.01 │ movq $0x3,0x330(%rsp) ▒
0.03 │ movq $0xfffffffffffffffc,0x348(%rsp) ▒
0.01 │ movq $0x8,0x350(%rsp) ▒
0.02 │ movq $0x8,0x360(%rsp) ▒
0.05 │ movq $0x1,0x368(%rsp) ▒
0.04 │ movq $0x1,0x370(%rsp) ▒
1.35 │ movq $0x3,0x378(%rsp) ▒
0.03 │ movq $0x3,0x380(%rsp) ▒
0.01 │ movq $0x1,0x388(%rsp) ▒
0.26 │ movq $0x3,0x390(%rsp) ▒
0.02 │ movq $0xfffffffffffffffc,0x3a8(%rsp) ▒
0.26 │ movq $0x8,0x3b0(%rsp) ▒
0.05 │ movq $0x8,0x3c0(%rsp) ▒
0.09 │ movq $0x1,0x3c8(%rsp) ▒
1.34 │ mov 0x28(%rsp),%rax ▒
0.02 │ add %rbp,%rax ▒
0.01 │ mov %rax,0x340(%rsp) ▒
0.02 │ mov 0x40(%rsp),%rax ▒
0.01 │ add %rbp,%rax ▒
0.00 │ mov %rax,0x3a0(%rsp) ▒
0.09 │ movq $0x1,0x3d0(%rsp) ▒
0.04 │ mov 0x30(%rsp),%rax ▒
1.35 │ mov %r11,0x418(%rsp) ▒
0.03 │ mov %r11,0x478(%rsp) ▒
0.02 │ movq $0x3,0x3d8(%rsp) ▒
0.04 │ movq $0x3,0x3e0(%rsp) ▒
0.01 │ movq $0x1,0x3e8(%rsp) ▒
0.07 │ movq $0x3,0x3f0(%rsp) ▒
0.06 │ add %rbp,%rax ▒
0.02 │ movq $0xfffffffffffffffc,0x408(%rsp) ▒
1.36 │ movq $0x8,0x410(%rsp) ▒
0.48 │ mov %rax,0x400(%rsp) ▒
0.01 │ movq $0x8,0x420(%rsp) ▒
0.24 │ movq $0x1,0x428(%rsp) ▒
0.01 │ movq $0x1,0x430(%rsp) ▒
0.24 │ movq $0x3,0x438(%rsp) ▒
0.01 │ movq $0x3,0x440(%rsp) ▒
0.28 │ movq $0x1,0x448(%rsp) ▒
1.32 │ movq $0x3,0x450(%rsp) ▒
13.32 │ movq $0xfffffffffffffffc,0x468(%rsp) ▒
0.01 │ movq $0x8,0x470(%rsp) ▒
5.68 │ movq $0x8,0x480(%rsp) ▒
0.01 │ movq $0x1,0x488(%rsp) ▒
2.87 │ movq $0x1,0x490(%rsp) ▒
0.06 │ movq $0x3,0x498(%rsp) ▒
2.82 │ movq $0x3,0x4a0(%rsp) ▒
0.02 │ movq $0x1,0x4a8(%rsp) ▒
2.87 │ movq $0x3,0x4b0(%rsp) ▒
0.02 │ mov 0x48(%rsp),%rax ◆
1.41 │ add %rbp,%rax ▒
0.02 │ add $0x48,%rbp ▒
0.00 │ mov %rax,0x460(%rsp) ▒
1.43 │ lea 0x170(%rsp),%rax ▒
0.00 │ push %rax ▒
1.42 │ lea 0x190(%rsp),%rax ▒
0.02 │ push %rax ▒
4.28 │ push %r13 ▒
0.01 │ add $0x8,%r13 ▒
0.00 │ push %r14 ▒
4.27 │ add $0x8,%r14 ▒
│ lea 0x480(%rsp),%rax ▒
0.01 │ push %rax ▒
0.03 │ push %r12 ▒
0.00 │ add $0x8,%r12 ▒
0.00 │ lea 0x430(%rsp),%rax ▒
1.42 │ push %rax ▒
2.77 │ lea 0x3d8(%rsp),%rax ▒
0.01 │ push %rax ▒
1.43 │ lea 0x380(%rsp),%rax ▒
0.01 │ push %rax ▒
1.47 │ lea 0x328(%rsp),%rax ▒
0.00 │ push %rax ▒
4.16 │ lea 0x1f8(%rsp),%rax ▒
0.01 │ push %rax ▒
2.86 │ lea 0x1d8(%rsp),%rax ▒
0.00 │ push %rax ▒
1.56 │ lea 0x1f0(%rsp),%rax ▒
0.01 │ push %rax ▒
5.42 │ lea 0x248(%rsp),%rax ▒
0.01 │ push %rax ▒
1.43 │ lea 0x1c8(%rsp),%r9 ▒
0.00 │ lea 0x220(%rsp),%r8 ▒
0.0 0.00 │ lea 0x210(%rsp),%rdx ▒
0.00 │ lea 0x208(%rsp),%rsi ▒
0.01 │ lea 0x1e8(%rsp),%rdi ▒
0.03 │ → call __perdida_m_MOD_perdida.constprop.0 ▒
0 │ lea 0x258(%rsp),%rcx ▒
We do constant propagate the array descriptor (up to the one variable entry)
but we do not optimize out the code initializing them.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org
2021-12-07 9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org
@ 2021-12-07 9:49 ` hubicka at gcc dot gnu.org
2021-12-07 10:10 ` hubicka at gcc dot gnu.org
` (11 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-07 9:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Last reconfirmed| |2021-12-07
Status|UNCONFIRMED |NEW
--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
perdida modref summary is:
loads:
Every base
stores:
Every base
Side effects
Nondeterministic
Global memory read
Global memory written
parm 0 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 1 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 2 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 3 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 4 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 5 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 6 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 7 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 8 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 9 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape no_indirect_read
parm 10 flags: no_direct_clobber no_direct_escape no_indirect_escape
parm 11 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 12 flags: no_direct_clobber no_direct_escape no_indirect_escape
parm 13 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
parm 14 flags: no_direct_escape no_indirect_escape
parm 15 flags: no_direct_clobber no_direct_escape no_indirect_escape
parm 16 flags: no_direct_escape no_indirect_escape
parm 17 flags: no_direct_escape no_indirect_escape
parm 18 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape no_indirect_read
parm 19 flags: no_direct_clobber no_indirect_clobber no_direct_escape
no_indirect_escape
__attribute__((fn spec (". r r r r r r r r r r w r w r w w w w r r ")))
So we do can not use it to optimize out the dead parameters. The reason is that
it calls to
dt_parm.237.common.filename = &"fatigue2.f90"[1]{lb: 1 sz: 1};
dt_parm.237.common.line = 1010;
MEM <vector(2) integer(kind=4)> [(integer(kind=4) *)&dt_parm.237] = { 128, 6
};
_gfortran_st_write (&dt_parm.237);
_gfortran_transfer_character_write (&dt_parm.237, &"discriminant is negative
in perdida, abort."[1]{lb: 1 sz: 1}, 43);
_gfortran_st_write_done (&dt_parm.237);
dt_parm.237 ={v} {CLOBBER};
_gfortran_stop_string (0B, 0, 0);
and fortran i/o has no useful fnspec on it so we think it access everything.
To optimize out the initializer we could (I guess)
1) teach ipa-cp to turn the one remaining field to scalar and then we will
detect the descriptor as non-escaping from MAIN and optimize it out
2) teach PTA summary of ipa-modref to be expressive enough to make PTA to work
this out.
-fipa-pta does not help (but I think it could if it worked well). We get:
<bb 69> [local count: 109362591]:
_1895 = (unsigned long) _44;
_1894 = _1895 + ivtmp.1050_1911;
_246 = (real(kind=8) *) _1894;
parm.326.data = _246;
parm.326.offset = -4;
parm.326.dtype.elem_len = 8;
MEM <unsigned long> [(void *)&parm.326 + 24B] = 3307124817920;
parm.326.span = 8;
parm.326.dim[0].stride = 1;
parm.326.dim[0].lbound = 1;
parm.326.dim[0].ubound = 3;
parm.326.dim[1].stride = 3;
parm.326.dim[1].lbound = 1;
parm.326.dim[1].ubound = 3;
_1883 = (unsigned long) _33;
_1882 = _1883 + ivtmp.1050_1911;
_247 = (real(kind=8) *) _1882;
parm.327.data = _247;
parm.327.offset = -4;
parm.327.dtype.elem_len = 8;
MEM <unsigned long> [(void *)&parm.327 + 24B] = 3307124817920;
parm.327.span = 8;
parm.327.dim[0].stride = 1;
parm.327.dim[0].lbound = 1;
parm.327.dim[0].ubound = 3;
parm.327.dim[1].stride = 3;
parm.327.dim[1].lbound = 1;
parm.327.dim[1].ubound = 3;
_1881 = (unsigned long) _56;
_1880 = _1881 + ivtmp.1050_1911;
_248 = (real(kind=8) *) _1880;
parm.328.data = _248;
parm.328.offset = -4;
parm.328.dtype.elem_len = 8;
MEM <unsigned long> [(void *)&parm.328 + 24B] = 3307124817920;
parm.328.span = 8;
parm.328.dim[0].stride = 1;
parm.328.dim[0].lbound = 1;
parm.328.dim[0].ubound = 3;
parm.328.dim[1].stride = 3;
parm.328.dim[1].lbound = 1;
parm.328.dim[1].ubound = 3;
_1879 = (unsigned long) _20;
_1878 = _1879 + ivtmp.1050_1911;
_249 = (real(kind=8) *) _1878;
parm.329.data = _249;
parm.329.offset = -4;
parm.329.dtype.elem_len = 8;
MEM <unsigned long> [(void *)&parm.329 + 24B] = 3307124817920;
parm.329.span = 8;
parm.329.dim[0].stride = 1;
parm.329.dim[0].lbound = 1;
parm.329.dim[0].ubound = 3;
parm.329.dim[1].stride = 3;
parm.329.dim[1].lbound = 1;
parm.329.dim[1].ubound = 3;
_1867 = (unsigned long) _67;
_1866 = _1867 + ivtmp.1050_1911;
_250 = (real(kind=8) *) _1866;
parm.330.data = _250;
parm.330.offset = -4;
parm.330.dtype.elem_len = 8;
MEM <unsigned long> [(void *)&parm.330 + 24B] = 3307124817920;
parm.330.span = 8;
parm.330.dim[0].stride = 1;
parm.330.dim[0].lbound = 1;
parm.330.dim[0].ubound = 3;
parm.330.dim[1].stride = 3;
parm.330.dim[1].lbound = 1;
parm.330.dim[1].ubound = 3;
_251 = (real(kind=8) *) ivtmp.1049_1915;
_252 = (real(kind=8) *) ivtmp.1048_1925;
_253 = (real(kind=8) *) ivtmp.1047_1936;
perdida.constprop (&dt, &lambda, &mu, &yield_stress, &r_infinity, &b,
&x_infinity, &gamma, &eta, &plastic_strain_threshold, &parm.326, &parm.327,
&parm.328, &parm.329, _253, &parm.330, _252, _251, &failure_threshold,
&crack_closure_parameter);
parm.326 ={v} {CLOBBER};
parm.327 ={v} {CLOBBER};
parm.328 ={v} {CLOBBER};
parm.329 ={v} {CLOBBER};
parm.330 ={v} {CLOBBER};
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (2 preceding siblings ...)
2021-12-07 9:49 ` hubicka at gcc dot gnu.org
@ 2021-12-07 10:10 ` hubicka at gcc dot gnu.org
2021-12-07 10:17 ` hubicka at gcc dot gnu.org
` (10 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-07 10:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 51940
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51940&action=edit
gcov report. perdida does have some loops in it but it seems we optimize them
out/make htem relatively cheap.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (3 preceding siblings ...)
2021-12-07 10:10 ` hubicka at gcc dot gnu.org
@ 2021-12-07 10:17 ` hubicka at gcc dot gnu.org
2021-12-07 10:24 ` marxin at gcc dot gnu.org
` (9 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-07 10:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Quick counting from gcov data. It is 585898984 calls and from the disassmebly
the call is 101 instructions, so it account 58589898400
The perf reports shows 69059173576 difference so at least instruction count
wise most of it is call overhead of perdida.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (4 preceding siblings ...)
2021-12-07 10:17 ` hubicka at gcc dot gnu.org
@ 2021-12-07 10:24 ` marxin at gcc dot gnu.org
2021-12-08 18:24 ` hubicka at gcc dot gnu.org
` (8 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-07 10:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #4)
> Created attachment 51940 [details]
> gcov report. perdida does have some loops in it but it seems we optimize
> them out/make htem relatively cheap.
HTML colored version of the report:
https://splichal.eu/tmp/fatique.gcov.html
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (5 preceding siblings ...)
2021-12-07 10:24 ` marxin at gcc dot gnu.org
@ 2021-12-08 18:24 ` hubicka at gcc dot gnu.org
2021-12-08 18:28 ` hubicka at gcc dot gnu.org
` (7 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-08 18:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 51952
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51952&action=edit
Patch to teach modref about global memory
This patch extends modref so we eliminate the dead part of array descriptors.
This is done by making difference between accesses to global memory and to
unknown memory accesses. The first is less evil since it can not access things
that does not escape and it is possible to arrange unanalyzed functions to
access only global memory (since everything reaching them escapes).
With the patch I now get
Performance counter stats for './a.out-badmod':
27,448.82 msec task-clock # 0.999 CPUs utilized
334 context-switches # 12.168 /sec
10 cpu-migrations # 0.364 /sec
70 page-faults # 2.550 /sec
109,079,862,444 cycles # 3.974 GHz
130,234,675 stalled-cycles-frontend # 0.12% frontend cycles
idle
80,559,978,185 stalled-cycles-backend # 73.85% backend cycles
idle
230,337,342,608 instructions # 2.11 insn per cycle
# 0.35 stalled cycles per
insn
9,808,852,327 branches # 357.351 M/sec
24,742,926 branch-misses # 0.25% of all branches
27.468971377 seconds time elapsed
27.445934000 seconds user
0.003999000 seconds sys
so 56% slowdown is now only 22% which is quite nice improvement.
The call of perdida now looks as follows:
<bb 70> [local count: 109362591]:
_818 = (unsigned long) _44;
_817 = _818 + ivtmp.1050_1737;
_246 = (real(kind=8) *) _817;
parm.326.data = _246;
_811 = (unsigned long) _33;
_810 = _811 + ivtmp.1050_1737;
_247 = (real(kind=8) *) _810;
parm.327.data = _247;
_712 = (unsigned long) _56;
_711 = _712 + ivtmp.1050_1737;
_248 = (real(kind=8) *) _711;
parm.328.data = _248;
_612 = (unsigned long) _20;
_574 = _612 + ivtmp.1050_1737;
_249 = (real(kind=8) *) _574;
parm.329.data = _249;
_573 = (unsigned long) _67;
_572 = _573 + ivtmp.1050_1737;
_250 = (real(kind=8) *) _572;
parm.330.data = _250;
_251 = (real(kind=8) *) ivtmp.1049_1740;
_252 = (real(kind=8) *) ivtmp.1048_1753;
_253 = (real(kind=8) *) ivtmp.1047_1762;
perdida.constprop (&dt, &lambda, &mu, &yield_stress, &r_infinity, &b,
&x_infinity, &gamma, &eta, &plastic_strain_threshold, &parm.326, &parm.327,
&parm.328, &parm.329, _253, &parm.330, _252, _251, &failure_threshold,
&crack_closure_parameter);
parm.326 ={v} {CLOBBER};
parm.327 ={v} {CLOBBER};
parm.328 ={v} {CLOBBER};
parm.329 ={v} {CLOBBER};
parm.330 ={v} {CLOBBER};
and compiles to:
movq 48(%rsp), %rax
movq %r10, 24(%rsp)
addq %rbp, %rax
movq %rax, 720(%rsp)
movq 32(%rsp), %rax
addq %rbp, %rax
movq %rax, 816(%rsp)
movq 64(%rsp), %rax
addq %rbp, %rax
movq %rax, 912(%rsp)
movq 40(%rsp), %rax
addq %rbp, %rax
movq %rax, 1008(%rsp)
movq 72(%rsp), %rax
addq %rbp, %rax
addq $72, %rbp
movq %rax, 1104(%rsp)
leaq 352(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1720
leaq 384(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1728
pushq %r13
.cfi_def_cfa_offset 1736
addq $8, %r13
pushq %r12
.cfi_def_cfa_offset 1744
addq $8, %r12
leaq 1136(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1752
pushq %r14
.cfi_def_cfa_offset 1760
addq $8, %r14
leaq 1056(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1768
leaq 968(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1776
leaq 880(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1784
leaq 792(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1792
leaq 488(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1800
leaq 456(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1808
leaq 480(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1816
leaq 568(%rsp), %rax
pushq %rax
.cfi_def_cfa_offset 1824
leaq 440(%rsp), %r9
leaq 528(%rsp), %r8
leaq 584(%rsp), %rcx
leaq 512(%rsp), %rdx
leaq 504(%rsp), %rsi
leaq 472(%rsp), %rdi
call __perdida_m_MOD_perdida.constprop.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (6 preceding siblings ...)
2021-12-08 18:24 ` hubicka at gcc dot gnu.org
@ 2021-12-08 18:28 ` hubicka at gcc dot gnu.org
2021-12-08 18:43 ` hubicka at gcc dot gnu.org
` (6 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-08 18:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #8 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Also IRA fails with
! Disqualifying parameter number 0 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 3 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 4 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 5 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 6 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 7 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 8 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 9 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 10 - Would result into a too big set of
replacements.
! Disqualifying parameter number 11 - Would result into a too big set of
replacements.
! Disqualifying parameter number 12 - Would result into a too big set of
replacements.
! Disqualifying parameter number 13 - Would result into a too big set of
replacements.
! Disqualifying parameter number 15 - Would result into a too big set of
replacements.
! Disqualifying parameter number 18 - Would create a possibly illegal
dereference in a caller.
! Disqualifying parameter number 19 - Would create a possibly illegal
dereference in a caller.
I think it is overparanoid about values passed by reference.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (7 preceding siblings ...)
2021-12-08 18:28 ` hubicka at gcc dot gnu.org
@ 2021-12-08 18:43 ` hubicka at gcc dot gnu.org
2021-12-12 8:43 ` hubicka at gcc dot gnu.org
` (5 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-08 18:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #9 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Hacking around the logic in ipa-sra that disables the transform and adding
-fwhole-program I get down to:
Performance counter stats for './a.out-bad3':
24,946.66 msec task-clock # 0.999 CPUs utilized
1,078 context-switches # 43.212 /sec
42 cpu-migrations # 1.684 /sec
71 page-faults # 2.846 /sec
96,144,941,575 cycles # 3.854 GHz
151,439,200 stalled-cycles-frontend # 0.16% frontend cycles
idle
68,072,941,085 stalled-cycles-backend # 70.80% backend cycles
idle
210,675,636,303 instructions # 2.19 insn per cycle
# 0.32 stalled cycles per
insn
9,128,994,716 branches # 365.941 M/sec
24,781,891 branch-misses # 0.27% of all branches
24.982117481 seconds time elapsed
24.909903000 seconds user
0.036031000 seconds sys
which is not bad. I think we have two ipa-sra issues
1) ipa-sra is overparanoid about not adding derefernece. First I think it is
safe for parameters that are REFERENCE_TYPE rather than POINTER_TYPE second it
should do propagation from callers to callees: it is quite easy to figure out
that a given param contains data packed to a structure only to make callee
happy.
2) since ipa-sra is run before ipa-cp it won't simplify ipa-cp (or other)
clones even if they are static symbols.
Also I think ipa-sra may consider packing multiple structures together. If
array descriptors are passed by references and built in caller it seems
pointless to pass each as separate struct. Not sure what kind of benefits to
expect here though.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (8 preceding siblings ...)
2021-12-08 18:43 ` hubicka at gcc dot gnu.org
@ 2021-12-12 8:43 ` hubicka at gcc dot gnu.org
2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org
` (4 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-12 8:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586644.html
improves modref so it is almost able to remove the unnecesary stores to array
descriptors. Unforutnately to elinate them one needs to handle the case that
calling unknown function is safe WRT all non-escaping values (since passing
them to the function makes them escape) that is sadly broken by Fortran FE
incorrectly annotating functions calling GOMP_parallel.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (9 preceding siblings ...)
2021-12-12 8:43 ` hubicka at gcc dot gnu.org
@ 2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org
2022-01-31 10:33 ` hubicka at gcc dot gnu.org
` (3 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-12-14 15:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:
https://gcc.gnu.org/g:3305135c29e1c3e988bd9bad40aefc01d138aaca
commit r12-5960-g3305135c29e1c3e988bd9bad40aefc01d138aaca
Author: Jan Hubicka <jh@suse.cz>
Date: Tue Dec 14 16:50:27 2021 +0100
Determine global memory accesses in ipa-modref
As discussed in PR103585, fatigue2 is now only benchmark from my usual
testing
set (SPEC2k6, SPEC2k17, CPP benchmarks, polyhedron, Firefox, clang) which
sees
important regression when inlining functions called once is limited. This
prevents us from solving runtime issues in roms benchmarks and elsewhere.
The problem is that there is perdida function that takes many arguments and
some of them are array descriptors. We constant propagate most of their
fields
but still keep their initialization. Because perdida is quite fast, the
call
overhead dominates, since we need over 100 memory stores consuing about 35%
of the overall benchmark runtime.
The memory stores would be eliminated if perdida did not call fortran I/O
which
makes modref to thin that the array descriptors could be accessed. We are
quite close discovering that they can't becuase they are non-escaping from
function. This patch makes modref to distingush between global memory
access
(only things that escapes) and unkonwn accesss (that may access also
nonescaping things reaching the function). This makes disambiguation for
functions containing error handling better.
Unfortunately the patch hits two semi-latent issues in Fortran frontned.
First is wrong code in gfortran.dg/unlimited_polymorphic_3.f03. This can be
turned into wrong code testcase on both mainline and gcc11 if the runtime
call is removed, so I filled PR 103662 for it. There is TBAA mismatch for
structure produced in FE.
Second is issue with GOMP where Fortran marks certain parameters as
non-escaping
and then makes them escape via GOMP_parallel. For this I disabled the use
of
escape info in verify_arg which also disables the useful transform on
perdida
but still does useful work for e.g. GCC error handling. I will work on
this
incrementally.
Bootstrapped/regtested x86_64-linux, lto-bootstrapped and also tested with
clang build. I plan to commit this tomorrow if there are no complains
(the patch is not completely short but conceptualy simple and handles a lot
of common cases).
gcc/ChangeLog:
2021-12-12 Jan Hubicka <hubicka@ucw.cz>
PR ipa/103585
* ipa-modref-tree.c (modref_access_node::range_info_useful_p):
Handle
MODREF_GLOBAL_MEMORY_PARM.
(modref_access_node::dump): Likewise.
(modref_access_node::get_call_arg): Likewise.
* ipa-modref-tree.h (enum modref_special_parms): Add
MODREF_GLOBAL_MEMORY_PARM.
(modref_access_node::useful_for_kill): Handle
MODREF_GLOBAL_MEMORY_PARM.
(modref:tree::merge): Add promote_unknown_to_global.
* ipa-modref.c (verify_arg):New function.
(may_access_nonescaping_parm_p): New function.
(modref_access_analysis::record_global_memory_load): New member
function.
(modref_access_analysis::record_global_memory_store): Likewise.
(modref_access_analysis::process_fnspec): Distingush global and
local
memory.
(modref_access_analysis::analyze_call): Likewise.
* tree-ssa-alias.c (ref_may_access_global_memory_p): New function.
(modref_may_conflict): Use it.
gcc/testsuite/ChangeLog:
2021-12-12 Jan Hubicka <hubicka@ucw.cz>
* gcc.dg/analyzer/data-model-1.c: Disable ipa-modref.
* gcc.dg/uninit-38.c: Likewise.
* gcc.dg/uninit-pr98578.c: Liewise.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (10 preceding siblings ...)
2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org
@ 2022-01-31 10:33 ` hubicka at gcc dot gnu.org
2022-12-14 0:05 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2022-01-31 10:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 52317
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52317&action=edit
WIP patch
This is patch I made while working on this problem. It improves the escape
analysis of the first patch so it cooperates with PTA. However because fortran
FE sort of lies about escapes in the presence of OpenMP I had to disable it for
cases that are interesting for perdida. So I will look on it next stage 1.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (11 preceding siblings ...)
2022-01-31 10:33 ` hubicka at gcc dot gnu.org
@ 2022-12-14 0:05 ` cvs-commit at gcc dot gnu.org
2023-01-19 13:21 ` jamborm at gcc dot gnu.org
2023-01-29 2:23 ` hubicka at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-12-14 0:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #13 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>:
https://gcc.gnu.org/g:10478270fe0c39c59eb0f35d19356a63bdf3a2ad
commit r13-4687-g10478270fe0c39c59eb0f35d19356a63bdf3a2ad
Author: Martin Jambor <mjambor@suse.cz>
Date: Wed Dec 14 00:33:06 2022 +0100
ipa-sra: Treat REFERENCE_TYPES as always dereferencable
C++ and especially Fortran pass data by references which are not
pointers potentially pointing anywhere and so can be assumed to be
safely dereferencable. This patch teaches IPA-SRA to treat them as
such and avoid the dance we do to prove that we can move loads from
them to the caller.
When we do not know that a dereference will happen all the time, we
need a heuristics so that we do not force memory accesses that normally
happen only rarely. The patch simply uses the (possibly guessed)
profile and checks whether the (expected) number of loads is at least
half of function invocations invocations - the half is now
configurable with a param as requested by Honza.
gcc/ChangeLog:
2022-12-13 Martin Jambor <mjambor@suse.cz>
PR ipa/103585
* params.opt (ipa-sra-deref-prob-threshold): New parameter.
* doc/invoke.texi (ipa-sra-deref-prob-threshold): Document it.
* ipa-sra.cc (struct gensum_param_access): New field load_count.
(struct gensum_param_desc): New field safe_ref, adjusted comments.
(by_ref_count): Renamed to unsafe_by_ref_count, adjusted all uses.
(dump_gensum_access): Dump the new field.
(dump_gensum_param_descriptor): Likewise.
(create_parameter_descriptors): Set safe_ref field, move setting
by_ref forward. Only increment unsafe_by_ref_count for unsafe
by_ref parameters.
(allocate_access): Initialize new field.
(mark_param_dereference): Adjust indentation. Only add data to
bb_dereferences for unsafe by_ref parameters.
(scan_expr_access): For loads, accumulate BB counts.
(dereference_probable_p): New function.
(check_gensum_access): Fix leading comment, add parameter FUN.
Check cumulative counts of loads for safe by_ref accesses instead
of dereferences.
(process_scan_results): Do not propagate dereference distances for
safe by_ref parameters. Pass fun to check_gensum_access. Safe
by_ref params do not need the postdominance check.
gcc/testsuite/ChangeLog:
2022-11-11 Martin Jambor <mjambor@suse.cz>
* g++.dg/ipa/ipa-sra-5.C: New test
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (12 preceding siblings ...)
2022-12-14 0:05 ` cvs-commit at gcc dot gnu.org
@ 2023-01-19 13:21 ` jamborm at gcc dot gnu.org
2023-01-29 2:23 ` hubicka at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-01-19 13:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #14 from Martin Jambor <jamborm at gcc dot gnu.org> ---
Honza, what remains to be done here (if anything)?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug ipa/103585] fatigue2 requires inlining of peridida to work well
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
` (13 preceding siblings ...)
2023-01-19 13:21 ` jamborm at gcc dot gnu.org
@ 2023-01-29 2:23 ` hubicka at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-01-29 2:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103585
--- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
We get 47s runtime with -O2 -flto and 53s with -O2
-fno-inline-functions-called-once.
The call sequence is:
<bb 81> [local count: 109362591]:
_1656 = (unsigned long) _45;
_1655 = _1656 + ivtmp.1182_2540;
_229 = (double *) _1655;
_1646 = (unsigned long) _35;
_1645 = _1646 + ivtmp.1182_2540;
_230 = (double *) _1645;
_1636 = (unsigned long) _55;
_1635 = _1636 + ivtmp.1182_2540;
_231 = (double *) _1635;
_1628 = (unsigned long) _17;
_1627 = _1628 + ivtmp.1182_2540;
_232 = (double *) _1627;
_1618 = (unsigned long) _64;
_1617 = _1618 + ivtmp.1182_2540;
_233 = (double *) _1617;
_234 = (double *) ivtmp.1181_2551;
_235 = (double *) ivtmp.1180_2575;
_236 = (double *) ivtmp.1178_2586;
_2607 = yield_stress;
perdida.constprop.isra (&dt, &lambda, &mu, _2607, &r_infinity, &b,
&x_infinity, &gamma, &eta, &plastic_strain_threshold, _229, _230, _231, _232,
_236, _233, _235, _234, &failure_threshold, &crack_closure_parameter);
It is not clear to me why lambda is not replaced. Howevever for dt it seems to
be:
! Disqualifying parameter number 0 - Dereferences in callers would happen much
more frequently.
I think this is too early since if we SRA all the way down to the original
caller we will avoid all dereferences completely.
Other place for improvement is non-LTO. Here IPA-sra disables itself since it
does not have cost model for cloning (that could be also improved).
Situation could be improved by ipa-modref that may optimize away unused parts
of the array descriptors. ipa-modref however gives up on the fact that perdida
has Fortran i/o and it then gives up on tracking on the descriptors even if the
descritors are never escaping to the i/o.
For this I need to finish the non-escaping analysis. I.e. make difference
between arguments that does not escape in a sense that once function return
they are not saved in global memory and in a sense that they are never passed
down to callee function.
We also may annotate fortran i/o and understand what it does.
So there is still a lot to do.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-01-29 2:23 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-06 16:59 [Bug tree-optimization/103585] New: fatigue2 requires inlining of peridida to work well hubicka at gcc dot gnu.org
2021-12-06 17:15 ` [Bug tree-optimization/103585] " hubicka at gcc dot gnu.org
2021-12-07 9:31 ` [Bug ipa/103585] " hubicka at gcc dot gnu.org
2021-12-07 9:49 ` hubicka at gcc dot gnu.org
2021-12-07 10:10 ` hubicka at gcc dot gnu.org
2021-12-07 10:17 ` hubicka at gcc dot gnu.org
2021-12-07 10:24 ` marxin at gcc dot gnu.org
2021-12-08 18:24 ` hubicka at gcc dot gnu.org
2021-12-08 18:28 ` hubicka at gcc dot gnu.org
2021-12-08 18:43 ` hubicka at gcc dot gnu.org
2021-12-12 8:43 ` hubicka at gcc dot gnu.org
2021-12-14 15:51 ` cvs-commit at gcc dot gnu.org
2022-01-31 10:33 ` hubicka at gcc dot gnu.org
2022-12-14 0:05 ` cvs-commit at gcc dot gnu.org
2023-01-19 13:21 ` jamborm at gcc dot gnu.org
2023-01-29 2:23 ` hubicka at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).