public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/110692] New: epilogues for loop which can be also vectorized with half size can be improved.
@ 2023-07-16 19:31 hubicka at gcc dot gnu.org
2023-07-17 7:45 ` [Bug tree-optimization/110692] " rguenth at gcc dot gnu.org
0 siblings, 1 reply; 2+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-07-16 19:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110692
Bug ID: 110692
Summary: epilogues for loop which can be also vectorized with
half size can be improved.
Product: gcc
Version: 13.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
for:
int a[99];
test()
{
for (int i = 0 ; i < 99; i++)
a[i]++;
}
we produce:
0: 66 0f 6f 0d 00 00 00 movdqa 0x0(%rip),%xmm1 # 8 <test+0x8>
7: 00
8: b8 00 00 00 00 mov $0x0,%eax
d: ba 00 00 00 00 mov $0x0,%edx
12: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
18: 66 0f 6f 00 movdqa (%rax),%xmm0
1c: 48 83 c0 10 add $0x10,%rax
20: 66 0f fe c1 paddd %xmm1,%xmm0
24: 0f 29 40 f0 movaps %xmm0,-0x10(%rax)
28: 48 39 c2 cmp %rax,%rdx
2b: 75 eb jne 18 <test+0x18>
2d: f3 0f 7e 05 00 00 00 movq 0x0(%rip),%xmm0 # 35 <test+0x35>
34: 00
35: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 3c <test+0x3c>
3c: f3 0f 7e 0d 00 00 00 movq 0x0(%rip),%xmm1 # 44 <test+0x44>
43: 00
44: 66 0f fe c1 paddd %xmm1,%xmm0
48: 66 0f d6 05 00 00 00 movq %xmm0,0x0(%rip) # 50 <test+0x50>
4f: 00
50: c3 ret
which does the 4x vectorized loop followed by 2x vectorized loopless epilogue
and copy of remaining byte.
When bound is unknow we do:
int a[99];
test(int n)
{
for (int i = 0 ; i < n; i++)
a[i]++;
}
0: 85 ff test %edi,%edi
2: 7e 70 jle 74 <test+0x74>
4: 8d 47 ff lea -0x1(%rdi),%eax
7: 83 f8 02 cmp $0x2,%eax
a: 76 6d jbe 79 <test+0x79>
c: 89 fa mov %edi,%edx
e: 66 0f 6f 0d 00 00 00 movdqa 0x0(%rip),%xmm1 # 16 <test+0x16>
15: 00
16: 31 c0 xor %eax,%eax
18: c1 ea 02 shr $0x2,%edx
1b: 48 c1 e2 04 shl $0x4,%rdx
1f: 66 0f 6f 80 00 00 00 movdqa 0x0(%rax),%xmm0
26: 00
27: 48 83 c0 10 add $0x10,%rax
2b: 66 0f fe c1 paddd %xmm1,%xmm0
2f: 0f 29 80 00 00 00 00 movaps %xmm0,0x0(%rax)
36: 48 39 d0 cmp %rdx,%rax
39: 75 e4 jne 1f <test+0x1f>
3b: 89 f8 mov %edi,%eax
3d: 83 e0 fc and $0xfffffffc,%eax
40: 40 f6 c7 03 test $0x3,%dil
44: 74 32 je 78 <test+0x78>
46: 48 63 d0 movslq %eax,%rdx
49: 83 04 95 00 00 00 00 addl $0x1,0x0(,%rdx,4)
50: 01
51: 8d 50 01 lea 0x1(%rax),%edx
54: 39 d7 cmp %edx,%edi
56: 7e 1c jle 74 <test+0x74>
58: 48 63 d2 movslq %edx,%rdx
5b: 83 c0 02 add $0x2,%eax
5e: 83 04 95 00 00 00 00 addl $0x1,0x0(,%rdx,4)
65: 01
66: 39 c7 cmp %eax,%edi
68: 7e 0a jle 74 <test+0x74>
6a: 48 98 cltq
6c: 83 04 85 00 00 00 00 addl $0x1,0x0(,%rax,4)
73: 01
74: c3 ret
75: 0f 1f 00 nopl (%rax)
78: c3 ret
79: 31 c0 xor %eax,%eax
7b: eb c9 jmp 46 <test+0x46>
the profitability threshold is 4.
Producing the loopless epilogue just as in the first case with additional tests
for block size would work better. The code looks quite bad for small trip
counts since there is extra jump down to 79.
^ permalink raw reply [flat|nested] 2+ messages in thread
* [Bug tree-optimization/110692] epilogues for loop which can be also vectorized with half size can be improved.
2023-07-16 19:31 [Bug middle-end/110692] New: epilogues for loop which can be also vectorized with half size can be improved hubicka at gcc dot gnu.org
@ 2023-07-17 7:45 ` rguenth at gcc dot gnu.org
0 siblings, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-17 7:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110692
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org
Keywords| |missed-optimization
Component|middle-end |tree-optimization
Blocks| |53947
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
With the variable bound we have
a[i_11] 1 times vector_load costs 12 in body
_1 + 1 1 times scalar_to_vec costs 4 in prologue
_1 + 1 1 times vector_stmt costs 4 in body
_2 1 times vector_store costs 12 in body
t.c:4:28: note: operating on full vectors for epilogue loop.
t.c:4:28: note: cost model: epilogue peel iters set to vf/2 because loop
iterations are unknown .
a[i_11] 1 times scalar_load costs 12 in epilogue
_1 + 1 1 times scalar_stmt costs 4 in epilogue
_2 1 times scalar_store costs 12 in epilogue
<unknown> 1 times cond_branch_taken costs 16 in epilogue
t.c:4:28: note: Cost model analysis:
Vector inside of loop cost: 28
Vector prologue cost: 4
Vector epilogue cost: 44
Scalar iteration cost: 28
Scalar outside cost: 32
Vector outside cost: 48
prologue iterations: 0
epilogue iterations: 1
Calculated minimum iters for profitability: 1
t.c:4:28: note: Runtime profitability threshold = 2
t.c:4:28: note: Static estimate profitability threshold = 4
t.c:4:28: missed: not vectorized: estimated iteration count too small.
with the static bound:
a[i_10] 1 times vector_load costs 12 in body
_1 + 1 1 times scalar_to_vec costs 4 in prologue
_1 + 1 1 times vector_stmt costs 4 in body
_2 1 times vector_store costs 12 in body
t.c:4:28: note: operating on full vectors for epilogue loop.
a[i_10] 1 times scalar_load costs 12 in epilogue
_1 + 1 1 times scalar_stmt costs 4 in epilogue
_2 1 times scalar_store costs 12 in epilogue
t.c:4:28: note: Cost model analysis:
Vector inside of loop cost: 28
Vector prologue cost: 4
Vector epilogue cost: 28
Scalar iteration cost: 28
Scalar outside cost: 0
Vector outside cost: 32
prologue iterations: 0
epilogue iterations: 1
Calculated minimum iters for profitability: 2
t.c:4:28: note: Runtime profitability threshold = 2
t.c:4:28: note: Static estimate profitability threshold = 2
so it's the branch of the epilog of the epilog that ups the cost, not sure
whether in a reasonable way for this case. In the end I think if the
count is unknown a fully peeled epilog with 2 iterations is a reasonable
implementation.
We could decide that costing the epilogue vectorization isn't worthwhile
but note we are not implementing all 8 byte vector ops with the same
efficiency as the 16 byte ones.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-07-17 7:45 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-16 19:31 [Bug middle-end/110692] New: epilogues for loop which can be also vectorized with half size can be improved hubicka at gcc dot gnu.org
2023-07-17 7:45 ` [Bug tree-optimization/110692] " rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).