public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
@ 2023-06-28 9:22 hliu at amperecomputing dot com
2023-06-28 20:11 ` [Bug tree-optimization/110449] " rsandifo at gcc dot gnu.org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: hliu at amperecomputing dot com @ 2023-06-28 9:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449
Bug ID: 110449
Summary: Vect: use a small step to calculate the loop induction
if the loop is unrolled during loop vectorization
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: hliu at amperecomputing dot com
Target Milestone: ---
This is inspired by clang. Compile the follwing case with "-mcpu=neoverse-n2
-O3":
void foo(int *arr, int val, int step) {
for (int i = 0; i < 1024; i++) {
arr[i] = val;
val += step;
}
}
It will be unrolled by 2 during vectorization. GCC generates code:
fmov s29, w2 # step
shl v27.2s, v29.2s, 3 # 8*step
shl v28.2s, v29.2s, 2 # 4*step
...
.L2:
mov v30.16b, v31.16b
add v31.4s, v31.4s, v27.4s # += 8*step
add v29.4s, v30.4s, v28.4s # += 4*step
stp q30, q29, [x0]
add x0, x0, 32
cmp x1, x0
bne .L2
The v27 (i.e. "8*step") is actually not necessary. We can use v29 + v28 (i.e.
"+ 4*step") and generate simpler code:
fmov s29, w2 # step
shl v28.2s, v29.2s, 2 # 4*step
...
.L2:
add v29.4s, v30.4s, v28.4s # += 4*step
stp q30, q29, [x0]
add x0, x0, 32
add v30.4s, v29.4s, v28.4s # += 4*step
cmp x1, x0
bne .L2
This has two benefits:
(1) Save 1 vector register and one "mov" instructon
(2) For floating point, the result value of small step should be closer to the
original scalar result value than large step. I.e. "A + 4*step + ... + 4*step"
should be closer to "A + step + ... + step" than "A + 8*step + ... 8*step".
Do you think if this is reasonable?
I have a simple patch to enhance the tree-vect-loop.cc
"vectorizable_induction()" to achieve this. Will send out the patch for code
review later.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/110449] Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
2023-06-28 9:22 [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization hliu at amperecomputing dot com
@ 2023-06-28 20:11 ` rsandifo at gcc dot gnu.org
2023-06-29 3:15 ` hliu at amperecomputing dot com
2023-07-06 16:05 ` cvs-commit at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2023-06-28 20:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org,
| |rsandifo at gcc dot gnu.org
--- Comment #1 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Interesting idea! But I think the ideal thing here would be
to do the 8*step after the store:
.L2:
add v29.4s, v31.4s, v28.4s # += 4*step
stp q31, q29, [x0]
add v31.4s, v31.4s, v27.4s # += 8*step
add x0, x0, 32
cmp x1, x0
bne .L2
This has the advantage that the loop-carried dependency
is only one ADD instruction deep, rather than 2 ADDs deep.
I haven't looked how easy it would be to do though…
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/110449] Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
2023-06-28 9:22 [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization hliu at amperecomputing dot com
2023-06-28 20:11 ` [Bug tree-optimization/110449] " rsandifo at gcc dot gnu.org
@ 2023-06-29 3:15 ` hliu at amperecomputing dot com
2023-07-06 16:05 ` cvs-commit at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: hliu at amperecomputing dot com @ 2023-06-29 3:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449
--- Comment #2 from Hao Liu <hliu at amperecomputing dot com> ---
That looks better than the currently generated code (it saves one "MOV"
instruction). Yes, it has the loop-carried dependency advantage. But it still
uses one more register for "8*step" (There may be a register pressure problem
for complicated code, not for this simple case).
This is still a floating point precision problem. There is a PR84201 discussed
about the same problem for X86:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84201. The larger step makes the
floating point calculation result has larger gap compared to the original
scalar calculation result. E.g. The SPEC2017 fp benchmark 549.fotonik may
result in VE (Validation Error) after unrolling a loop of double:
319 do ifreq = 1, tmppower%nofreq <------ HERE
320 frequency(ifreq,ipower) = freq
321 freq = freq + freqstep
322 end do
it uses 4*step for unrolled vectorization version other than the 2*step for
non-unrolled vectorization version. The SPEC fp result checks the "relative
tolerance" of the fp results and it is higher than the current standard (i.e.
the compare command line option of "--reltol 1e-10").
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/110449] Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
2023-06-28 9:22 [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization hliu at amperecomputing dot com
2023-06-28 20:11 ` [Bug tree-optimization/110449] " rsandifo at gcc dot gnu.org
2023-06-29 3:15 ` hliu at amperecomputing dot com
@ 2023-07-06 16:05 ` cvs-commit at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-06 16:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449
--- Comment #3 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jeff Law <law@gcc.gnu.org>:
https://gcc.gnu.org/g:224fd59b2dc8a5fa78a309a09863afe9b3cf2111
commit r14-2367-g224fd59b2dc8a5fa78a309a09863afe9b3cf2111
Author: Hao Liu OS <hliu@os.amperecomputing.com>
Date: Thu Jul 6 10:04:46 2023 -0600
Vect: use a small step to calculate induction for the unrolled loop (PR
tree-optimization/110449)
If a loop is unrolled by n times during vectoriation, two steps are used to
calculate the induction variable:
- The small step for the unrolled ith-copy: vec_1 = vec_iv + (VF/n *
Step)
- The large step for the whole loop: vec_loop = vec_iv + (VF * Step)
This patch calculates an extra vec_n to replace vec_loop:
vec_n = vec_prev + (VF/n * S) = vec_iv + (VF/n * S) * n = vec_loop.
So that we can save the large step register and related operations.
gcc/ChangeLog:
PR tree-optimization/110449
* tree-vect-loop.cc (vectorizable_induction): use vec_n to replace
vec_loop for the unrolled loop.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/pr110449.c: New testcase.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-07-06 16:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-28 9:22 [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization hliu at amperecomputing dot com
2023-06-28 20:11 ` [Bug tree-optimization/110449] " rsandifo at gcc dot gnu.org
2023-06-29 3:15 ` hliu at amperecomputing dot com
2023-07-06 16:05 ` cvs-commit at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).