public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations
@ 2023-12-21 8:22 fxue at os dot amperecomputing.com
2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2023-12-21 8:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
Bug ID: 113104
Summary: Suboptimal loop-based slp node splicing across
iterations
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: fxue at os dot amperecomputing.com
Target Milestone: ---
Given a partial vector-sized slp node in loop, code generation would utilize
inter-iteration parallelism to archive full vectorization by splicing defs of
the node in multiple iterations into one vector. This strategy is not always
good, and could be refined in some situation. To be specific, we'd better not
splice node if it participates in a full-vector-sized operation, otherwise a
permute and vextract that are really unneeded would be introduced.
Suppose target vector size is 128-bit, and a slp node is mapped to VEC_OP in an
iteration. Depending on whether backend supports LO/HI version of the
operation, there are two kinds code sequence for splicing.
// Isolated 2 iterations
res_v128_I0 = VEC_OP(opnd_v64_I0, ...) // iteration #0
res_v128_I1 = VEC_OP(opnd_v64_I1, ...) // iteration #1
// Spliced (1)
opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 } // extra permute
opnd_v64_lo = [vec_unpack_lo_expr] opnd_v128_I0_I1; // extra vextract
opnd_v64_hi = [vec_unpack_hi_expr] opnd_v128_I0_I1; // extra vextract
res_v128_I0 = VEC_OP(opnd_v64_lo, ...)
res_v128_I1 = VEC_OP(opnd_v64_hi, ...)
// Spliced (2)
opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 } // extra permute
res_v128_I0 = VEC_OP_LO(opnd_v128_i0_i1, ...) // similar or same as VEC_OP
res_v128_I1 = VEC_OP_HI(opnd_v128_i0_i1, ...) // similar or same as VEC_OP
Sometime, such permute and vextract might be optimized away by backend passes.
But sometime, it can not. Here is a case on aarch64.
int test(unsigned array[4][4]);
int foo(unsigned short *a, unsigned long n)
{
unsigned array[4][4];
for (unsigned i = 0; i < 4; i++, a += n)
{
array[i][0] = a[0] << 6;
array[i][1] = a[1] << 6;
array[i][2] = a[2] << 6;
array[i][3] = a[3] << 6;
}
return test(array);
}
// Current code generation
mov x2, x0
stp x29, x30, [sp, -80]!
add x3, x2, x1, lsl 1
lsl x1, x1, 1
mov x29, sp
add x4, x3, x1
ldr d0, [x2]
movi v30.4s, 0
add x0, sp, 16
ldr d31, [x2, x1]
ldr d29, [x3, x1]
ldr d28, [x4, x1]
ins v0.d[1], v31.d[0] //
ins v29.d[1], v28.d[0] //
zip1 v1.8h, v0.8h, v30.8h // superfluous
zip2 v0.8h, v0.8h, v30.8h //
zip1 v31.8h, v29.8h, v30.8h //
zip2 v29.8h, v29.8h, v30.8h //
shl v1.4s, v1.4s, 6
shl v0.4s, v0.4s, 6
shl v31.4s, v31.4s, 6
shl v29.4s, v29.4s, 6
stp q1, q0, [sp, 16]
stp q31, q29, [sp, 48]
bl test
ldp x29, x30, [sp], 80
ret
// May be optimized to:
stp x29, x30, [sp, -80]!
mov x29, sp
mov x2, x0
add x0, sp, 16
lsl x3, x1, 1
add x1, x2, x1, lsl 1
add x4, x1, x3
ldr d31, [x2, x3]
ushll v31.4s, v31.4h, 6
ldr d30, [x2]
ushll v30.4s, v30.4h, 6
str q30, [sp, 16]
ldr d30, [x1, x3]
ushll v30.4s, v30.4h, 6
str q31, [sp, 32]
ldr d31, [x4, x3]
ushll v31.4s, v31.4h, 6
stp q30, q31, [sp, 48]
bl test
ldp x29, x30, [sp], 80
ret
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
@ 2023-12-21 9:07 ` rguenth at gcc dot gnu.org
2023-12-21 9:33 ` fxue at os dot amperecomputing.com
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-12-21 9:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
See my proposal on the mailing list to lift the restriction of sticking to a
single vector size, I think this is another example showing this. If you
use BB level vectorization by disabling loop vectorization but not SLP
vectorization the code should improve?
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
@ 2023-12-21 9:33 ` fxue at os dot amperecomputing.com
2023-12-21 9:41 ` rguenther at suse dot de
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2023-12-21 9:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
--- Comment #2 from Feng Xue <fxue at os dot amperecomputing.com> ---
(In reply to Richard Biener from comment #1)
> See my proposal on the mailing list to lift the restriction of sticking to a
> single vector size, I think this is another example showing this. If you
> use BB level vectorization by disabling loop vectorization but not SLP
> vectorization the code should improve?
Yes, the loop is fully unrolled, and BB SLP would.
I could not find the proposal, would you share me a link? Thanks
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
2023-12-21 9:33 ` fxue at os dot amperecomputing.com
@ 2023-12-21 9:41 ` rguenther at suse dot de
2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenther at suse dot de @ 2023-12-21 9:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 21 Dec 2023, fxue at os dot amperecomputing.com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
>
> --- Comment #2 from Feng Xue <fxue at os dot amperecomputing.com> ---
> (In reply to Richard Biener from comment #1)
> > See my proposal on the mailing list to lift the restriction of sticking to a
> > single vector size, I think this is another example showing this. If you
> > use BB level vectorization by disabling loop vectorization but not SLP
> > vectorization the code should improve?
>
> Yes, the loop is fully unrolled, and BB SLP would.
I suspect even when the loop isn't unrolled (just increase iteration
count) the code would improve
> I could not find the proposal, would you share me a link? Thanks
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640476.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
` (2 preceding siblings ...)
2023-12-21 9:41 ` rguenther at suse dot de
@ 2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2023-12-30 12:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
Richard Sandiford <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2023-12-30
Ever confirmed|0 |1
CC| |rsandifo at gcc dot gnu.org
Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
--- Comment #4 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
FWIW, we do get the desired code with -march=armv8-a+sve (even though the test
doesn't use SVE). This is because of:
/* Consider enabling VECT_COMPARE_COSTS for SVE, both so that we
can compare SVE against Advanced SIMD and so that we can compare
multiple SVE vectorization approaches against each other. There's
not really any point doing this for Advanced SIMD only, since the
first mode that works should always be the best. */
if (TARGET_SVE && aarch64_sve_compare_costs)
flags |= VECT_COMPARE_COSTS;
The testcase in this PR is a counterexample to the claim in the final sentence.
I think the comment might predate significant support for mixed-sized Advanced
SIMD vectorisation.
If we enable SVE (or uncomment the "if" line), the costs are 13 units per
vector iteration for 128-bit vectors and 4 units per vector iteration for
64-bit vectors (so 8 units per 128 bits on a parity basis). The 64-bit version
is therefore seen as significantly cheaper and is chosen ahead of the 128-bit
version.
I think this PR is enough proof that we should enable VECT_COMPARE_COSTS even
without SVE. Assigning to myself for that.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
` (3 preceding siblings ...)
2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
@ 2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
2024-01-10 5:01 ` fxue at os dot amperecomputing.com
6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-01-05 16:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
--- Comment #5 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:
https://gcc.gnu.org/g:7328faf89e9b4953baaff10e18262c70fbd3e578
commit r14-6961-g7328faf89e9b4953baaff10e18262c70fbd3e578
Author: Richard Sandiford <richard.sandiford@arm.com>
Date: Fri Jan 5 16:25:16 2024 +0000
aarch64: Extend VECT_COMPARE_COSTS to !SVE [PR113104]
When SVE is enabled, we try vectorising with multiple different SVE and
Advanced SIMD approaches and use the cost model to pick the best one.
Until now, we've not done that for Advanced SIMD, since "the first mode
that works should always be the best".
The testcase is a counterexample. Each iteration of the scalar loop
vectorises naturally with 64-bit input vectors and 128-bit output
vectors. We do try that for SVE, and choose it as the best approach.
But the first approach we try is instead to use:
- a vectorisation factor of 2
- 1 128-bit vector for the inputs
- 2 128-bit vectors for the outputs
But since the stride is variable, the cost of marshalling the input
vector from two iterations outweighs the benefit of doing two iterations
at once.
This patch therefore generalises aarch64-sve-compare-costs to
aarch64-vect-compare-costs and applies it to non-SVE compilations.
gcc/
PR target/113104
* doc/invoke.texi (aarch64-sve-compare-costs): Replace with...
(aarch64-vect-compare-costs): ...this.
* config/aarch64/aarch64.opt (-param=aarch64-sve-compare-costs=):
Replace with...
(-param=aarch64-vect-compare-costs=): ...this new param.
* config/aarch64/aarch64.cc (aarch64_override_options_internal):
Don't disable it when vectorizing for Advanced SIMD only.
(aarch64_autovectorize_vector_modes): Apply VECT_COMPARE_COSTS
whenever aarch64_vect_compare_costs is true.
gcc/testsuite/
PR target/113104
* gcc.target/aarch64/pr113104.c: New test.
* gcc.target/aarch64/sve/cond_arith_1.c: Update for new parameter
names.
* gcc.target/aarch64/sve/cond_arith_1_run.c: Likewise.
* gcc.target/aarch64/sve/cond_arith_3.c: Likewise.
* gcc.target/aarch64/sve/cond_arith_3_run.c: Likewise.
* gcc.target/aarch64/sve/gather_load_6.c: Likewise.
* gcc.target/aarch64/sve/gather_load_7.c: Likewise.
* gcc.target/aarch64/sve/load_const_offset_2.c: Likewise.
* gcc.target/aarch64/sve/load_const_offset_3.c: Likewise.
* gcc.target/aarch64/sve/mask_gather_load_6.c: Likewise.
* gcc.target/aarch64/sve/mask_gather_load_7.c: Likewise.
* gcc.target/aarch64/sve/mask_load_slp_1.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_load_1.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_load_2.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_load_4.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_store_1.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_store_1_run.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_store_2.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_store_2_run.c: Likewise.
* gcc.target/aarch64/sve/pack_1.c: Likewise.
* gcc.target/aarch64/sve/reduc_4.c: Likewise.
* gcc.target/aarch64/sve/scatter_store_6.c: Likewise.
* gcc.target/aarch64/sve/scatter_store_7.c: Likewise.
* gcc.target/aarch64/sve/strided_load_3.c: Likewise.
* gcc.target/aarch64/sve/strided_store_3.c: Likewise.
* gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_signed_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_unsigned_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_unsigned_1_run.c: Likewise.
* gcc.target/aarch64/sve/vcond_11.c: Likewise.
* gcc.target/aarch64/sve/vcond_11_run.c: Likewise.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
` (4 preceding siblings ...)
2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
@ 2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
2024-01-10 5:01 ` fxue at os dot amperecomputing.com
6 siblings, 0 replies; 8+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-01-05 16:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
Richard Sandiford <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #6 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
Fixed. Thanks for the report.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
` (5 preceding siblings ...)
2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
@ 2024-01-10 5:01 ` fxue at os dot amperecomputing.com
6 siblings, 0 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2024-01-10 5:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
Feng Xue <fxue at os dot amperecomputing.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|FIXED |---
Status|RESOLVED |REOPENED
--- Comment #7 from Feng Xue <fxue at os dot amperecomputing.com> ---
Partial permutation (especially extract-low/high) would incur inefficient
splicing in some situation even the vector mode with lowest cost is used.
int test(unsigned array[4][4]);
int foo(unsigned short *a, unsigned long n)
{
unsigned array[4][4];
for (unsigned i = 0; i < 4; i++, a += n)
{
array[i][0] = (a[0] << 3) - (a[4] << 6);
array[i][1] = (a[1] << 3) - (a[5] << 6);
array[i][2] = (a[2] << 3) - (a[6] << 6);
array[i][3] = (a[3] << 3) - (a[7] << 6);
}
return test(array);
}
// After vect-compare-cost fix
mov x2, x0
stp x29, x30, [sp, -80]!
add x3, x2, x1, lsl 1
lsl x1, x1, 1
mov x29, sp
add x4, x3, x1
add x0, sp, 16
ldr q5, [x2]
ldr q31, [x4, x1]
ldr q0, [x3, x1]
ldr q1, [x2, x1]
movi v28.4s, 0 //
zip1 v29.2d, v0.2d, v31.2d //
zip1 v2.2d, v5.2d, v1.2d //
zip2 v31.2d, v0.2d, v31.2d //
zip2 v1.2d, v5.2d, v1.2d //
zip1 v30.8h, v29.8h, v28.8h //
zip1 v4.8h, v2.8h, v28.8h // superfluous
zip1 v27.8h, v31.8h, v28.8h //
zip1 v3.8h, v1.8h, v28.8h //
zip2 v29.8h, v29.8h, v28.8h //
zip2 v31.8h, v31.8h, v28.8h //
zip2 v2.8h, v2.8h, v28.8h //
zip2 v1.8h, v1.8h, v28.8h //
shl v30.4s, v30.4s, 3
shl v29.4s, v29.4s, 3
shl v4.4s, v4.4s, 3
shl v2.4s, v2.4s, 3
shl v27.4s, v27.4s, 6
shl v31.4s, v31.4s, 6
shl v3.4s, v3.4s, 6
shl v1.4s, v1.4s, 6
sub v27.4s, v30.4s, v27.4s
sub v31.4s, v29.4s, v31.4s
sub v3.4s, v4.4s, v3.4s
sub v1.4s, v2.4s, v1.4s
stp q27, q31, [sp, 48]
stp q3, q1, [sp, 16]
bl test
ldp x29, x30, [sp], 80
ret
// Expect it to be optimized as:
lsl x3, x1, 1
mov x2, x0
stp x29, x30, [sp, -80]!
add x1, x2, x1, lsl 1
add x4, x1, x3
mov x29, sp
add x0, sp, 16
ldr q30, [x2, x3]
ldr q0, [x2]
ushll v31.4s, v30.4h, 3
ushll2 v30.4s, v30.8h, 6
ushll v29.4s, v0.4h, 3
ushll2 v0.4s, v0.8h, 6
sub v30.4s, v31.4s, v30.4s
sub v0.4s, v29.4s, v0.4s
str q0, [sp, 16]
ldr q0, [x1, x3]
str q30, [sp, 32]
ldr q30, [x4, x3]
ushll v29.4s, v0.4h, 3
ushll2 v0.4s, v0.8h, 6
ushll v31.4s, v30.4h, 3
ushll2 v30.4s, v30.8h, 6
sub v0.4s, v29.4s, v0.4s
sub v30.4s, v31.4s, v30.4s
stp q0, q30, [sp, 48]
bl test
ldp x29, x30, [sp], 80
ret
Based on cost arising from splicing, we still need a way to select the most
profitable vector mode per slp node, over the global vector mode specified in
vinfo.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-01-10 5:01 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
2023-12-21 9:33 ` fxue at os dot amperecomputing.com
2023-12-21 9:41 ` rguenther at suse dot de
2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
2024-01-10 5:01 ` fxue at os dot amperecomputing.com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).