public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations @ 2023-12-21 8:22 fxue at os dot amperecomputing.com 2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org ` (6 more replies) 0 siblings, 7 replies; 8+ messages in thread From: fxue at os dot amperecomputing.com @ 2023-12-21 8:22 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Bug ID: 113104 Summary: Suboptimal loop-based slp node splicing across iterations Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Given a partial vector-sized slp node in loop, code generation would utilize inter-iteration parallelism to archive full vectorization by splicing defs of the node in multiple iterations into one vector. This strategy is not always good, and could be refined in some situation. To be specific, we'd better not splice node if it participates in a full-vector-sized operation, otherwise a permute and vextract that are really unneeded would be introduced. Suppose target vector size is 128-bit, and a slp node is mapped to VEC_OP in an iteration. Depending on whether backend supports LO/HI version of the operation, there are two kinds code sequence for splicing. // Isolated 2 iterations res_v128_I0 = VEC_OP(opnd_v64_I0, ...) // iteration #0 res_v128_I1 = VEC_OP(opnd_v64_I1, ...) // iteration #1 // Spliced (1) opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 } // extra permute opnd_v64_lo = [vec_unpack_lo_expr] opnd_v128_I0_I1; // extra vextract opnd_v64_hi = [vec_unpack_hi_expr] opnd_v128_I0_I1; // extra vextract res_v128_I0 = VEC_OP(opnd_v64_lo, ...) res_v128_I1 = VEC_OP(opnd_v64_hi, ...) // Spliced (2) opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 } // extra permute res_v128_I0 = VEC_OP_LO(opnd_v128_i0_i1, ...) // similar or same as VEC_OP res_v128_I1 = VEC_OP_HI(opnd_v128_i0_i1, ...) // similar or same as VEC_OP Sometime, such permute and vextract might be optimized away by backend passes. But sometime, it can not. Here is a case on aarch64. int test(unsigned array[4][4]); int foo(unsigned short *a, unsigned long n) { unsigned array[4][4]; for (unsigned i = 0; i < 4; i++, a += n) { array[i][0] = a[0] << 6; array[i][1] = a[1] << 6; array[i][2] = a[2] << 6; array[i][3] = a[3] << 6; } return test(array); } // Current code generation mov x2, x0 stp x29, x30, [sp, -80]! add x3, x2, x1, lsl 1 lsl x1, x1, 1 mov x29, sp add x4, x3, x1 ldr d0, [x2] movi v30.4s, 0 add x0, sp, 16 ldr d31, [x2, x1] ldr d29, [x3, x1] ldr d28, [x4, x1] ins v0.d[1], v31.d[0] // ins v29.d[1], v28.d[0] // zip1 v1.8h, v0.8h, v30.8h // superfluous zip2 v0.8h, v0.8h, v30.8h // zip1 v31.8h, v29.8h, v30.8h // zip2 v29.8h, v29.8h, v30.8h // shl v1.4s, v1.4s, 6 shl v0.4s, v0.4s, 6 shl v31.4s, v31.4s, 6 shl v29.4s, v29.4s, 6 stp q1, q0, [sp, 16] stp q31, q29, [sp, 48] bl test ldp x29, x30, [sp], 80 ret // May be optimized to: stp x29, x30, [sp, -80]! mov x29, sp mov x2, x0 add x0, sp, 16 lsl x3, x1, 1 add x1, x2, x1, lsl 1 add x4, x1, x3 ldr d31, [x2, x3] ushll v31.4s, v31.4h, 6 ldr d30, [x2] ushll v30.4s, v30.4h, 6 str q30, [sp, 16] ldr d30, [x1, x3] ushll v30.4s, v30.4h, 6 str q31, [sp, 32] ldr d31, [x4, x3] ushll v31.4s, v31.4h, 6 stp q30, q31, [sp, 48] bl test ldp x29, x30, [sp], 80 ret ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations 2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com @ 2023-12-21 9:07 ` rguenth at gcc dot gnu.org 2023-12-21 9:33 ` fxue at os dot amperecomputing.com ` (5 subsequent siblings) 6 siblings, 0 replies; 8+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-12-21 9:07 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- See my proposal on the mailing list to lift the restriction of sticking to a single vector size, I think this is another example showing this. If you use BB level vectorization by disabling loop vectorization but not SLP vectorization the code should improve? ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations 2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com 2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org @ 2023-12-21 9:33 ` fxue at os dot amperecomputing.com 2023-12-21 9:41 ` rguenther at suse dot de ` (4 subsequent siblings) 6 siblings, 0 replies; 8+ messages in thread From: fxue at os dot amperecomputing.com @ 2023-12-21 9:33 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 --- Comment #2 from Feng Xue <fxue at os dot amperecomputing.com> --- (In reply to Richard Biener from comment #1) > See my proposal on the mailing list to lift the restriction of sticking to a > single vector size, I think this is another example showing this. If you > use BB level vectorization by disabling loop vectorization but not SLP > vectorization the code should improve? Yes, the loop is fully unrolled, and BB SLP would. I could not find the proposal, would you share me a link? Thanks ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations 2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com 2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org 2023-12-21 9:33 ` fxue at os dot amperecomputing.com @ 2023-12-21 9:41 ` rguenther at suse dot de 2023-12-30 12:35 ` rsandifo at gcc dot gnu.org ` (3 subsequent siblings) 6 siblings, 0 replies; 8+ messages in thread From: rguenther at suse dot de @ 2023-12-21 9:41 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 --- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 21 Dec 2023, fxue at os dot amperecomputing.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 > > --- Comment #2 from Feng Xue <fxue at os dot amperecomputing.com> --- > (In reply to Richard Biener from comment #1) > > See my proposal on the mailing list to lift the restriction of sticking to a > > single vector size, I think this is another example showing this. If you > > use BB level vectorization by disabling loop vectorization but not SLP > > vectorization the code should improve? > > Yes, the loop is fully unrolled, and BB SLP would. I suspect even when the loop isn't unrolled (just increase iteration count) the code would improve > I could not find the proposal, would you share me a link? Thanks https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640476.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations 2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com ` (2 preceding siblings ...) 2023-12-21 9:41 ` rguenther at suse dot de @ 2023-12-30 12:35 ` rsandifo at gcc dot gnu.org 2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org ` (2 subsequent siblings) 6 siblings, 0 replies; 8+ messages in thread From: rsandifo at gcc dot gnu.org @ 2023-12-30 12:35 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Sandiford <rsandifo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2023-12-30 Ever confirmed|0 |1 CC| |rsandifo at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #4 from Richard Sandiford <rsandifo at gcc dot gnu.org> --- FWIW, we do get the desired code with -march=armv8-a+sve (even though the test doesn't use SVE). This is because of: /* Consider enabling VECT_COMPARE_COSTS for SVE, both so that we can compare SVE against Advanced SIMD and so that we can compare multiple SVE vectorization approaches against each other. There's not really any point doing this for Advanced SIMD only, since the first mode that works should always be the best. */ if (TARGET_SVE && aarch64_sve_compare_costs) flags |= VECT_COMPARE_COSTS; The testcase in this PR is a counterexample to the claim in the final sentence. I think the comment might predate significant support for mixed-sized Advanced SIMD vectorisation. If we enable SVE (or uncomment the "if" line), the costs are 13 units per vector iteration for 128-bit vectors and 4 units per vector iteration for 64-bit vectors (so 8 units per 128 bits on a parity basis). The 64-bit version is therefore seen as significantly cheaper and is chosen ahead of the 128-bit version. I think this PR is enough proof that we should enable VECT_COMPARE_COSTS even without SVE. Assigning to myself for that. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations 2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com ` (3 preceding siblings ...) 2023-12-30 12:35 ` rsandifo at gcc dot gnu.org @ 2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org 2024-01-05 16:32 ` rsandifo at gcc dot gnu.org 2024-01-10 5:01 ` fxue at os dot amperecomputing.com 6 siblings, 0 replies; 8+ messages in thread From: cvs-commit at gcc dot gnu.org @ 2024-01-05 16:25 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 --- Comment #5 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:7328faf89e9b4953baaff10e18262c70fbd3e578 commit r14-6961-g7328faf89e9b4953baaff10e18262c70fbd3e578 Author: Richard Sandiford <richard.sandiford@arm.com> Date: Fri Jan 5 16:25:16 2024 +0000 aarch64: Extend VECT_COMPARE_COSTS to !SVE [PR113104] When SVE is enabled, we try vectorising with multiple different SVE and Advanced SIMD approaches and use the cost model to pick the best one. Until now, we've not done that for Advanced SIMD, since "the first mode that works should always be the best". The testcase is a counterexample. Each iteration of the scalar loop vectorises naturally with 64-bit input vectors and 128-bit output vectors. We do try that for SVE, and choose it as the best approach. But the first approach we try is instead to use: - a vectorisation factor of 2 - 1 128-bit vector for the inputs - 2 128-bit vectors for the outputs But since the stride is variable, the cost of marshalling the input vector from two iterations outweighs the benefit of doing two iterations at once. This patch therefore generalises aarch64-sve-compare-costs to aarch64-vect-compare-costs and applies it to non-SVE compilations. gcc/ PR target/113104 * doc/invoke.texi (aarch64-sve-compare-costs): Replace with... (aarch64-vect-compare-costs): ...this. * config/aarch64/aarch64.opt (-param=aarch64-sve-compare-costs=): Replace with... (-param=aarch64-vect-compare-costs=): ...this new param. * config/aarch64/aarch64.cc (aarch64_override_options_internal): Don't disable it when vectorizing for Advanced SIMD only. (aarch64_autovectorize_vector_modes): Apply VECT_COMPARE_COSTS whenever aarch64_vect_compare_costs is true. gcc/testsuite/ PR target/113104 * gcc.target/aarch64/pr113104.c: New test. * gcc.target/aarch64/sve/cond_arith_1.c: Update for new parameter names. * gcc.target/aarch64/sve/cond_arith_1_run.c: Likewise. * gcc.target/aarch64/sve/cond_arith_3.c: Likewise. * gcc.target/aarch64/sve/cond_arith_3_run.c: Likewise. * gcc.target/aarch64/sve/gather_load_6.c: Likewise. * gcc.target/aarch64/sve/gather_load_7.c: Likewise. * gcc.target/aarch64/sve/load_const_offset_2.c: Likewise. * gcc.target/aarch64/sve/load_const_offset_3.c: Likewise. * gcc.target/aarch64/sve/mask_gather_load_6.c: Likewise. * gcc.target/aarch64/sve/mask_gather_load_7.c: Likewise. * gcc.target/aarch64/sve/mask_load_slp_1.c: Likewise. * gcc.target/aarch64/sve/mask_struct_load_1.c: Likewise. * gcc.target/aarch64/sve/mask_struct_load_2.c: Likewise. * gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise. * gcc.target/aarch64/sve/mask_struct_load_4.c: Likewise. * gcc.target/aarch64/sve/mask_struct_store_1.c: Likewise. * gcc.target/aarch64/sve/mask_struct_store_1_run.c: Likewise. * gcc.target/aarch64/sve/mask_struct_store_2.c: Likewise. * gcc.target/aarch64/sve/mask_struct_store_2_run.c: Likewise. * gcc.target/aarch64/sve/pack_1.c: Likewise. * gcc.target/aarch64/sve/reduc_4.c: Likewise. * gcc.target/aarch64/sve/scatter_store_6.c: Likewise. * gcc.target/aarch64/sve/scatter_store_7.c: Likewise. * gcc.target/aarch64/sve/strided_load_3.c: Likewise. * gcc.target/aarch64/sve/strided_store_3.c: Likewise. * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Likewise. * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise. * gcc.target/aarch64/sve/unpack_signed_1.c: Likewise. * gcc.target/aarch64/sve/unpack_unsigned_1.c: Likewise. * gcc.target/aarch64/sve/unpack_unsigned_1_run.c: Likewise. * gcc.target/aarch64/sve/vcond_11.c: Likewise. * gcc.target/aarch64/sve/vcond_11_run.c: Likewise. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations 2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com ` (4 preceding siblings ...) 2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org @ 2024-01-05 16:32 ` rsandifo at gcc dot gnu.org 2024-01-10 5:01 ` fxue at os dot amperecomputing.com 6 siblings, 0 replies; 8+ messages in thread From: rsandifo at gcc dot gnu.org @ 2024-01-05 16:32 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Sandiford <rsandifo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution|--- |FIXED --- Comment #6 from Richard Sandiford <rsandifo at gcc dot gnu.org> --- Fixed. Thanks for the report. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations 2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com ` (5 preceding siblings ...) 2024-01-05 16:32 ` rsandifo at gcc dot gnu.org @ 2024-01-10 5:01 ` fxue at os dot amperecomputing.com 6 siblings, 0 replies; 8+ messages in thread From: fxue at os dot amperecomputing.com @ 2024-01-10 5:01 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Feng Xue <fxue at os dot amperecomputing.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|FIXED |--- Status|RESOLVED |REOPENED --- Comment #7 from Feng Xue <fxue at os dot amperecomputing.com> --- Partial permutation (especially extract-low/high) would incur inefficient splicing in some situation even the vector mode with lowest cost is used. int test(unsigned array[4][4]); int foo(unsigned short *a, unsigned long n) { unsigned array[4][4]; for (unsigned i = 0; i < 4; i++, a += n) { array[i][0] = (a[0] << 3) - (a[4] << 6); array[i][1] = (a[1] << 3) - (a[5] << 6); array[i][2] = (a[2] << 3) - (a[6] << 6); array[i][3] = (a[3] << 3) - (a[7] << 6); } return test(array); } // After vect-compare-cost fix mov x2, x0 stp x29, x30, [sp, -80]! add x3, x2, x1, lsl 1 lsl x1, x1, 1 mov x29, sp add x4, x3, x1 add x0, sp, 16 ldr q5, [x2] ldr q31, [x4, x1] ldr q0, [x3, x1] ldr q1, [x2, x1] movi v28.4s, 0 // zip1 v29.2d, v0.2d, v31.2d // zip1 v2.2d, v5.2d, v1.2d // zip2 v31.2d, v0.2d, v31.2d // zip2 v1.2d, v5.2d, v1.2d // zip1 v30.8h, v29.8h, v28.8h // zip1 v4.8h, v2.8h, v28.8h // superfluous zip1 v27.8h, v31.8h, v28.8h // zip1 v3.8h, v1.8h, v28.8h // zip2 v29.8h, v29.8h, v28.8h // zip2 v31.8h, v31.8h, v28.8h // zip2 v2.8h, v2.8h, v28.8h // zip2 v1.8h, v1.8h, v28.8h // shl v30.4s, v30.4s, 3 shl v29.4s, v29.4s, 3 shl v4.4s, v4.4s, 3 shl v2.4s, v2.4s, 3 shl v27.4s, v27.4s, 6 shl v31.4s, v31.4s, 6 shl v3.4s, v3.4s, 6 shl v1.4s, v1.4s, 6 sub v27.4s, v30.4s, v27.4s sub v31.4s, v29.4s, v31.4s sub v3.4s, v4.4s, v3.4s sub v1.4s, v2.4s, v1.4s stp q27, q31, [sp, 48] stp q3, q1, [sp, 16] bl test ldp x29, x30, [sp], 80 ret // Expect it to be optimized as: lsl x3, x1, 1 mov x2, x0 stp x29, x30, [sp, -80]! add x1, x2, x1, lsl 1 add x4, x1, x3 mov x29, sp add x0, sp, 16 ldr q30, [x2, x3] ldr q0, [x2] ushll v31.4s, v30.4h, 3 ushll2 v30.4s, v30.8h, 6 ushll v29.4s, v0.4h, 3 ushll2 v0.4s, v0.8h, 6 sub v30.4s, v31.4s, v30.4s sub v0.4s, v29.4s, v0.4s str q0, [sp, 16] ldr q0, [x1, x3] str q30, [sp, 32] ldr q30, [x4, x3] ushll v29.4s, v0.4h, 3 ushll2 v0.4s, v0.8h, 6 ushll v31.4s, v30.4h, 3 ushll2 v30.4s, v30.8h, 6 sub v0.4s, v29.4s, v0.4s sub v30.4s, v31.4s, v30.4s stp q0, q30, [sp, 48] bl test ldp x29, x30, [sp], 80 ret Based on cost arising from splicing, we still need a way to select the most profitable vector mode per slp node, over the global vector mode specified in vinfo. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-01-10 5:01 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-12-21 8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com 2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org 2023-12-21 9:33 ` fxue at os dot amperecomputing.com 2023-12-21 9:41 ` rguenther at suse dot de 2023-12-30 12:35 ` rsandifo at gcc dot gnu.org 2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org 2024-01-05 16:32 ` rsandifo at gcc dot gnu.org 2024-01-10 5:01 ` fxue at os dot amperecomputing.com
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).