public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "fxue at os dot amperecomputing.com" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations Date: Thu, 21 Dec 2023 08:22:15 +0000 [thread overview] Message-ID: <bug-113104-4@http.gcc.gnu.org/bugzilla/> (raw) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Bug ID: 113104 Summary: Suboptimal loop-based slp node splicing across iterations Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Given a partial vector-sized slp node in loop, code generation would utilize inter-iteration parallelism to archive full vectorization by splicing defs of the node in multiple iterations into one vector. This strategy is not always good, and could be refined in some situation. To be specific, we'd better not splice node if it participates in a full-vector-sized operation, otherwise a permute and vextract that are really unneeded would be introduced. Suppose target vector size is 128-bit, and a slp node is mapped to VEC_OP in an iteration. Depending on whether backend supports LO/HI version of the operation, there are two kinds code sequence for splicing. // Isolated 2 iterations res_v128_I0 = VEC_OP(opnd_v64_I0, ...) // iteration #0 res_v128_I1 = VEC_OP(opnd_v64_I1, ...) // iteration #1 // Spliced (1) opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 } // extra permute opnd_v64_lo = [vec_unpack_lo_expr] opnd_v128_I0_I1; // extra vextract opnd_v64_hi = [vec_unpack_hi_expr] opnd_v128_I0_I1; // extra vextract res_v128_I0 = VEC_OP(opnd_v64_lo, ...) res_v128_I1 = VEC_OP(opnd_v64_hi, ...) // Spliced (2) opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 } // extra permute res_v128_I0 = VEC_OP_LO(opnd_v128_i0_i1, ...) // similar or same as VEC_OP res_v128_I1 = VEC_OP_HI(opnd_v128_i0_i1, ...) // similar or same as VEC_OP Sometime, such permute and vextract might be optimized away by backend passes. But sometime, it can not. Here is a case on aarch64. int test(unsigned array[4][4]); int foo(unsigned short *a, unsigned long n) { unsigned array[4][4]; for (unsigned i = 0; i < 4; i++, a += n) { array[i][0] = a[0] << 6; array[i][1] = a[1] << 6; array[i][2] = a[2] << 6; array[i][3] = a[3] << 6; } return test(array); } // Current code generation mov x2, x0 stp x29, x30, [sp, -80]! add x3, x2, x1, lsl 1 lsl x1, x1, 1 mov x29, sp add x4, x3, x1 ldr d0, [x2] movi v30.4s, 0 add x0, sp, 16 ldr d31, [x2, x1] ldr d29, [x3, x1] ldr d28, [x4, x1] ins v0.d[1], v31.d[0] // ins v29.d[1], v28.d[0] // zip1 v1.8h, v0.8h, v30.8h // superfluous zip2 v0.8h, v0.8h, v30.8h // zip1 v31.8h, v29.8h, v30.8h // zip2 v29.8h, v29.8h, v30.8h // shl v1.4s, v1.4s, 6 shl v0.4s, v0.4s, 6 shl v31.4s, v31.4s, 6 shl v29.4s, v29.4s, 6 stp q1, q0, [sp, 16] stp q31, q29, [sp, 48] bl test ldp x29, x30, [sp], 80 ret // May be optimized to: stp x29, x30, [sp, -80]! mov x29, sp mov x2, x0 add x0, sp, 16 lsl x3, x1, 1 add x1, x2, x1, lsl 1 add x4, x1, x3 ldr d31, [x2, x3] ushll v31.4s, v31.4h, 6 ldr d30, [x2] ushll v30.4s, v30.4h, 6 str q30, [sp, 16] ldr d30, [x1, x3] ushll v30.4s, v30.4h, 6 str q31, [sp, 32] ldr d31, [x4, x3] ushll v31.4s, v31.4h, 6 stp q30, q31, [sp, 48] bl test ldp x29, x30, [sp], 80 ret
next reply other threads:[~2023-12-21 8:22 UTC|newest] Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-12-21 8:22 fxue at os dot amperecomputing.com [this message] 2023-12-21 9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org 2023-12-21 9:33 ` fxue at os dot amperecomputing.com 2023-12-21 9:41 ` rguenther at suse dot de 2023-12-30 12:35 ` rsandifo at gcc dot gnu.org 2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org 2024-01-05 16:32 ` rsandifo at gcc dot gnu.org 2024-01-10 5:01 ` fxue at os dot amperecomputing.com
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-113104-4@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).