From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id D91303858C00 for ; Thu, 4 Aug 2022 09:06:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D91303858C00 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F082911FB; Thu, 4 Aug 2022 02:06:21 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.37]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C34BB3F70D; Thu, 4 Aug 2022 02:06:20 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener , gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] RFC: Extend SLP permutation optimisations References: Date: Thu, 04 Aug 2022 10:06:19 +0100 In-Reply-To: (Richard Biener's message of "Wed, 3 Aug 2022 11:23:37 +0000 (UTC)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-46.5 required=5.0 tests=BAYES_00, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 04 Aug 2022 09:06:24 -0000 Richard Biener writes: > On Tue, 2 Aug 2022, Richard Sandiford wrote: > >> Currently SLP tries to force permute operations "down" the graph >> from loads in the hope of reducing the total number of permutes >> needed or (in the best case) removing the need for the permutes >> entirely. This patch tries to extend it as follows: >> >> - Allow loads to take a different permutation from the one they >> started with, rather than choosing between "original permutation" >> and "no permutation". >> >> - Allow changes in both directions, if the target supports the >> reverse permute operation. >> >> - Treat the placement of permute operations as a two-way dataflow >> problem: after propagating information from leaves to roots (as now), >> propagate information back up the graph. >> >> - Take execution frequency into account when optimising for speed, >> so that (for example) permutes inside loops have a higher cost >> than permutes outside loops. >> >> - Try to reduce the total number of permutes when optimising for >> size, even if that increases the number of permutes on a given >> execution path. >> >> See the big block comment above vect_optimize_slp_pass for >> a detailed description. >> >> A while back I posted a patch to extend the existing optimisation >> to non-consecutive loads. This patch doesn't include that change >> (although it's a possible future addition). >> >> The original motivation for doing this was to add a framework that would >> allow other layout differences in future. The two main ones are: >> >> - Make it easier to represent predicated operations, including >> predicated operations with gaps. E.g.: >> >> a[0] += 1; >> a[1] += 1; >> a[3] += 1; >> >> could be a single load/add/store for SVE. We could handle this >> by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } >> (depending on what's being counted). We might need to move >> elements between lanes at various points, like with permutes. >> >> (This would first mean adding support for stores with gaps.) >> >> - Make it easier to switch between an even/odd and unpermuted layout >> when switching between wide and narrow elements. E.g. if a widening >> operation produces an even vector and an odd vector, we should try >> to keep operations on the wide elements in that order rather than >> force them to be permuted back "in order". >> >> To give some examples of what the current patch does: >> >> int f1(int *__restrict a, int *__restrict b, int *__restrict c, >> int *__restrict d) >> { >> a[0] = (b[1] << c[3]) - d[1]; >> a[1] = (b[0] << c[2]) - d[0]; >> a[2] = (b[3] << c[1]) - d[3]; >> a[3] = (b[2] << c[0]) - d[2]; >> } >> >> continues to produce the same code as before when optimising for >> speed: b, c and d are permuted at load time. But when optimising >> for size we instead permute c into the same order as b+d and then >> permute the result of the arithmetic into the same order as a: >> >> ldr q1, [x2] >> ldr q0, [x1] >> ext v1.16b, v1.16b, v1.16b, #8 // <------ >> sshl v0.4s, v0.4s, v1.4s >> ldr q1, [x3] >> sub v0.4s, v0.4s, v1.4s >> rev64 v0.4s, v0.4s // <------ >> str q0, [x0] >> ret >> >> The following function: >> >> int f2(int *__restrict a, int *__restrict b, int *__restrict c, >> int *__restrict d) >> { >> a[0] = (b[3] << c[3]) - d[3]; >> a[1] = (b[2] << c[2]) - d[2]; >> a[2] = (b[1] << c[1]) - d[1]; >> a[3] = (b[0] << c[0]) - d[0]; >> } >> >> continues to push the reverse down to just before the store, >> like the current code does. >> >> In: >> >> int f3(int *__restrict a, int *__restrict b, int *__restrict c, >> int *__restrict d) >> { >> for (int i = 0; i < 100; ++i) >> { >> a[0] = (a[0] + c[3]); >> a[1] = (a[1] + c[2]); >> a[2] = (a[2] + c[1]); >> a[3] = (a[3] + c[0]); >> c += 4; >> } >> } >> >> the loads of a are hoisted and the stores of a are sunk, so that >> only the load from c happens in the loop. When optimising for >> speed, we prefer to have the loop operate on the reversed layout, >> changing on entry and exit from the loop: >> >> mov x3, x0 >> adrp x0, .LC0 >> add x1, x2, 1600 >> ldr q2, [x0, #:lo12:.LC0] >> ldr q0, [x3] >> mov v1.16b, v0.16b >> tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- >> .p2align 3,,7 >> .L6: >> ldr q1, [x2], 16 >> add v0.4s, v0.4s, v1.4s >> cmp x2, x1 >> bne .L6 >> mov v1.16b, v0.16b >> adrp x0, .LC0 >> ldr q2, [x0, #:lo12:.LC0] >> tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- >> str q0, [x3] >> ret >> >> Similarly, for the very artificial testcase: >> >> int f4(int *__restrict a, int *__restrict b, int *__restrict c, >> int *__restrict d) >> { >> int a0 = a[0]; >> int a1 = a[1]; >> int a2 = a[2]; >> int a3 = a[3]; >> for (int i = 0; i < 100; ++i) >> { >> a0 ^= c[0]; >> a1 ^= c[1]; >> a2 ^= c[2]; >> a3 ^= c[3]; >> c += 4; >> for (int j = 0; j < 100; ++j) >> { >> a0 += d[1]; >> a1 += d[0]; >> a2 += d[3]; >> a3 += d[2]; >> d += 4; >> } >> b[0] = a0; >> b[1] = a1; >> b[2] = a2; >> b[3] = a3; >> b += 4; >> } >> a[0] = a0; >> a[1] = a1; >> a[2] = a2; >> a[3] = a3; >> } >> >> the a vector in the inner loop maintains the order { 1, 0, 3, 2 }, >> even though it's part of an SCC that includes the outer loop. >> In other words, this is a motivating case for not assigning >> permutes at SCC granularity. The code we get is: >> >> ldr q0, [x0] >> mov x4, x1 >> mov x5, x0 >> add x1, x3, 1600 >> add x3, x4, 1600 >> .p2align 3,,7 >> .L11: >> ldr q1, [x2], 16 >> sub x0, x1, #1600 >> eor v0.16b, v1.16b, v0.16b >> rev64 v0.4s, v0.4s // <--- >> .p2align 3,,7 >> .L10: >> ldr q1, [x0], 16 >> add v0.4s, v0.4s, v1.4s >> cmp x0, x1 >> bne .L10 >> rev64 v0.4s, v0.4s // <--- >> add x1, x0, 1600 >> str q0, [x4], 16 >> cmp x3, x4 >> bne .L11 >> str q0, [x5] >> ret >> >> The patch is still incomplete, hence the FIXMEs, but it's at the stage >> where it passes bootstrap & regtest on aarch64-linux-gnu. (There's one >> regression in slp-11b.c, but I think it's arguably restoring the >> expected load-lanes behaviour.) Does this look like a plausible >> direction, before I go filling in the FIXME parts? > > Yes, it looks nice! Thanks > I wonder if you spent much thoughts on > layout optimizing for SLP patterns and/or how to integrate both? Hadn't thought of that. But yeah, it looks like there's some overlap. Perhaps the main extensions that would help SLP pattern matching are: (1) Look for multiple loads from the same group. Represent them as a single load node followed by individual VEC_PERM_EXPR nodes. Try to optimise the placement of these new VEC_PERM_EXPRs. (2) Support non-bijective permutes. (3) Look for CSE opportunities that can be exposed by moving the permutes (not sure about this one -- maybe we wouldn't discover much that wouldn't also be discovered on scalar code). If (1) moved the new VEC_PERM_EXPRs as far from the loads as possible then complex mul could become a peephole pattern match. But moving VEC_PERM_EXPRs as far away as possible isn't always the best thing to do otherwise, so I think in practice we'd still need to track multiple possibilities. So yeah, it looks like integrating them would help, but I'm not sure how mcuh it would reduce the overall complexity. Richard