From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id D91303858C00
 for <gcc-patches@gcc.gnu.org>; Thu,  4 Aug 2022 09:06:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D91303858C00
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F082911FB;
 Thu,  4 Aug 2022 02:06:21 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.37])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C34BB3F70D;
 Thu,  4 Aug 2022 02:06:20 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Mail-Followup-To: Richard Biener <rguenther@suse.de>, gcc-patches@gcc.gnu.org,
 richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] RFC: Extend SLP permutation optimisations
References: <mpt7d3rru04.fsf@arm.com>
 <nycvar.YFH.7.77.849.2208031118240.4208@jbgna.fhfr.qr>
Date: Thu, 04 Aug 2022 10:06:19 +0100
In-Reply-To: <nycvar.YFH.7.77.849.2208031118240.4208@jbgna.fhfr.qr> (Richard
 Biener's message of "Wed, 3 Aug 2022 11:23:37 +0000 (UTC)")
Message-ID: <mptedxwnzsk.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-46.5 required=5.0 tests=BAYES_00, KAM_DMARC_NONE,
 KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Aug 2022 09:06:24 -0000

Richard Biener <rguenther@suse.de> writes:
> On Tue, 2 Aug 2022, Richard Sandiford wrote:
>
>> Currently SLP tries to force permute operations "down" the graph
>> from loads in the hope of reducing the total number of permutes
>> needed or (in the best case) removing the need for the permutes
>> entirely.  This patch tries to extend it as follows:
>> 
>> - Allow loads to take a different permutation from the one they
>>   started with, rather than choosing between "original permutation"
>>   and "no permutation".
>> 
>> - Allow changes in both directions, if the target supports the
>>   reverse permute operation.
>> 
>> - Treat the placement of permute operations as a two-way dataflow
>>   problem: after propagating information from leaves to roots (as now),
>>   propagate information back up the graph.
>> 
>> - Take execution frequency into account when optimising for speed,
>>   so that (for example) permutes inside loops have a higher cost
>>   than permutes outside loops.
>> 
>> - Try to reduce the total number of permutes when optimising for
>>   size, even if that increases the number of permutes on a given
>>   execution path.
>> 
>> See the big block comment above vect_optimize_slp_pass for
>> a detailed description.
>> 
>> A while back I posted a patch to extend the existing optimisation
>> to non-consecutive loads.  This patch doesn't include that change
>> (although it's a possible future addition).
>> 
>> The original motivation for doing this was to add a framework that would
>> allow other layout differences in future.  The two main ones are:
>> 
>> - Make it easier to represent predicated operations, including
>>   predicated operations with gaps.  E.g.:
>> 
>>      a[0] += 1;
>>      a[1] += 1;
>>      a[3] += 1;
>> 
>>   could be a single load/add/store for SVE.  We could handle this
>>   by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>>   (depending on what's being counted).  We might need to move
>>   elements between lanes at various points, like with permutes.
>> 
>>   (This would first mean adding support for stores with gaps.)
>> 
>> - Make it easier to switch between an even/odd and unpermuted layout
>>   when switching between wide and narrow elements.  E.g. if a widening
>>   operation produces an even vector and an odd vector, we should try
>>   to keep operations on the wide elements in that order rather than
>>   force them to be permuted back "in order".
>> 
>> To give some examples of what the current patch does:
>> 
>> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
>>        int *__restrict d)
>> {
>>   a[0] = (b[1] << c[3]) - d[1];
>>   a[1] = (b[0] << c[2]) - d[0];
>>   a[2] = (b[3] << c[1]) - d[3];
>>   a[3] = (b[2] << c[0]) - d[2];
>> }
>> 
>> continues to produce the same code as before when optimising for
>> speed: b, c and d are permuted at load time.  But when optimising
>> for size we instead permute c into the same order as b+d and then
>> permute the result of the arithmetic into the same order as a:
>> 
>>         ldr     q1, [x2]
>>         ldr     q0, [x1]
>>         ext     v1.16b, v1.16b, v1.16b, #8     // <------
>>         sshl    v0.4s, v0.4s, v1.4s
>>         ldr     q1, [x3]
>>         sub     v0.4s, v0.4s, v1.4s
>>         rev64   v0.4s, v0.4s                   // <------
>>         str     q0, [x0]
>>         ret
>> 
>> The following function:
>> 
>> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
>>        int *__restrict d)
>> {
>>   a[0] = (b[3] << c[3]) - d[3];
>>   a[1] = (b[2] << c[2]) - d[2];
>>   a[2] = (b[1] << c[1]) - d[1];
>>   a[3] = (b[0] << c[0]) - d[0];
>> }
>> 
>> continues to push the reverse down to just before the store,
>> like the current code does.
>> 
>> In:
>> 
>> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
>>        int *__restrict d)
>> {
>>   for (int i = 0; i < 100; ++i)
>>     {
>>       a[0] = (a[0] + c[3]);
>>       a[1] = (a[1] + c[2]);
>>       a[2] = (a[2] + c[1]);
>>       a[3] = (a[3] + c[0]);
>>       c += 4;
>>     }
>> }
>> 
>> the loads of a are hoisted and the stores of a are sunk, so that
>> only the load from c happens in the loop.  When optimising for
>> speed, we prefer to have the loop operate on the reversed layout,
>> changing on entry and exit from the loop:
>> 
>>         mov     x3, x0
>>         adrp    x0, .LC0
>>         add     x1, x2, 1600
>>         ldr     q2, [x0, #:lo12:.LC0]
>>         ldr     q0, [x3]
>>         mov     v1.16b, v0.16b
>>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>>         .p2align 3,,7
>> .L6:
>>         ldr     q1, [x2], 16
>>         add     v0.4s, v0.4s, v1.4s
>>         cmp     x2, x1
>>         bne     .L6
>>         mov     v1.16b, v0.16b
>>         adrp    x0, .LC0
>>         ldr     q2, [x0, #:lo12:.LC0]
>>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>>         str     q0, [x3]
>>         ret
>> 
>> Similarly, for the very artificial testcase:
>> 
>> int f4(int *__restrict a, int *__restrict b, int *__restrict c,
>>        int *__restrict d)
>> {
>>   int a0 = a[0];
>>   int a1 = a[1];
>>   int a2 = a[2];
>>   int a3 = a[3];
>>   for (int i = 0; i < 100; ++i)
>>     {
>>       a0 ^= c[0];
>>       a1 ^= c[1];
>>       a2 ^= c[2];
>>       a3 ^= c[3];
>>       c += 4;
>>       for (int j = 0; j < 100; ++j)
>> 	{
>> 	  a0 += d[1];
>> 	  a1 += d[0];
>> 	  a2 += d[3];
>> 	  a3 += d[2];
>> 	  d += 4;
>> 	}
>>       b[0] = a0;
>>       b[1] = a1;
>>       b[2] = a2;
>>       b[3] = a3;
>>       b += 4;
>>     }
>>   a[0] = a0;
>>   a[1] = a1;
>>   a[2] = a2;
>>   a[3] = a3;
>> }
>> 
>> the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
>> even though it's part of an SCC that includes the outer loop.
>> In other words, this is a motivating case for not assigning
>> permutes at SCC granularity.  The code we get is:
>> 
>>         ldr     q0, [x0]
>>         mov     x4, x1
>>         mov     x5, x0
>>         add     x1, x3, 1600
>>         add     x3, x4, 1600
>>         .p2align 3,,7
>> .L11:
>>         ldr     q1, [x2], 16
>>         sub     x0, x1, #1600
>>         eor     v0.16b, v1.16b, v0.16b
>>         rev64   v0.4s, v0.4s              // <---
>>         .p2align 3,,7
>> .L10:
>>         ldr     q1, [x0], 16
>>         add     v0.4s, v0.4s, v1.4s
>>         cmp     x0, x1
>>         bne     .L10
>>         rev64   v0.4s, v0.4s              // <---
>>         add     x1, x0, 1600
>>         str     q0, [x4], 16
>>         cmp     x3, x4
>>         bne     .L11
>>         str     q0, [x5]
>>         ret
>> 
>> The patch is still incomplete, hence the FIXMEs, but it's at the stage
>> where it passes bootstrap & regtest on aarch64-linux-gnu.  (There's one
>> regression in slp-11b.c, but I think it's arguably restoring the
>> expected load-lanes behaviour.)  Does this look like a plausible
>> direction, before I go filling in the FIXME parts?
>
> Yes, it looks nice!

Thanks

> I wonder if you spent much thoughts on
> layout optimizing for SLP patterns and/or how to integrate both?

Hadn't thought of that.  But yeah, it looks like there's some overlap.
Perhaps the main extensions that would help SLP pattern matching are:

(1) Look for multiple loads from the same group.  Represent them as
    a single load node followed by individual VEC_PERM_EXPR nodes.
    Try to optimise the placement of these new VEC_PERM_EXPRs.

(2) Support non-bijective permutes.

(3) Look for CSE opportunities that can be exposed by moving the
    permutes (not sure about this one -- maybe we wouldn't discover
    much that wouldn't also be discovered on scalar code).

If (1) moved the new VEC_PERM_EXPRs as far from the loads as possible
then complex mul could become a peephole pattern match.  But moving
VEC_PERM_EXPRs as far away as possible isn't always the best thing
to do otherwise, so I think in practice we'd still need to track
multiple possibilities.

So yeah, it looks like integrating them would help, but I'm not
sure how mcuh it would reduce the overall complexity.

Richard