From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 75F483835DFF for ; Tue, 30 Aug 2022 14:50:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 75F483835DFF Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EF1D41042; Tue, 30 Aug 2022 07:50:52 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 56CFA3F71A; Tue, 30 Aug 2022 07:50:46 -0700 (PDT) From: Richard Sandiford To: Jeff Law via Gcc-patches Mail-Followup-To: Jeff Law via Gcc-patches , Jeff Law , richard.sandiford@arm.com Subject: Re: [PATCH 6/6] Extend SLP permutation optimisations References: <1106b505-65b0-4681-c53b-fe7579490e92@gmail.com> Date: Tue, 30 Aug 2022 15:50:44 +0100 In-Reply-To: <1106b505-65b0-4681-c53b-fe7579490e92@gmail.com> (Jeff Law via Gcc-patches's message of "Fri, 26 Aug 2022 10:26:13 -0600") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-43.9 required=5.0 tests=BAYES_00, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Aug 2022 14:50:51 -0000 Message-ID: <20220830145044.YKQYbHtRgqGG1oNJuU7OAMbSMrrgJTGhoTuiWQFcXIk@z> Jeff Law via Gcc-patches writes: > On 8/25/2022 7:07 AM, Richard Sandiford via Gcc-patches wrote: >> Currently SLP tries to force permute operations "down" the graph >> from loads in the hope of reducing the total number of permutations >> needed or (in the best case) removing the need for the permutations >> entirely. This patch tries to extend it as follows: >> >> - Allow loads to take a different permutation from the one they >> started with, rather than choosing between "original permutation" >> and "no permutation". >> >> - Allow changes in both directions, if the target supports the >> reverse permutation. >> >> - Treat the placement of permutations as a two-way dataflow problem: >> after propagating information from leaves to roots (as now), propagate >> information back up the graph. >> >> - Take execution frequency into account when optimising for speed, >> so that (for example) permutations inside loops have a higher >> cost than permutations outside loops. >> >> - Try to reduce the total number of permutations when optimising for >> size, even if that increases the number of permutations on a given >> execution path. >> >> See the big block comment above vect_optimize_slp_pass for >> a detailed description. >> >> The original motivation for doing this was to add a framework that would >> allow other layout differences in future. The two main ones are: >> >> - Make it easier to represent predicated operations, including >> predicated operations with gaps. E.g.: >> >> a[0] +=3D 1; >> a[1] +=3D 1; >> a[3] +=3D 1; >> >> could be a single load/add/store for SVE. We could handle this >> by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } >> (depending on what's being counted). We might need to move >> elements between lanes at various points, like with permutes. >> >> (This would first mean adding support for stores with gaps.) >> >> - Make it easier to switch between an even/odd and unpermuted layout >> when switching between wide and narrow elements. E.g. if a widening >> operation produces an even vector and an odd vector, we should try >> to keep operations on the wide elements in that order rather than >> force them to be permuted back "in order". > Very cool.=C2=A0 Richi and I discussed this a bit a year or so ago --=20 > basically noting that bi-directional movement is really the way to go=20 > and that the worst thing to do is push a permute down into the *middle*=20 > of a computation chain since that will tend to break FMA generation.=C2= =A0=20 > Moving to the loads or stores or to another permute point ought to be=20 > the goal. Hmm, I hadn't thought specifically about the case of permutes ending up in the middle of a fusable operation. The series doesn't address that directly. If we have: permute(a) * permute(b) + c then the tendency will still be to convert that into: permute(a * b) + c Damn. Another case to think about ;-) I've pushed the series for now though (thanks to Richi for the reviews). Richard