From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by sourceware.org (Postfix) with ESMTPS id C3A88385ED4B for ; Wed, 4 Nov 2020 13:35:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C3A88385ED4B Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rguenther@suse.de X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 9261EB239; Wed, 4 Nov 2020 13:35:47 +0000 (UTC) Date: Wed, 4 Nov 2020 14:35:47 +0100 (CET) From: Richard Biener Sender: rguenther@c653.arch.suse.de To: Tamar Christina cc: gcc-patches@gcc.gnu.org, nd@arm.com, ook@ucw.cz Subject: Re: [PATCH v2 10/18]middle-end simplify lane permutes which selects from loads from the same DR. In-Reply-To: Message-ID: References: User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Nov 2020 13:35:50 -0000 On Tue, 3 Nov 2020, Tamar Christina wrote: > Hi All, > > This change allows one to simplify lane permutes that select from multiple load > leafs that load from the same DR group by promoting the VEC_PERM node into a > load itself and pushing the lane permute into it as a load permute. > > This saves us from having to calculate where to materialize a new load node. > If the resulting loads are now unused they are freed and are removed from the > graph. > > This allows us to handle cases where we would have generated: > > movi v4.4s, 0 > adrp x3, .LC0 > ldr q5, [x3, #:lo12:.LC0] > mov x3, 0 > .p2align 3,,7 > .L2: > mov v0.16b, v4.16b > mov v3.16b, v4.16b > ldr q1, [x1, x3] > ldr q2, [x0, x3] > fcmla v0.4s, v2.4s, v1.4s, #0 > fcmla v3.4s, v1.4s, v2.4s, #0 > fcmla v0.4s, v2.4s, v1.4s, #270 > fcmla v3.4s, v1.4s, v2.4s, #270 > mov v1.16b, v3.16b > tbl v0.16b, {v0.16b - v1.16b}, v5.16b > str q0, [x2, x3] > add x3, x3, 16 > cmp x3, 1600 > bne .L2 > ret > > and instead generate > > mov x3, 0 > .p2align 3,,7 > .L27: > ldr q0, [x2, x3] > ldr q1, [x0, x3] > ldr q2, [x1, x3] > fcmla v0.2d, v1.2d, v2.2d, #0 > fcmla v0.2d, v1.2d, v2.2d, #270 > str q0, [x2, x3] > add x3, x3, 16 > cmp x3, 512 > bne .L27 > ret > > This runs as a pre step such that permute simplification can still inspect this > permute is needed > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > Tests are included as part of the final patch as they need the SLP pattern > matcher to insert permutes in between. > > Ok for master? So I think this is too specialized for the general issue that we're doing a bad job in CSEing the load part of different permutes of the same group. I've played with fixing this half a year ago (again) in multiple general ways but they all caused some regressions. So you're now adding some heuristics as to when to anticipate "CSE" (or merging with followup permutes). To quickly recap what I did consider two loads (V2DF) one { a[0], a[1] } and the other { a[1], a[0] }. They currently are two SLP nodes and one with a load_permutation. My original attempts focused on trying to get rid of load_permutation in favor of lane_permute nodes and thus during SLP discovery I turned the second into { a[0], a[1] } (magically unified with the other load) and a followup lane-permute node. So for your case you have IIUC { a[0], a[0] } and { a[1], a[1] } which eventually will (due to patterns) be lane-permuted into { a[0], a[1] }, right? So generalizing this as a single { a[0], a[1] } plus two lane-permute nodes { 0, 0 } and { 1, 1 } early would solve the issue as well? Now, in general it might be more profitable to generate the { a[0], a[0] } and { a[1], a[1] } via scalar-load-and-splat rather than vector load and permute so we have to be careful to not over-optimize here or be prepared to do the reverse transform. The patch itself is a bit ugly since it modifies the SLP graph when we already produced the graphds graph so I would do any of this before. I did consider gathering all loads nodes loading from a group and then trying to apply some heuristic to alter the SLP graph so it can be better optimized. In fact when we want to generate the same code as the non-SLP interleaving scheme does we do have to look at those since we have to unify loads there. I'd put this after vect_slp_build_vertices but before the new_graph call - altering 'vertices' / 'leafs' should be more easily possible and the 'leafs' array contains all loads already (vect_slp_build_vertices could be massaged to provide a map from DR_GROUP_FIRST_ELEMENT to slp_tree, giving us the meta we want). That said, I'd like to see something more forward-looking rather than the ad-hoc special-casing of what you run into with the pattern matching. In case we want to still go with the special-casing it should IMHO be done in a pre-order walk simply looking for lane permute nodes with children that all load from the same group performing what you do before any of the vertices/graph stuff is built. That's probably easiest at this point and it can be done when then bst_map is still around so you can properly CSE the new load you build. Thanks, Richard. > Thanks, > Tamar > > gcc/ChangeLog: > > * tree-vect-slp.c (vect_optimize_slp): Promote permutes. > > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imend