From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by sourceware.org (Postfix) with ESMTPS id 4C0B83857C4F for ; Wed, 4 Nov 2020 15:12:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 4C0B83857C4F Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rguenther@suse.de X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 1B63EB2C5; Wed, 4 Nov 2020 15:12:23 +0000 (UTC) Date: Wed, 4 Nov 2020 16:12:22 +0100 (CET) From: Richard Biener Sender: rguenther@c653.arch.suse.de To: Tamar Christina cc: "gcc-patches@gcc.gnu.org" , nd , "ook@ucw.cz" Subject: RE: [PATCH v2 10/18]middle-end simplify lane permutes which selects from loads from the same DR. In-Reply-To: Message-ID: References: User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Nov 2020 15:12:26 -0000 On Wed, 4 Nov 2020, Tamar Christina wrote: > Hi Richi, > > > -----Original Message----- > > From: rguenther@c653.arch.suse.de On > > Behalf Of Richard Biener > > Sent: Wednesday, November 4, 2020 1:36 PM > > To: Tamar Christina > > Cc: gcc-patches@gcc.gnu.org; nd ; ook@ucw.cz > > Subject: Re: [PATCH v2 10/18]middle-end simplify lane permutes which > > selects from loads from the same DR. > > > > On Tue, 3 Nov 2020, Tamar Christina wrote: > > > > > Hi All, > > > > > > This change allows one to simplify lane permutes that select from > > > multiple load leafs that load from the same DR group by promoting the > > > VEC_PERM node into a load itself and pushing the lane permute into it as a > > load permute. > > > > > > This saves us from having to calculate where to materialize a new load node. > > > If the resulting loads are now unused they are freed and are removed > > > from the graph. > > > > > > This allows us to handle cases where we would have generated: > > > > > > movi v4.4s, 0 > > > adrp x3, .LC0 > > > ldr q5, [x3, #:lo12:.LC0] > > > mov x3, 0 > > > .p2align 3,,7 > > > .L2: > > > mov v0.16b, v4.16b > > > mov v3.16b, v4.16b > > > ldr q1, [x1, x3] > > > ldr q2, [x0, x3] > > > fcmla v0.4s, v2.4s, v1.4s, #0 > > > fcmla v3.4s, v1.4s, v2.4s, #0 > > > fcmla v0.4s, v2.4s, v1.4s, #270 > > > fcmla v3.4s, v1.4s, v2.4s, #270 > > > mov v1.16b, v3.16b > > > tbl v0.16b, {v0.16b - v1.16b}, v5.16b > > > str q0, [x2, x3] > > > add x3, x3, 16 > > > cmp x3, 1600 > > > bne .L2 > > > ret > > > > > > and instead generate > > > > > > mov x3, 0 > > > .p2align 3,,7 > > > .L27: > > > ldr q0, [x2, x3] > > > ldr q1, [x0, x3] > > > ldr q2, [x1, x3] > > > fcmla v0.2d, v1.2d, v2.2d, #0 > > > fcmla v0.2d, v1.2d, v2.2d, #270 > > > str q0, [x2, x3] > > > add x3, x3, 16 > > > cmp x3, 512 > > > bne .L27 > > > ret > > > > > > This runs as a pre step such that permute simplification can still > > > inspect this permute is needed > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > Tests are included as part of the final patch as they need the SLP > > > pattern matcher to insert permutes in between. > > > > > > Ok for master? > > > > So I think this is too specialized for the general issue that we're doing a bad > > job in CSEing the load part of different permutes of the same group. I've > > played with fixing this half a year ago (again) in multiple general ways but > > they all caused some regressions. > > > > So you're now adding some heuristics as to when to anticipate "CSE" (or > > merging with followup permutes). > > > > To quickly recap what I did consider two loads (V2DF) one { a[0], a[1] } and > > the other { a[1], a[0] }. They currently are two SLP nodes and one with a > > load_permutation. > > My original attempts focused on trying to get rid of load_permutation in > > favor of lane_permute nodes and thus during SLP discovery I turned the > > second into { a[0], a[1] } (magically unified with the other load) and a > > followup lane-permute node. > > > > So for your case you have IIUC { a[0], a[0] } and { a[1], a[1] } which eventually > > will (due to patterns) be lane-permuted into { a[0], a[1] }, right? So > > generalizing this as a single { a[0], a[1] } plus two lane-permute nodes { 0, 0 } > > and { 1, 1 } early would solve the issue as well? > > Correct, I did wonder why it was generating two different nodes instead of a lane > permute but didn't pay much attention that it was just a short coming. > > > Now, in general it might be > > more profitable to generate the { a[0], a[0] } and { a[1], a[1] } via scalar-load- > > and-splat rather than vector load and permute so we have to be careful to > > not over-optimize here or be prepared to do the reverse transform. > > This in principle can be done in optimize_slp then right? Since it would do > a lot of the same work already and find the materialization points. > > > > > The patch itself is a bit ugly since it modifies the SLP graph when we already > > produced the graphds graph so I would do any of this before. I did consider > > gathering all loads nodes loading from a group and then trying to apply some > > heuristic to alter the SLP graph so it can be better optimized. In fact when we > > want to generate the same code as the non-SLP interleaving scheme does > > we do have to look at those since we have to unify loads there. > > > > Yes.. I will concede the patch isn't my finest work.. I also don't like the fact that I > had to keep leafs in tact less I break things later. But wanted feedback :) > > > I'd put this after vect_slp_build_vertices but before the new_graph call - > > altering 'vertices' / 'leafs' should be more easily possible and the 'leafs' array > > contains all loads already (vect_slp_build_vertices could be massaged to > > provide a map from DR_GROUP_FIRST_ELEMENT to slp_tree, giving us the > > meta we want). > > > > That said, I'd like to see something more forward-looking rather than the ad- > > hoc special-casing of what you run into with the pattern matching. > > > > Yeah, I like your suggestion about doing it at build time and CSEing early, but > don't think I can get that work in a week given that you've already tried multiple times :) > Happy to give it a go next stage-1 opening though. > > > In case we want to still go with the special-casing it should IMHO be done in a > > pre-order walk simply looking for lane permute nodes with children that all > > load from the same group performing what you do before any of the > > vertices/graph stuff is built. That's probably easiest at this point and it can be > > done when then bst_map is still around so you can properly CSE the new > > load you build. > > That's fair enough. I do think I need a temporary (not terrible) workaround...This would > then need to be somewhere in vect_analyze_slp. Would you prefer I do it during the > construction of the instance of afterwards? Well, right after pattern recog processed all instances - the issue only pops up due to pattern recog, right? > Regards, > Tamar > > > > > Thanks, > > Richard. > > > > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > * tree-vect-slp.c (vect_optimize_slp): Promote permutes. > > > > > > > > > > -- > > Richard Biener > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > > Nuernberg, Germany; GF: Felix Imend > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imend