From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
 by sourceware.org (Postfix) with ESMTPS id 4C0B83857C4F
 for <gcc-patches@gcc.gnu.org>; Wed,  4 Nov 2020 15:12:24 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 4C0B83857C4F
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=rguenther@suse.de
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.221.27])
 by mx2.suse.de (Postfix) with ESMTP id 1B63EB2C5;
 Wed,  4 Nov 2020 15:12:23 +0000 (UTC)
Date: Wed, 4 Nov 2020 16:12:22 +0100 (CET)
From: Richard Biener <rguenther@suse.de>
Sender: rguenther@c653.arch.suse.de
To: Tamar Christina <Tamar.Christina@arm.com>
cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>, 
 "ook@ucw.cz" <ook@ucw.cz>
Subject: RE: [PATCH v2 10/18]middle-end simplify lane permutes which selects
 from loads from the same DR.
In-Reply-To: <VI1PR08MB5325858DC4D0E73005EF5953FFEF0@VI1PR08MB5325.eurprd08.prod.outlook.com>
Message-ID: <nycvar.YFH.7.76.2011041611320.10073@p653.nepu.fhfr.qr>
References: <patch-13720-tamar@arm.com>
 <nycvar.YFH.7.76.2011041413100.10073@p653.nepu.fhfr.qr>
 <VI1PR08MB5325858DC4D0E73005EF5953FFEF0@VI1PR08MB5325.eurprd08.prod.outlook.com>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Nov 2020 15:12:26 -0000

On Wed, 4 Nov 2020, Tamar Christina wrote:

> Hi Richi,
> 
> > -----Original Message-----
> > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > Behalf Of Richard Biener
> > Sent: Wednesday, November 4, 2020 1:36 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ook@ucw.cz
> > Subject: Re: [PATCH v2 10/18]middle-end simplify lane permutes which
> > selects from loads from the same DR.
> > 
> > On Tue, 3 Nov 2020, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > This change allows one to simplify lane permutes that select from
> > > multiple load leafs that load from the same DR group by promoting the
> > > VEC_PERM node into a load itself and pushing the lane permute into it as a
> > load permute.
> > >
> > > This saves us from having to calculate where to materialize a new load node.
> > > If the resulting loads are now unused they are freed and are removed
> > > from the graph.
> > >
> > > This allows us to handle cases where we would have generated:
> > >
> > > 	movi    v4.4s, 0
> > > 	adrp    x3, .LC0
> > > 	ldr     q5, [x3, #:lo12:.LC0]
> > > 	mov     x3, 0
> > > 	.p2align 3,,7
> > > .L2:
> > > 	mov     v0.16b, v4.16b
> > > 	mov     v3.16b, v4.16b
> > > 	ldr     q1, [x1, x3]
> > > 	ldr     q2, [x0, x3]
> > > 	fcmla   v0.4s, v2.4s, v1.4s, #0
> > > 	fcmla   v3.4s, v1.4s, v2.4s, #0
> > > 	fcmla   v0.4s, v2.4s, v1.4s, #270
> > > 	fcmla   v3.4s, v1.4s, v2.4s, #270
> > > 	mov     v1.16b, v3.16b
> > > 	tbl     v0.16b, {v0.16b - v1.16b}, v5.16b
> > > 	str     q0, [x2, x3]
> > > 	add     x3, x3, 16
> > > 	cmp     x3, 1600
> > > 	bne     .L2
> > > 	ret
> > >
> > > and instead generate
> > >
> > > 	mov     x3, 0
> > > 	.p2align 3,,7
> > > .L27:
> > > 	ldr     q0, [x2, x3]
> > > 	ldr     q1, [x0, x3]
> > > 	ldr     q2, [x1, x3]
> > > 	fcmla   v0.2d, v1.2d, v2.2d, #0
> > > 	fcmla   v0.2d, v1.2d, v2.2d, #270
> > > 	str     q0, [x2, x3]
> > > 	add     x3, x3, 16
> > > 	cmp     x3, 512
> > > 	bne     .L27
> > > 	ret
> > >
> > > This runs as a pre step such that permute simplification can still
> > > inspect this permute is needed
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > Tests are included as part of the final patch as they need the SLP
> > > pattern matcher to insert permutes in between.
> > >
> > > Ok for master?
> > 
> > So I think this is too specialized for the general issue that we're doing a bad
> > job in CSEing the load part of different permutes of the same group.  I've
> > played with fixing this half a year ago (again) in multiple general ways but
> > they all caused some regressions.
> > 
> > So you're now adding some heuristics as to when to anticipate "CSE" (or
> > merging with followup permutes).
> > 
> > To quickly recap what I did consider two loads (V2DF) one { a[0], a[1] } and
> > the other { a[1], a[0] }.  They currently are two SLP nodes and one with a
> > load_permutation.
> > My original attempts focused on trying to get rid of load_permutation in
> > favor of lane_permute nodes and thus during SLP discovery I turned the
> > second into { a[0], a[1] } (magically unified with the other load) and a
> > followup lane-permute node.
> > 
> > So for your case you have IIUC { a[0], a[0] } and { a[1], a[1] } which eventually
> > will (due to patterns) be lane-permuted into { a[0], a[1] }, right?  So
> > generalizing this as a single { a[0], a[1] } plus two lane-permute nodes  { 0, 0 }
> > and { 1, 1 } early would solve the issue as well?
> 
> Correct, I did wonder why it was generating two different nodes instead of a lane
> permute but didn't pay much attention that it was just a short coming.
> 
> > Now, in general it might be
> > more profitable to generate the { a[0], a[0] } and { a[1], a[1] } via scalar-load-
> > and-splat rather than vector load and permute so we have to be careful to
> > not over-optimize here or be prepared to do the reverse transform.
> 
> This in principle can be done in optimize_slp then right? Since it would do
> a lot of the same work already and find the materialization points. 
> 
> > 
> > The patch itself is a bit ugly since it modifies the SLP graph when we already
> > produced the graphds graph so I would do any of this before.  I did consider
> > gathering all loads nodes loading from a group and then trying to apply some
> > heuristic to alter the SLP graph so it can be better optimized.  In fact when we
> > want to generate the same code as the non-SLP interleaving scheme does
> > we do have to look at those since we have to unify loads there.
> > 
> 
> Yes.. I will concede the patch isn't my finest work.. I also don't like the fact that I
> had to keep leafs in tact less I break things later. But wanted feedback :) 
> 
> > I'd put this after vect_slp_build_vertices but before the new_graph call -
> > altering 'vertices' / 'leafs' should be more easily possible and the 'leafs' array
> > contains all loads already (vect_slp_build_vertices could be massaged to
> > provide a map from DR_GROUP_FIRST_ELEMENT to slp_tree, giving us the
> > meta we want).
> > 
> > That said, I'd like to see something more forward-looking rather than the ad-
> > hoc special-casing of what you run into with the pattern matching.
> > 
> 
> Yeah, I like your suggestion about doing it at build time and CSEing early, but
> don't think I can get that work in a week given that you've already tried multiple times :)
> Happy to give it a go next stage-1 opening though.
> 
> > In case we want to still go with the special-casing it should IMHO be done in a
> > pre-order walk simply looking for lane permute nodes with children that all
> > load from the same group performing what you do before any of the
> > vertices/graph stuff is built.  That's probably easiest at this point and it can be
> > done when then bst_map is still around so you can properly CSE the new
> > load you build.
> 
> That's fair enough. I do think I need a temporary (not terrible) workaround...This would
> then need to be somewhere in vect_analyze_slp. Would you prefer I do it during the
> construction of the instance of afterwards?

Well, right after pattern recog processed all instances - the issue only
pops up due to pattern recog, right?

> Regards,
> Tamar
> 
> > 
> > Thanks,
> > Richard.
> > 
> > 
> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	* tree-vect-slp.c (vect_optimize_slp): Promote permutes.
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Felix Imend
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imend