RE: [PATCH v2 10/18]middle-end simplify lane permutes which selects from loads from the same DR.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, "ook@ucw.cz" <ook@ucw.cz>
Subject: RE: [PATCH v2 10/18]middle-end simplify lane permutes which selects from loads from the same DR.
Date: Wed, 4 Nov 2020 14:02:47 +0000	[thread overview]
Message-ID: <VI1PR08MB5325858DC4D0E73005EF5953FFEF0@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <nycvar.YFH.7.76.2011041413100.10073@p653.nepu.fhfr.qr>

Hi Richi,

> -----Original Message-----
> From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> Behalf Of Richard Biener
> Sent: Wednesday, November 4, 2020 1:36 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ook@ucw.cz
> Subject: Re: [PATCH v2 10/18]middle-end simplify lane permutes which
> selects from loads from the same DR.
> 
> On Tue, 3 Nov 2020, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This change allows one to simplify lane permutes that select from
> > multiple load leafs that load from the same DR group by promoting the
> > VEC_PERM node into a load itself and pushing the lane permute into it as a
> load permute.
> >
> > This saves us from having to calculate where to materialize a new load node.
> > If the resulting loads are now unused they are freed and are removed
> > from the graph.
> >
> > This allows us to handle cases where we would have generated:
> >
> > 	movi    v4.4s, 0
> > 	adrp    x3, .LC0
> > 	ldr     q5, [x3, #:lo12:.LC0]
> > 	mov     x3, 0
> > 	.p2align 3,,7
> > .L2:
> > 	mov     v0.16b, v4.16b
> > 	mov     v3.16b, v4.16b
> > 	ldr     q1, [x1, x3]
> > 	ldr     q2, [x0, x3]
> > 	fcmla   v0.4s, v2.4s, v1.4s, #0
> > 	fcmla   v3.4s, v1.4s, v2.4s, #0
> > 	fcmla   v0.4s, v2.4s, v1.4s, #270
> > 	fcmla   v3.4s, v1.4s, v2.4s, #270
> > 	mov     v1.16b, v3.16b
> > 	tbl     v0.16b, {v0.16b - v1.16b}, v5.16b
> > 	str     q0, [x2, x3]
> > 	add     x3, x3, 16
> > 	cmp     x3, 1600
> > 	bne     .L2
> > 	ret
> >
> > and instead generate
> >
> > 	mov     x3, 0
> > 	.p2align 3,,7
> > .L27:
> > 	ldr     q0, [x2, x3]
> > 	ldr     q1, [x0, x3]
> > 	ldr     q2, [x1, x3]
> > 	fcmla   v0.2d, v1.2d, v2.2d, #0
> > 	fcmla   v0.2d, v1.2d, v2.2d, #270
> > 	str     q0, [x2, x3]
> > 	add     x3, x3, 16
> > 	cmp     x3, 512
> > 	bne     .L27
> > 	ret
> >
> > This runs as a pre step such that permute simplification can still
> > inspect this permute is needed
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > Tests are included as part of the final patch as they need the SLP
> > pattern matcher to insert permutes in between.
> >
> > Ok for master?
> 
> So I think this is too specialized for the general issue that we're doing a bad
> job in CSEing the load part of different permutes of the same group.  I've
> played with fixing this half a year ago (again) in multiple general ways but
> they all caused some regressions.
> 
> So you're now adding some heuristics as to when to anticipate "CSE" (or
> merging with followup permutes).
> 
> To quickly recap what I did consider two loads (V2DF) one { a[0], a[1] } and
> the other { a[1], a[0] }.  They currently are two SLP nodes and one with a
> load_permutation.
> My original attempts focused on trying to get rid of load_permutation in
> favor of lane_permute nodes and thus during SLP discovery I turned the
> second into { a[0], a[1] } (magically unified with the other load) and a
> followup lane-permute node.
> 
> So for your case you have IIUC { a[0], a[0] } and { a[1], a[1] } which eventually
> will (due to patterns) be lane-permuted into { a[0], a[1] }, right?  So
> generalizing this as a single { a[0], a[1] } plus two lane-permute nodes  { 0, 0 }
> and { 1, 1 } early would solve the issue as well?

Correct, I did wonder why it was generating two different nodes instead of a lane
permute but didn't pay much attention that it was just a short coming.

> Now, in general it might be
> more profitable to generate the { a[0], a[0] } and { a[1], a[1] } via scalar-load-
> and-splat rather than vector load and permute so we have to be careful to
> not over-optimize here or be prepared to do the reverse transform.

This in principle can be done in optimize_slp then right? Since it would do
a lot of the same work already and find the materialization points. 

> 
> The patch itself is a bit ugly since it modifies the SLP graph when we already
> produced the graphds graph so I would do any of this before.  I did consider
> gathering all loads nodes loading from a group and then trying to apply some
> heuristic to alter the SLP graph so it can be better optimized.  In fact when we
> want to generate the same code as the non-SLP interleaving scheme does
> we do have to look at those since we have to unify loads there.
> 

Yes.. I will concede the patch isn't my finest work.. I also don't like the fact that I
had to keep leafs in tact less I break things later. But wanted feedback :) 

> I'd put this after vect_slp_build_vertices but before the new_graph call -
> altering 'vertices' / 'leafs' should be more easily possible and the 'leafs' array
> contains all loads already (vect_slp_build_vertices could be massaged to
> provide a map from DR_GROUP_FIRST_ELEMENT to slp_tree, giving us the
> meta we want).
> 
> That said, I'd like to see something more forward-looking rather than the ad-
> hoc special-casing of what you run into with the pattern matching.
> 

Yeah, I like your suggestion about doing it at build time and CSEing early, but
don't think I can get that work in a week given that you've already tried multiple times :)
Happy to give it a go next stage-1 opening though.

> In case we want to still go with the special-casing it should IMHO be done in a
> pre-order walk simply looking for lane permute nodes with children that all
> load from the same group performing what you do before any of the
> vertices/graph stuff is built.  That's probably easiest at this point and it can be
> done when then bst_map is still around so you can properly CSE the new
> load you build.

That's fair enough. I do think I need a temporary (not terrible) workaround...This would
then need to be somewhere in vect_analyze_slp. Would you prefer I do it during the
construction of the instance of afterwards?

Regards,
Tamar

> 
> Thanks,
> Richard.
> 
> 
> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* tree-vect-slp.c (vect_optimize_slp): Promote permutes.
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imend

next prev parent reply	other threads:[~2020-11-04 14:03 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-03 15:07 Tamar Christina
2020-11-04 13:35 ` Richard Biener
2020-11-04 14:02   ` Tamar Christina [this message]
2020-11-04 15:12     ` Richard Biener
2020-11-04 15:17       ` Tamar Christina

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB5325858DC4D0E73005EF5953FFEF0@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=ook@ucw.cz \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).