From: Vidya Praveen <vidyapraveen@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc@gcc.gnu.org" <gcc@gcc.gnu.org>, "ook@ucw.cz" <ook@ucw.cz>
Subject: Re: [RFC] Vectorization of indexed elements
Date: Mon, 30 Sep 2013 13:01:00 -0000 [thread overview]
Message-ID: <20130930130145.GE3460@e103625-lin.cambridge.arm.com> (raw)
In-Reply-To: <alpine.LNX.2.00.1309251116040.29411@zhemvz.fhfr.qr>
On Wed, Sep 25, 2013 at 10:22:05AM +0100, Richard Biener wrote:
> On Tue, 24 Sep 2013, Vidya Praveen wrote:
>
> > On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> > > On Mon, 9 Sep 2013, Marc Glisse wrote:
> > >
> > > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > This post details some thoughts on an enhancement to the vectorizer that
> > > > > could take advantage of the SIMD instructions that allows indexed element
> > > > > as an operand thus reducing the need for duplication and possibly improve
> > > > > reuse of previously loaded data.
> > > > >
> > > > > Appreciate your opinion on this.
> > > > >
> > > > > ---
> > > > >
> > > > > A phrase like this:
> > > > >
> > > > > for(i=0;i<4;i++)
> > > > > a[i] = b[i] <op> c[2];
> > > > >
> > > > > is usually vectorized as:
> > > > >
> > > > > va:V4SI = a[0:3]
> > > > > vb:V4SI = b[0:3]
> > > > > t = c[2]
> > > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> > > > > ...
> > > > > va:V4SI = vb:V4SI <op> vc:V4SI
> > > > >
> > > > > But this could be simplified further if a target has instructions that
> > > > > support
> > > > > indexed element as a parameter. For example an instruction like this:
> > > > >
> > > > > mul v0.4s, v1.4s, v2.4s[2]
> > > > >
> > > > > can perform multiplication of each element of v2.4s with the third element
> > > > > of
> > > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > > > elements of v0.4s.
> > > > >
> > > > > For this to happen, vectorizer needs to understand this idiom and treat the
> > > > > operand c[2] specially (and by taking in to consideration if the machine
> > > > > supports indexed element as an operand for <op> through a target hook or
> > > > > macro)
> > > > > and consider this as vectorizable statement without having to duplicate the
> > > > > elements explicitly.
> > > > >
> > > > > There are fews ways this could be represented at gimple:
> > > > >
> > > > > ...
> > > > > va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > > > ...
> > > > >
> > > > > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > > > > vectorizable statement:
> > > >
> > > > Might as well allow any scalar then...
> > >
> > > I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> > > would necessarily be two extra separate statements and thus subject
> > > to CSE obfuscating it enough for RTL expansion to no longer notice it.
> >
> > I also thought about having a specialized expression like
> >
> > VEC_INDEXED_<op>_EXPR < arg0, arg1, arg2, index>
> >
> > to mean:
> >
> > arg0 = arg1 <op> arg2[index]
> >
> > and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR
> > is handled in expr.c. But I dropped this idea since we may need to introduce
> > many such nodes.
> >
> > >
> > > That said, allowing mixed scalar/vector ops isn't very nice and
> > > your scheme can be simplified by just using
> > >
> > > vc:V4SI = VEC_DUPLICATE_EXPR <...>
> > > va:V4SI = vb:V4SI <op> vc:V4SI
> > >
> > > where the expander only has to see that vc:V4SI is defined by
> > > a duplicate.
> >
> > I did try out something like this quickly before I posted this RFC, though
> > I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())
> >
> > for:
> >
> > for(i=0;i<8;i++)
> > a[i] = b[2] * c[i];
> >
> > I could generate:
> >
> > ...
> > <bb 8>:
> > _88 = prolog_loop_adjusted_niters.6_60 * 4;
> > vectp_c.13_87 = c_10(D) + _88;
> > vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; <<<<<<<<
> > vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); <<<<<<<<
> > _96 = prolog_loop_adjusted_niters.6_60 * 4;
> > vectp_a.19_95 = a_6(D) + _96;
> > vect__12.14_115 = MEM[(int *)vectp_c.13_87];
> > vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93; <<<<<<<<
> > MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; <<<<<<<<
> > vectp_c.12_118 = vectp_c.13_87 + 16;
> > vectp_a.18_119 = vectp_a.19_95 + 16;
> > ivtmp_120 = 1;
> > if (ivtmp_120 < bnd.8_62)
> > goto <bb 9>;
> > else
> > goto <bb 11>;
> >
> > <bb 9>:
> > # vectp_c.12_89 = PHI <vectp_c.12_118(8)>
> > # vectp_a.18_97 = PHI <vectp_a.18_119(8)>
> > # ivtmp_14 = PHI <ivtmp_120(8)>
> > vect__12.14_91 = MEM[(int *)vectp_c.12_89]; <<<<<<<<
> > vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; <<<<<<<<
> > MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94;
> > ...
> >
> > It's a crude implementation so VEC_DUP is printed as:
> >
> > (vect_ldidx_.16_92) <<< ??? >>> (0);
> >
> >
> > > > > ...
> > > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > > > > ...
> > > > >
> > > > > For the sake of explanation, the above two representations assumes that
> > > > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is
> > > > > the
> > > > > only use of 'c' then it may be safer to just load one element and use it
> > > > > like
> > > > > this:
> > > > >
> > > > > vc:V4SI[0] = c[2]
> > > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > > >
> > > > > This could also mean that expressions involving scalar could be treated
> > > > > similarly. For example,
> > > > >
> > > > > for(i=0;i<4;i++)
> > > > > a[i] = b[i] <op> c
> > > > >
> > > > > could be vectorized as:
> > > > >
> > > > > vc:V4SI[0] = c
> > > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > > >
> > > > > Such a change would also require new standard pattern names to be defined
> > > > > for
> > > > > each <op>.
> > > > >
> > > > > Alternatively, having something like this:
> > > > >
> > > > > ...
> > > > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > ...
> > > > >
> > > > > would remove the need to introduce several new standard pattern names but
> > > > > have
> > > > > just one to represent vec_duplicate(vec_select()) but ofcourse this will
> > > > > expect
> > > > > the target to have combiner patterns.
> > > >
> > > > The cost estimation wouldn't be very good, but aren't combine patterns enough
> > > > for the whole thing? Don't you model your mul instruction as:
> > > >
> > > > (mult:V4SI
> > > > (match_operand:V4SI)
> > > > (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > > >
> > > > anyway? Seems that combine should be able to handle it. What currently happens
> > > > that we fail to generate the right instruction?
> > > >
> > > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for
> > > > vec_duplicate, adding new nodes is always painful.
> > >
> > > True, though CONSTRUCTOR isn't a good vec_duplicate primitive. But yes,
> > > we have it that way at the moment and indeed adding new nodes is always
> > > painful.
> > >
> > > > > This enhancement could possibly help further optimizing larger scenarios
> > > > > such
> > > > > as linear systems.
> > >
> > > Given that the vectorizer already handles all cases you quote but
> > > just the expansion doesn't use the targets special abilities - can't
> > > you just teach the expander to lookup the definition of the
> > > vectors and see if it is an uniform CONSTRUCTOR?
> > >
> > > Richard.
> > >
> >
> > I did think of handling this as a part of expanding CONSTRUCTOR but I thought
> > it may not a good idea if we want to enhance this support in the future to
> > handle larger cases like this one (hypothetical example):
> >
> > for i = 0 to 3
> > for j = 0 to 3
> > a[j] += b[j] * c[i]
> >
> > to
> >
> > a[0:3] += b[0:3] + c[0]
> > a[0:3] += b[0:3] + c[1]
> > a[0:3] += b[0:3] + c[2]
> > a[0:3] += b[0:3] + c[3]
Actually there was a typo in my previous example, it should've been:
a[0:3] += b[0:3] * c[0]
a[0:3] += b[0:3] * c[1]
a[0:3] += b[0:3] * c[2]
a[0:3] += b[0:3] * c[3]
> >
> > Secondly, I am not sure if introducing a single lane load at the time of
> > expansion and removing or expecting the existing scalar load to be removed
> > later as it is unused, is a good idea.
>
> The above example requires unrolling and scalar cleanup anyway so you'll
> currently see sth like
>
> tem00_1 = c[0];
> tem0_2 = { tem00_1, tem00_1, tem00_1, tem00_1 };
> temb_3 = b[0:3];
> tems_4 = temb_3 + tem0_2;
> tema_5 = a[0:3];
> tems_6 = tema_5 + tems_4;
> a[0:3] = tems_6;
>
> which at expansion time can be short-cut easily without even going to
> expand the constructor. That's the feature of TER which defers
> expansion of SSA name definition statements until they are first
> used. In this case when you expand tems_4 you can pattern match
> the case you are interested in (not 100% sure which it is, I suppose
> it is vector register + vector_select < vector register >) and
> simply expand it optimally. The "consumed" SSA names will have no
> further uses and will simply not be expanded at all.
Thanks. This sounds better. What I'm interested in is:
<vecreg> += <vecreg> * vec_duplicate( vec_select( <vecreg> <index>))
Cheers
VP
next prev parent reply other threads:[~2013-09-30 13:01 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-09 17:25 Vidya Praveen
2013-09-09 18:02 ` Marc Glisse
2013-09-10 8:25 ` Richard Biener
2013-09-24 15:03 ` Vidya Praveen
2013-09-25 9:22 ` Richard Biener
2013-09-30 13:01 ` Vidya Praveen [this message]
2013-09-24 15:04 ` Vidya Praveen
2013-09-25 9:25 ` Richard Biener
2013-09-27 14:50 ` Vidya Praveen
2013-09-27 15:19 ` Vidya Praveen
2013-09-30 12:55 ` Vidya Praveen
2013-09-30 13:19 ` Richard Biener
2013-09-30 14:00 ` Vidya Praveen
2013-10-01 8:26 ` Richard Biener
2013-10-11 14:54 ` Vidya Praveen
2013-10-11 15:05 ` Jakub Jelinek
2013-12-04 17:07 ` Vidya Praveen
2013-10-14 8:05 ` Richard Biener
2013-12-04 16:10 ` Vidya Praveen
2013-12-06 11:48 ` Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130930130145.GE3460@e103625-lin.cambridge.arm.com \
--to=vidyapraveen@arm.com \
--cc=gcc@gcc.gnu.org \
--cc=ook@ucw.cz \
--cc=rguenther@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).