From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 24850 invoked by alias); 25 Sep 2013 09:22:11 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 24840 invoked by uid 89); 25 Sep 2013 09:22:10 -0000 Received: from cantor2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 25 Sep 2013 09:22:10 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.1 required=5.0 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.3.2 X-HELO: mx2.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id E53D8A51FF; Wed, 25 Sep 2013 11:22:06 +0200 (CEST) Date: Wed, 25 Sep 2013 09:22:00 -0000 From: Richard Biener To: Vidya Praveen Cc: "gcc@gcc.gnu.org" , "ook@ucw.cz" Subject: Re: [RFC] Vectorization of indexed elements In-Reply-To: <20130924150343.GD22907@e103625-lin.cambridge.arm.com> Message-ID: References: <20130909172533.GA25330@e103625-lin.cambridge.arm.com> <20130924150343.GD22907@e103625-lin.cambridge.arm.com> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-SW-Source: 2013-09/txt/msg00224.txt.bz2 On Tue, 24 Sep 2013, Vidya Praveen wrote: > On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote: > > On Mon, 9 Sep 2013, Marc Glisse wrote: > > > > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > > > > > Hello, > > > > > > > > This post details some thoughts on an enhancement to the vectorizer that > > > > could take advantage of the SIMD instructions that allows indexed element > > > > as an operand thus reducing the need for duplication and possibly improve > > > > reuse of previously loaded data. > > > > > > > > Appreciate your opinion on this. > > > > > > > > --- > > > > > > > > A phrase like this: > > > > > > > > for(i=0;i<4;i++) > > > > a[i] = b[i] c[2]; > > > > > > > > is usually vectorized as: > > > > > > > > va:V4SI = a[0:3] > > > > vb:V4SI = b[0:3] > > > > t = c[2] > > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init > > > > ... > > > > va:V4SI = vb:V4SI vc:V4SI > > > > > > > > But this could be simplified further if a target has instructions that > > > > support > > > > indexed element as a parameter. For example an instruction like this: > > > > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > > > > > can perform multiplication of each element of v2.4s with the third element > > > > of > > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding > > > > elements of v0.4s. > > > > > > > > For this to happen, vectorizer needs to understand this idiom and treat the > > > > operand c[2] specially (and by taking in to consideration if the machine > > > > supports indexed element as an operand for through a target hook or > > > > macro) > > > > and consider this as vectorizable statement without having to duplicate the > > > > elements explicitly. > > > > > > > > There are fews ways this could be represented at gimple: > > > > > > > > ... > > > > va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > > ... > > > > > > > > or by allowing a vectorizer treat an indexed element as a valid operand in a > > > > vectorizable statement: > > > > > > Might as well allow any scalar then... > > > > I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form > > would necessarily be two extra separate statements and thus subject > > to CSE obfuscating it enough for RTL expansion to no longer notice it. > > I also thought about having a specialized expression like > > VEC_INDEXED__EXPR < arg0, arg1, arg2, index> > > to mean: > > arg0 = arg1 arg2[index] > > and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR > is handled in expr.c. But I dropped this idea since we may need to introduce > many such nodes. > > > > > That said, allowing mixed scalar/vector ops isn't very nice and > > your scheme can be simplified by just using > > > > vc:V4SI = VEC_DUPLICATE_EXPR <...> > > va:V4SI = vb:V4SI vc:V4SI > > > > where the expander only has to see that vc:V4SI is defined by > > a duplicate. > > I did try out something like this quickly before I posted this RFC, though > I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select()) > > for: > > for(i=0;i<8;i++) > a[i] = b[2] * c[i]; > > I could generate: > > ... > : > _88 = prolog_loop_adjusted_niters.6_60 * 4; > vectp_c.13_87 = c_10(D) + _88; > vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; <<<<<<<< > vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); <<<<<<<< > _96 = prolog_loop_adjusted_niters.6_60 * 4; > vectp_a.19_95 = a_6(D) + _96; > vect__12.14_115 = MEM[(int *)vectp_c.13_87]; > vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93; <<<<<<<< > MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; <<<<<<<< > vectp_c.12_118 = vectp_c.13_87 + 16; > vectp_a.18_119 = vectp_a.19_95 + 16; > ivtmp_120 = 1; > if (ivtmp_120 < bnd.8_62) > goto ; > else > goto ; > > : > # vectp_c.12_89 = PHI > # vectp_a.18_97 = PHI > # ivtmp_14 = PHI > vect__12.14_91 = MEM[(int *)vectp_c.12_89]; <<<<<<<< > vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; <<<<<<<< > MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94; > ... > > It's a crude implementation so VEC_DUP is printed as: > > (vect_ldidx_.16_92) <<< ??? >>> (0); > > > > > > ... > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > > > ... > > > > > > > > For the sake of explanation, the above two representations assumes that > > > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is > > > > the > > > > only use of 'c' then it may be safer to just load one element and use it > > > > like > > > > this: > > > > > > > > vc:V4SI[0] = c[2] > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > > > This could also mean that expressions involving scalar could be treated > > > > similarly. For example, > > > > > > > > for(i=0;i<4;i++) > > > > a[i] = b[i] c > > > > > > > > could be vectorized as: > > > > > > > > vc:V4SI[0] = c > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > > > Such a change would also require new standard pattern names to be defined > > > > for > > > > each . > > > > > > > > Alternatively, having something like this: > > > > > > > > ... > > > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > > va:V4SI = vb:V4SI vt:V4SI > > > > ... > > > > > > > > would remove the need to introduce several new standard pattern names but > > > > have > > > > just one to represent vec_duplicate(vec_select()) but ofcourse this will > > > > expect > > > > the target to have combiner patterns. > > > > > > The cost estimation wouldn't be very good, but aren't combine patterns enough > > > for the whole thing? Don't you model your mul instruction as: > > > > > > (mult:V4SI > > > (match_operand:V4SI) > > > (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI)))) > > > > > > anyway? Seems that combine should be able to handle it. What currently happens > > > that we fail to generate the right instruction? > > > > > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for > > > vec_duplicate, adding new nodes is always painful. > > > > True, though CONSTRUCTOR isn't a good vec_duplicate primitive. But yes, > > we have it that way at the moment and indeed adding new nodes is always > > painful. > > > > > > This enhancement could possibly help further optimizing larger scenarios > > > > such > > > > as linear systems. > > > > Given that the vectorizer already handles all cases you quote but > > just the expansion doesn't use the targets special abilities - can't > > you just teach the expander to lookup the definition of the > > vectors and see if it is an uniform CONSTRUCTOR? > > > > Richard. > > > > I did think of handling this as a part of expanding CONSTRUCTOR but I thought > it may not a good idea if we want to enhance this support in the future to > handle larger cases like this one (hypothetical example): > > for i = 0 to 3 > for j = 0 to 3 > a[j] += b[j] * c[i] > > to > > a[0:3] += b[0:3] + c[0] > a[0:3] += b[0:3] + c[1] > a[0:3] += b[0:3] + c[2] > a[0:3] += b[0:3] + c[3] > > Secondly, I am not sure if introducing a single lane load at the time of > expansion and removing or expecting the existing scalar load to be removed > later as it is unused, is a good idea. The above example requires unrolling and scalar cleanup anyway so you'll currently see sth like tem00_1 = c[0]; tem0_2 = { tem00_1, tem00_1, tem00_1, tem00_1 }; temb_3 = b[0:3]; tems_4 = temb_3 + tem0_2; tema_5 = a[0:3]; tems_6 = tema_5 + tems_4; a[0:3] = tems_6; which at expansion time can be short-cut easily without even going to expand the constructor. That's the feature of TER which defers expansion of SSA name definition statements until they are first used. In this case when you expand tems_4 you can pattern match the case you are interested in (not 100% sure which it is, I suppose it is vector register + vector_select < vector register >) and simply expand it optimally. The "consumed" SSA names will have no further uses and will simply not be expanded at all. Richard.