From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 31603 invoked by alias); 14 Oct 2013 08:05:03 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 31590 invoked by uid 89); 14 Oct 2013 08:05:02 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.7 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: mx2.suse.de Received: from cantor2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 14 Oct 2013 08:05:01 +0000 Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id DA217A5879; Mon, 14 Oct 2013 10:04:58 +0200 (CEST) Date: Mon, 14 Oct 2013 08:05:00 -0000 From: Richard Biener To: Vidya Praveen Cc: "gcc@gcc.gnu.org" , "ook@ucw.cz" , marc.glisse@inria.fr Subject: Re: [RFC] Vectorization of indexed elements In-Reply-To: <20131011145408.GB23850@e103625-lin.cambridge.arm.com> Message-ID: References: <20130909172533.GA25330@e103625-lin.cambridge.arm.com> <20130924150425.GE22907@e103625-lin.cambridge.arm.com> <20130927145008.GA861@e103625-lin.cambridge.arm.com> <20130927151945.GB861@e103625-lin.cambridge.arm.com> <20130930125454.GD3460@e103625-lin.cambridge.arm.com> <20130930140001.GF3460@e103625-lin.cambridge.arm.com> <20131011145408.GB23850@e103625-lin.cambridge.arm.com> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-SW-Source: 2013-10/txt/msg00134.txt.bz2 On Fri, 11 Oct 2013, Vidya Praveen wrote: > On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote: > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote: > > > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > > > > > [...] > > > > > > > > > I can't really insist on the single lane load.. something like: > > > > > > > > > > > > > > > > > > vc:V4SI[0] = c > > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > > > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > > > > > > > > > > > > > Or is there any other way to do this? > > > > > > > > > > > > > > > > Can you elaborate on "I can't really insist on the single lane load"? > > > > > > > > What's the single lane load in your example? > > > > > > > > > > > > > > Loading just one lane of the vector like this: > > > > > > > > > > > > > > vc:V4SI[0] = c // from the above scalar example > > > > > > > > > > > > > > or > > > > > > > > > > > > > > vc:V4SI[0] = c[2] > > > > > > > > > > > > > > is what I meant by single lane load. In this example: > > > > > > > > > > > > > > t = c[2] > > > > > > > ... > > > > > > > vb:v4si = b[0:3] > > > > > > > vc:v4si = { t, t, t, t } > > > > > > > va:v4si = vb:v4si vc:v4si > > > > > > > > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot > > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be > > > > > > > seen as vec_select:SI (vect_t 0) ). > > > > > > > > > > > > > > > I'd expect the instruction > > > > > > > > pattern as quoted to just work (and I hope we expand an uniform > > > > > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > > > > > > > > > As much as I went through the code, this is only done using vect_init. It is > > > > > > > not expanded as vec_duplicate from, for example, store_constructor() of expr.c > > > > > > > > > > > > Do you see any issues if we expand such constructor as vec_duplicate directly > > > > > > instead of going through vect_init way? > > > > > > > > > > Sorry, that was a bad question. > > > > > > > > > > But here's what I would like to propose as a first step. Please tell me if this > > > > > is acceptable or if it makes sense: > > > > > > > > > > - Introduce standard pattern names > > > > > > > > > > "vmulim4" - vector muliply with second operand as indexed operand > > > > > > > > > > Example: > > > > > > > > > > (define_insn "vmuliv4si4" > > > > > [set (match_operand:V4SI 0 "register_operand") > > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand") > > > > > (vec_duplicate:V4SI > > > > > (vec_select:SI > > > > > (match_operand:V4SI 2 "register_operand") > > > > > (match_operand:V4SI 3 "immediate_operand)))))] > > > > > ... > > > > > ) > > > > > > > > We could factor this with providing a standard pattern name for > > > > > > > > (define_insn "vdupi" > > > > [set (match_operand: 0 "register_operand") > > > > (vec_duplicate: > > > > (vec_select: > > > > (match_operand: 1 "register_operand") > > > > (match_operand:SI 2 "immediate_operand))))] > > > > > > This is good. I did think about this but then I thought of avoiding the need > > > for combiner patterns :-) > > > > > > But do you find the lane specific mov pattern I proposed, acceptable? > > > > The specific mul pattern? As said, consider factoring to vdupi to > > avoid an explosion in required special optabs. > > > > > > (you use V4SI for the immediate? > > > > > > Sorry typo again!! It should've been SI. > > > > > > > Ideally vdupi has another custom > > > > mode for the vector index). > > > > > > > > Note that this factored pattern is already available as vec_perm_const! > > > > It is simply (vec_perm_const:V4SI ). > > > > > > > > Which means that on the GIMPLE level we should try to combine > > > > > > > > el_4 = BIT_FIELD_REF ; > > > > v_5 = { el_4, el_4, ... }; > > > > > > I don't think we reach this state at all for the scenarios in discussion. > > > what we generally have is: > > > > > > el_4 = MEM_REF < array + index*size > > > > v_5 = { el_4, ... } > > > > > > Or am I missing something? > > > > Well, but in that case I doubt it is profitable (or even valid!) to > > turn this into a vector lane load from the array. If it is profitable > > to perform a vector read (because we're going to use the other elements > > of the vector as well) then the vectorizer should produce a vector > > load and materialize the uniform vector from one of its elements. > > > > Maybe at this point you should show us a compilable C testcase > > with a loop that should be vectorized using your instructions in > > the end? > > Here's a compilable example: > > void > foo (int *__restrict__ a, > int *__restrict__ b, > int *__restrict__ c) > { > int i; > > for (i = 0; i < 8; i++) > a[i] = b[i] * c[2]; > } > > This is vectorized by duplicating c[2] now. But I'm trying to take advantage > of target instructions that can take a vector register as second argument but > use only one element (by using the same value for all the lanes) of the > vector register. > > Eg. mul , , [index] > mla , , [index] // multiply and add > > But for a loop like the one in the C example given, I will have to load the > c[2] in one element of the vector register (leaving the remaining unused) > rather. This is why I was proposing to load just one element in a vector > register (what I meant as "lane specific load"). The benefit of doing this is > that we avoid explicit duplication, however such a simplification can only > be done where such support is available - the reason why I was thinking in > terms of optional standard pattern name. Another benefit is we will also be > able to support scalars in the expression like in the following example: > > void > foo (int *__restrict__ a, > int *__restrict__ b, > int c) > { > int i; > > for (i = 0; i < 8; i++) > a[i] = b[i] * c; > } Both cases can be handled by patterns that match (mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI))) You'd then "consume" the vec_duplicate and implement it as load scalar into element zero of the vector and use index mult with index zero. This handling can be either implemented at combine level (this is a two-instruction combine only!) or via TER. > Another example which can take advantage of the target feature: > > void > foo (int *__restrict__ a, > int *__restrict__ b, > int *__restrict__ c) > { > int i,j; > > for (i = 0; i < 8; i++) > for (j = 0; j < 8; j++) > a[j] += b[j] * c[i]; > } > > This scenario we discussed this earlier (you suggested handling this at TER). Yeah, though for optimal code you'd want to load c[0:8] once with a vector load. Eventually the vectorizer outer loop vectorization support can be massaged to handle this optimally. Richard.