From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 79086 invoked by alias); 24 Jan 2019 10:48:26 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 78875 invoked by uid 89); 24 Jan 2019 10:48:05 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=vector_size, crude, considers, CONSTRUCTOR X-HELO: mail-lj1-f169.google.com Received: from mail-lj1-f169.google.com (HELO mail-lj1-f169.google.com) (209.85.208.169) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 24 Jan 2019 10:48:03 +0000 Received: by mail-lj1-f169.google.com with SMTP id v1-v6so4830871ljd.0 for ; Thu, 24 Jan 2019 02:48:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=trZAsP/q1DdtyGOs7/8sMYC2T9uXp48keA/QAPQiOfY=; b=sVKrT8NDBHDyv44Vm3rtVfR7f2Pd5xFOaogISOYLXW0R0MTBzEZlsfFM+neRcpNEnB kndx0Rm+vFMEaRgsGH3oH6C7cQd1wQT1nS7UJxB9FhQbuQEf1MQHhblS4u1kX8IWJDjV NspGvz7WS+9ZfCzWwVqEbcsPpQ3HxXusUrmW02p/KAzwDHXC0DmVVbR9apXsE1xn+sci uTHwiO4383az9YRBG2GCD2UqbuROKiP8WWesSQMesH7tCDb9A+Y/nD5xNdw2RVAUao4k 5C4kgiCoCmLsSvys+EdQiUh6yYPCXoq5XT24t7sX3yL4FB1FBJdcgM1OFWLAsUg91FYU KL/A== MIME-Version: 1.0 References: <20190121131954.GA21839@bell-sw.com> In-Reply-To: <20190121131954.GA21839@bell-sw.com> From: Richard Biener Date: Thu, 24 Jan 2019 10:48:00 -0000 Message-ID: Subject: Re: SLP-based reduction vectorization To: Anton Youdkevitch Cc: GCC Development Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2019-01/txt/msg00204.txt.bz2 On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch wrote: > > Here is the prototype for doing vectorized reduction > using SLP approach. I would appreciate feedback if this > is a feasible approach and if overall the direction is > right. > > The idea is to vectorize reduction like this > > S = A[0]+A[1]+...A[N]; > > into > > Sv = Av[0]+Av[1]+...+Av[N/VL]; > > > So that, for instance, the following code: > > typedef double T; > T sum; > > void foo (T* __restrict__ a) > { > sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7]; > } > > > instead of: > > foo: > .LFB23: > .cfi_startproc > movsd (%rdi), %xmm0 > movsd 16(%rdi), %xmm1 > addsd 8(%rdi), %xmm0 > addsd 24(%rdi), %xmm1 > addsd %xmm1, %xmm0 > movsd 32(%rdi), %xmm1 > addsd 40(%rdi), %xmm1 > addsd %xmm1, %xmm0 > movsd 48(%rdi), %xmm1 > addsd 56(%rdi), %xmm1 > addsd %xmm1, %xmm0 > movsd %xmm0, sum2(%rip) > ret > .cfi_endproc > > > be compiled into: > > foo: > .LFB11: > .cfi_startproc > movupd 32(%rdi), %xmm0 > movupd 48(%rdi), %xmm3 > movupd (%rdi), %xmm1 > movupd 16(%rdi), %xmm2 > addpd %xmm3, %xmm0 > addpd %xmm2, %xmm1 > addpd %xmm1, %xmm0 > haddpd %xmm0, %xmm0 > movlpd %xmm0, sum(%rip) > ret > .cfi_endproc > > > As this is a very crude prototype there are some things > to consider. > > 1. As the current SLP framework assumes presence of > group stores I cannot use directly it as reduction > does not require group stores (or even stores at all), > so, I'm partially using the existing functionality but > sometimes I have to create a stripped down version > of it for my own needs; > > 2. The current version considers only PLUS reduction > as it is encountered most often and therefore is the > most practical; > > 3. While normally SLP transformation should operate > inside single basic block this requirement greatly > restricts it's practical application as in a code > complex enough there will be vectorizable subexpressions > defined in basic block(s) different from that where the > reduction result resides. However, for the sake of > simplicity only single uses in the same block are > considered now; > > 4. For the same sake the current version does not deal > with partial reductions which would require partial sum > merging and careful removal of the scalars that participate > in the vector part. The latter gets done automatically > by DCE in the case of full reduction vectorization; > > 5. There is no cost model yet for the reasons mentioned > in the paragraphs 3 and 4. First sorry for the late response. No, I don't think your approach of bypassing the "rest" is OK. I've once started to implement BB reduction support but somehow got distracted IIRC. Your testcase (and the prototype I worked on) still has a (scalar non-grouped) store which can be keyed on in SLP discovery phase. You should be able to "re-use" (by a lot of refactoring I guess) the reduction finding code (vect_is_slp_reduction) to see wheter such a store is fed by a reduction chain. Note that if you adjust the testcase to have sum[0] = a[0] + ... + a[n]; sum[1] = b[0] + ... + b[n]; then you'll have a grouped store fed by reductions. You can also consider for (i = ...) { sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3]; } which we should be able to handle. So the whole problem of doing BB reductions boils down to SLP tree discovery, the rest should be more straight-forward (of course code-gen needs to be adapted for the non-loop case as well). It's not the easiest problem you try to tackle btw ;) May I suggest you become familiar with the code by BB vectorizing vector CONSTRUCTORs instead? typedef int v4si __attribute__((vector_size(16))); v4si foo (int *i, *j) { return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] }; } it has the same SLP discovery "issue", this time somewhat easier as a CONSTRUCTOR directly plays the role of the "grouped store". Richard. > Thanks in advance. > > -- > Anton