From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-198148-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 79086 invoked by alias); 24 Jan 2019 10:48:26 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 78875 invoked by uid 89); 24 Jan 2019 10:48:05 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=vector_size, crude, considers, CONSTRUCTOR
X-HELO: mail-lj1-f169.google.com
Received: from mail-lj1-f169.google.com (HELO mail-lj1-f169.google.com) (209.85.208.169) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 24 Jan 2019 10:48:03 +0000
Received: by mail-lj1-f169.google.com with SMTP id v1-v6so4830871ljd.0        for <gcc@gcc.gnu.org>; Thu, 24 Jan 2019 02:48:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc;        bh=trZAsP/q1DdtyGOs7/8sMYC2T9uXp48keA/QAPQiOfY=;        b=sVKrT8NDBHDyv44Vm3rtVfR7f2Pd5xFOaogISOYLXW0R0MTBzEZlsfFM+neRcpNEnB         kndx0Rm+vFMEaRgsGH3oH6C7cQd1wQT1nS7UJxB9FhQbuQEf1MQHhblS4u1kX8IWJDjV         NspGvz7WS+9ZfCzWwVqEbcsPpQ3HxXusUrmW02p/KAzwDHXC0DmVVbR9apXsE1xn+sci         uTHwiO4383az9YRBG2GCD2UqbuROKiP8WWesSQMesH7tCDb9A+Y/nD5xNdw2RVAUao4k         5C4kgiCoCmLsSvys+EdQiUh6yYPCXoq5XT24t7sX3yL4FB1FBJdcgM1OFWLAsUg91FYU         KL/A==
MIME-Version: 1.0
References: <20190121131954.GA21839@bell-sw.com>
In-Reply-To: <20190121131954.GA21839@bell-sw.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Thu, 24 Jan 2019 10:48:00 -0000
Message-ID: <CAFiYyc2BVP9RVH8PV++--T9erW2jzXaNp85w3f6qzRqZoxpcFA@mail.gmail.com>
Subject: Re: SLP-based reduction vectorization
To: Anton Youdkevitch <anton.youdkevitch@bell-sw.com>
Cc: GCC Development <gcc@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2019-01/txt/msg00204.txt.bz2

On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
<anton.youdkevitch@bell-sw.com> wrote:
>
> Here is the prototype for doing vectorized reduction
> using SLP approach. I would appreciate feedback if this
> is a feasible approach and if overall the direction is
> right.
>
> The idea is to vectorize reduction like this
>
> S = A[0]+A[1]+...A[N];
>
> into
>
> Sv = Av[0]+Av[1]+...+Av[N/VL];
>
>
> So that, for instance, the following code:
>
> typedef double T;
> T sum;
>
> void foo (T*  __restrict__ a)
> {
>         sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
> }
>
>
> instead of:
>
> foo:
> .LFB23:
>         .cfi_startproc
>         movsd   (%rdi), %xmm0
>         movsd   16(%rdi), %xmm1
>         addsd   8(%rdi), %xmm0
>         addsd   24(%rdi), %xmm1
>         addsd   %xmm1, %xmm0
>         movsd   32(%rdi), %xmm1
>         addsd   40(%rdi), %xmm1
>         addsd   %xmm1, %xmm0
>         movsd   48(%rdi), %xmm1
>         addsd   56(%rdi), %xmm1
>         addsd   %xmm1, %xmm0
>         movsd   %xmm0, sum2(%rip)
>         ret
>         .cfi_endproc
>
>
> be compiled into:
>
> foo:
> .LFB11:
>         .cfi_startproc
>         movupd  32(%rdi), %xmm0
>         movupd  48(%rdi), %xmm3
>         movupd  (%rdi), %xmm1
>         movupd  16(%rdi), %xmm2
>         addpd   %xmm3, %xmm0
>         addpd   %xmm2, %xmm1
>         addpd   %xmm1, %xmm0
>         haddpd  %xmm0, %xmm0
>         movlpd  %xmm0, sum(%rip)
>         ret
>         .cfi_endproc
>
>
> As this is a very crude prototype there are some things
> to consider.
>
> 1. As the current SLP framework assumes presence of
> group stores I cannot use directly it as reduction
> does not require group stores (or even stores at all),
> so, I'm partially using the existing functionality but
> sometimes I have to create a stripped down version
> of it for my own needs;
>
> 2. The current version considers only PLUS reduction
> as it is encountered most often and therefore is the
> most practical;
>
> 3. While normally SLP transformation should operate
> inside single basic block this requirement greatly
> restricts it's practical application as in a code
> complex enough there will be vectorizable subexpressions
> defined in basic block(s) different from that where the
> reduction result resides. However, for the sake of
> simplicity only single uses in the same block are
> considered now;
>
> 4. For the same sake the current version does not deal
> with partial reductions which would require partial sum
> merging and careful removal of the scalars that participate
> in the vector part. The latter gets done automatically
> by DCE in the case of full reduction vectorization;
>
> 5. There is no cost model yet for the reasons mentioned
> in the paragraphs 3 and 4.

First sorry for the late response.

No, I don't think your approach of bypassing the "rest"
is OK.  I've once started to implement BB reduction
support but somehow got distracted IIRC.  Your testcase
(and the prototype I worked on) still has a (scalar non-grouped)
store which can be keyed on in SLP discovery phase.

You should be able to "re-use" (by a lot of refactoring I guess)
the reduction finding code (vect_is_slp_reduction) to see
wheter such a store is fed by a reduction chain.  Note that
if you adjust the testcase to have

 sum[0] = a[0] + ... + a[n];
 sum[1] = b[0] + ... + b[n];

then you'll have a grouped store fed by reductions.  You
can also consider

 for (i = ...)
  {
     sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3];
  }

which we should be able to handle.

So the whole problem of doing BB reductions boils down
to SLP tree discovery, the rest should be more straight-forward
(of course code-gen needs to be adapted for the non-loop case
as well).

It's not the easiest problem you try to tackle btw ;)  May
I suggest you become familiar with the code by BB vectorizing
vector CONSTRUCTORs instead?

typedef int v4si __attribute__((vector_size(16)));

v4si foo (int *i, *j)
{
  return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] };
}

it has the same SLP discovery "issue", this time somewhat
easier as a CONSTRUCTOR directly plays the role of the
"grouped store".

Richard.

> Thanks in advance.
>
> --
>   Anton