public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
From: Anton Youdkevitch <anton.youdkevitch@bell-sw.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Development <gcc@gcc.gnu.org>
Subject: Re: SLP-based reduction vectorization
Date: Thu, 24 Jan 2019 12:04:00 -0000	[thread overview]
Message-ID: <5C49A9E0.8030604@bell-sw.com> (raw)
In-Reply-To: <CAFiYyc2BVP9RVH8PV++--T9erW2jzXaNp85w3f6qzRqZoxpcFA@mail.gmail.com>

Richard,

Thanks a lot for the response! I will definitely
try the constructor approach.

You mentioned non-grouped store. Is the handling
of such stores actually there and I just missed
it? It was a big hassle for me as I didn't manage
to find it and assumed there isn't one.

I have a question (a lot of them, though, but this
one is bothering me most). It is actually paragraph
4 of my previous letter. In real world code there can
be a case that the loading of the elements and the use
of them (for a reduction) are in different BBs (I saw
this myself). So, not only it complicates the things
in general but for me it breaks some SLP code assuming
single BB operation (IIRC, some dataref analysis phase
assumes single BB). Did anybody consider this before?

OK, I know I start looking kind of stubborn here but
what about the case:

foo(A[0]+...A[n])

There won't be any store here by definition while a
reduction will. Or is it something too rarely seen?

-- 
   Thanks,
   Anton


On 24/1/2019 13:47, Richard Biener wrote:
> On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
> <anton.youdkevitch@bell-sw.com> wrote:
>>
>> Here is the prototype for doing vectorized reduction
>> using SLP approach. I would appreciate feedback if this
>> is a feasible approach and if overall the direction is
>> right.
>>
>> The idea is to vectorize reduction like this
>>
>> S = A[0]+A[1]+...A[N];
>>
>> into
>>
>> Sv = Av[0]+Av[1]+...+Av[N/VL];
>>
>>
>> So that, for instance, the following code:
>>
>> typedef double T;
>> T sum;
>>
>> void foo (T*  __restrict__ a)
>> {
>>          sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
>> }
>>
>>
>> instead of:
>>
>> foo:
>> .LFB23:
>>          .cfi_startproc
>>          movsd   (%rdi), %xmm0
>>          movsd   16(%rdi), %xmm1
>>          addsd   8(%rdi), %xmm0
>>          addsd   24(%rdi), %xmm1
>>          addsd   %xmm1, %xmm0
>>          movsd   32(%rdi), %xmm1
>>          addsd   40(%rdi), %xmm1
>>          addsd   %xmm1, %xmm0
>>          movsd   48(%rdi), %xmm1
>>          addsd   56(%rdi), %xmm1
>>          addsd   %xmm1, %xmm0
>>          movsd   %xmm0, sum2(%rip)
>>          ret
>>          .cfi_endproc
>>
>>
>> be compiled into:
>>
>> foo:
>> .LFB11:
>>          .cfi_startproc
>>          movupd  32(%rdi), %xmm0
>>          movupd  48(%rdi), %xmm3
>>          movupd  (%rdi), %xmm1
>>          movupd  16(%rdi), %xmm2
>>          addpd   %xmm3, %xmm0
>>          addpd   %xmm2, %xmm1
>>          addpd   %xmm1, %xmm0
>>          haddpd  %xmm0, %xmm0
>>          movlpd  %xmm0, sum(%rip)
>>          ret
>>          .cfi_endproc
>>
>>
>> As this is a very crude prototype there are some things
>> to consider.
>>
>> 1. As the current SLP framework assumes presence of
>> group stores I cannot use directly it as reduction
>> does not require group stores (or even stores at all),
>> so, I'm partially using the existing functionality but
>> sometimes I have to create a stripped down version
>> of it for my own needs;
>>
>> 2. The current version considers only PLUS reduction
>> as it is encountered most often and therefore is the
>> most practical;
>>
>> 3. While normally SLP transformation should operate
>> inside single basic block this requirement greatly
>> restricts it's practical application as in a code
>> complex enough there will be vectorizable subexpressions
>> defined in basic block(s) different from that where the
>> reduction result resides. However, for the sake of
>> simplicity only single uses in the same block are
>> considered now;
>>
>> 4. For the same sake the current version does not deal
>> with partial reductions which would require partial sum
>> merging and careful removal of the scalars that participate
>> in the vector part. The latter gets done automatically
>> by DCE in the case of full reduction vectorization;
>>
>> 5. There is no cost model yet for the reasons mentioned
>> in the paragraphs 3 and 4.
>
> First sorry for the late response.
>
> No, I don't think your approach of bypassing the "rest"
> is OK.  I've once started to implement BB reduction
> support but somehow got distracted IIRC.  Your testcase
> (and the prototype I worked on) still has a (scalar non-grouped)
> store which can be keyed on in SLP discovery phase.
>
> You should be able to "re-use" (by a lot of refactoring I guess)
> the reduction finding code (vect_is_slp_reduction) to see
> wheter such a store is fed by a reduction chain.  Note that
> if you adjust the testcase to have
>
>   sum[0] = a[0] + ... + a[n];
>   sum[1] = b[0] + ... + b[n];
>
> then you'll have a grouped store fed by reductions.  You
> can also consider
>
>   for (i = ...)
>    {
>       sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3];
>    }
>
> which we should be able to handle.
>
> So the whole problem of doing BB reductions boils down
> to SLP tree discovery, the rest should be more straight-forward
> (of course code-gen needs to be adapted for the non-loop case
> as well).
>
> It's not the easiest problem you try to tackle btw ;)  May
> I suggest you become familiar with the code by BB vectorizing
> vector CONSTRUCTORs instead?
>
> typedef int v4si __attribute__((vector_size(16)));
>
> v4si foo (int *i, *j)
> {
>    return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] };
> }
>
> it has the same SLP discovery "issue", this time somewhat
> easier as a CONSTRUCTOR directly plays the role of the
> "grouped store".
>
> Richard.
>
>> Thanks in advance.
>>
>> --
>>    Anton

  reply	other threads:[~2019-01-24 12:04 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-21 13:20 Anton Youdkevitch
2019-01-24 10:48 ` Richard Biener
2019-01-24 12:04   ` Anton Youdkevitch [this message]
2019-01-24 12:16     ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5C49A9E0.8030604@bell-sw.com \
    --to=anton.youdkevitch@bell-sw.com \
    --cc=gcc@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).