From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-457062-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 24266 invoked by alias); 28 Jun 2017 13:09:22 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 24252 invoked by uid 89); 28 Jun 2017 13:09:21 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy=
X-HELO: mail-ua0-f174.google.com
Received: from mail-ua0-f174.google.com (HELO mail-ua0-f174.google.com) (209.85.217.174) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 28 Jun 2017 13:09:18 +0000
Received: by mail-ua0-f174.google.com with SMTP id g40so37287849uaa.3        for <gcc-patches@gcc.gnu.org>; Wed, 28 Jun 2017 06:09:18 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20161025;        h=x-gm-message-state:mime-version:in-reply-to:references:from:date         :message-id:subject:to:cc;        bh=Rd0qL44ZNrg72zlL5pZ4AL4EWW6o70B74Vr6Bp/AWms=;        b=IQIwwp/7RfNoLVXa1K3OzS0b7ERWeMR++xqzenFH8dV81gEHRO048vXVIWkm/aQP7P         P155o67unQ5HkLrS0sVJ1IxYFRpyhyf4E/oPXj9ln+oVhwXmOSrOa19QnZQ+WgUR1tRu         vTg6GAQ+3NWp7tTLxu4sn9kPgt1m5arPsGykI+CodlIqJDyeWf1Iva7MNuyB0xFn2dsh         itiQT99q0UB5sz5mfsh+N7YtUofi82KMBvkkjyLFjWJNQ/NN3wZqy1Xw2ueppGq4SDlW         SOkqKKNlyg5cW0EDrR8gtISb1IXj8y8S907h6GAt2qdLs9ThTS7iEtMQSgjMkllbZOnd         QmSw==
X-Gm-Message-State: AKS2vOy+h2B1H+/0OgwoNIBiROIx/B8cvIMOSq9V7jgUT3gf85dPSdUj	aEJ0wfJ0FfUDfnS8pVMhUN0pO+88Nat/
X-Received: by 10.176.2.116 with SMTP id 107mr6132853uas.111.1498655356878; Wed, 28 Jun 2017 06:09:16 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.49.142 with HTTP; Wed, 28 Jun 2017 06:09:16 -0700 (PDT)
In-Reply-To: <CAFiYyc1Beu=yCvpOdgRfvH6tO=42Puh0eXZJBR0hwak7fW-8FA@mail.gmail.com>
References: <VI1PR0802MB2176F36122DDF66429B0FB3AE7CD0@VI1PR0802MB2176.eurprd08.prod.outlook.com> <CAHFci29yG3+oHwefL44Dx6whLCkYn=yCp-Eg2nMKjSv+ETs0Xg@mail.gmail.com> <CAHFci2_AMQFc857ai1KnabtO4iJcHjksx4ZXATa+vhLOgfj+wA@mail.gmail.com> <CAFiYyc2vH25ZhrWFuJyvqwqHda8SggCP2m66YG8Lc8+43vXDBw@mail.gmail.com> <CAHFci2_TzjRQFecSkckJEX0cBAoNwSpsas=n3oUrqdFujpmU5w@mail.gmail.com> <CAFiYyc3VvAahs1KtNRPiTFAd=cCoy9fvUbVWYkkwErFxXnByMQ@mail.gmail.com> <CAHFci29NWPwbEd_s9+E1fHRuG2kHVSjMcoYwerOk0KarbwdKpg@mail.gmail.com> <CAFiYyc1Beu=yCvpOdgRfvH6tO=42Puh0eXZJBR0hwak7fW-8FA@mail.gmail.com>
From: "Bin.Cheng" <amker.cheng@gmail.com>
Date: Wed, 28 Jun 2017 13:09:00 -0000
Message-ID: <CAHFci2-Tbnku=UNMOQtPmyGBXwekLfYcTexvDX5XRv7GMuy4mQ@mail.gmail.com>
Subject: Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
To: Richard Biener <richard.guenther@gmail.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2017-06/txt/msg02157.txt.bz2

On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>> Hi,
>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>> x86_64 and aarch64.
>>>>>
>>>>> why is ldist-12.c no longer distributed?  your comment says it doesn't expose
>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>> which it clearly does.
>>>>>
>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the wording
>>>>> of "parallelism" still confuses me.
>>>>>
>>>>> Can you elaborate on that.  Now onto the patch:
>>>> Given we don't model data locality or memory bandwidth, whether
>>>> distribution enables loops that can be executed paralleled becomes the
>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>> etc., appropriate for distribution.
>>>
>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into two loops
>>> can be executed "in parallel"
>> So if a loop by itself can be vectorized (or so called can be executed
>> paralleled), we tend to no distribute it into small ones.  But there
>> is one exception here, if the distributed small loops are recognized
>> as builtin functions, we still distribute it.  I assume it's generally
>> better to call builtin memory functions than vectorize it by GCC?
>
> Yes.
>
>>>
>>>>>
>>>>> +   Loop distribution is the dual of loop fusion.  It separates statements
>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>> +   same loop header.  The major goal is to separate statements which may
>>>>> +   be vectorized from those that can't.  This pass implements distribution
>>>>> +   in the following steps:
>>>>>
>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>> enabler.  distributing a loop can also reduce register pressure.
>>>> I will revise the comment, but as explained, enabling more
>>>> vectorization is the major criteria for distribution to some extend
>>>> now.
>>>
>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>> Let's see if any performance drop will be reported against this patch.
>> Let's see if we can create a cost model for it.
>
> Fine.
I will run some benchmarks to see if there is breakage.
>
>>>
>>>>>
>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>> didn't look
>>>>> into yet).  If you don't use that please introduce it separately.
>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>
>>>>>
>>>>> +             /* Be conservative.  If data references are not well analyzed,
>>>>> +                or the two data references have the same base address and
>>>>> +                offset, add dependence and consider it alias to each other.
>>>>> +                In other words, the dependence can not be resolved by
>>>>> +                runtime alias check.  */
>>>>> +             if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>> +                 || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>> +                 || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>> +                 || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>> +                 || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>> +                 || res == 0)
>>>>>
>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>> a specific case?
>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>  Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>> here straightforwardly.  Shall I try to further generalize the
>>>> interface as independence patch to this one?
>>>
>>> That would be nice.
>>>
>>>>>
>>>>> +  /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
>>>>> +  if (flag_tree_loop_vectorize)
>>>>> +    {
>>>>>
>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>> on flag_tree_loop_vectorize.  You seem to support versioning w/o
>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>> That looks somewhat inconsistent...
>>>> It is a bit complicated.  In function version_for_distribution_p, we have
>>>> +
>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>> +  if (alias_ddrs->length () > 0)
>>>> +    return true;
>>>> +
>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>> +  if (!flag_tree_loop_vectorize)
>>>> +    return false;
>>>> +
>>>>
>>>> It means we also versioning loops even if runtime alias check is
>>>> unnecessary.  I did this because we lack cost model and current
>>>> distribution may result in too many distribution?  If that's the case,
>>>> at least vectorizer will remove distributed version loop and fall back
>>>> to the original one.  Hmm, shall I change it into below code:
>>>
>>> Hmm, ok - that really ties things to vectorization apart from the
>>> builtin recognition.  So what happens if the vectorizer can vectorize
>>> both the distributed and non-distributed loops?
>> Hmm, there is no such cases.  Under condition no builtins is
>> recognized, we wouldn't distribute loop if by itself can be
>> vectorized.  Does this make sense?  Of course, cost model for memory
>> behavior can change this behavior and is wanted.
>
> So which cases _do_ we split loops then?  "more parallelism" -- but what
> does that mean exactly?  Is there any testcase that shows the desired
> splits for vectorization?
At least one of distributed loop can be executed paralleled while the
original loop can't.
Not many, but ldist-26.c is added by one of patch.
>
>>>
>>>> +
>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>> +  if (alias_ddrs->length () == 0)
>>>> +    return false;
>>>> +
>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>> +  if (!flag_tree_loop_vectorize)
>>>> +    return false;
>>>> +
>>>>
>>>> Then I can remove the check you mentioned in function
>>>> version_loop_by_alias_check?
>>>
>>> Yeah, I guess that would be easier to understand.  Need to update
>>> the comment above the alias_ddrs check though.
>> Yes the logic here is complicated.  On one hand, I want to be
>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>> can undo all "unnecessary" distribution before memory behavior is
>> modeled.
>>
>>
>>>
>>>>>
>>>>> +  /* Don't version loop if any partition is builtin.  */
>>>>> +  for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>> +    {
>>>>> +      if (partition->kind != PKIND_NORMAL)
>>>>> +       break;
>>>>> +    }
>>>>>
>>>>> why's that?  Do you handle the case where only a subset of partitions
>>>> One reason is I generally consider distributed builtin functions as a
>>>> win, thus distribution won't be canceled later in vectorizer.  Another
>>>> reason is if all distributed loops are recognized as builtins, we
>>>> can't really version with current logic because the
>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>
>>> Ah, ok.  I guess the maze was too twisted for me to see what
>>> version_for_distribution_p
>>> does ;)
>>>
>>>>> require a runtime alias check to be generated?  Thus from a loop
>>>>> generate three loops, only two of them being versioned?  The
>>>>> complication would be that both the runtime alias and non-alias
>>>>> versions would be "transformed".  Or do we rely on recursively
>>>>> distributing the distribution result, thus if we have partitions that
>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>> and recurse on those?
>>>> No, this is not precisely handled now, the pass will version the whole
>>>> loop once.  Though I think it's not very difficult to do two stages
>>>> distribution, I am not sure how useful it is.
>>>
>>> Not sure either.
>>>
>>> So to understand you're looking at loop-distribution as vectorization enabler
>>> and pattern detector.  I think that is reasonable without a better cost model.
>>>
>>> Note that I think the state before your patches had the sensible cost-model
>>> that it fused very conservatively and just produced the maximum distribution
>>> with the idea that the looping overhead itself is cheap.  Note that with
>>> a more "maximum" distribution the vectorizer also gets the chance to
>>> do "partial vectorization" in case profitability is different.  Of course the
>>> setup cost may offset that in the case all loops end up vectorized...
>> Ideally, we have cost model for memory behavior in distribution.  If
>> we know distribution is beneficial in loop distribution, we can simply
>> distribute it; otherwise we pass distribution cost (including memory
>> cost as well as runtime alias check cost as an argument to
>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>> together with vectorization.
>
> Yes.  The old cost model wasn't really one thus loop distribution was never
> enabled by default.
>
>>>
>>> So in reality the vectorizer itself should "distribute away"
>>> non-vectorizable parts
>>> of the loop ... :/
>> Do we really want to put more and more things into vectorizer? :)
>> Michael's opinion (IIUC, about introducing different loop
>> optimizations as building blocks for loop optimization infrastructure)
>> is quite inspiring to me.  In this way, we can at least make different
>> optimization conservative.  In general, it's much worse to do
>> aggressive (unnecessary) transformation than simply miss the
>> opportunity?
>
> Heh, but with increasing number of "different loop optimizations" we
> miss an overall goal.  Like either each pass knows what the other
> would do (vectorize or predcom for example) or we need to have
> a meta-engine driving the building blocks and the building blocks
> divided into analysis + costing and a transform phase (and if ever
> two transforms shall apply their analysis may not be invalidated
> by an earlier one).  Integrating the simple building blocks into one
> framework is what is sexy about GRAPHITE ...
>
> For example predcom might be more beneficial than vectorization
> with v2df vectors for example.
We need to be careful/conservative for transformations with
complicated impact.  But..., as talk is easy, I think I better to stop
here. :)

Thanks,
bin
>
> Richard.
>
>>
>> Thanks,
>> bin
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> bin