From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-458317-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 103621 invoked by alias); 17 Jul 2017 12:09:20 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 102624 invoked by uid 89); 17 Jul 2017 12:09:19 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy=accident, revise, difficulty, questionable
X-HELO: mail-ua0-f181.google.com
Received: from mail-ua0-f181.google.com (HELO mail-ua0-f181.google.com) (209.85.217.181) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 17 Jul 2017 12:09:16 +0000
Received: by mail-ua0-f181.google.com with SMTP id z22so91807811uah.1        for <gcc-patches@gcc.gnu.org>; Mon, 17 Jul 2017 05:09:15 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20161025;        h=x-gm-message-state:mime-version:in-reply-to:references:from:date         :message-id:subject:to:cc;        bh=5snpltf1FepgWuoqYSkg9F73cF+QQWmjxNCLiDU7hJI=;        b=J8Im27PrUr5Vo0cvQC6WbZl3JMB3G4NXrfAQ65k6xa3u73M9Xjax1oFrH6s5kRhAq8         eJqY5btWf/fw3efJFfR+0XLeSFJlgmFp5V9AiHue7qbt+DpFqiNOZFnEnC4fstCOSwKe         Hu2zCZpRcI2ZblDwuOiBEEIHG3dRa36k0RezwxsBiqWYmB0T3WmS5qQhJv4ZZztrO0X0         pVI97YYn0jO2XW7RV5jyS3Nxer2inoyyKvwakUu0ot8uEpliZ1hWooWn3+UyuAHlI7f3         /Cp5x49d1Kp+bJsWH7lyE4v7v920HmuhulW2bKKdRIny2QwvSyJwabjg4rLg57tB0xSu         HaHg==
X-Gm-Message-State: AIVw112K1/ICvNVt2Y0DxiJq2uMmu70w5tfmfaj5pbAD9Pg3Sp5H1SJ5	JmF0A3JDyYYp5TjmKb+ckkRx19oKOZhs
X-Received: by 10.176.71.87 with SMTP id i23mr11065776uac.150.1500293354068; Mon, 17 Jul 2017 05:09:14 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.91.71 with HTTP; Mon, 17 Jul 2017 05:09:13 -0700 (PDT)
In-Reply-To: <CAHFci2_X7F+nNW+-pMmOV97bXpGkdCQU5Yn9xK7J9cdo7bM=kg@mail.gmail.com>
References: <VI1PR0802MB2176F36122DDF66429B0FB3AE7CD0@VI1PR0802MB2176.eurprd08.prod.outlook.com> <CAHFci29yG3+oHwefL44Dx6whLCkYn=yCp-Eg2nMKjSv+ETs0Xg@mail.gmail.com> <CAHFci2_AMQFc857ai1KnabtO4iJcHjksx4ZXATa+vhLOgfj+wA@mail.gmail.com> <CAFiYyc2vH25ZhrWFuJyvqwqHda8SggCP2m66YG8Lc8+43vXDBw@mail.gmail.com> <CAHFci2_TzjRQFecSkckJEX0cBAoNwSpsas=n3oUrqdFujpmU5w@mail.gmail.com> <CAFiYyc3VvAahs1KtNRPiTFAd=cCoy9fvUbVWYkkwErFxXnByMQ@mail.gmail.com> <CAHFci29NWPwbEd_s9+E1fHRuG2kHVSjMcoYwerOk0KarbwdKpg@mail.gmail.com> <CAFiYyc1Beu=yCvpOdgRfvH6tO=42Puh0eXZJBR0hwak7fW-8FA@mail.gmail.com> <CAHFci2-Tbnku=UNMOQtPmyGBXwekLfYcTexvDX5XRv7GMuy4mQ@mail.gmail.com> <CAHFci28gbSZYSiAXCfRbTi=m7wYB9syWc5LEKF5ssAP4eL48NA@mail.gmail.com> <CAKdteObgf-6OGGwELRTfn-5LwF2NZtnwKDXpgzXYca0mmaCTLw@mail.gmail.com> <CAHFci2_X7F+nNW+-pMmOV97bXpGkdCQU5Yn9xK7J9cdo7bM=kg@mail.gmail.com>
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Mon, 17 Jul 2017 12:09:00 -0000
Message-ID: <CAKdteOYH6ZwPe_gT0Ofv7e75DxEaa-SK3migF_XMq1_H0h4_fw@mail.gmail.com>
Subject: Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
To: "Bin.Cheng" <amker.cheng@gmail.com>
Cc: Richard Biener <richard.guenther@gmail.com>, 	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2017-07/txt/msg00966.txt.bz2

On 17 July 2017 at 12:06, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
> <christophe.lyon@linaro.org> wrote:
>> Hi Bin,
>>
>> On 30 June 2017 at 12:43, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>>>>>> x86_64 and aarch64.
>>>>>>>>>
>>>>>>>>> why is ldist-12.c no longer distributed?  your comment says it doesn't expose
>>>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>>>> which it clearly does.
>>>>>>>>>
>>>>>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the wording
>>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>>
>>>>>>>>> Can you elaborate on that.  Now onto the patch:
>>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>>> etc., appropriate for distribution.
>>>>>>>
>>>>>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into two loops
>>>>>>> can be executed "in parallel"
>>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>>> paralleled), we tend to no distribute it into small ones.  But there
>>>>>> is one exception here, if the distributed small loops are recognized
>>>>>> as builtin functions, we still distribute it.  I assume it's generally
>>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>>
>>>>> Yes.
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> +   Loop distribution is the dual of loop fusion.  It separates statements
>>>>>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>>>> +   same loop header.  The major goal is to separate statements which may
>>>>>>>>> +   be vectorized from those that can't.  This pass implements distribution
>>>>>>>>> +   in the following steps:
>>>>>>>>>
>>>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>>>> enabler.  distributing a loop can also reduce register pressure.
>>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>>> now.
>>>>>>>
>>>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>>>> Let's see if any performance drop will be reported against this patch.
>>>>>> Let's see if we can create a cost model for it.
>>>>>
>>>>> Fine.
>>>> I will run some benchmarks to see if there is breakage.
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>>>> didn't look
>>>>>>>>> into yet).  If you don't use that please introduce it separately.
>>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> +             /* Be conservative.  If data references are not well analyzed,
>>>>>>>>> +                or the two data references have the same base address and
>>>>>>>>> +                offset, add dependence and consider it alias to each other.
>>>>>>>>> +                In other words, the dependence can not be resolved by
>>>>>>>>> +                runtime alias check.  */
>>>>>>>>> +             if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>>>> +                 || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>>>> +                 || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>>>> +                 || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>>>> +                 || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>>>> +                 || res == 0)
>>>>>>>>>
>>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>>>> a specific case?
>>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>>>  Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>>>> here straightforwardly.  Shall I try to further generalize the
>>>>>>>> interface as independence patch to this one?
>>>>>>>
>>>>>>> That would be nice.
>>>>>>>
>>>>>>>>>
>>>>>>>>> +  /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
>>>>>>>>> +  if (flag_tree_loop_vectorize)
>>>>>>>>> +    {
>>>>>>>>>
>>>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>>>> on flag_tree_loop_vectorize.  You seem to support versioning w/o
>>>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>>>> That looks somewhat inconsistent...
>>>>>>>> It is a bit complicated.  In function version_for_distribution_p, we have
>>>>>>>> +
>>>>>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>>>>>> +  if (alias_ddrs->length () > 0)
>>>>>>>> +    return true;
>>>>>>>> +
>>>>>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>>>>>> +  if (!flag_tree_loop_vectorize)
>>>>>>>> +    return false;
>>>>>>>> +
>>>>>>>>
>>>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>>>> unnecessary.  I did this because we lack cost model and current
>>>>>>>> distribution may result in too many distribution?  If that's the case,
>>>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>>>> to the original one.  Hmm, shall I change it into below code:
>>>>>>>
>>>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>>>> builtin recognition.  So what happens if the vectorizer can vectorize
>>>>>>> both the distributed and non-distributed loops?
>>>>>> Hmm, there is no such cases.  Under condition no builtins is
>>>>>> recognized, we wouldn't distribute loop if by itself can be
>>>>>> vectorized.  Does this make sense?  Of course, cost model for memory
>>>>>> behavior can change this behavior and is wanted.
>>>>>
>>>>> So which cases _do_ we split loops then?  "more parallelism" -- but what
>>>>> does that mean exactly?  Is there any testcase that shows the desired
>>>>> splits for vectorization?
>>>> At least one of distributed loop can be executed paralleled while the
>>>> original loop can't.
>>>> Not many, but ldist-26.c is added by one of patch.
>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>>>>>> +  if (alias_ddrs->length () == 0)
>>>>>>>> +    return false;
>>>>>>>> +
>>>>>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>>>>>> +  if (!flag_tree_loop_vectorize)
>>>>>>>> +    return false;
>>>>>>>> +
>>>>>>>>
>>>>>>>> Then I can remove the check you mentioned in function
>>>>>>>> version_loop_by_alias_check?
>>>>>>>
>>>>>>> Yeah, I guess that would be easier to understand.  Need to update
>>>>>>> the comment above the alias_ddrs check though.
>>>>>> Yes the logic here is complicated.  On one hand, I want to be
>>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>>>> can undo all "unnecessary" distribution before memory behavior is
>>>>>> modeled.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> +  /* Don't version loop if any partition is builtin.  */
>>>>>>>>> +  for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>>>> +    {
>>>>>>>>> +      if (partition->kind != PKIND_NORMAL)
>>>>>>>>> +       break;
>>>>>>>>> +    }
>>>>>>>>>
>>>>>>>>> why's that?  Do you handle the case where only a subset of partitions
>>>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>>>> win, thus distribution won't be canceled later in vectorizer.  Another
>>>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>>>> can't really version with current logic because the
>>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>>>
>>>>>>> Ah, ok.  I guess the maze was too twisted for me to see what
>>>>>>> version_for_distribution_p
>>>>>>> does ;)
>>>>>>>
>>>>>>>>> require a runtime alias check to be generated?  Thus from a loop
>>>>>>>>> generate three loops, only two of them being versioned?  The
>>>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>>>> versions would be "transformed".  Or do we rely on recursively
>>>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>>>> and recurse on those?
>>>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>>>> loop once.  Though I think it's not very difficult to do two stages
>>>>>>>> distribution, I am not sure how useful it is.
>>>>>>>
>>>>>>> Not sure either.
>>>>>>>
>>>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>>>> and pattern detector.  I think that is reasonable without a better cost model.
>>>>>>>
>>>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>>>> with the idea that the looping overhead itself is cheap.  Note that with
>>>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>>>> do "partial vectorization" in case profitability is different.  Of course the
>>>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>>>> Ideally, we have cost model for memory behavior in distribution.  If
>>>>>> we know distribution is beneficial in loop distribution, we can simply
>>>>>> distribute it; otherwise we pass distribution cost (including memory
>>>>>> cost as well as runtime alias check cost as an argument to
>>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>>>> together with vectorization.
>>>>>
>>>>> Yes.  The old cost model wasn't really one thus loop distribution was never
>>>>> enabled by default.
>>>>>
>>>
>>> Hi,
>>> This is updated patch.  It makes below changes as well as renaming
>>> ldist_alias_id to orig_loop_num.
>>> 1) Simplifies relation with flag_tree_loop_vectorization.  Now it only
>>> versions loop if runtime alias check is needed.
>>> 2) Record the new versioned loop as original loop in order to simplify
>>> dominance working routine.  It also makes sense since versioned loop
>>> is the one same with the original loop.
>>>
>>> No change for ChangeLog entry.  Bootstrap and test.  Is it OK?
>>>
>>
>> I've noticed that this patch introduces regressions on armeb:
>> FAIL:    gcc.dg/torture/pr52028.c   -O3 -fomit-frame-pointer
>> -funroll-loops -fpeel-loops -ftracer -finline-functions  execution
>> test
>> FAIL:    gcc.dg/torture/pr52028.c   -O3 -g  execution test
>>
>> For instance for
>> armeb-none-linux-gnueabihf
>> --with-cpu=cortex-a9
>> --with-fpu=neon-fp16
> Hi Christophe,
> I am having difficulty in reproducing this one.  According to you test
> results at
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/report-build-info.html
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/armeb-none-linux-gnueabihf/fail-gcc-rh60-armeb-none-linux-gnueabihf-arm-cortex-a9-neon-fp16.txt
>
> It started at least from r248356?  Also I didn't get any difference in
> generated assembly between r248356/r248342 for my local toolchain.
> Maybe because I used different configuration options?
>

Hi Bin,

Sorry I should have shared more details.
I noticed the regression when you committed this patch (r249994):
http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/249994/report-build-info.html

Checking in my validation logs, it was also present
between r248356 and r249990 (your patch 9/13 Simply cost model merges
partitions with the same references)
and is failing since r249994 (this patch)

Was it "fixed" by accident by r249990 and latent until r249994 ?

HTH

Thanks,

Christophe

> Thanks,
> bin
>>
>> Christophe
>>
>>> Thanks,
>>> bin