From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 103621 invoked by alias); 17 Jul 2017 12:09:20 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 102624 invoked by uid 89); 17 Jul 2017 12:09:19 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy=accident, revise, difficulty, questionable X-HELO: mail-ua0-f181.google.com Received: from mail-ua0-f181.google.com (HELO mail-ua0-f181.google.com) (209.85.217.181) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 17 Jul 2017 12:09:16 +0000 Received: by mail-ua0-f181.google.com with SMTP id z22so91807811uah.1 for ; Mon, 17 Jul 2017 05:09:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=5snpltf1FepgWuoqYSkg9F73cF+QQWmjxNCLiDU7hJI=; b=J8Im27PrUr5Vo0cvQC6WbZl3JMB3G4NXrfAQ65k6xa3u73M9Xjax1oFrH6s5kRhAq8 eJqY5btWf/fw3efJFfR+0XLeSFJlgmFp5V9AiHue7qbt+DpFqiNOZFnEnC4fstCOSwKe Hu2zCZpRcI2ZblDwuOiBEEIHG3dRa36k0RezwxsBiqWYmB0T3WmS5qQhJv4ZZztrO0X0 pVI97YYn0jO2XW7RV5jyS3Nxer2inoyyKvwakUu0ot8uEpliZ1hWooWn3+UyuAHlI7f3 /Cp5x49d1Kp+bJsWH7lyE4v7v920HmuhulW2bKKdRIny2QwvSyJwabjg4rLg57tB0xSu HaHg== X-Gm-Message-State: AIVw112K1/ICvNVt2Y0DxiJq2uMmu70w5tfmfaj5pbAD9Pg3Sp5H1SJ5 JmF0A3JDyYYp5TjmKb+ckkRx19oKOZhs X-Received: by 10.176.71.87 with SMTP id i23mr11065776uac.150.1500293354068; Mon, 17 Jul 2017 05:09:14 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.91.71 with HTTP; Mon, 17 Jul 2017 05:09:13 -0700 (PDT) In-Reply-To: References: From: Christophe Lyon Date: Mon, 17 Jul 2017 12:09:00 -0000 Message-ID: Subject: Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check To: "Bin.Cheng" Cc: Richard Biener , "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2017-07/txt/msg00966.txt.bz2 On 17 July 2017 at 12:06, Bin.Cheng wrote: > On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon > wrote: >> Hi Bin, >> >> On 30 June 2017 at 12:43, Bin.Cheng wrote: >>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng wrote: >>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener >>>> wrote: >>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng wrote: >>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener >>>>>> wrote: >>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng wrote: >>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener >>>>>>>> wrote: >>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng wrote: >>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng wrote: >>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng wrote: >>>>>>>>>>>> Hi, >>>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on >>>>>>>>>> x86_64 and aarch64. >>>>>>>>> >>>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose >>>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements >>>>>>>>> which it clearly does. >>>>>>>>> >>>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording >>>>>>>>> of "parallelism" still confuses me. >>>>>>>>> >>>>>>>>> Can you elaborate on that. Now onto the patch: >>>>>>>> Given we don't model data locality or memory bandwidth, whether >>>>>>>> distribution enables loops that can be executed paralleled becomes the >>>>>>>> major criteria for distribution. BTW, I think a good memory stream >>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c, >>>>>>>> etc., appropriate for distribution. >>>>>>> >>>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops >>>>>>> can be executed "in parallel" >>>>>> So if a loop by itself can be vectorized (or so called can be executed >>>>>> paralleled), we tend to no distribute it into small ones. But there >>>>>> is one exception here, if the distributed small loops are recognized >>>>>> as builtin functions, we still distribute it. I assume it's generally >>>>>> better to call builtin memory functions than vectorize it by GCC? >>>>> >>>>> Yes. >>>>> >>>>>>> >>>>>>>>> >>>>>>>>> + Loop distribution is the dual of loop fusion. It separates statements >>>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the >>>>>>>>> + same loop header. The major goal is to separate statements which may >>>>>>>>> + be vectorized from those that can't. This pass implements distribution >>>>>>>>> + in the following steps: >>>>>>>>> >>>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization >>>>>>>>> enabler. distributing a loop can also reduce register pressure. >>>>>>>> I will revise the comment, but as explained, enabling more >>>>>>>> vectorization is the major criteria for distribution to some extend >>>>>>>> now. >>>>>>> >>>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC. >>>>>> Let's see if any performance drop will be reported against this patch. >>>>>> Let's see if we can create a cost model for it. >>>>> >>>>> Fine. >>>> I will run some benchmarks to see if there is breakage. >>>>> >>>>>>> >>>>>>>>> >>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I >>>>>>>>> didn't look >>>>>>>>> into yet). If you don't use that please introduce it separately. >>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch. >>>>>>>> >>>>>>>>> >>>>>>>>> + /* Be conservative. If data references are not well analyzed, >>>>>>>>> + or the two data references have the same base address and >>>>>>>>> + offset, add dependence and consider it alias to each other. >>>>>>>>> + In other words, the dependence can not be resolved by >>>>>>>>> + runtime alias check. */ >>>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2) >>>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2) >>>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2) >>>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1)) >>>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2)) >>>>>>>>> + || res == 0) >>>>>>>>> >>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for >>>>>>>>> a specific case? >>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously? >>>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit >>>>>>>> here straightforwardly. Shall I try to further generalize the >>>>>>>> interface as independence patch to this one? >>>>>>> >>>>>>> That would be nice. >>>>>>> >>>>>>>>> >>>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */ >>>>>>>>> + if (flag_tree_loop_vectorize) >>>>>>>>> + { >>>>>>>>> >>>>>>>>> so at this point I'd condition the whole runtime-alias check generating >>>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o >>>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize. >>>>>>>>> That looks somewhat inconsistent... >>>>>>>> It is a bit complicated. In function version_for_distribution_p, we have >>>>>>>> + >>>>>>>> + /* Need to version loop if runtime alias check is necessary. */ >>>>>>>> + if (alias_ddrs->length () > 0) >>>>>>>> + return true; >>>>>>>> + >>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>>>>>> + vectorizer is not enable because no other pass can fold it. */ >>>>>>>> + if (!flag_tree_loop_vectorize) >>>>>>>> + return false; >>>>>>>> + >>>>>>>> >>>>>>>> It means we also versioning loops even if runtime alias check is >>>>>>>> unnecessary. I did this because we lack cost model and current >>>>>>>> distribution may result in too many distribution? If that's the case, >>>>>>>> at least vectorizer will remove distributed version loop and fall back >>>>>>>> to the original one. Hmm, shall I change it into below code: >>>>>>> >>>>>>> Hmm, ok - that really ties things to vectorization apart from the >>>>>>> builtin recognition. So what happens if the vectorizer can vectorize >>>>>>> both the distributed and non-distributed loops? >>>>>> Hmm, there is no such cases. Under condition no builtins is >>>>>> recognized, we wouldn't distribute loop if by itself can be >>>>>> vectorized. Does this make sense? Of course, cost model for memory >>>>>> behavior can change this behavior and is wanted. >>>>> >>>>> So which cases _do_ we split loops then? "more parallelism" -- but what >>>>> does that mean exactly? Is there any testcase that shows the desired >>>>> splits for vectorization? >>>> At least one of distributed loop can be executed paralleled while the >>>> original loop can't. >>>> Not many, but ldist-26.c is added by one of patch. >>>>> >>>>>>> >>>>>>>> + >>>>>>>> + /* Need to version loop if runtime alias check is necessary. */ >>>>>>>> + if (alias_ddrs->length () == 0) >>>>>>>> + return false; >>>>>>>> + >>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>>>>>> + vectorizer is not enable because no other pass can fold it. */ >>>>>>>> + if (!flag_tree_loop_vectorize) >>>>>>>> + return false; >>>>>>>> + >>>>>>>> >>>>>>>> Then I can remove the check you mentioned in function >>>>>>>> version_loop_by_alias_check? >>>>>>> >>>>>>> Yeah, I guess that would be easier to understand. Need to update >>>>>>> the comment above the alias_ddrs check though. >>>>>> Yes the logic here is complicated. On one hand, I want to be >>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer >>>>>> can undo all "unnecessary" distribution before memory behavior is >>>>>> modeled. >>>>>> >>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> + /* Don't version loop if any partition is builtin. */ >>>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i) >>>>>>>>> + { >>>>>>>>> + if (partition->kind != PKIND_NORMAL) >>>>>>>>> + break; >>>>>>>>> + } >>>>>>>>> >>>>>>>>> why's that? Do you handle the case where only a subset of partitions >>>>>>>> One reason is I generally consider distributed builtin functions as a >>>>>>>> win, thus distribution won't be canceled later in vectorizer. Another >>>>>>>> reason is if all distributed loops are recognized as builtins, we >>>>>>>> can't really version with current logic because the >>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer. >>>>>>> >>>>>>> Ah, ok. I guess the maze was too twisted for me to see what >>>>>>> version_for_distribution_p >>>>>>> does ;) >>>>>>> >>>>>>>>> require a runtime alias check to be generated? Thus from a loop >>>>>>>>> generate three loops, only two of them being versioned? The >>>>>>>>> complication would be that both the runtime alias and non-alias >>>>>>>>> versions would be "transformed". Or do we rely on recursively >>>>>>>>> distributing the distribution result, thus if we have partitions that >>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions >>>>>>>>> and recurse on those? >>>>>>>> No, this is not precisely handled now, the pass will version the whole >>>>>>>> loop once. Though I think it's not very difficult to do two stages >>>>>>>> distribution, I am not sure how useful it is. >>>>>>> >>>>>>> Not sure either. >>>>>>> >>>>>>> So to understand you're looking at loop-distribution as vectorization enabler >>>>>>> and pattern detector. I think that is reasonable without a better cost model. >>>>>>> >>>>>>> Note that I think the state before your patches had the sensible cost-model >>>>>>> that it fused very conservatively and just produced the maximum distribution >>>>>>> with the idea that the looping overhead itself is cheap. Note that with >>>>>>> a more "maximum" distribution the vectorizer also gets the chance to >>>>>>> do "partial vectorization" in case profitability is different. Of course the >>>>>>> setup cost may offset that in the case all loops end up vectorized... >>>>>> Ideally, we have cost model for memory behavior in distribution. If >>>>>> we know distribution is beneficial in loop distribution, we can simply >>>>>> distribute it; otherwise we pass distribution cost (including memory >>>>>> cost as well as runtime alias check cost as an argument to >>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit >>>>>> together with vectorization. >>>>> >>>>> Yes. The old cost model wasn't really one thus loop distribution was never >>>>> enabled by default. >>>>> >>> >>> Hi, >>> This is updated patch. It makes below changes as well as renaming >>> ldist_alias_id to orig_loop_num. >>> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only >>> versions loop if runtime alias check is needed. >>> 2) Record the new versioned loop as original loop in order to simplify >>> dominance working routine. It also makes sense since versioned loop >>> is the one same with the original loop. >>> >>> No change for ChangeLog entry. Bootstrap and test. Is it OK? >>> >> >> I've noticed that this patch introduces regressions on armeb: >> FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer >> -funroll-loops -fpeel-loops -ftracer -finline-functions execution >> test >> FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test >> >> For instance for >> armeb-none-linux-gnueabihf >> --with-cpu=cortex-a9 >> --with-fpu=neon-fp16 > Hi Christophe, > I am having difficulty in reproducing this one. According to you test > results at > http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/report-build-info.html > http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/armeb-none-linux-gnueabihf/fail-gcc-rh60-armeb-none-linux-gnueabihf-arm-cortex-a9-neon-fp16.txt > > It started at least from r248356? Also I didn't get any difference in > generated assembly between r248356/r248342 for my local toolchain. > Maybe because I used different configuration options? > Hi Bin, Sorry I should have shared more details. I noticed the regression when you committed this patch (r249994): http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/249994/report-build-info.html Checking in my validation logs, it was also present between r248356 and r249990 (your patch 9/13 Simply cost model merges partitions with the same references) and is failing since r249994 (this patch) Was it "fixed" by accident by r249990 and latent until r249994 ? HTH Thanks, Christophe > Thanks, > bin >> >> Christophe >> >>> Thanks, >>> bin