From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 24266 invoked by alias); 28 Jun 2017 13:09:22 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 24252 invoked by uid 89); 28 Jun 2017 13:09:21 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy= X-HELO: mail-ua0-f174.google.com Received: from mail-ua0-f174.google.com (HELO mail-ua0-f174.google.com) (209.85.217.174) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 28 Jun 2017 13:09:18 +0000 Received: by mail-ua0-f174.google.com with SMTP id g40so37287849uaa.3 for ; Wed, 28 Jun 2017 06:09:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=Rd0qL44ZNrg72zlL5pZ4AL4EWW6o70B74Vr6Bp/AWms=; b=IQIwwp/7RfNoLVXa1K3OzS0b7ERWeMR++xqzenFH8dV81gEHRO048vXVIWkm/aQP7P P155o67unQ5HkLrS0sVJ1IxYFRpyhyf4E/oPXj9ln+oVhwXmOSrOa19QnZQ+WgUR1tRu vTg6GAQ+3NWp7tTLxu4sn9kPgt1m5arPsGykI+CodlIqJDyeWf1Iva7MNuyB0xFn2dsh itiQT99q0UB5sz5mfsh+N7YtUofi82KMBvkkjyLFjWJNQ/NN3wZqy1Xw2ueppGq4SDlW SOkqKKNlyg5cW0EDrR8gtISb1IXj8y8S907h6GAt2qdLs9ThTS7iEtMQSgjMkllbZOnd QmSw== X-Gm-Message-State: AKS2vOy+h2B1H+/0OgwoNIBiROIx/B8cvIMOSq9V7jgUT3gf85dPSdUj aEJ0wfJ0FfUDfnS8pVMhUN0pO+88Nat/ X-Received: by 10.176.2.116 with SMTP id 107mr6132853uas.111.1498655356878; Wed, 28 Jun 2017 06:09:16 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.49.142 with HTTP; Wed, 28 Jun 2017 06:09:16 -0700 (PDT) In-Reply-To: References: From: "Bin.Cheng" Date: Wed, 28 Jun 2017 13:09:00 -0000 Message-ID: Subject: Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check To: Richard Biener Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2017-06/txt/msg02157.txt.bz2 On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener wrote: > On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng wrote: >> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener >> wrote: >>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng wrote: >>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener >>>> wrote: >>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng wrote: >>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng wrote: >>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng wrote: >>>>>>>> Hi, >>>>>> Rebased V3 for changes in previous patches. Bootstap and test on >>>>>> x86_64 and aarch64. >>>>> >>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose >>>>> more "parallelism" but the point is to reduce memory bandwith requirements >>>>> which it clearly does. >>>>> >>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording >>>>> of "parallelism" still confuses me. >>>>> >>>>> Can you elaborate on that. Now onto the patch: >>>> Given we don't model data locality or memory bandwidth, whether >>>> distribution enables loops that can be executed paralleled becomes the >>>> major criteria for distribution. BTW, I think a good memory stream >>>> optimization model shouldn't consider small loops as in ldist-12.c, >>>> etc., appropriate for distribution. >>> >>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops >>> can be executed "in parallel" >> So if a loop by itself can be vectorized (or so called can be executed >> paralleled), we tend to no distribute it into small ones. But there >> is one exception here, if the distributed small loops are recognized >> as builtin functions, we still distribute it. I assume it's generally >> better to call builtin memory functions than vectorize it by GCC? > > Yes. > >>> >>>>> >>>>> + Loop distribution is the dual of loop fusion. It separates statements >>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the >>>>> + same loop header. The major goal is to separate statements which may >>>>> + be vectorized from those that can't. This pass implements distribution >>>>> + in the following steps: >>>>> >>>>> misses the goal of being a memory stream optimization, not only a vectorization >>>>> enabler. distributing a loop can also reduce register pressure. >>>> I will revise the comment, but as explained, enabling more >>>> vectorization is the major criteria for distribution to some extend >>>> now. >>> >>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC. >> Let's see if any performance drop will be reported against this patch. >> Let's see if we can create a cost model for it. > > Fine. I will run some benchmarks to see if there is breakage. > >>> >>>>> >>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I >>>>> didn't look >>>>> into yet). If you don't use that please introduce it separately. >>>> Hmm, yes it is introduced in patch [01/n] and set in this patch. >>>> >>>>> >>>>> + /* Be conservative. If data references are not well analyzed, >>>>> + or the two data references have the same base address and >>>>> + offset, add dependence and consider it alias to each other. >>>>> + In other words, the dependence can not be resolved by >>>>> + runtime alias check. */ >>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2) >>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2) >>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2) >>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1)) >>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2)) >>>>> + || res == 0) >>>>> >>>>> ISTR a helper that computes whether we can handle a runtime alias check for >>>>> a specific case? >>>> I guess you mean runtime_alias_check_p that I factored out previously? >>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit >>>> here straightforwardly. Shall I try to further generalize the >>>> interface as independence patch to this one? >>> >>> That would be nice. >>> >>>>> >>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */ >>>>> + if (flag_tree_loop_vectorize) >>>>> + { >>>>> >>>>> so at this point I'd condition the whole runtime-alias check generating >>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o >>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize. >>>>> That looks somewhat inconsistent... >>>> It is a bit complicated. In function version_for_distribution_p, we have >>>> + >>>> + /* Need to version loop if runtime alias check is necessary. */ >>>> + if (alias_ddrs->length () > 0) >>>> + return true; >>>> + >>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>> + vectorizer is not enable because no other pass can fold it. */ >>>> + if (!flag_tree_loop_vectorize) >>>> + return false; >>>> + >>>> >>>> It means we also versioning loops even if runtime alias check is >>>> unnecessary. I did this because we lack cost model and current >>>> distribution may result in too many distribution? If that's the case, >>>> at least vectorizer will remove distributed version loop and fall back >>>> to the original one. Hmm, shall I change it into below code: >>> >>> Hmm, ok - that really ties things to vectorization apart from the >>> builtin recognition. So what happens if the vectorizer can vectorize >>> both the distributed and non-distributed loops? >> Hmm, there is no such cases. Under condition no builtins is >> recognized, we wouldn't distribute loop if by itself can be >> vectorized. Does this make sense? Of course, cost model for memory >> behavior can change this behavior and is wanted. > > So which cases _do_ we split loops then? "more parallelism" -- but what > does that mean exactly? Is there any testcase that shows the desired > splits for vectorization? At least one of distributed loop can be executed paralleled while the original loop can't. Not many, but ldist-26.c is added by one of patch. > >>> >>>> + >>>> + /* Need to version loop if runtime alias check is necessary. */ >>>> + if (alias_ddrs->length () == 0) >>>> + return false; >>>> + >>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if >>>> + vectorizer is not enable because no other pass can fold it. */ >>>> + if (!flag_tree_loop_vectorize) >>>> + return false; >>>> + >>>> >>>> Then I can remove the check you mentioned in function >>>> version_loop_by_alias_check? >>> >>> Yeah, I guess that would be easier to understand. Need to update >>> the comment above the alias_ddrs check though. >> Yes the logic here is complicated. On one hand, I want to be >> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer >> can undo all "unnecessary" distribution before memory behavior is >> modeled. >> >> >>> >>>>> >>>>> + /* Don't version loop if any partition is builtin. */ >>>>> + for (i = 0; partitions->iterate (i, &partition); ++i) >>>>> + { >>>>> + if (partition->kind != PKIND_NORMAL) >>>>> + break; >>>>> + } >>>>> >>>>> why's that? Do you handle the case where only a subset of partitions >>>> One reason is I generally consider distributed builtin functions as a >>>> win, thus distribution won't be canceled later in vectorizer. Another >>>> reason is if all distributed loops are recognized as builtins, we >>>> can't really version with current logic because the >>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer. >>> >>> Ah, ok. I guess the maze was too twisted for me to see what >>> version_for_distribution_p >>> does ;) >>> >>>>> require a runtime alias check to be generated? Thus from a loop >>>>> generate three loops, only two of them being versioned? The >>>>> complication would be that both the runtime alias and non-alias >>>>> versions would be "transformed". Or do we rely on recursively >>>>> distributing the distribution result, thus if we have partitions that >>>>> can be handled w/o runtime alias check fuse the remaining partitions >>>>> and recurse on those? >>>> No, this is not precisely handled now, the pass will version the whole >>>> loop once. Though I think it's not very difficult to do two stages >>>> distribution, I am not sure how useful it is. >>> >>> Not sure either. >>> >>> So to understand you're looking at loop-distribution as vectorization enabler >>> and pattern detector. I think that is reasonable without a better cost model. >>> >>> Note that I think the state before your patches had the sensible cost-model >>> that it fused very conservatively and just produced the maximum distribution >>> with the idea that the looping overhead itself is cheap. Note that with >>> a more "maximum" distribution the vectorizer also gets the chance to >>> do "partial vectorization" in case profitability is different. Of course the >>> setup cost may offset that in the case all loops end up vectorized... >> Ideally, we have cost model for memory behavior in distribution. If >> we know distribution is beneficial in loop distribution, we can simply >> distribute it; otherwise we pass distribution cost (including memory >> cost as well as runtime alias check cost as an argument to >> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit >> together with vectorization. > > Yes. The old cost model wasn't really one thus loop distribution was never > enabled by default. > >>> >>> So in reality the vectorizer itself should "distribute away" >>> non-vectorizable parts >>> of the loop ... :/ >> Do we really want to put more and more things into vectorizer? :) >> Michael's opinion (IIUC, about introducing different loop >> optimizations as building blocks for loop optimization infrastructure) >> is quite inspiring to me. In this way, we can at least make different >> optimization conservative. In general, it's much worse to do >> aggressive (unnecessary) transformation than simply miss the >> opportunity? > > Heh, but with increasing number of "different loop optimizations" we > miss an overall goal. Like either each pass knows what the other > would do (vectorize or predcom for example) or we need to have > a meta-engine driving the building blocks and the building blocks > divided into analysis + costing and a transform phase (and if ever > two transforms shall apply their analysis may not be invalidated > by an earlier one). Integrating the simple building blocks into one > framework is what is sexy about GRAPHITE ... > > For example predcom might be more beneficial than vectorization > with v2df vectors for example. We need to be careful/conservative for transformations with complicated impact. But..., as talk is easy, I think I better to stop here. :) Thanks, bin > > Richard. > >> >> Thanks, >> bin >>> >>> Thanks, >>> Richard. >>> >>>> Thanks, >>>> bin