From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 60369 invoked by alias); 7 Jun 2017 10:19:01 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 60344 invoked by uid 89); 7 Jun 2017 10:19:00 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: mail-ua0-f169.google.com Received: from mail-ua0-f169.google.com (HELO mail-ua0-f169.google.com) (209.85.217.169) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 07 Jun 2017 10:18:58 +0000 Received: by mail-ua0-f169.google.com with SMTP id h39so3816106uaa.3 for ; Wed, 07 Jun 2017 03:19:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=7xfkGEowuznOCPvBPM+rt/gH03jW1VyYz0v/gCjR3Wo=; b=q2vC/XxKpWpZaDYCSBlbmwR4he6UDCknpH8esRf1sjIQuRxp66bO5Il0hFRMY0I7lf N0rOno10CXOl0+v+oG+XB+VewFGkWAAKahk7vb+G3TvFKkF91FcwtEOOROAg/lNVslvp ofJH5TGH9qUxYu133Z8JFGFHxvl4VeSfmZh01BCMdlu5vBfHY3IqPe3dbcSD0DGcrZU0 P9OekD9VQcJbpSP4fVxUd9aGwTdR73Hp61U2j+1agukeAx6p7QHtH/tiqObbT4IesF0t xjP+Exr0QOyjjmRVmzxS1UEf3X5oydue4n6Y33JmdeOqZK9Sl5efzw7y6aZ8LV/PfO1D mvWQ== X-Gm-Message-State: AODbwcAr4MYq205pLu/3rNnY0EhsP9GPRbDainNJ/j/c0+y8kO/CEmSk mLaxYpxqFozGNvxCcybO16Cd5WmMHw== X-Received: by 10.176.2.79 with SMTP id 73mr9517113uas.111.1496830740708; Wed, 07 Jun 2017 03:19:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.49.9 with HTTP; Wed, 7 Jun 2017 03:19:00 -0700 (PDT) In-Reply-To: References: From: "Bin.Cheng" Date: Wed, 07 Jun 2017 10:19:00 -0000 Message-ID: Subject: Re: [PATCH GCC][4/5]Improve loop distribution to handle hmmer To: Richard Biener Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2017-06/txt/msg00417.txt.bz2 On Wed, Jun 7, 2017 at 11:03 AM, Richard Biener wrote: > On Fri, Jun 2, 2017 at 1:51 PM, Bin Cheng wrote: >> Hi, >> This is the main patch of the change. It improves loop distribution by versioning loop under >> runtime alias check conditions, as well as better partition fusion. As described in comments, >> the patch basically implements distribution in the following steps: >> >> 1) Seed partitions with specific type statements. For now we support >> two types seed statements: statement defining variable used outside >> of loop; statement storing to memory. >> 2) Build reduced dependence graph (RDG) for loop to be distributed. >> The vertices (RDG:V) model all statements in the loop and the edges >> (RDG:E) model flow and control dependences between statements. >> 3) Apart from RDG, compute data dependences between memory references. >> 4) Starting from seed statement, build up partition by adding depended >> statements according to RDG's dependence information. Partition is >> classified as parallel type if it can be executed parallelly; or as >> sequential type if it can't. Parallel type partition is further >> classified as different builtin kinds if it can be implemented as >> builtin function calls. >> 5) Build partition dependence graph (PG) based on data dependences. >> The vertices (PG:V) model all partitions and the edges (PG:E) model >> all data dependences between every partitions pair. In general, >> data dependence is either compilation time known or unknown. In C >> family languages, there exists quite amount compilation time unknown >> dependences because of possible alias relation of data references. >> We categorize PG's edge to two types: "true" edge that represents >> compilation time known data dependences; "alias" edge for all other >> data dependences. >> 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge >> partitions in each strong connected commponent (SCC) correspondingly. >> Build new PG for merged partitions. >> 7) Traverse PG again and this time with both "true" and "alias" edges >> included. We try to break SCCs by removing some edges. Because >> SCCs by "true" edges are all fused in step 6), we can break SCCs >> by removing some "alias" edges. It's NP-hard to choose optimal >> edge set, fortunately simple approximation is good enough for us >> given the small problem scale. >> 8) Collect all data dependences of the removed "alias" edges. Create >> runtime alias checks for collected data dependences. >> 9) Version loop under the condition of runtime alias checks. Given >> loop distribution generally introduces additional overhead, it is >> only useful if vectorization is achieved in distributed loop. We >> version loop with internal function call IFN_LOOP_DIST_ALIAS. If >> no distributed loop can be vectorized, we simply remove distributed >> loops and recover to the original one. >> >> Also, there are some more to improve in the future (which shouldn't be difficult): >> TODO: >> 1) We only distribute innermost loops now. This pass should handle loop >> nests in the future. >> 2) We only fuse partitions in SCC now. A better fusion algorithm is >> desired to minimize loop overhead, maximize parallelism and maximize >> >> This patch also fixes couple of latent bugs in the original implementation. > > It would be nice to split this patch up, for example fixing the latent > bugs first > (so we can backport that part). I only remember the major one in which topological order is needed when sorting statements, dominance order is not enough. I will try to split it as much as possible, but the change falls in a big chunk to some extend. > > You now compute _all_ dependences in the loop while I originally designed loop > distribution to do less dependence computation by only computing dependences > between datarefs in different partitions (and delaying that until > after partition fusing > required by things not requiring dependence info). > Please do not remove this optimization. Right, I did that and cache it in hash table because we need to query dependence for two references multiple times. I will restore the on-demand computing behavior. Thanks, bin > > Richard. > >> After this change, kernel loop in hmmer can be distributed and vectorized as a result. >> This gives obvious performance improvement. There is still inefficient code generation >> issue which I will try to fix in loop split. Apart from this, the next opportunity in hmmer >> is to eliminate number of dead stores under proper alias information. >> Bootstrap and test at O2/O3 on x86_64 and AArch64. is it OK? >> >> Thanks, >> bin >> 2017-05-31 Bin Cheng >> >> * cfgloop.h (struct loop): New field ldist_alias_id. >> * cfgloopmanip.c (lv_adjust_loop_entry_edge): Refine comment for >> new internal function. >> * internal-fn.c (expand_LOOP_DIST_ALIAS): New function. >> * internal-fn.def (IFN_LOOP_DIST_ALIAS): New internal function. >> * tree-loop-distribution.c: Add general explanantion on the pass. >> Include header file. >> (struct ddr_entry, struct ddr_entry_hasher): New structs. >> (ddr_entry_hasher::hash, ddr_entry_hasher::equal): New functions. >> (bb_top_order_index, bb_top_order_index_size): New static vars. >> (bb_top_order_cmp): New function. >> (stmts_from_loop): Get basic blocks in topological order. Don't >> free data references. >> (build_rdg): New parameter pointing to vector of data references. >> Store data references in it. >> (enum partition_type): New enum. >> (enum partition_kind, struct partition): Add comments. New fields. >> (partition_alloc, partition_free): Handle new fields of partition. >> (enum fuse_type, fuse_message): New enum and varaible. >> (partition_merge_into): New parameter. Update implementation wrto >> new fields of partition. >> (generate_loops_for_partition): Set ldist_alias_id information. >> (record_ddr, get_ddr, possible_data_dep_cycle_p): New functions. >> (build_rdg_partition_for_vertex): New parameter. Compute type info >> for partition. >> (classify_partition): New parameter. Only classify partition as >> reduction if the reduction is not included by all partitions. >> Retrieve cached ddr, rather than compute it on the fly. >> (ref_base_address): Delete. >> (similar_memory_accesses): Rename to ... >> (share_memory_accesses): ... this. >> (rdg_build_partitions): New parameter and update uses. >> (pg_add_dependence_edges): New parameter. Store ddr in parameter >> vector if it could be resolved by runtime alias check. >> (rdg_compute_data_dependence): New function. >> (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs. >> (init_partition_graph_vertices, add_partition_graph_edge): New. >> (pg_skip_alias_edge, free_partition_graph_edata_cb): New. >> (free_partition_graph_vdata, build_partition_graph): New. >> (sort_partitions_by_post_order, merge_dep_scc_partitions): New. >> (pg_collect_alias_ddrs, break_alias_scc_partitions): New. >> (data_ref_segment_size, latch_dominated_by_data_ref): New. >> (compute_alias_check_pairs, version_loop_by_alias_check): New. >> (version_for_distribution_p, finalize_partitions): New. >> (distribute_loop): Update uses. Break SCC and version loop >> with runtime alias checks. >> (pass_loop_distribution::execute): Compute topological order for >> basic blocks. Update uses. >> * tree-vectorizer.c (vect_loop_dist_alias_call): New. >> (fold_loop_dist_alias_call): New. >> (vectorize_loops): Fold IFN_LOOP_DIST_ALIAS call depending on >> successful vectorization or not. >> >> gcc/testsuite/ChangeLog >> 2017-05-31 Bin Cheng >> >> * gcc.dg/tree-ssa/ldist-4.c: Adjust test string. >> * gcc.dg/tree-ssa/ldist-12.c: Ditto. >> * gcc.dg/tree-ssa/ldist-13.c: Ditto. >> * gcc.dg/tree-ssa/ldist-14.c: Ditto.