From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id D21823851C0A for ; Thu, 27 Aug 2020 15:19:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D21823851C0A Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=ucw.cz Authentication-Results: sourceware.org; spf=none smtp.mailfrom=hubicka@kam.mff.cuni.cz Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 3A001282A43; Thu, 27 Aug 2020 17:18:56 +0200 (CEST) Date: Thu, 27 Aug 2020 17:18:56 +0200 From: Jan Hubicka To: Giuliano Belinassi Cc: gcc-patches@gcc.gnu.org, richard.guenther@gmail.com Subject: Re: [PATCH 2/6] Implement a new partitioner for parallel compilation Message-ID: <20200827151856.GA95047@kam.mff.cuni.cz> References: <20200820220019.3131804-1-giuliano.belinassi@usp.br> <20200820220019.3131804-3-giuliano.belinassi@usp.br> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200820220019.3131804-3-giuliano.belinassi@usp.br> User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-16.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_STOCKGEN, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Aug 2020 15:19:03 -0000 > When using the LTO infrastructure to compile files in parallel, we > can't simply use any of the LTO partitioner, once extra dependency > analysis is required to ensure that some nodes are correctly > partitioned together. > > Therefore, here we implement a new partitioner called > "lto_merge_comdat_map" that does all these required analysis. > The partitioner works as follows: > > 1. We create a number of disjoint sets and inserts each node into a > separate set, which may be merged together in the future. > > 2. Find COMDAT groups, and mark them to be partitioned together. > > 3. Check all nodes that would require any COMDAT group to be > copied to its partition (which we name "COMDAT frontier"), > and mark them to be partitioned together. > This avoids duplication of COMDAT groups and crashes on the LTO > partitioning infrastructure. What kind of crash you get here? > > 4. Check if the user allows the partitioner to promote non-public > functions or variables to global to improve parallelization > opportunity with a cost of modifying the output code layout. > > 5. Balance generated partitions for performance unless not told to. > > The choice of 1. was by design, so we could use a union-find > data structure, which are know for being very fast on set unite > operations. In LTO partitioner the groups of objects that "must go toghether" are discovered when first object is placed into the partition (via add_to_partition) because with the LTO rules it is always possible to discover all members from the group starting from any random element via graph walking. I guess it is same with your partitioner? I basically wonder how much code can be shared and what needs to be duplicated. It is not very nice to have partitioning implemented twice - it is bit subtle problem when it comes to details so I would be happier if we brought in the lto/lto-partition.c to middle end and updaed/cleaned it up as needed. > > For 3. to work properly, we also had to modify > lto_promote_cross_file_statics to handle this case. > > The parameters --param=promote-statics and --param=balance-partitions > control 4. and 5., respectively > > gcc/ChangeLog: > 2020-08-20 Giuliano Belinassi > > * Makefile.in: Add lto-partition.o > * cgraph.h (struct symtab_node::aux2): New variable. > * lto-partition.c: Move from gcc/lto/lto-partition.c > (add_symbol_to_partition_1): Only compute insn size > if information is available. > (node_cmp): Same as above. > (class union_find): New. > (ds_print_roots): New function. > (balance_partitions): New function. > (build_ltrans_partitions): New function. > (merge_comdat_nodes): New function. > (merge_static_calls): New function. > (merge_contained_symbols): New function. > (lto_merge_comdat_map): New function. > (privatize_symbol_name_1): Handle when WPA is not enabled. > (privatize_symbol_name): Same as above. > (lto_promote_cross_file_statics): New parameter to select when > to promote to global. > (lto_check_usage_from_other_partitions): New function. > * lto-partition.h: Move from gcc/lto/lto-partition.h > (lto_promote_cross_file_statics): Update prototype. > (lto_check_usage_from_other_partitions): Declare. > (lto_merge_comdat_map): Declare. > > gcc/lto/ChangeLog: > 2020-08-20 Giuliano Belinassi > > * lto-partition.c: Move to gcc/lto-partition.c. > * lto-partition.h: Move to gcc/lto-partition.h. > * lto.c: Update call to lto_promote_cross_file_statics. > * Makefile.in: Remove lto-partition.o. > --- > gcc/Makefile.in | 1 + > gcc/cgraph.h | 1 + > gcc/{lto => }/lto-partition.c | 463 +++++++++++++++++++++++++++++++++- > gcc/{lto => }/lto-partition.h | 4 +- > gcc/lto/Make-lang.in | 4 +- > gcc/lto/lto.c | 2 +- > gcc/params.opt | 8 + > gcc/tree.c | 23 +- > 8 files changed, 489 insertions(+), 17 deletions(-) > rename gcc/{lto => }/lto-partition.c (78%) > rename gcc/{lto => }/lto-partition.h (89%) > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in > index 79e854aa938..be42b15f4ff 100644 > --- a/gcc/Makefile.in > +++ b/gcc/Makefile.in > @@ -1457,6 +1457,7 @@ OBJS = \ > lra-spills.o \ > lto-cgraph.o \ > lto-streamer.o \ > + lto-partition.o \ > lto-streamer-in.o \ > lto-streamer-out.o \ > lto-section-in.o \ > diff --git a/gcc/cgraph.h b/gcc/cgraph.h > index 0211f08964f..b4a7871bd3d 100644 > --- a/gcc/cgraph.h > +++ b/gcc/cgraph.h > @@ -615,6 +615,7 @@ public: > struct lto_file_decl_data * lto_file_data; > > PTR GTY ((skip)) aux; > + int aux2; We do not really want to add more pass specific data to the symbol table (since it is critical wrt memory use during WPA stage). It is possible to attach extra info using the symbol-summary.h > > /* Comdat group the symbol is in. Can be private if GGC allowed that. */ > tree x_comdat_group; > diff --git a/gcc/lto/lto-partition.c b/gcc/lto-partition.c > similarity index 78% > rename from gcc/lto/lto-partition.c > rename to gcc/lto-partition.c > index 8e0488ab13e..ca962e69b5d 100644 > --- a/gcc/lto/lto-partition.c > +++ b/gcc/lto-partition.c > @@ -170,7 +170,11 @@ add_symbol_to_partition_1 (ltrans_partition part, symtab_node *node) > { > struct cgraph_edge *e; > if (!node->alias && c == SYMBOL_PARTITION) > - part->insns += ipa_size_summaries->get (cnode)->size; > + { > + /* FIXME: Find out why this is being returned NULL in some cases. */ > + if (ipa_size_summaries->get (cnode)) > + part->insns += ipa_size_summaries->get (cnode)->size; It returns NULL when symbol summary is not available. Either symbol is not defined (in which case it should not be SYMBOL_PARTITION) or it is not computed (probaly because of -O0?) > +/* Quickly balance partitions, trying to reach target_size in each of > + them. Returns true if something was done, or false if we decided > + that it is not worth. */ > + > +static bool > +balance_partitions (union_find *ds, int n, int jobs) Generally options should be documented, so I would learn what N means ;) > +{ > + int *sizes, i, j; > + int total_size = 0, max_size = -1; > + int target_size; > + const int eps = 0; > + > + symtab_node *node; > + > + sizes = (int *) alloca (n * sizeof (*sizes)); > + memset (sizes, 0, n * sizeof (*sizes)); And we avoid using alloca for arrays that can grow over stack limit. I assume this has the size of symbol table, which means that you want to use vec.h API and allocae on heap. > + > + /* Compute costs. */ > + i = 0; > + FOR_EACH_SYMBOL (node) > + { > + int root = ds->find (i); > + > + if (cgraph_node *cnode = dyn_cast (node)) > + { > + ipa_size_summary *summary = ipa_size_summaries->get (cnode); > + if (summary) > + sizes[root] += summary->size; > + else > + sizes[root] += 10; > + } > + else > + sizes[root] += 10; Do you have testcases where summary is mising? > + > + > + i++; > + } > + > + /* Compute the total size and maximum size. */ > + for (i = 0; i < n; ++i) > + { > + total_size += sizes[i]; > + max_size = MAX (max_size, sizes[i]); Also i think we should start using 64bit values for total sizes of units. In some extreme cases they already get close to overflow. > + } > + > + /* Quick return if total size is small. */ > + if (total_size < param_min_partition_size) > + return false; > + > + target_size = total_size / (jobs + 1); > + > + /* Unite small partitions. */ > + for (i = 0, j = 0; j < n; ++j) > + { > + if (sizes[j] == 0) > + continue; > + > + if (i == -1) > + i = j; > + else > + { > + if (sizes[i] + sizes[j] < target_size + eps) > + { > + ds->unite (i, j); > + sizes[i] += sizes[j]; > + sizes[j] = 0; > + } > + else > + i = j; > + } > + } > + return true; Note that partitioning is not free, since reference to static var or a call from one partition to another will lead to more expensive relocations/instructions to be used (since gas will not be able to relax it to IP relative addressing where available). For that reason LTO partitioner has the logic tracking boundary and minimizing it. I think we should merge them and also cleanup lto/lto-partition.c - the partitioner was one late night experiment that I intended as a first proof of concept to be rewritten later, but we never got into impementing anything much smarter. On the other hand we added a lot of extra hacks to it (order preserving and other things), so it deserves some TLC. Also I think you unite partitions in the FOR_EACH_* order that is not really meaningful, the code layout is controlled by node->order values. > +} > + > +/* Builds the LTRANS partitions, or return if not needed. */ > + > +static int > +build_ltrans_partitions (union_find *ds, int n) > +{ > + int i, n_partitions; > + symtab_node *node; > + > + int *compression = (int *) alloca (n * sizeof (*compression)); > + for (i = 0; i < n; ++i) > + compression[i] = -1; /* Invalid value. */ > + > + i = 0, n_partitions = 0; > + FOR_EACH_SYMBOL (node) > + { > + int root = ds->find (i); > + node->aux2 = root; > + node->aux = NULL; > + > + if (node->get_partitioning_class () == SYMBOL_PARTITION > + && compression[root] < 0) > + compression[root] = n_partitions++; > + i++; > + } What exactly compression is used for? > + > + if (dump_file) > + fprintf (dump_file, "n_partitions = %d\n", n_partitions); > + > + if (n_partitions <= 1) > + return false; > + > + /* Create LTRANS partitions. */ > + ltrans_partitions.create (n_partitions); > + for (i = 0; i < n_partitions; i++) > + new_partition (""); > + > + FOR_EACH_SYMBOL (node) > + { > + if (node->get_partitioning_class () != SYMBOL_PARTITION > + || symbol_partitioned_p (node)) > + continue; > + > + int p = compression[node->aux2]; > + if (dump_file) > + fprintf (dump_file, "p = %d\t;; %s\n", p, node->dump_name ()); > + add_symbol_to_partition (ltrans_partitions[p], node); > + } > + > + return true; > +} > + > +/* Partition COMDAT groups together, and also bring together nodes that > + requires them. Such nodes that are not in the COMDAT group that have > + references to COMDAT grouped nodes are called the COMDAT frontier. */ > + > +static bool > +merge_comdat_nodes (symtab_node *node, int set) > +{ > + enum symbol_partitioning_class c = node->get_partitioning_class (); > + bool ret = false; > + symtab_node *node1; > + cgraph_edge *e; > + > + /* If node is already analysed, quickly return. */ > + if (node->aux) > + return false; > + > + /* Mark as analysed. */ > + node->aux = (void *) 1; > + > + > + /* Aglomerate the COMDAT group into the same partition. */ > + if (node->same_comdat_group) > + { > + for (node1 = node->same_comdat_group; > + node1 != node; node1 = node1->same_comdat_group) > + if (!node->alias) > + { > + ds->unite (node1->aux2, set); > + merge_comdat_nodes (node1, set); > + ret = true; > + } > + } > + > + /* Look at nodes that can reach the COMDAT group, and aglomerate to the > + same partition. These nodes are called the "COMDAT Frontier". The > + idea is that every unpartitioned node that reaches a COMDAT group MUST > + go through the COMDAT frontier before reaching it. Therefore, only > + nodes in the frontier are exported. */ > + if (node->same_comdat_group || c == SYMBOL_DUPLICATE) > + { > + int i; > + struct ipa_ref *ref = NULL; > + > + if (cgraph_node *cnode = dyn_cast (node)) > + { > + /* Add all inline clones and callees that are duplicated. */ > + for (e = cnode->callers; e; e = e->next_caller) > + if (!e->inline_failed || c == SYMBOL_DUPLICATE) > + { > + ds->unite (set, e->caller->aux2); > + merge_comdat_nodes (e->caller, set); > + ret = true; > + } > + > + /* Add all thunks associated with the function. */ > + for (e = cnode->callees; e; e = e->next_callee) > + if (e->caller->thunk.thunk_p && !e->caller->inlined_to) > + { > + ds->unite (set, e->callee->aux2); > + merge_comdat_nodes (e->callee, set); > + ret = true; > + } > + } I do not think it is strictly necessary to merge comdat function with all users. If the comdat is some easy accessor this may prevent a lot of merging. All you need to do is to place it into one of partitins and mark set "used" flag on the symbols used in other partitions, so it does not get optimized away when unused. Other partitions should reffer to it w/o problems. I am not sure what exactly the goal is here about not changing code layout. > +/* Partition the program into several partitions with a restriction that > + COMDATS are partitioned together with all nodes requiring them. If > + promote_statics is false, we also partition together static functions > + and nodes that call eachother, so non-public functions are not promoted > + to globals. */ > + > +void > +lto_merge_comdat_map (bool balance, bool promote_statics, int jobs) BTW I think it is odd name for partitioner. Comdat is just one of problems it deals with. But I would still like to see this merged with the lto partitioning logic, so we could also use all kinds of partitioners in both modes. It all looks quite nice, but lets work on avoiding the code duplication here... Honza