From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 90720 invoked by alias); 25 Nov 2015 08:59:20 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 89357 invoked by uid 89); 25 Nov 2015 08:59:20 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.2 required=5.0 tests=AWL,BAYES_50,KAM_ASCII_DIVIDERS,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: nikam.ms.mff.cuni.cz Received: from nikam.ms.mff.cuni.cz (HELO nikam.ms.mff.cuni.cz) (195.113.20.16) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Wed, 25 Nov 2015 08:59:16 +0000 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 2C67A541BE3; Wed, 25 Nov 2015 09:59:12 +0100 (CET) Date: Wed, 25 Nov 2015 09:04:00 -0000 From: Jan Hubicka To: gcc-patches@gcc.gnu.org, rguenther@suse.de, ak@linux.intel.com, hongjiu.lu@intel.com, ccoutant@google.com, iant@google.com Subject: [RFC] Getting LTO incremental linking work Message-ID: <20151125085912.GD58491@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-SW-Source: 2015-11/txt/msg02987.txt.bz2 Hi, PR 67548 is about LTO not supporting incremental linking. I never really considered our current incremental linking very useful, because it triggers code generation at the incremental link time basically nullifying any benefits of whole program optimization and in fact I think it is harmful, because it sort of works and w/o any warning produce not very optimized code. Basically there are 3 schemes how to make incremental link work 1) Turn LTO objects to non-LTO as we do now 2) concatenate LTO sections as implemented by Andi and Hj 3) Do actual linking of LTO sections The problem of current implementation of 1) is that GCC thinks the resulting object file will not be used for static linking and thus assume that hidden symbols can be turned to static. In the log of PR67548 HJ actually pointed out that we do have API at linker plugin side which says what type of output is done. This is cool because we can also use it to drop -fpic when building static binary. This is common in Firefox, where some objects are built with -fpic and linked to both binaries and libraries. Moreover we do have all infrastructure ready to implement 3). Our tree merging and symbol table handling is fuly incremental and I think made a patch to implement it today. The scheme is easy: 1) linker plugin is modified to pass -flinker-output to lto wrapper linker-output is either dyn (.so), pie or exec for incremental linking I added .rel for 3) and noltorel for 1) currently it does rel because 3) (nor 2) can not be done when incremnetal linking is done on both LTO and non-LTO objects. In this case linker plugin output warings about code quality loss and switch to noltorel. 2) with -flinker-ouptut the lto wrapper behaves same way as with -flto-partition=none. 3) lto frontend parses -flinker-output and sets our internal flags accordingly. I added new flag_incremental_linking to inform middle-end about the fact that the output is going to be statically linked again. This disables the privatization of hidden symbols and if set to 2 it also triggers the LTO IL streaming The incremental linking with rel mode now streams in all global streams, merges trees, merges symbol table, removes unreachable symbols (which are result of merging) and streams everything out to .s file. I only tested the patch on incremental linnking libbackend.o. The linking time is 46 seconds: Execution times (seconds) phase opt and generate : 35.75 (81%) usr 0.90 (76%) sys 36.63 (81%) wall 5008 kB ( 1%) ggc phase stream in : 8.57 (19%) usr 0.28 (24%) sys 8.86 (19%) wall 700851 kB (99%) ggc callgraph optimization : 0.08 ( 0%) usr 0.01 ( 1%) sys 0.08 ( 0%) wall 0 kB ( 0%) ggc ipa dead code removal : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc ipa cp : 0.36 ( 1%) usr 0.04 ( 3%) sys 0.41 ( 1%) wall 42862 kB ( 6%) ggc ipa inlining heuristics : 0.18 ( 0%) usr 0.02 ( 2%) sys 0.19 ( 0%) wall 26771 kB ( 4%) ggc lto stream inflate : 3.57 ( 8%) usr 0.14 (12%) sys 3.70 ( 8%) wall 0 kB ( 0%) ggc lto stream deflate : 20.13 (45%) usr 0.05 ( 4%) sys 19.42 (43%) wall 0 kB ( 0%) ggc lto stream output : 9.70 (22%) usr 0.32 (27%) sys 10.50 (23%) wall 0 kB ( 0%) ggc ipa lto gimple out : 0.66 ( 1%) usr 0.24 (20%) sys 1.09 ( 2%) wall 4655 kB ( 1%) ggc ipa lto decl in : 5.87 (13%) usr 0.11 ( 9%) sys 6.10 (13%) wall 552108 kB (78%) ggc ipa lto decl out : 2.91 ( 7%) usr 0.16 (14%) sys 3.07 ( 7%) wall 0 kB ( 0%) ggc ipa lto constructors in : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 108 kB ( 0%) ggc ipa lto constructors out: 0.12 ( 0%) usr 0.03 ( 3%) sys 0.13 ( 0%) wall 178 kB ( 0%) ggc ipa lto cgraph I/O : 0.12 ( 0%) usr 0.02 ( 2%) sys 0.15 ( 0%) wall 70005 kB (10%) ggc ipa lto decl merge : 0.31 ( 1%) usr 0.00 ( 0%) sys 0.30 ( 1%) wall 1023 kB ( 0%) ggc ipa lto cgraph merge : 0.11 ( 0%) usr 0.00 ( 0%) sys 0.11 ( 0%) wall 7972 kB ( 1%) ggc ipa profile : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 0.01 ( 0%) usr 0.01 ( 1%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc ipa icf : 0.04 ( 0%) usr 0.01 ( 1%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.02 ( 0%) usr 0.01 ( 1%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 44.32 1.18 45.49 707846 kB There are few low hanging fruits. First streaming LTO files is slow because of vprintf: case 1: /* TODO: Print in hex with fast function, important for -flto. */ fprintf (f, "\\%03o", c); break; a trivial bug to fix, will send separate patch for this. Second most of inflate/deflate time goes to compressing and uncompressing sections that are being copied. Also something that is trivial to fix, will do that in separate patch - this also affects WPA and /tmp space usage. The size of library is cut to about a half. -rw-r--r-- 1 hubicka _cvsadmin 211854942 Nov 25 09:18 libbackend.a -rw-r--r-- 1 hubicka _cvsadmin 121986816 Nov 25 09:16 libbackend.o and linking of cc1 binary goes from 1m31s to 1m20s. Because we link libbackend.a more than 4 times, it would actually pay back even in GCC setting, though i suppose the main utility would be in parallelizing the builds (like kernel does). WPA stage times are: Execution times (seconds) phase opt and generate : 3.76 (52%) usr 0.07 ( 6%) sys 3.83 (41%) wall 53777 kB (13%) ggc phase stream in : 3.04 (42%) usr 0.33 (28%) sys 3.37 (36%) wall 346427 kB (86%) ggc phase stream out : 0.40 ( 6%) usr 0.78 (66%) sys 2.18 (23%) wall 0 kB ( 0%) ggc callgraph optimization : 0.05 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 18 kB ( 0%) ggc ipa dead code removal : 0.46 ( 6%) usr 0.00 ( 0%) sys 0.44 ( 5%) wall 0 kB ( 0%) ggc ipa cp : 0.40 ( 6%) usr 0.05 ( 4%) sys 0.47 ( 5%) wall 55439 kB (14%) ggc ipa inlining heuristics : 1.95 (27%) usr 0.02 ( 2%) sys 1.97 (21%) wall 65871 kB (16%) ggc lto stream inflate : 0.60 ( 8%) usr 0.11 ( 9%) sys 0.67 ( 7%) wall 0 kB ( 0%) ggc ipa lto decl in : 1.93 (27%) usr 0.18 (15%) sys 2.10 (22%) wall 205593 kB (51%) ggc ipa lto decl out : 0.28 ( 4%) usr 0.02 ( 2%) sys 0.29 ( 3%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 0.09 ( 1%) usr 0.02 ( 2%) sys 0.12 ( 1%) wall 62797 kB (16%) ggc ipa lto decl merge : 0.20 ( 3%) usr 0.00 ( 0%) sys 0.20 ( 2%) wall 1023 kB ( 0%) ggc whopr partitioning : 0.56 ( 8%) usr 0.00 ( 0%) sys 0.56 ( 6%) wall 1419 kB ( 0%) ggc ipa reference : 0.17 ( 2%) usr 0.00 ( 0%) sys 0.17 ( 2%) wall 0 kB ( 0%) ggc ipa pure const : 0.17 ( 2%) usr 0.00 ( 0%) sys 0.16 ( 2%) wall 0 kB ( 0%) ggc ipa icf : 0.07 ( 1%) usr 0.00 ( 0%) sys 0.07 ( 1%) wall 485 kB ( 0%) ggc unaccounted todo : 0.06 ( 1%) usr 0.00 ( 0%) sys 0.06 ( 1%) wall 0 kB ( 0%) ggc TOTAL : 7.20 1.18 9.39 402192 kB Execution times (seconds) phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1986 kB ( 0%) ggc phase opt and generate : 6.66 (39%) usr 0.38 (22%) sys 7.03 (36%) wall 199143 kB (21%) ggc phase stream in : 9.33 (54%) usr 0.38 (22%) sys 9.71 (50%) wall 764698 kB (79%) ggc phase stream out : 0.82 ( 5%) usr 0.97 (55%) sys 2.23 (11%) wall 2 kB ( 0%) ggc phase finalize : 0.40 ( 2%) usr 0.03 ( 2%) sys 0.43 ( 2%) wall 0 kB ( 0%) ggc garbage collection : 0.79 ( 5%) usr 0.01 ( 1%) sys 0.80 ( 4%) wall 0 kB ( 0%) ggc ipa dead code removal : 0.41 ( 2%) usr 0.00 ( 0%) sys 0.45 ( 2%) wall 0 kB ( 0%) ggc ipa cp : 0.33 ( 2%) usr 0.05 ( 3%) sys 0.41 ( 2%) wall 56753 kB ( 6%) ggc ipa inlining heuristics : 1.74 (10%) usr 0.02 ( 1%) sys 1.80 ( 9%) wall 55600 kB ( 6%) ggc lto stream inflate : 2.18 (13%) usr 0.12 ( 7%) sys 2.28 (12%) wall 0 kB ( 0%) ggc ipa lto gimple in : 0.62 ( 4%) usr 0.23 (13%) sys 0.96 ( 5%) wall 135317 kB (14%) ggc ipa lto decl in : 6.63 (39%) usr 0.15 ( 9%) sys 6.70 (35%) wall 598144 kB (62%) ggc ipa lto decl out : 0.55 ( 3%) usr 0.01 ( 1%) sys 0.57 ( 3%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 0.14 ( 1%) usr 0.03 ( 2%) sys 0.15 ( 1%) wall 76843 kB ( 8%) ggc ipa lto decl merge : 0.35 ( 2%) usr 0.00 ( 0%) sys 0.34 ( 2%) wall 1023 kB ( 0%) ggc ipa lto cgraph merge : 0.13 ( 1%) usr 0.00 ( 0%) sys 0.13 ( 1%) wall 9284 kB ( 1%) ggc whopr partitioning : 0.51 ( 3%) usr 0.00 ( 0%) sys 0.50 ( 3%) wall 1496 kB ( 0%) ggc ipa reference : 0.18 ( 1%) usr 0.00 ( 0%) sys 0.19 ( 1%) wall 0 kB ( 0%) ggc ipa pure const : 0.20 ( 1%) usr 0.01 ( 1%) sys 0.20 ( 1%) wall 0 kB ( 0%) ggc ipa icf : 1.82 (11%) usr 0.05 ( 3%) sys 1.85 (10%) wall 2138 kB ( 0%) ggc tree operand scan : 0.13 ( 1%) usr 0.06 ( 3%) sys 0.17 ( 1%) wall 21674 kB ( 2%) ggc TOTAL : 17.21 1.76 19.41 965830 kB so 50% cut in memory use and resonable speedup. I need to check what happens with ICF. The WPA stats are as follows: WPA statistics [WPA] read 891308 SCCs of average size 1.972195 [WPA] 1757833 tree bodies read in total [WPA] tree SCC table: size 524287, 230881 elements, collision ratio: 1.107788 [WPA] tree SCC max chain length 39 (size 1) [WPA] Compared 73318 SCCs, 81315 collisions (1.109073) [WPA] Merged 52578 SCCs [WPA] Merged 502850 tree bodies [WPA] Merged 36730 types [WPA] 205971 types prevailed (565069 associated trees) [WPA] GIMPLE canonical type table: size 16381, 1251 elements, 28138 searches, 444 collisions (ratio: 0.015779) [WPA] GIMPLE canonical type pointer-map: 1251 elements, 99917 searches [WPA] # of input files: 125 [WPA] Compression: 23123694 input bytes, 79799028 uncompressed bytes (ratio: 3.450963) [WPA] Size of mmap'd section decls: 23123694 bytes compoared to WPA statistics [WPA] read 3633234 SCCs of average size 2.539347 [WPA] 9226041 tree bodies read in total [WPA] tree SCC table: size 524287, 257562 elements, collision ratio: 0.673833 [WPA] tree SCC max chain length 39 (size 1) [WPA] Compared 500618 SCCs, 646007 collisions (1.290419) [WPA] Merged 478513 SCCs [WPA] Merged 5659960 tree bodies [WPA] Merged 326141 types [WPA] 207806 types prevailed (562649 associated trees) [WPA] GIMPLE canonical type table: size 16381, 1246 elements, 27925 searches, 437 collisions (ratio: 0.015649) [WPA] GIMPLE canonical type pointer-map: 1246 elements, 97858 searches [WPA] # of input files: 461 [WPA] Compression: 95695388 input bytes, 303240971 uncompressed bytes (ratio: 3.168815) [WPA] Size of mmap'd section decls: 95695388 bytes So about 5fold improvement in number of trees and decls read. By end of WPA: [WPA] 1757833 tree bodies read in total [WPA] # of input files: 125 [WPA] # of input cgraph nodes: 36977 [WPA] # of function bodies: 651 [WPA] # of output files: 31 [WPA] # of output symtab nodes: 185336 [WPA] # of output tree pickle references: 629336 [WPA] # of output tree bodies: 129898 [WPA] # callgraph partitions: 31 [WPA] Compression: 30134544 input bytes, 100590102 uncompressed bytes (ratio: 3.338033) [WPA] Size of mmap'd section decls: 23123694 bytes [WPA] Size of mmap'd section function_body: 2641029 bytes [WPA] Size of mmap'd section statics: 0 bytes [WPA] Size of mmap'd section symtab: 0 bytes [WPA] Size of mmap'd section refs: 408500 bytes [WPA] Size of mmap'd section asm: 0 bytes [WPA] Size of mmap'd section jmpfuncs: 1432063 bytes [WPA] Size of mmap'd section pureconst: 80213 bytes [WPA] Size of mmap'd section reference: 0 bytes [WPA] Size of mmap'd section profile: 2439 bytes [WPA] Size of mmap'd section symbol_nodes: 1413364 bytes [WPA] Size of mmap'd section opts: 0 bytes [WPA] Size of mmap'd section cgraphopt: 0 bytes [WPA] Size of mmap'd section inline: 1005113 bytes [WPA] Size of mmap'd section ipcp_trans: 0 bytes [WPA] Size of mmap'd section icf: 28129 bytes [WPA] Size of mmap'd section offload_table: 0 bytes [WPA] Size of mmap'd section mode_table: 0 bytes [WPA] 9226041 tree bodies read in total [WPA] # of input files: 461 [WPA] # of input cgraph nodes: 36888 [WPA] # of function bodies: 7690 [WPA] # of output files: 31 [WPA] # of output symtab nodes: 191489 [WPA] # of output tree pickle references: 1444221 [WPA] # of output tree bodies: 261141 [WPA] # callgraph partitions: 31 [WPA] Compression: 112942159 input bytes, 347530231 uncompressed bytes (ratio: 3.077064) [WPA] Size of mmap'd section decls: 95695388 bytes [WPA] Size of mmap'd section function_body: 11747200 bytes [WPA] Size of mmap'd section statics: 0 bytes [WPA] Size of mmap'd section symtab: 0 bytes [WPA] Size of mmap'd section refs: 395831 bytes [WPA] Size of mmap'd section asm: 0 bytes [WPA] Size of mmap'd section jmpfuncs: 1666954 bytes [WPA] Size of mmap'd section pureconst: 94608 bytes [WPA] Size of mmap'd section reference: 0 bytes [WPA] Size of mmap'd section profile: 9259 bytes [WPA] Size of mmap'd section symbol_nodes: 1769069 bytes [WPA] Size of mmap'd section opts: 0 bytes [WPA] Size of mmap'd section cgraphopt: 0 bytes [WPA] Size of mmap'd section inline: 1266586 bytes [WPA] Size of mmap'd section ipcp_trans: 0 bytes [WPA] Size of mmap'd section icf: 297264 bytes [WPA] Size of mmap'd section offload_table: 0 bytes [WPA] Size of mmap'd section mode_table: 0 bytes Does anyone see problems with this approach? I think this is easy enough and fixes PR67548 so it may still get to mainline? I need to do more testing, but in general I think the implemntation is OK as it is. We need a way to force noltorel model for testsuite, as the new default will bypass codegen for all our -r -nostdlib testcases. BTW ltrans now dies with -ftime-report. Any ideas why? Honza Index: gcc/common.opt =================================================================== --- gcc/common.opt (revision 230847) +++ gcc/common.opt (working copy) @@ -46,6 +46,13 @@ int optimize_fast Variable bool in_lto_p = false +; This variable is set to non-0 only by LTO front-end. 1 indicates that +; the output produced will be used for incrmeental linking (thus weak symbols +; can still be bound) and 2 indicates that the IL is going to be linked and +; and output to LTO object file. +Variable +int flag_incremental_link = 0 + ; 0 means straightforward implementation of complex divide acceptable. ; 1 means wide ranges of inputs must work for complex divide. ; 2 means C99-like requirements for complex multiply and divide. Index: gcc/lto-streamer-out.c =================================================================== --- gcc/lto-streamer-out.c (revision 230847) +++ gcc/lto-streamer-out.c (working copy) @@ -2286,13 +2286,16 @@ lto_output (void) } decl_state = lto_new_out_decl_state (); lto_push_out_decl_state (decl_state); - if (gimple_has_body_p (node->decl) || !flag_wpa + if (gimple_has_body_p (node->decl) /* Thunks have no body but they may be synthetized at WPA time. */ || DECL_ARGUMENTS (node->decl)) output_function (node); else - copy_function_or_variable (node); + { + gcc_checking_assert (flag_wpa || flag_incremental_link == 2); + copy_function_or_variable (node); + } gcc_assert (lto_get_out_decl_state () == decl_state); lto_pop_out_decl_state (); lto_record_function_out_decl_state (node->decl, decl_state); @@ -2318,7 +2321,7 @@ lto_output (void) decl_state = lto_new_out_decl_state (); lto_push_out_decl_state (decl_state); if (DECL_INITIAL (node->decl) != error_mark_node - || !flag_wpa) + || (!flag_wpa && flag_incremental_link != 2)) output_constructor (node); else copy_function_or_variable (node); Index: gcc/passes.c =================================================================== --- gcc/passes.c (revision 230847) +++ gcc/passes.c (working copy) @@ -2530,7 +2530,7 @@ ipa_write_summaries (void) { struct cgraph_node *node = order[i]; - if (node->has_gimple_body_p ()) + if (gimple_has_body_p (node->decl)) { /* When streaming out references to statements as part of some IPA pass summary, the statements need to have uids assigned and the Index: gcc/cgraphunit.c =================================================================== --- gcc/cgraphunit.c (revision 230847) +++ gcc/cgraphunit.c (working copy) @@ -2270,8 +2270,10 @@ ipa_passes (void) if (flag_generate_lto || flag_generate_offload) targetm.asm_out.lto_start (); - if (!in_lto_p) + if (!in_lto_p || flag_incremental_link == 2) { + if (!quiet_flag) + fprintf (stderr, "Streaming LTO\n"); if (g->have_offload) { section_name_prefix = OFFLOAD_SECTION_NAME_PREFIX; @@ -2290,7 +2292,9 @@ ipa_passes (void) if (flag_generate_lto || flag_generate_offload) targetm.asm_out.lto_end (); - if (!flag_ltrans && (in_lto_p || !flag_lto || flag_fat_lto_objects)) + if (!flag_ltrans + && ((in_lto_p && flag_incremental_link != 2) + || !flag_lto || flag_fat_lto_objects)) execute_ipa_pass_list (passes->all_regular_ipa_passes); invoke_plugin_callbacks (PLUGIN_ALL_IPA_PASSES_END, NULL); @@ -2381,7 +2385,8 @@ symbol_table::compile (void) /* Do nothing else if any IPA pass found errors or if we are just streaming LTO. */ if (seen_error () - || (!in_lto_p && flag_lto && !flag_fat_lto_objects)) + || ((!in_lto_p || flag_incremental_link == 2) + && flag_lto && !flag_fat_lto_objects)) { timevar_pop (TV_CGRAPHOPT); return; Index: gcc/lto-cgraph.c =================================================================== --- gcc/lto-cgraph.c (revision 230847) +++ gcc/lto-cgraph.c (working copy) @@ -534,7 +534,10 @@ lto_output_node (struct lto_simple_outpu bp_pack_value (&bp, node->thunk.thunk_p, 1); bp_pack_value (&bp, node->parallelized_function, 1); bp_pack_enum (&bp, ld_plugin_symbol_resolution, - LDPR_NUM_KNOWN, node->resolution); + LDPR_NUM_KNOWN, + /* When doing incremental link, we will get new resolution + info next time we process the file. */ + flag_incremental_link ? LDPR_UNKNOWN : node->resolution); bp_pack_value (&bp, node->instrumentation_clone, 1); bp_pack_value (&bp, node->split_part, 1); streamer_write_bitpack (&bp); Index: gcc/toplev.c =================================================================== --- gcc/toplev.c (revision 230847) +++ gcc/toplev.c (working copy) @@ -504,7 +504,8 @@ compile_file (void) /* Compilation unit is finalized. When producing non-fat LTO object, we are basically finished. */ - if (in_lto_p || !flag_lto || flag_fat_lto_objects) + if ((in_lto_p && flag_incremental_link != 2) + || !flag_lto || flag_fat_lto_objects) { /* File-scope initialization for AddressSanitizer. */ if (flag_sanitize & SANITIZE_ADDRESS) Index: gcc/flag-types.h =================================================================== --- gcc/flag-types.h (revision 230847) +++ gcc/flag-types.h (working copy) @@ -265,6 +265,15 @@ enum lto_partition_model { LTO_PARTITION_MAX = 4 }; +/* flag_lto_linker_output initialization values. */ +enum lto_linker_output { + LTO_LINKER_OUTPUT_UNKNOWN, + LTO_LINKER_OUTPUT_REL, + LTO_LINKER_OUTPUT_NOLTOREL, + LTO_LINKER_OUTPUT_DYN, + LTO_LINKER_OUTPUT_PIE, + LTO_LINKER_OUTPUT_EXEC +}; /* gfortran -finit-real= values. */ Index: gcc/lto/lto.c =================================================================== --- gcc/lto/lto.c (revision 230847) +++ gcc/lto/lto.c (working copy) @@ -3188,6 +3188,8 @@ lto_eh_personality (void) static void lto_process_name (void) { + if (flag_incremental_link == 2) + setproctitle ("lto1-incremental-link"); if (flag_lto) setproctitle ("lto1-lto"); if (flag_wpa) Index: gcc/lto/lang.opt =================================================================== --- gcc/lto/lang.opt (revision 230847) +++ gcc/lto/lang.opt (working copy) @@ -24,6 +24,32 @@ Language LTO +Enum +Name(lto_linker_output) Type(enum lto_linker_output) UnknownError(unknown linker output %qs) + +EnumValue +Enum(lto_linker_output) String(unknown) Value(LTO_LINKER_OUTPUT_UNKNOWN) + +EnumValue +Enum(lto_linker_output) String(rel) Value(LTO_LINKER_OUTPUT_REL) + +EnumValue +Enum(lto_linker_output) String(noltorel) Value(LTO_LINKER_OUTPUT_NOLTOREL) + +EnumValue +Enum(lto_linker_output) String(dyn) Value(LTO_LINKER_OUTPUT_DYN) + +EnumValue +Enum(lto_linker_output) String(pie) Value(LTO_LINKER_OUTPUT_PIE) + +EnumValue +Enum(lto_linker_output) String(exec) Value(LTO_LINKER_OUTPUT_EXEC) + +flinker-output= +LTO Report Driver Joined RejectNegative Enum(lto_linker_output) Var(flag_lto_linker_output) Init(LTO_LINKER_OUTPUT_UNKNOWN) +Set linker output type (used internally during LTO optimization) + + fltrans LTO Report Var(flag_ltrans) Run the link-time optimizer in local transformation (LTRANS) mode. Index: gcc/lto/lto-lang.c =================================================================== --- gcc/lto/lto-lang.c (revision 230847) +++ gcc/lto/lto-lang.c (working copy) @@ -819,6 +819,56 @@ lto_post_options (const char **pfilename if (flag_wpa) flag_generate_lto = 1; + /* Initialize the codegen flags according to the output type. */ + switch (flag_lto_linker_output) + { + case LTO_LINKER_OUTPUT_REL: /* .o: incremental link producing LTO IL */ + /* Configure compiler same way as normal frontend would do with -flto: + this way we read the trees (declarations & types), symbol table, + optimization summaries and link them. Subsequently we output new LTO + file. */ + flag_lto = ""; + flag_incremental_link = 2; + flag_whole_program = 0; + flag_wpa = 0; + flag_generate_lto = 1; + /* It would be cool to produce .o file directly, but our current + simple objects does not contain the lto symbol markers. Go the slow + way through the asm file. */ + lang_hooks.lto.begin_section = lhd_begin_section; + lang_hooks.lto.append_data = lhd_append_data; + lang_hooks.lto.end_section = lhd_end_section; + if (flag_ltrans) + error ("-flinker-output=rel and -fltrans are mutually exclussive"); + break; + + case LTO_LINKER_OUTPUT_NOLTOREL: /* .o: incremental link producing asm */ + flag_whole_program = 0; + flag_incremental_link = 1; + break; + + case LTO_LINKER_OUTPUT_DYN: /* .so: PID library */ + /* On some targets, like i386 it makes sense to build PIC library wihout + -fpic for performance reasons. So no need to adjust flags. */ + break; + + case LTO_LINKER_OUTPUT_PIE: /* PIE binary */ + /* If -fPIC or -fPIE was used at compile time, be sure that + flag_pie is 2. */ + if (!flag_pie && flag_pic) + flag_pie = flag_pic; + flag_pic = 0; + break; + + case LTO_LINKER_OUTPUT_EXEC: /* Normal executable */ + flag_pic = 0; + flag_pie = 0; + break; + + case LTO_LINKER_OUTPUT_UNKNOWN: + break; + } + /* Excess precision other than "fast" requires front-end support. */ flag_excess_precision_cmdline = EXCESS_PRECISION_FAST; @@ -1214,7 +1264,7 @@ lto_init (void) int i; /* We need to generate LTO if running in WPA mode. */ - flag_generate_lto = (flag_wpa != NULL); + flag_generate_lto = (flag_incremental_link == 2 || flag_wpa != NULL); /* Create the basic integer types. */ build_common_tree_nodes (flag_signed_char, flag_short_double); Index: gcc/ipa-visibility.c =================================================================== --- gcc/ipa-visibility.c (revision 230847) +++ gcc/ipa-visibility.c (working copy) @@ -217,13 +217,13 @@ cgraph_externally_visible_p (struct cgra This improves code quality and we know we will duplicate them at most twice (in the case that we are not using plugin and link with object file implementing same COMDAT) */ - if ((in_lto_p || whole_program) + if (((in_lto_p || whole_program) && !flag_incremental_link) && DECL_COMDAT (node->decl) && comdat_can_be_unshared_p (node)) return false; /* When doing link time optimizations, hidden symbols become local. */ - if (in_lto_p + if ((in_lto_p && !flag_incremental_link) && (DECL_VISIBILITY (node->decl) == VISIBILITY_HIDDEN || DECL_VISIBILITY (node->decl) == VISIBILITY_INTERNAL) /* Be sure that node is defined in IR file, not in other object @@ -293,13 +293,13 @@ varpool_node::externally_visible_p (void so this does not enable more optimization, but referring static var is faster for dynamic linking. Also this match logic hidding vtables from LTO symbol tables. */ - if ((in_lto_p || flag_whole_program) + if (((in_lto_p || flag_whole_program) && !flag_incremental_link) && DECL_COMDAT (decl) && comdat_can_be_unshared_p (this)) return false; /* When doing link time optimizations, hidden symbols become local. */ - if (in_lto_p + if (in_lto_p && !flag_incremental_link && (DECL_VISIBILITY (decl) == VISIBILITY_HIDDEN || DECL_VISIBILITY (decl) == VISIBILITY_INTERNAL) /* Be sure that node is defined in IR file, not in other object Index: gcc/lto-wrapper.c =================================================================== --- gcc/lto-wrapper.c (revision 230847) +++ gcc/lto-wrapper.c (working copy) @@ -953,9 +953,15 @@ run_gcc (unsigned argc, char *argv[]) file_offset = (off_t) loffset; } fd = open (filename, O_RDONLY | O_BINARY); + /* Linker plugin passes -fresolution and -flinker-output options. */ if (fd == -1) { lto_argv[lto_argc++] = argv[i]; + if (strcmp (argv[i], "-flinker-output=rel") == 0) + { + no_partition = true; + lto_mode = LTO_MODE_LTO; + } continue; } Index: lto-plugin/lto-plugin.c =================================================================== --- lto-plugin/lto-plugin.c (revision 230847) +++ lto-plugin/lto-plugin.c (working copy) @@ -151,6 +151,7 @@ static ld_plugin_add_symbols add_symbols static struct plugin_file_info *claimed_files = NULL; static unsigned int num_claimed_files = 0; +static unsigned int non_claimed_files = 0; static struct plugin_file_info *offload_files = NULL; static unsigned int num_offload_files = 0; @@ -167,6 +168,7 @@ static unsigned int num_pass_through_ite static char debug; static char nop; static char *resolution_file = NULL; +static const char *linker_output = NULL; /* The version of gold being used, or -1 if not gold. The number is MAJOR * 100 + MINOR. */ @@ -624,7 +626,7 @@ all_symbols_read_handler (void) { unsigned i; unsigned num_lto_args - = num_claimed_files + num_offload_files + lto_wrapper_num_args + 1; + = num_claimed_files + num_offload_files + lto_wrapper_num_args + 2; char **lto_argv; const char **lto_arg_ptr; if (num_claimed_files + num_offload_files == 0) @@ -648,6 +650,15 @@ all_symbols_read_handler (void) for (i = 0; i < lto_wrapper_num_args; i++) *lto_arg_ptr++ = lto_wrapper_argv[i]; + assert (linker_output); + if (non_claimed_files && !strcmp (linker_output, "-flinker-output=rel")) + { + linker_output="-flinker-output=nonltorel"; + message (LDPL_WARNING, "incremental linking of LTO and non-LTO " + "objects will produce final assembly for LTO objects and " + "bypass whole program optimization"); + } + *lto_arg_ptr++ = xstrdup (linker_output); for (i = 0; i < num_claimed_files; i++) { struct plugin_file_info *info = &claimed_files[i]; @@ -985,6 +996,8 @@ claim_file_handler (const struct ld_plug num_claimed_files * sizeof (struct plugin_file_info)); claimed_files[num_claimed_files - 1] = lto_file; } + else + non_claimed_files++; if (obj.found == 0 && obj.offload == 1) { @@ -1054,6 +1067,31 @@ process_option (const char *option) } } +/* Pass -flinker-output to the wrapper. */ + +void +add_linker_output_option (int val) +{ + switch (val) + { + case LDPO_REL: + linker_output = "-flinker-output=rel"; + break; + case LDPO_DYN: + linker_output = "-flinker-output=dyn"; + break; + case LDPO_PIE: + linker_output = "-flinker-output=pie"; + break; + case LDPO_EXEC: + linker_output = "-flinker-output=exec"; + break; + default: + message (LDPL_FATAL, "unsupported linker output %i", val); + break; + } +} + /* Called by gold after loading the plugin. TV is the transfer vector. */ enum ld_plugin_status @@ -1100,6 +1138,9 @@ onload (struct ld_plugin_tv *tv) case LDPT_GOLD_VERSION: gold_version = p->tv_u.tv_val; break; + case LDPT_LINKER_OUTPUT: + add_linker_output_option (p->tv_u.tv_val); + break; default: break; }