From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 105917 invoked by alias); 16 Jan 2019 12:44:18 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 105901 invoked by uid 89); 16 Jan 2019 12:44:17 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=1920, substantial, repair, percentage X-HELO: mail-lj1-f181.google.com Received: from mail-lj1-f181.google.com (HELO mail-lj1-f181.google.com) (209.85.208.181) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 16 Jan 2019 12:44:09 +0000 Received: by mail-lj1-f181.google.com with SMTP id l15-v6so5298211lja.9 for ; Wed, 16 Jan 2019 04:44:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=VVZ7+g+pCW9jx1CkXLgm6pSAMo7wbL18lh3qxV0nBq0=; b=mlKdVmE+SsoP0Vg41F0c1DXGts8wpxpnYpOjESpobY/7NA6qR2m3IF+UDe4bV5KPWQ 2Cd5Dsa+7C6qkYusLhNQN/mAs2+ihyENE/eXj7qERfPs/LeloOuWZme2ZsX36bWKrZXi ECWNVi8X1kz4ytHIzkSQL3OObVyonUkyxK5YkIOSTu1l9ZoIGANlElpWkwaKph/hysnf UCUec+z4vXI+J4UxIuEF3pSVJBTac/KORuc2oebetLy5lgqtLzGDJCLsM1NreIqRmSwv ysJ67c3BEC3RAXF1do6m98JYX6q/vABW2pW/sFGZD86N4nBXMnAh3AcyXg4kC+IV+XSQ l1MQ== MIME-Version: 1.0 References: <20190114114149.cvqvgpv32a37h5da@smtp.gmail.com> <20190115214457.dvf44pmd7ydujo5d@smtp.gmail.com> In-Reply-To: <20190115214457.dvf44pmd7ydujo5d@smtp.gmail.com> From: Richard Biener Date: Wed, 16 Jan 2019 12:44:00 -0000 Message-ID: Subject: Re: Parallelize the compilation using Threads To: Giuliano Belinassi Cc: GCC Development , kernel-usp@googlegroups.com, gold@ime.usp.br, Alfredo Goldman , Gregory.Mounie@imag.fr Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2019-01/txt/msg00126.txt.bz2 On Tue, Jan 15, 2019 at 10:45 PM Giuliano Belinassi wrote: > > Hi > > I've managed to compile gimple-match.c with -ftime-report, and "phase opt= and > generate" seems to be what takes most of the compilation time. This is ca= ptured > by the "TV_PHASE_OPT_GEN" timevar, and all its occurrences seem to be in > toplev.c and lto.c. TV_PHASE_OPT_GEN covers nearly everything besides parsing. Thus all stuff below "phase *" is covered by one of the phases. It would probably be nice to split up TV_PHASE_OPT_GEN into GIMPLE, IPA and RTL optimization phases. > Any ideas of which part such that this variable captures is > the most costly? Also, is that percentage in "GGC" column the amount of t= ime > inside the Garbage Collector? The percentage for the GGC column is the percentage of total GGC memory, not time. See timevar.c:print_row The most costly part of opt-and-generate is the various verifiers. See the note printed at the bottom: > Extra diagnostic checks enabled; compiler may run slowly. > Configure with --enable-checking=3Drelease to disable checks. you can get a clearer picture when you configure GCC with --enable-checking=3Drelease. For a quick start passing -fno-checking will disable the most costly bits already. Richard. > > Time variable usr sys = wall GGC > phase setup : 0.01 ( 0%) 0.01 ( 0%) 0.02 = ( 0%) 1473 kB ( 0%) > phase parsing : 3.74 ( 4%) 1.43 ( 30%) 5.17 = ( 5%) 294287 kB ( 16%) > phase lang. deferred : 0.08 ( 0%) 0.03 ( 1%) 0.11 = ( 0%) 7582 kB ( 0%) > phase opt and generate : 94.10 ( 95%) 3.26 ( 67%) 97.46 = ( 93%) 1543477 kB ( 82%) > phase last asm : 0.89 ( 1%) 0.09 ( 2%) 0.98 = ( 1%) 39802 kB ( 2%) > phase finalize : 0.00 ( 0%) 0.01 ( 0%) 0.50 = ( 0%) 0 kB ( 0%) > |name lookup : 0.42 ( 0%) 0.12 ( 2%) 0.46 = ( 0%) 6162 kB ( 0%) > |overload resolution : 0.37 ( 0%) 0.13 ( 3%) 0.42 = ( 0%) 18172 kB ( 1%) > garbage collection : 2.99 ( 3%) 0.03 ( 1%) 3.02 = ( 3%) 0 kB ( 0%) > dump files : 0.11 ( 0%) 0.01 ( 0%) 0.16 = ( 0%) 0 kB ( 0%) > callgraph construction : 0.35 ( 0%) 0.01 ( 0%) 0.24 = ( 0%) 61143 kB ( 3%) > callgraph optimization : 0.21 ( 0%) 0.01 ( 0%) 0.17 = ( 0%) 175 kB ( 0%) > ipa function summary : 0.12 ( 0%) 0.00 ( 0%) 0.14 = ( 0%) 2216 kB ( 0%) > ipa dead code removal : 0.04 ( 0%) 0.01 ( 0%) 0.00 = ( 0%) 0 kB ( 0%) > ipa devirtualization : 0.00 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 0 kB ( 0%) > ipa cp : 0.33 ( 0%) 0.01 ( 0%) 0.39 = ( 0%) 9073 kB ( 0%) > ipa inlining heuristics : 0.48 ( 0%) 0.00 ( 0%) 0.48 = ( 0%) 6175 kB ( 0%) > ipa function splitting : 0.10 ( 0%) 0.01 ( 0%) 0.07 = ( 0%) 9111 kB ( 0%) > ipa comdats : 0.01 ( 0%) 0.00 ( 0%) 0.00 = ( 0%) 0 kB ( 0%) > ipa various optimizations : 0.03 ( 0%) 0.03 ( 1%) 0.01 = ( 0%) 480 kB ( 0%) > ipa reference : 0.01 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 0 kB ( 0%) > ipa profile : 0.01 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 0 kB ( 0%) > ipa pure const : 0.13 ( 0%) 0.00 ( 0%) 0.12 = ( 0%) 8 kB ( 0%) > ipa icf : 0.08 ( 0%) 0.00 ( 0%) 0.08 = ( 0%) 6 kB ( 0%) > ipa SRA : 1.26 ( 1%) 0.28 ( 6%) 1.78 = ( 2%) 165814 kB ( 9%) > ipa free lang data : 0.01 ( 0%) 0.00 ( 0%) 0.00 = ( 0%) 0 kB ( 0%) > ipa free inline summary : 0.00 ( 0%) 0.00 ( 0%) 0.03 = ( 0%) 0 kB ( 0%) > cfg construction : 0.09 ( 0%) 0.00 ( 0%) 0.09 = ( 0%) 7926 kB ( 0%) > cfg cleanup : 1.84 ( 2%) 0.00 ( 0%) 1.73 = ( 2%) 13673 kB ( 1%) > CFG verifier : 6.05 ( 6%) 0.12 ( 2%) 6.80 = ( 7%) 0 kB ( 0%) > trivially dead code : 0.32 ( 0%) 0.01 ( 0%) 0.38 = ( 0%) 0 kB ( 0%) > df scan insns : 0.23 ( 0%) 0.00 ( 0%) 0.30 = ( 0%) 28 kB ( 0%) > df multiple defs : 0.13 ( 0%) 0.00 ( 0%) 0.20 = ( 0%) 0 kB ( 0%) > df reaching defs : 0.52 ( 1%) 0.00 ( 0%) 0.55 = ( 1%) 0 kB ( 0%) > df live regs : 2.70 ( 3%) 0.02 ( 0%) 3.08 = ( 3%) 425 kB ( 0%) > df live&initialized regs : 1.28 ( 1%) 0.00 ( 0%) 1.13 = ( 1%) 0 kB ( 0%) > df must-initialized regs : 0.14 ( 0%) 0.00 ( 0%) 0.16 = ( 0%) 0 kB ( 0%) > df use-def / def-use chains : 0.32 ( 0%) 0.00 ( 0%) 0.26 = ( 0%) 0 kB ( 0%) > df reg dead/unused notes : 0.96 ( 1%) 0.01 ( 0%) 0.89 = ( 1%) 11726 kB ( 1%) > register information : 0.29 ( 0%) 0.00 ( 0%) 0.21 = ( 0%) 0 kB ( 0%) > alias analysis : 0.54 ( 1%) 0.00 ( 0%) 0.53 = ( 1%) 17487 kB ( 1%) > alias stmt walking : 1.10 ( 1%) 0.08 ( 2%) 1.22 = ( 1%) 118 kB ( 0%) > register scan : 0.08 ( 0%) 0.01 ( 0%) 0.08 = ( 0%) 118 kB ( 0%) > rebuild jump labels : 0.12 ( 0%) 0.01 ( 0%) 0.11 = ( 0%) 0 kB ( 0%) > preprocessing : 0.29 ( 0%) 0.43 ( 9%) 0.65 = ( 1%) 37409 kB ( 2%) > parser (global) : 0.39 ( 0%) 0.39 ( 8%) 0.94 = ( 1%) 92661 kB ( 5%) > parser struct body : 0.07 ( 0%) 0.00 ( 0%) 0.08 = ( 0%) 6159 kB ( 0%) > parser enumerator list : 0.01 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 3342 kB ( 0%) > parser function body : 2.37 ( 2%) 0.43 ( 9%) 2.82 = ( 3%) 119124 kB ( 6%) > parser inl. func. body : 0.18 ( 0%) 0.05 ( 1%) 0.16 = ( 0%) 10354 kB ( 1%) > parser inl. meth. body : 0.04 ( 0%) 0.01 ( 0%) 0.03 = ( 0%) 2986 kB ( 0%) > template instantiation : 0.17 ( 0%) 0.08 ( 2%) 0.26 = ( 0%) 15801 kB ( 1%) > constant expression evaluation : 0.06 ( 0%) 0.05 ( 1%) 0.07 = ( 0%) 516 kB ( 0%) > early inlining heuristics : 0.13 ( 0%) 0.00 ( 0%) 0.08 = ( 0%) 19547 kB ( 1%) > inline parameters : 0.14 ( 0%) 0.01 ( 0%) 0.22 = ( 0%) 3372 kB ( 0%) > integration : 1.00 ( 1%) 0.23 ( 5%) 1.22 = ( 1%) 132386 kB ( 7%) > tree gimplify : 0.36 ( 0%) 0.02 ( 0%) 0.31 = ( 0%) 63162 kB ( 3%) > tree eh : 0.03 ( 0%) 0.00 ( 0%) 0.04 = ( 0%) 4173 kB ( 0%) > tree CFG construction : 0.07 ( 0%) 0.00 ( 0%) 0.07 = ( 0%) 20805 kB ( 1%) > tree CFG cleanup : 1.40 ( 1%) 0.14 ( 3%) 1.57 = ( 2%) 3995 kB ( 0%) > tree tail merge : 0.17 ( 0%) 0.01 ( 0%) 0.16 = ( 0%) 7251 kB ( 0%) > tree VRP : 1.94 ( 2%) 0.08 ( 2%) 1.83 = ( 2%) 40527 kB ( 2%) > tree Early VRP : 0.27 ( 0%) 0.03 ( 1%) 0.30 = ( 0%) 3298 kB ( 0%) > tree copy propagation : 0.14 ( 0%) 0.00 ( 0%) 0.08 = ( 0%) 427 kB ( 0%) > tree PTA : 0.61 ( 1%) 0.03 ( 1%) 0.53 = ( 1%) 3861 kB ( 0%) > tree PHI insertion : 0.01 ( 0%) 0.02 ( 0%) 0.03 = ( 0%) 8529 kB ( 0%) > tree SSA rewrite : 0.23 ( 0%) 0.03 ( 1%) 0.43 = ( 0%) 24334 kB ( 1%) > tree SSA other : 0.10 ( 0%) 0.01 ( 0%) 0.10 = ( 0%) 538 kB ( 0%) > tree SSA incremental : 0.79 ( 1%) 0.07 ( 1%) 0.88 = ( 1%) 11828 kB ( 1%) > tree operand scan : 1.33 ( 1%) 0.30 ( 6%) 1.51 = ( 1%) 56249 kB ( 3%) > dominator optimization : 1.92 ( 2%) 0.07 ( 1%) 1.90 = ( 2%) 31786 kB ( 2%) > backwards jump threading : 0.20 ( 0%) 0.02 ( 0%) 0.16 = ( 0%) 8676 kB ( 0%) > tree SRA : 0.17 ( 0%) 0.01 ( 0%) 0.09 = ( 0%) 6050 kB ( 0%) > isolate eroneous paths : 0.01 ( 0%) 0.00 ( 0%) 0.04 = ( 0%) 1319 kB ( 0%) > tree CCP : 0.67 ( 1%) 0.08 ( 2%) 0.62 = ( 1%) 4190 kB ( 0%) > tree PHI const/copy prop : 0.10 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 132 kB ( 0%) > tree split crit edges : 0.12 ( 0%) 0.00 ( 0%) 0.15 = ( 0%) 10236 kB ( 1%) > tree reassociation : 0.14 ( 0%) 0.00 ( 0%) 0.08 = ( 0%) 168 kB ( 0%) > tree PRE : 0.74 ( 1%) 0.04 ( 1%) 0.76 = ( 1%) 16728 kB ( 1%) > tree FRE : 0.69 ( 1%) 0.04 ( 1%) 0.60 = ( 1%) 5370 kB ( 0%) > tree code sinking : 0.06 ( 0%) 0.01 ( 0%) 0.06 = ( 0%) 9670 kB ( 1%) > tree linearize phis : 0.10 ( 0%) 0.00 ( 0%) 0.09 = ( 0%) 699 kB ( 0%) > tree backward propagate : 0.03 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 0 kB ( 0%) > tree forward propagate : 0.52 ( 1%) 0.04 ( 1%) 0.48 = ( 0%) 3055 kB ( 0%) > tree phiprop : 0.05 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 0 kB ( 0%) > tree conservative DCE : 0.27 ( 0%) 0.03 ( 1%) 0.43 = ( 0%) 1557 kB ( 0%) > tree aggressive DCE : 0.21 ( 0%) 0.04 ( 1%) 0.23 = ( 0%) 2565 kB ( 0%) > tree buildin call DCE : 0.00 ( 0%) 0.00 ( 0%) 0.04 = ( 0%) 0 kB ( 0%) > tree DSE : 0.18 ( 0%) 0.01 ( 0%) 0.18 = ( 0%) 274 kB ( 0%) > PHI merge : 0.07 ( 0%) 0.00 ( 0%) 0.06 = ( 0%) 3170 kB ( 0%) > tree loop optimization : 0.00 ( 0%) 0.00 ( 0%) 0.04 = ( 0%) 0 kB ( 0%) > loopless fn : 0.01 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 0 kB ( 0%) > tree loop invariant motion : 0.03 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 0 kB ( 0%) > tree canonical iv : 0.01 ( 0%) 0.00 ( 0%) 0.00 = ( 0%) 58 kB ( 0%) > complete unrolling : 0.00 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 361 kB ( 0%) > tree iv optimization : 0.00 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 128 kB ( 0%) > tree copy headers : 0.02 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 414 kB ( 0%) > tree SSA uncprop : 0.06 ( 0%) 0.00 ( 0%) 0.09 = ( 0%) 0 kB ( 0%) > tree NRV optimization : 0.01 ( 0%) 0.00 ( 0%) 0.05 = ( 0%) 14 kB ( 0%) > tree SSA verifier : 8.44 ( 9%) 0.26 ( 5%) 8.77 = ( 8%) 0 kB ( 0%) > tree STMT verifier : 12.57 ( 13%) 0.35 ( 7%) 13.03 = ( 12%) 0 kB ( 0%) > tree switch conversion : 0.00 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 5 kB ( 0%) > tree switch lowering : 0.02 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 1194 kB ( 0%) > gimple CSE sin/cos : 0.01 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 0 kB ( 0%) > gimple widening/fma detection : 0.06 ( 0%) 0.00 ( 0%) 0.03 = ( 0%) 2 kB ( 0%) > tree strlen optimization : 0.03 ( 0%) 0.00 ( 0%) 0.05 = ( 0%) 0 kB ( 0%) > callgraph verifier : 0.93 ( 1%) 0.07 ( 1%) 0.99 = ( 1%) 0 kB ( 0%) > dominance frontiers : 0.14 ( 0%) 0.00 ( 0%) 0.07 = ( 0%) 0 kB ( 0%) > dominance computation : 1.98 ( 2%) 0.05 ( 1%) 2.17 = ( 2%) 0 kB ( 0%) > control dependences : 0.03 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 0 kB ( 0%) > out of ssa : 0.11 ( 0%) 0.00 ( 0%) 0.11 = ( 0%) 253 kB ( 0%) > expand vars : 0.12 ( 0%) 0.00 ( 0%) 0.12 = ( 0%) 5803 kB ( 0%) > expand : 0.68 ( 1%) 0.02 ( 0%) 0.75 = ( 1%) 129150 kB ( 7%) > post expand cleanups : 0.09 ( 0%) 0.00 ( 0%) 0.03 = ( 0%) 1400 kB ( 0%) > varconst : 0.01 ( 0%) 0.01 ( 0%) 0.01 = ( 0%) 13 kB ( 0%) > lower subreg : 0.02 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 63 kB ( 0%) > forward prop : 0.32 ( 0%) 0.01 ( 0%) 0.34 = ( 0%) 7384 kB ( 0%) > CSE : 1.03 ( 1%) 0.02 ( 0%) 0.95 = ( 1%) 4656 kB ( 0%) > dead code elimination : 0.23 ( 0%) 0.00 ( 0%) 0.22 = ( 0%) 0 kB ( 0%) > dead store elim1 : 0.40 ( 0%) 0.00 ( 0%) 0.34 = ( 0%) 5665 kB ( 0%) > dead store elim2 : 0.60 ( 1%) 0.00 ( 0%) 0.65 = ( 1%) 9079 kB ( 0%) > loop analysis : 0.01 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 0 kB ( 0%) > loop init : 1.31 ( 1%) 0.05 ( 1%) 1.64 = ( 2%) 5802 kB ( 0%) > loop invariant motion : 0.02 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 19 kB ( 0%) > loop fini : 0.02 ( 0%) 0.01 ( 0%) 0.04 = ( 0%) 0 kB ( 0%) > CPROP : 1.27 ( 1%) 0.01 ( 0%) 1.14 = ( 1%) 30881 kB ( 2%) > PRE : 0.61 ( 1%) 0.00 ( 0%) 0.59 = ( 1%) 1920 kB ( 0%) > CSE 2 : 0.57 ( 1%) 0.01 ( 0%) 0.58 = ( 1%) 2822 kB ( 0%) > branch prediction : 0.08 ( 0%) 0.01 ( 0%) 0.10 = ( 0%) 887 kB ( 0%) > combiner : 1.15 ( 1%) 0.00 ( 0%) 1.28 = ( 1%) 35520 kB ( 2%) > if-conversion : 0.24 ( 0%) 0.00 ( 0%) 0.22 = ( 0%) 5851 kB ( 0%) > integrated RA : 2.29 ( 2%) 0.03 ( 1%) 2.37 = ( 2%) 54041 kB ( 3%) > LRA non-specific : 0.97 ( 1%) 0.01 ( 0%) 1.04 = ( 1%) 5294 kB ( 0%) > LRA virtuals elimination : 0.44 ( 0%) 0.00 ( 0%) 0.39 = ( 0%) 6089 kB ( 0%) > LRA reload inheritance : 0.17 ( 0%) 0.00 ( 0%) 0.27 = ( 0%) 5783 kB ( 0%) > LRA create live ranges : 1.07 ( 1%) 0.00 ( 0%) 1.09 = ( 1%) 1004 kB ( 0%) > LRA hard reg assignment : 0.11 ( 0%) 0.00 ( 0%) 0.09 = ( 0%) 0 kB ( 0%) > LRA rematerialization : 0.20 ( 0%) 0.00 ( 0%) 0.20 = ( 0%) 0 kB ( 0%) > reload : 0.02 ( 0%) 0.00 ( 0%) 0.03 = ( 0%) 0 kB ( 0%) > reload CSE regs : 0.90 ( 1%) 0.01 ( 0%) 0.80 = ( 1%) 13780 kB ( 1%) > ree : 0.13 ( 0%) 0.00 ( 0%) 0.10 = ( 0%) 589 kB ( 0%) > thread pro- & epilogue : 0.51 ( 1%) 0.01 ( 0%) 0.57 = ( 1%) 2328 kB ( 0%) > if-conversion 2 : 0.08 ( 0%) 0.00 ( 0%) 0.08 = ( 0%) 319 kB ( 0%) > combine stack adjustments : 0.04 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 0 kB ( 0%) > peephole 2 : 0.12 ( 0%) 0.00 ( 0%) 0.18 = ( 0%) 1242 kB ( 0%) > hard reg cprop : 0.57 ( 1%) 0.00 ( 0%) 0.49 = ( 0%) 189 kB ( 0%) > scheduling 2 : 2.53 ( 3%) 0.03 ( 1%) 2.53 = ( 2%) 5740 kB ( 0%) > machine dep reorg : 0.08 ( 0%) 0.00 ( 0%) 0.07 = ( 0%) 0 kB ( 0%) > reorder blocks : 0.74 ( 1%) 0.01 ( 0%) 0.69 = ( 1%) 6926 kB ( 0%) > shorten branches : 0.20 ( 0%) 0.00 ( 0%) 0.16 = ( 0%) 0 kB ( 0%) > final : 0.85 ( 1%) 0.01 ( 0%) 0.97 = ( 1%) 115151 kB ( 6%) > symout : 1.17 ( 1%) 0.11 ( 2%) 1.25 = ( 1%) 202121 kB ( 11%) > variable tracking : 0.77 ( 1%) 0.01 ( 0%) 0.81 = ( 1%) 45792 kB ( 2%) > var-tracking dataflow : 1.30 ( 1%) 0.01 ( 0%) 1.24 = ( 1%) 926 kB ( 0%) > var-tracking emit : 1.43 ( 1%) 0.01 ( 0%) 1.42 = ( 1%) 57281 kB ( 3%) > tree if-combine : 0.06 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 417 kB ( 0%) > uninit var analysis : 0.03 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 0 kB ( 0%) > straight-line strength reduction : 0.04 ( 0%) 0.00 ( 0%) 0.03 = ( 0%) 525 kB ( 0%) > store merging : 0.04 ( 0%) 0.00 ( 0%) 0.03 = ( 0%) 492 kB ( 0%) > initialize rtl : 0.01 ( 0%) 0.00 ( 0%) 0.04 = ( 0%) 12 kB ( 0%) > address lowering : 0.04 ( 0%) 0.00 ( 0%) 0.02 = ( 0%) 2 kB ( 0%) > early local passes : 0.02 ( 0%) 0.01 ( 0%) 0.00 = ( 0%) 0 kB ( 0%) > unaccounted optimizations : 0.01 ( 0%) 0.00 ( 0%) 0.00 = ( 0%) 0 kB ( 0%) > rest of compilation : 1.29 ( 1%) 0.01 ( 0%) 1.11 = ( 1%) 5063 kB ( 0%) > remove unused locals : 0.25 ( 0%) 0.04 ( 1%) 0.25 = ( 0%) 37 kB ( 0%) > address taken : 0.11 ( 0%) 0.10 ( 2%) 0.25 = ( 0%) 0 kB ( 0%) > verify loop closed : 0.00 ( 0%) 0.00 ( 0%) 0.01 = ( 0%) 0 kB ( 0%) > verify RTL sharing : 5.24 ( 5%) 0.05 ( 1%) 5.37 = ( 5%) 0 kB ( 0%) > rebuild frequencies : 0.04 ( 0%) 0.00 ( 0%) 0.06 = ( 0%) 621 kB ( 0%) > repair loop structures : 0.17 ( 0%) 0.00 ( 0%) 0.24 = ( 0%) 0 kB ( 0%) > TOTAL : 98.82 4.83 104.24 = 1886632 kB > Extra diagnostic checks enabled; compiler may run slowly. > Configure with --enable-checking=3Drelease to disable checks. > > real 1m54.934s > user 1m48.938s > sys 0m5.196s > > > Thank you > Giuliano. > > On 01/14, Richard Biener wrote: > > On Mon, Jan 14, 2019 at 12:41 PM Giuliano Belinassi > > wrote: > > > > > > Hi, > > > > > > I am currently studying the GIMPLE IR documentation and thinking abou= t a > > > way easily gather the timing information. I was thinking about about > > > adding this feature to gcc to show/dump the elapsed time on GIMPLE. D= oes > > > this makes sense? Is this already implemented somewhere? Where is a g= ood > > > way to start it? > > > > There's -ftime-report which more-or-less tells you the time spent in the > > individual passes. I think there's no overall group to count GIMPLE > > optimizers vs. RTL optimizers though. > > > > > Richard Biener: I would like to know What is your nickname in IRC :) > > > > It's richi. > > > > Richard. > > > > > Thank you, > > > Giuliano. > > > > > > On 12/17, Richard Biener wrote: > > > > On Wed, Dec 12, 2018 at 4:46 PM Giuliano Augusto Faulin Belinassi > > > > wrote: > > > > > > > > > > Hi, I have some news. :-) > > > > > > > > > > I replicated the Martin Li=C5=A1ka experiment [1] on a 64-cores m= achine for > > > > > gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized= ), > > > > > and I am excited to dive into this problem. As a result, I want to > > > > > propose GSoC project on this issue, starting with something like: > > > > > 1- Systematically create a benchmark for easily information > > > > > gathering. Martin Li=C5=A1ka already made the first version of it= , but I > > > > > need to improve it. > > > > > 2- Find and document the global states (Try to reduce the gcc= 's > > > > > global states as well). > > > > > 3- Define the parallelization strategy. > > > > > 4- First parallelization attempt. > > > > > > > > > > I also proposed this issue as a research project to my advisor an= d he > > > > > supported me on this idea. So I can work for at least one year on > > > > > this, and other things related to it. > > > > > > > > > > Would anyone be willing to mentor me on this? > > > > > > > > As the one who initially suggested the project I'm certainly willing > > > > to mentor you on this. > > > > > > > > Richard. > > > > > > > > > [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=3D43440 > > > > > [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg > > > > > [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg > > > > > On Mon, Nov 19, 2018 at 8:53 AM Richard Biener > > > > > wrote: > > > > > > > > > > > > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinas= si > > > > > > wrote: > > > > > > > > > > > > > > Hi! Sorry for the late reply again :P > > > > > > > > > > > > > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener > > > > > > > wrote: > > > > > > > > > > > > > > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Be= linassi > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > As a brief introduction, I am a graduate student that got= interested > > > > > > > > > > > > > > > > > > in the "Parallelize the compilation using threads"(GSoC 2= 018 [1]). I > > > > > > > > > am a newcommer in GCC, but already have sent some patches= , some of > > > > > > > > > them have already been accepted [2]. > > > > > > > > > > > > > > > > > > I brought this subject up in IRC, but maybe here is a pro= per place to > > > > > > > > > discuss this topic. > > > > > > > > > > > > > > > > > > From my point of view, parallelizing GCC itself will only= speed up the > > > > > > > > > compilation of projects which have a big file that create= s a > > > > > > > > > bottleneck in the whole project compilation (note: by big= , I mean the > > > > > > > > > amount of code to generate). > > > > > > > > > > > > > > > > That's true. During GCC bootstrap there are some of those = (see PR84402). > > > > > > > > > > > > > > > > > > > > > > > One way to improve parallelism is to use link-time optimiza= tion where > > > > > > > > even single source files can be split up into multiple link= -time units. But > > > > > > > > then there's the serial whole-program analysis part. > > > > > > > > > > > > > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?= id=3D84402 ? > > > > > > > That is a lot of data :-) > > > > > > > > > > > > > > It seems that 'phase opt and generate' is the most time-consu= ming > > > > > > > part. Is that the 'GIMPLE optimization pipeline' you were tal= king > > > > > > > about in this thread: > > > > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html > > > > > > > > > > > > It's everything that comes after the frontend parsing bits, thu= s this > > > > > > includes in particular RTL optimization and early GIMPLE optimi= zations. > > > > > > > > > > > > > > > Additionally, I know that GCC must not > > > > > > > > > change the project layout, but from the software engineer= ing perspective, > > > > > > > > > this may be a bad smell that indicates that the file shou= ld be broken > > > > > > > > > into smaller files. Finally, the Makefiles will take care= of the > > > > > > > > > parallelization task. > > > > > > > > > > > > > > > > What do you mean by GCC must not change the project layout?= GCC > > > > > > > > happily re-orders functions and link-time optimization will= reorder > > > > > > > > TUs (well, linking may as well). > > > > > > > > > > > > > > > > > > > > > > That was a response to a comment made on IRC: > > > > > > > > > > > > > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely wrote: > > > > > > > >I think this is in response to a comment I made on IRC. Giul= iano said > > > > > > > >that if a project has a very large file that dominates the t= otal build > > > > > > > >time, the file should be split up into smaller pieces. I sai= d "GCC > > > > > > > >can't restructure people's code. it can only try to compile = it > > > > > > > >faster". We weren't referring to code transformations in the= compiler > > > > > > > >like re-ordering functions, but physically refactoring the s= ource > > > > > > > >code. > > > > > > > > > > > > > > Yes. But from one of the attachments from PR84402, it seems t= hat such > > > > > > > files exist on GCC, > > > > > > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=3D43440 > > > > > > > > > > > > > > > > My questions are: > > > > > > > > > > > > > > > > > > 1. Is there any project compilation that will significan= tly be improved > > > > > > > > > if GCC runs in parallel? Do someone has data about someth= ing related > > > > > > > > > to that? How about the Linux Kernel? If not, I can try to= bring some. > > > > > > > > > > > > > > > > We do not have any data about this apart from experiments w= ith > > > > > > > > splitting up source files for PR84402. > > > > > > > > > > > > > > > > > 2. Did I correctly understand the goal of the paralleliz= ation? Can > > > > > > > > > anyone provide extra details to me? > > > > > > > > > > > > > > > > You may want to search the mailing list archives since we h= ad a > > > > > > > > student application (later revoked) for the task with some = discussion. > > > > > > > > > > > > > > > > In my view (I proposed the thing) the most interesting part= s are > > > > > > > > getting GCCs global state documented and reduced. The para= llelization > > > > > > > > itself is an interesting experiment but whether there will = be any > > > > > > > > substantial improvement for builds that can already benefit= from make > > > > > > > > parallelism remains a question. > > > > > > > > > > > > > > As I agree that documenting GCC's global states is good for t= he > > > > > > > community and the development of GCC, I really don't think th= is a good > > > > > > > motivation for parallelizing a compiler from a research stand= point. > > > > > > > > > > > > True ;) Note that my suggestions to the other GSoC student were > > > > > > purely based on where it's easiest to experiment with paralelli= zation > > > > > > and not where it would be most beneficial. > > > > > > > > > > > > > There must be something or someone that could take advantage = of the > > > > > > > fine-grained parallelism. But that data from PR84402 seems to= have the > > > > > > > answer to it. :-) > > > > > > > > > > > > > > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy wrote: > > > > > > > > > > > > > > > > On 15/11/18 10:29, Richard Biener wrote: > > > > > > > > > In my view (I proposed the thing) the most interesting pa= rts are > > > > > > > > > getting GCCs global state documented and reduced. The pa= rallelization > > > > > > > > > itself is an interesting experiment but whether there wil= l be any > > > > > > > > > substantial improvement for builds that can already benef= it from make > > > > > > > > > parallelism remains a question. > > > > > > > > > > > > > > > > in the common case (project with many small files, much mor= e than > > > > > > > > core count) i'd expect a regression: > > > > > > > > > > > > > > > > if gcc itself tries to parallelize that introduces inter th= read > > > > > > > > synchronization and potential false sharing in gcc (e.g. ma= lloc > > > > > > > > locks) that does not exist with make parallelism (glibc can= avoid > > > > > > > > some atomic instructions when a process is single threaded). > > > > > > > > > > > > > > That is what I am mostly worried about. Or the most costly pa= rt is not > > > > > > > parallelizable at all. Also, I would expect a regression on v= ery small > > > > > > > files, which probably could be avoided implementing this feat= ure as a > > > > > > > flag? > > > > > > > > > > > > I think the the issue should be avoided by avoiding fine-graine= d paralellism. > > > > > > Which might be somewhat hard given there are core data structur= es that > > > > > > are shared (the memory allocator for a start). > > > > > > > > > > > > The other issue I am more worried about is that we probably hav= e to > > > > > > interact with make somehow so that we do not end up with 64 thr= eads > > > > > > when one does -j8 on a 8 core machine. That's basically the sa= me > > > > > > issue we run into with -flto and it's threaded WPA writeout or = recursive > > > > > > invocation of make. > > > > > > > > > > > > > > > > > > > > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor wrote: > > > > > > > > > > > > > > > > Hi Giuliano, > > > > > > > > > > > > > > > > On Thu, Nov 15 2018, Richard Biener wrote: > > > > > > > > > You may want to search the mailing list archives since we= had a > > > > > > > > > student application (later revoked) for the task with som= e discussion. > > > > > > > > > > > > > > > > Specifically, the whole thread beginning with > > > > > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html > > > > > > > > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > Yes, I will research this carefully ;-) > > > > > > > > > > > > > > Thank you