From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 123819 invoked by alias); 1 Jun 2015 20:50:19 -0000 Mailing-List: contact binutils-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: binutils-owner@sourceware.org Received: (qmail 123787 invoked by uid 89); 1 Jun 2015 20:50:17 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL,BAYES_50,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Mon, 01 Jun 2015 20:50:16 +0000 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (Postfix) with ESMTPS id 277A4C017D; Mon, 1 Jun 2015 20:50:15 +0000 (UTC) Received: from c64.redhat.com (vpn-230-103.phx2.redhat.com [10.3.230.103]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t51KoE57030113; Mon, 1 Jun 2015 16:50:14 -0400 From: David Malcolm To: gcc-patches@gcc.gnu.org, binutils@sourceware.org Cc: David Malcolm Subject: [PATCH 00/16] RFC: Embedding as and ld inside gcc driver and into libgccjit Date: Mon, 01 Jun 2015 20:50:00 -0000 Message-Id: <1433192664-50156-1-git-send-email-dmalcolm@redhat.com> X-IsSubscribed: yes X-SW-Source: 2015-06/txt/msg00010.txt.bz2 [Crossposting to both gcc-patches and binutils lists, since this patch kit touches both source trees]. Binutils devs: GCC 5 gained a way to build GCC as a shared library, libgccjit.so. I'm been experimenting with ways of optimizing libgccjit, and the following patch kit (touching both gcc and binutils) achieves a 5x speedup of gcc/testsuite/jit.dg/test-benchmark.c on this x86_64 box (Fedora 20). The benchmark constructs IR for a simple function in memory, compiles it, and runs it, 100 times in a row, in the hope of simulating the workload of an interpreter/VM/language runtime, where bytecode functions gradually become "hot" (e.g. interpretation count exceeds a threshold) and are compiled to machine code, all within one process. gcc's backend code emits .s files, and libgccjit currently use pex to invoke the gcc driver to turn it from .s to a .so file (which in turn invokes "as" and "ld"). These invocations dominate the time take by libgccjit, so the patch series attempts to time them, and to move them in-process; doing so largely eliminates the cost of them. Here are the performance gains: jit.dg/test-benchmark.c, 100 iterations at optlevel 0: Without embedded driver: wallclock of 5.300s (0.053s per iteration) With embedded driver: wallclock of 4.630s (0.046s per iteration) With embedded driver & gas: wallclock of 3.510s (0.035s per iteration) With embedded driver&as&ld: wallclock of 2.130s (0.021s per iteration) As above, hacking up ld args: wallclock of 1.030s (0.010s per iteration) i.e. about 5x speedup. There are some memory leaks, FIXMEs, etc, and it hasn't been fully tested yet, but I thought it was time to post this for discussion. The patch kit also generalizes gcc's timevar mechanism in such a way that it can be used both by jit client code, and by "as" and "ld". An example of a combined report on the accumulated timings of 100 iterations of jit.dg/test-benchmark.c at optlevel 0: Execution times (seconds) Client items: test_jit : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc create_code : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc compile : 0.21 (30%) usr 0.13 (45%) sys 0.25 (25%) wall 14939 kB (74%) ggc verify_code : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc GCC items: phase setup : 0.15 (22%) usr 0.02 ( 7%) sys 0.15 (15%) wall 10661 kB (53%) ggc phase parsing : 0.02 ( 3%) usr 0.00 ( 0%) sys 0.02 ( 2%) wall 653 kB ( 3%) ggc callgraph construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 2%) wall 242 kB ( 1%) ggc callgraph optimization : 0.01 ( 1%) usr 0.01 ( 3%) sys 0.01 ( 1%) wall 142 kB ( 1%) ggc cfg construction : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 17 kB ( 0%) ggc cfg cleanup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc df live regs : 0.02 ( 3%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 23 kB ( 0%) ggc register information : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc parser (global) : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 199 kB ( 1%) ggc tree eh : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc tree CFG construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 196 kB ( 1%) ggc tree operand scan : 0.00 ( 0%) usr 0.01 ( 3%) sys 0.00 ( 0%) wall 100 kB ( 0%) ggc out of ssa : 0.00 ( 0%) usr 0.02 ( 7%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc expand : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 398 kB ( 2%) ggc loop init : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 67 kB ( 0%) ggc integrated RA : 0.07 (10%) usr 0.02 ( 7%) sys 0.02 ( 2%) wall 2468 kB (12%) ggc LRA virtuals elimination: 0.01 ( 1%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 56 kB ( 0%) ggc machine dep reorg : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc shorten branches : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 2%) wall 0 kB ( 0%) ggc final : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 216 kB ( 1%) ggc initialize rtl : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 12 kB ( 0%) ggc rest of compilation : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 3%) wall 232 kB ( 1%) ggc unaccounted todo : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 2%) wall 0 kB ( 0%) ggc replay of JIT client activity: 0.01 ( 1%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 309 kB ( 2%) ggc driver : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc driver: setup : 0.04 ( 6%) usr 0.00 ( 0%) sys 0.06 ( 6%) wall 0 kB ( 0%) ggc driver: do spec on infiles: 0.01 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 2%) wall 0 kB ( 0%) ggc driver: run linker : 0.00 ( 0%) usr 0.01 ( 3%) sys 0.02 ( 2%) wall 0 kB ( 0%) ggc driver: embedded assembler: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc driver: embedded linker : 0.04 ( 6%) usr 0.02 ( 7%) sys 0.04 ( 4%) wall 0 kB ( 0%) ggc load JIT result : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc Embedded 'as': gas_main : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc before pass : 0.03 ( 4%) usr 0.02 ( 7%) sys 0.13 (13%) wall 0 kB ( 0%) ggc perform_an_assembly_pass: 0.06 ( 9%) usr 0.01 ( 3%) sys 0.06 ( 6%) wall 0 kB ( 0%) ggc after pass : 0.04 ( 6%) usr 0.00 ( 0%) sys 0.03 ( 3%) wall 0 kB ( 0%) ggc cleanup : 0.02 ( 3%) usr 0.00 ( 0%) sys 0.03 ( 3%) wall 0 kB ( 0%) ggc Embedded 'ld': ld_internal_main: init : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc ldmain.c: lang_final : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc ldmain.c: lang_process : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc lang_process: 1st half : 0.00 ( 0%) usr 0.02 ( 7%) sys 0.02 ( 2%) wall 0 kB ( 0%) ggc open_output : 0.01 ( 1%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc open_input_bfds : 0.01 ( 1%) usr 0.02 ( 7%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc lang_input_statement_enum: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc open_input_bfds:load_symbols: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc load_symbols: ldfile_open_file: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc ldlang_add_file : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc load_symbols: bfd_link_add_symbols: 0.02 ( 3%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc lang_process: 2nd half : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 4%) wall 0 kB ( 0%) ggc ldmain.c: ldwrite : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 3%) wall 0 kB ( 0%) ggc ld_main cleanup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 0.69 0.29 0.99 20298 kB Thoughts? -- 1.8.5.3