From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-x736.google.com (mail-qk1-x736.google.com [IPv6:2607:f8b0:4864:20::736]) by sourceware.org (Postfix) with ESMTPS id 402CD385B834 for ; Tue, 24 Mar 2020 00:37:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 402CD385B834 Received: by mail-qk1-x736.google.com with SMTP id k13so5906980qki.2 for ; Mon, 23 Mar 2020 17:37:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=NEjFB40hAC4hUHvsTB/gcxLgTeAIJCYstEULebwJMtw=; b=EfY6WIbf9X/GUElt12rjx8y/jha+v/+hl17q8WOAzjkte3pr4x7hLRnXnms1x8nENt Hyb7D4YsZEmOfb+MiVr3NxOurJ8SUPk1eKAKCLbOMMCdEh0Tz7nh3LT+oiMCEYGx+ZuE nQFrF6J0yzMPKX5yqegdOBlz9GbUNRreAmFJtxWBnjoV4mi3c+3Ch7iV3uYjsbMFSPro XKma3P/gUefLmegodIXeUVhHzgaeNyGEH9DqFn0PqRTjbwTg6jQBMkWlg9owGejWeAMH L2Zc7zDr8WBHGsfyWOZuyh730yb5C/PW2bsU2R/p50i1HWerRJqxTN+orvUsSJ6VGClg IIDA== X-Gm-Message-State: ANhLgQ1iqBWqW3Bkkh69LDvCT+NSgAEbYO+/fXTl1XXVY4+b/RxXn8B0 amy5REKYpbs9OEyE7q9YxNyBTA== X-Google-Smtp-Source: ADFU+vtyka5U7Zgfum1JsY4hFro3W7mkmRRImLXDofI2Ed9WXBnlUqSqAQI70XJ3sJWdIONY9/T9og== X-Received: by 2002:a37:888:: with SMTP id 130mr21871337qki.261.1585010238516; Mon, 23 Mar 2020 17:37:18 -0700 (PDT) Received: from smtp.gmail.com ([2804:14c:7a:962a::1003]) by smtp.gmail.com with ESMTPSA id r29sm12327078qkk.85.2020.03.23.17.37.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Mar 2020 17:37:17 -0700 (PDT) Date: Mon, 23 Mar 2020 21:37:12 -0300 From: Giuliano Belinassi To: Richard Biener Cc: gcc@gcc.gnu.org, mjambor@suse.cz, hubicka@ucw.cz Subject: Re: [GSoC 2020] Automatic Detection of Parallel Compilation Viability Message-ID: <20200324003712.6mskkje7dw32isz2@smtp.gmail.com> References: <20200313201535.4hguknuxjejpt6f3@smtp.gmail.com> <20200317202427.2cho2t3qjatw6suv@smtp.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_2, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Mar 2020 00:37:22 -0000 Hi, Richi On 03/18, Richard Biener wrote: > On Tue, 17 Mar 2020, Giuliano Belinassi wrote: > > > Hi, all > > > > I have applied some revews to the project. Please see the new proposal > > here: > > Looks good, some editorial changes below > > > https://www.ime.usp.br/~belinass/Automatic_Detection_of_Parallel_Compilation_Viability.pdf > > > > **Automatic Detection of Parallel Compilation Viability** > > > > [Giuliano Belinassi]{style="color: darkgreen"}\ > > Timezone: GMT$-$3:00\ > > University of São Paulo -- Brazil\ > > IRC: giulianob in \#gcc\ > > Email: [`giuliano.belinassi@usp.br`](mailto:giuliano.belinassi@usp.br)\ > > Github: \ > > Date: > > > > About Me Computer Science Bachelor (University of São Paulo), currently > > pursuing a Masters Degree in Computer Science at the same institution. > > I've always been fascinated by topics such as High-Performance Computing > > and Code Optimization, having worked with a parallel implementation of a > > Boundary Elements Method software in GPU. I am currently conducting > > research on compiler parallelization and developing the > > [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having > > already presented it in [GNU Cauldron > > 2019](https://www.youtube.com/watch?v=jd6R3IK__1Q). > > > > **Skills**: Strong knowledge in C, Concurrency, Shared Memory > > Parallelism, Multithreaded Debugging and other typical programming > > tools. > > > > Brief Introduction > > > > In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that > > parallelizing the Intra Procedural optimizations improves speed when > > compiling huge files by a factor of 1.8x in a 4 cores machine, and also > > showed that this takes 75% of compilation time. > > > > In this project we plan to use the LTO infrastructure to improve > > compilation performance in the non-LTO case, with a tradeoff of > > generating a binary as good as if LTO is disabled. Here, we will > > automatically detect when a single file will benefit from parallelism, > > and procceed with the compilation in parallel if so. > > > > Use of LTO > > > > The Link Time Optimization (LTO) is a compilation technique that allows > > the compiler to analyse the program as a whole, instead of analysing and > > compiling one file at time. Therefore, LTO is able to collect more > > information about the program and generate a better optimization plan. > > LTO is divided in three parts: > > > > - *LGEN (Local Generation)*: Each file is translated to GIMPLE. This > > stage runs sequentially in each file and, therefore, in parallel in > > the project compilation. > > > > - *WPA (Whole Program Analysis)*: Run the Inter Procedural Analysis > > (IPA) in the entire program. This state runs serially in the > > project. > > > > - *LTRANS (Local Transformation)*: Execute all Intra Procedural > > Optimizations in each partition. This stage runs in parallel. > > > > Since WPA can bottleneck the compilation because it runs serially in the > > entire project, LTO was designed to produce faster binaries, not to > > produce binaries fast. > > > > Here, the proposed use of LTO to address this problem is to run the IPA > > for each Translation Unit (TU), instead in the Whole Program, and > > This proposal is to use LTO to produce binaries fast by running > the IPA phase separately for each Translation Unit (TU), instead of on the > Whole Program and ... > > > automatically detect when to partition the TU into multiple LTRANS to > > improve compilation performance. The advantage of this approach is: > > > > - It can generate binaries as good as when LTO is disabled. > > > > - It is faster, as we can partition big files into multiple partitions > > and compile these partitions in parallel. > > > > - It can interact with GNU Make Jobserver, improving CPU utilization. > > This reads a bit odd, regular compilation already interacts with the > GNU Make Jobserver. I'd reorder and reword it w/o dashes like > > We can partition big files into multiple partitions and compile these > partitions in parallel which should improve CPU utilization by exposing > smaller chunks to the GNU Make Jobserver. Code generation quality > should be unaffected by this. How about: ``` The advantage of this approach is: by partitioning big files into multiple partitions, we can improve the compilation performance by exposing these partitions to the Jobserver. Therefore, it can improve CPU utilization in manycore machines. Generated code quality should be unaffected by this procedure, which means that it should run as fast as when LTO is disabled. ``` ? > > Thanks, > Richard. > > > Planned Tasks > > > > I plan to use the GSoC time to develop the following topics: > > > > - Week \[1, 3\] -- April 27 to May 15:\ > > Update `cc1`, `cc1plus`, `f771`, ..., to partition the Compilation > > Unit (CU) after IPA analysis directly into multiple LTRANS > > partitions, instead of generating a temporary GIMPLE file, and to > > accept a additional parameter `-fsplit-outputs=`, in which > > the generated ASM filenames will be written to. > > > > There are two possible cases in which I could work on: > > > > 1. *Fork*: After the CU is partitioned into multiple LTRANS, then > > `cc1` will fork and compile these partitions, each of them > > generating a ASM file, and write the generated asm name into > > ``. Note that if the number of partitions is one, then > > this part is not necessary. > > > > 2. *Stream LTRANS IR*: After CU is partitionated into multiple > > LTRANS, then `cc1` will write these partitions into disk so that > > LTO can read these files and proceed as a standard LTO operation > > in order to generate a partially linked object file. > > > > 1\. Has the advantage of having less overhead than 2., as there is less > > IO operations, however it may be hard to implement as the assembler file > > may be already opened before forking, so caution is necessary to make > > sure that there are a 1 - 1 relationship between assembler file and the > > compilation process. 2. on the other hand can easily interact with the > > GNU jobserver. > > > > - Week \[4, 7\] -- May 18 to June 12:\ > > Update the `gcc` driver to take each file in ``, then > > assemble and partially link them together. Here, an important > > optimization is to use a named pipe in `` to avoid having > > to wait every partition to end its compilation before assembling the > > files. > > > > - Week 8 -- June 15 to 19: **First Evaluation**\ > > Deliver a non-optimized version of the project. Some programs ought > > to be compiled correctly, but probably there will be a huge overhead > > because so far there is no way of interacting with GNU Jobserver. > > > > - Week \[9, 11\] -- June 22 to July 10:\ > > Work on GNU Make Jobserver integration. A way of doing this is to > > adapt the LTO WPA -> LTRANS way of interacting with > > Jobserver. Another way is to make the forked `cc1` consume Jobserver > > tokens until the compilation finishes, then return the token when > > done. > > > > - Week 12 -- July 13 to 17: **Second Evaluation**\ > > Deliver a more optimized version of the project. Here we should > > filter files that would compile fast from files that would require > > partitioning, and interact with GNU Jobserver. Therefore we should > > see some speedup. > > > > - Week \[13, 15\] -- July 20 to August 10:\ > > Develop adequate tests coverage and address unexpected issues so > > that this feature can be merged to trunk for the next GCC release. > > > > - Week 16: **Final evaluation**\ > > Deliver the final product as a series of patches for trunk. > > > > On 03/13, Giuliano Belinassi wrote: > > > Hi, all > > > > > > I want to propose and apply for the following GSoC project: Automatic > > > Detection of Parallel Compilation Viability. > > > > > > https://www.ime.usp.br/~belinass/Automatic_Detection_of_Parallel_Compilation_Viability.pdf > > > > > > Feedback is welcome :) > > > > > > Here is a markdown version of it: > > > > > > **Automatic Detection of Parallel Compilation Viability** > > > > > > [Giuliano Belinassi]{style="color: darkgreen"}\ > > > Timezone: GMT$-$3:00\ > > > University of São Paulo -- Brazil\ > > > IRC: giulianob in \#gcc\ > > > Email: [`giuliano.belinassi@usp.br`](mailto:giuliano.belinassi@usp.br)\ > > > Github: \ > > > Date: > > > > > > About Me Computer Science Bachelor (University of São Paulo), currently > > > pursuing a Masters Degree in Computer Science at the same institution. > > > I've always been fascinated by topics such as High-Performance Computing > > > and Code Optimization, having worked with a parallel implementation of a > > > Boundary Elements Method software in GPU. I am currently conducting > > > research on compiler parallelization and developing the > > > [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having > > > already presented it in [GNU Cauldron > > > 2019](https://www.youtube.com/watch?v=jd6R3IK__1Q). > > > > > > **Skills**: Strong knowledge in C, Concurrency, Shared Memory > > > Parallelism, Multithreaded Debugging and other typical programming > > > tools. > > > > > > Brief Introduction > > > > > > In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that > > > parallelizing the Intra Procedural optimizations improves speed when > > > compiling huge files by a factor of 1.8x in a 4 cores machine, and also > > > showed that this takes 75% of compilation time. > > > > > > In this project we plan to use the LTO infrastructure to improve > > > compilation performance in the non-LTO case, with a tradeoff of > > > generating a binary as good as if LTO is disabled. Here, we will > > > automatically detect when a single file will benefit from parallelism, > > > and proceed with the compilation in parallel if so. > > > > > > Use of LTO > > > > > > The Link Time Optimization (LTO) is a compilation technique that allows > > > the compiler to analyse the program as a whole, instead of analysing and > > > compiling one file at time. Therefore, LTO is able to collect more > > > information about the program and generate a better optimization plan. > > > LTO is divided in three parts: > > > > > > - *LGEN (Local Generation)*: Each file is translated to GIMPLE. This > > > stage runs sequentially in each file and, therefore, in parallel in > > > the project compilation. > > > > > > - *WPA (Whole Program Analysis)*: Run the Inter Procedural Analysis > > > (IPA) in the entire program. This state runs serially in the > > > project. > > > > > > - *LTRANS (Local Transformation)*: Execute all Intra Procedural > > > Optimizations in each partition. This stage runs in parallel. > > > > > > Since WPA can bottleneck the compilation because it runs serially in the > > > entire project, LTO was designed to produce faster binaries, not to > > > produce binaries fast. > > > > > > Here, the proposed use of LTO to address this problem is to run the IPA > > > for each Translation Unit (TU), instead in the Whole Program, and > > > automatically detect when to partition the TU into multiple LTRANS to > > > improve performance. The advantage of this approach is: > > > > > > - It can generate binaries as good as when LTO is disabled. > > > > > > - It is faster, as we can partition big files into multiple partitions > > > and compile these partitions in parallel > > > > > > - It can interact with GNU Make Jobserver, improving CPU utilization. > > > > > > Planned Tasks > > > > > > I plan to use the GSoC time to develop the following topics: > > > > > > - Week \[1, 3\] -- April 27 to May 15:\ > > > Update `cc1`, `cc1plus`, `f771`, ..., to partition the data after > > > IPA analysis directly into multiple LTRANS partitions, instead of > > > generating a temporary GIMPLE file. > > > > > > - Week \[4, 7\] -- May 18 to June 12:\ > > > Update the `gcc` driver to take these multiple LTRANS partitions, > > > then call the compiler and assembler for each of them, and merge the > > > results into one object file. Here I will use the LTO LTRANS object > > > streaming, therefore it should interact with GNU Make Jobserver. > > > > > > - Week 8 -- June 15 to 19: **First Evaluation**\ > > > Deliver a non-optimized version of the project. Some programs ought > > > to be compiled correctly, but probably there will be a huge overhead > > > because so far there will not be any criteria about when to > > > partition. Some tests are also planned for this evaluation. > > > > > > - Week \[9, 11\] -- June 22 to July 10:\ > > > Implement a criteria about when to partition, and interactively > > > improve it based on data. > > > > > > - Week 12 --- July 13 to 17: **Second Evaluation**\ > > > Deliver a more optimized version of the project. Here we should > > > filter files that would compile fast from files that would require > > > partitioning, and therefore we should see some speedup. > > > > > > - Week \[13, 15\] --- July 20 to August 10:\ > > > Develop adequate tests coverage and address unexpected issues so > > > that this feature can be merged to trunk for the next GCC release. > > > > > > - Week 16: **Final evaluation**\ > > > Deliver the final product as a series of patches for trunk. > > > > > > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, > Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) Thank you, Giuliano.