From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id D2E40385E005 for ; Fri, 13 Nov 2020 08:10:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D2E40385E005 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4237E142F for ; Fri, 13 Nov 2020 00:10:56 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DDB2A3F718 for ; Fri, 13 Nov 2020 00:10:55 -0800 (PST) From: Richard Sandiford To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Subject: [00/23] Make fwprop use an on-the-side RTL SSA representation Date: Fri, 13 Nov 2020 08:10:54 +0000 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Nov 2020 08:10:58 -0000 Just after GCC 10 stage 1 closed (oops), I posted a patch to add a new combine pass. One of its main aims was to allow instructions to move around where necessary in order to make a combination possible. It also tried to parallelise instructions that use the same resource. That pass contained its own code for maintaining limited def-use chains. When I posted the patch, Segher asked why we wanted yet another piece of pass-specific code to do that. Although I had specific reasons (which I explained at the time) I've gradually come round to agreeing that that was a flaw. This series of patches is the result of a Covid-time project to add a more general, pass-agnostic framework. There are two parts: adding the framework itself, and using it to make fwprop.c faster. The framework part ------------------ The framework provides an optional, on-the-side SSA view of existing RTL instructions. Each instruction gets a list of definitions and a list of uses, with each use having a single definition. Phi nodes handle cases in which there are multiple possible definitions of a register on entry to a basic block. There are also routines for updating instructions while keeping the SSA representation intact. The aim is only to provide a different view of existing RTL instructions. Unlike gimple, and unlike (IIRC) the old RTL SSA project from way back, the new framework isn't a =E2=80=9Cnative=E2=80=9D SSA representation. Thi= s means that all inputs to a phi node for a register R are also definitions of register R; no move operation is =E2=80=9Chidden=E2=80=9D in the phi node. Like gimple, the framework treats memory as a single unified resource. A more in-depth summary is contained in the doc patch, but some other random notes: * At the moment, the SSA information is local to one pass, but it might be good to maintain it between passes in future. * The SSA code groups blocks into extended basic blocks, with the EBBs rather than individual blocks having phi nodes.=20=20 * The framework also provides live range information for registers within an extended basic block and allows instructions to move within their EBB. It might be useful to allow further movement in future; I just don't have a use case for it yet. * One advantage of the new infrastructure is that it gives recog_for_combine-like behaviour: if recog wants to add clobbers of things like the flags register, the SSA code will make sure that the flags register is free. * All current queries and updates have amortised sublinear complexity. Some updates are done lazily in order to avoid an upfront linear cost. * I've tried to optimise the code for both memory footprint and compile time. The first part involves quite a bit of overloading of pointers and various other kinds of reuse, so most of the new data structures use private member variables and public accessor functions. I know that style isn't universally popular, but I think it's justified here. Things could easily go wrong if passes tried to operate directly on the underlying data structures. * Debug instructions get SSA information too, on a best-effort basis. Providing complete information would be significantly more expensive. * I wasn't sure for new C++ code whether to stick to the old C /* =E2=80=A6= */ comments, or whether to switch to //. In the end I went for //, on the basis that: - The ranger code already does this. - // is certainly more idiomatic in C++. - // is in the lisp tradition of per-line comments and it matches the ;; used in .md files. I feel sure that GCC would have been written using // from the outset if that had been possible. The patches only do this for new files. The aim is to ensure that each file is at least self-consistent. Using RTL SSA to make fwprop faster ----------------------------------- In order to show the thing in action, I tried to port fwprop.c to use RTL SSA while preserving the pass's current heuristics as much as possible. To get an extreme measurement of speed, I made each fwprop pass run 5000 times, calling: df_finish_pass (false); after each iteration. Usually only the first iteration would actually do any optimisation, the other iterations would simply test the cost of the instruction processing. In the case of the =E2=80=9Cold=E2=80=9D pass,= this included: - df_analyze (including solving the notes and md problems) - the dominator walk to build a list of single definitions In the case of the =E2=80=9Cnew=E2=80=9D pass, this included: - df_analyze (with no additional problems) - building the SSA representation On an --enable-checking=3Drelease compiler, the post-patch version was 23% faster than the pre-patch version when compiling simplify-rtx.ii at -O. When compiling simplify-rtx.ii at -O normally (without the hack above), the compile-time improvement is ~0.5% (which was outside the noise). The assembly output was unchanged. Testing ------- Tested so far on aarch64-linux-gnu, arm-linux-gnueabihf and x86_64-linux-gnu. I'll test on powerpc64le-linux-gnu too. I also tried comparing code with the old and new fwprop.c implementations. When testing the =E2=80=9Cold=E2=80=9D fwprop.c implementation I applied th= e attached patch to avoid a couple of quirks that would otherwise skew the results: (1) The code that handled LO_SUM propagations had: /* OP1 is likely not a legitimate address, otherwise there would have been no LO_SUM. We want it to disappear if it is invalid, return false in that case. */ return memory_address_p (mode, tem); But this early exit occurs before any replacement has been made, so the pass never substituted into LO_SUMs, even on a true return. (2) use_killed_between didn't take advantage of the fact that frame_pointer_rtx and arg_pointer_rtx are function invariants. Sometimes it could prove this indirectly, but not always. With those tweaks, the =E2=80=9Cold=E2=80=9D and =E2=80=9Cnew=E2=80=9D pass= es produced the same assembly output for some spot-checked GCC files, such as simplify-rtx.ii, optabs.ii, etc., compiled with -O2 -g. I also tried building glibc for aarch64-linux-gnu with both versions. There were some cases in which INDEX+BASE addresses were canonicalised as BASE+INDEX, but there were no other differences. I also tried compiling gcc.dg, g++.dg and gcc.c-torture at -O2 -ftree-vectorize on at least one target per CPU directory. The same kind of address canonicalisation differences showed up here too, but for most targets there were only a handful (<=3D10) cases in which the new file had more or fewer lines. Most of the differences were improvements. As a side-effect of the above, I tried building with gcc 10.2.1, gcc 7.4.0 and gcc 5.4.0. I also tried gcc 4.8.5 and clang 10 for compatibility purposes. Thanks, Richard