From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id D2E40385E005
 for <gcc-patches@gcc.gnu.org>; Fri, 13 Nov 2020 08:10:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D2E40385E005
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4237E142F
 for <gcc-patches@gcc.gnu.org>; Fri, 13 Nov 2020 00:10:56 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DDB2A3F718
 for <gcc-patches@gcc.gnu.org>; Fri, 13 Nov 2020 00:10:55 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Subject: [00/23] Make fwprop use an on-the-side RTL SSA representation
Date: Fri, 13 Nov 2020 08:10:54 +0000
Message-ID: <mpth7ptad81.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Nov 2020 08:10:58 -0000

Just after GCC 10 stage 1 closed (oops), I posted a patch to add a new
combine pass.  One of its main aims was to allow instructions to move
around where necessary in order to make a combination possible.
It also tried to parallelise instructions that use the same resource.

That pass contained its own code for maintaining limited def-use chains.
When I posted the patch, Segher asked why we wanted yet another piece
of pass-specific code to do that.  Although I had specific reasons
(which I explained at the time) I've gradually come round to agreeing
that that was a flaw.

This series of patches is the result of a Covid-time project to add
a more general, pass-agnostic framework.  There are two parts:
adding the framework itself, and using it to make fwprop.c faster.

The framework part
------------------

The framework provides an optional, on-the-side SSA view of existing
RTL instructions.  Each instruction gets a list of definitions and a
list of uses, with each use having a single definition.  Phi nodes
handle cases in which there are multiple possible definitions of a
register on entry to a basic block.  There are also routines for
updating instructions while keeping the SSA representation intact.

The aim is only to provide a different view of existing RTL instructions.
Unlike gimple, and unlike (IIRC) the old RTL SSA project from way back,
the new framework isn't a =E2=80=9Cnative=E2=80=9D SSA representation.  Thi=
s means that
all inputs to a phi node for a register R are also definitions of
register R; no move operation is =E2=80=9Chidden=E2=80=9D in the phi node.

Like gimple, the framework treats memory as a single unified resource.

A more in-depth summary is contained in the doc patch, but some
other random notes:

* At the moment, the SSA information is local to one pass, but it might
  be good to maintain it between passes in future.

* The SSA code groups blocks into extended basic blocks, with the
  EBBs rather than individual blocks having phi nodes.=20=20

* The framework also provides live range information for registers
  within an extended basic block and allows instructions to move within
  their EBB.  It might be useful to allow further movement in future;
  I just don't have a use case for it yet.

* One advantage of the new infrastructure is that it gives
  recog_for_combine-like behaviour: if recog wants to add clobbers
  of things like the flags register, the SSA code will make sure
  that the flags register is free.

* All current queries and updates have amortised sublinear complexity.
  Some updates are done lazily in order to avoid an upfront linear cost.

* I've tried to optimise the code for both memory footprint and
  compile time.  The first part involves quite a bit of overloading
  of pointers and various other kinds of reuse, so most of the new data
  structures use private member variables and public accessor functions.
  I know that style isn't universally popular, but I think it's
  justified here.  Things could easily go wrong if passes tried
  to operate directly on the underlying data structures.

* Debug instructions get SSA information too, on a best-effort basis.
  Providing complete information would be significantly more expensive.

* I wasn't sure for new C++ code whether to stick to the old C /* =E2=80=A6=
 */
  comments, or whether to switch to //.  In the end I went for //,
  on the basis that:

  - The ranger code already does this.

  - // is certainly more idiomatic in C++.

  - // is in the lisp tradition of per-line comments and it matches the
    ;; used in .md files.  I feel sure that GCC would have been written
    using // from the outset if that had been possible.

  The patches only do this for new files.  The aim is to ensure that
  each file is at least self-consistent.

Using RTL SSA to make fwprop faster
-----------------------------------

In order to show the thing in action, I tried to port fwprop.c
to use RTL SSA while preserving the pass's current heuristics as
much as possible.

To get an extreme measurement of speed, I made each fwprop pass
run 5000 times, calling:

  df_finish_pass (false);

after each iteration.  Usually only the first iteration would actually
do any optimisation, the other iterations would simply test the cost of
the instruction processing.  In the case of the =E2=80=9Cold=E2=80=9D pass,=
 this included:

  - df_analyze (including solving the notes and md problems)
  - the dominator walk to build a list of single definitions

In the case of the =E2=80=9Cnew=E2=80=9D pass, this included:

  - df_analyze (with no additional problems)
  - building the SSA representation

On an --enable-checking=3Drelease compiler, the post-patch version was 23%
faster than the pre-patch version when compiling simplify-rtx.ii at -O.

When compiling simplify-rtx.ii at -O normally (without the hack above),
the compile-time improvement is ~0.5% (which was outside the noise).
The assembly output was unchanged.

Testing
-------

Tested so far on aarch64-linux-gnu, arm-linux-gnueabihf and
x86_64-linux-gnu.  I'll test on powerpc64le-linux-gnu too.

I also tried comparing code with the old and new fwprop.c implementations.
When testing the =E2=80=9Cold=E2=80=9D fwprop.c implementation I applied th=
e attached
patch to avoid a couple of quirks that would otherwise skew the results:

(1) The code that handled LO_SUM propagations had:

	  /* OP1 is likely not a legitimate address, otherwise there would have
	     been no LO_SUM.  We want it to disappear if it is invalid, return
	     false in that case.  */
	  return memory_address_p (mode, tem);

    But this early exit occurs before any replacement has been made,
    so the pass never substituted into LO_SUMs, even on a true return.

(2) use_killed_between didn't take advantage of the fact that
    frame_pointer_rtx and arg_pointer_rtx are function invariants.
    Sometimes it could prove this indirectly, but not always.

With those tweaks, the =E2=80=9Cold=E2=80=9D and =E2=80=9Cnew=E2=80=9D pass=
es produced the same assembly
output for some spot-checked GCC files, such as simplify-rtx.ii, optabs.ii,
etc., compiled with -O2 -g.

I also tried building glibc for aarch64-linux-gnu with both versions.
There were some cases in which INDEX+BASE addresses were canonicalised
as BASE+INDEX, but there were no other differences.

I also tried compiling gcc.dg, g++.dg and gcc.c-torture at -O2
-ftree-vectorize on at least one target per CPU directory.  The same
kind of address canonicalisation differences showed up here too,
but for most targets there were only a handful (<=3D10) cases in
which the new file had more or fewer lines.  Most of the differences
were improvements.

As a side-effect of the above, I tried building with gcc 10.2.1, gcc 7.4.0
and gcc 5.4.0.  I also tried gcc 4.8.5 and clang 10 for compatibility
purposes.

Thanks,
Richard