This is a v6 of the aarch64 load/store pair fusion pass, which
addresses the feedback from Richard's last review here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640539.html

In particular this version implements the suggested changes which
greatly simplify the double list walk.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
and store pairs (LDPs and STPs).

As a motivating example for the kind of thing this improves, take the
following testcase:

extern double c[20];

double f(double x)
{
  double y = x*x;
  y += c[16];
  y += c[17];
  y += c[18];
  y += c[19];
  return y;
}

for which we currently generate (at -O2):

f:
        adrp    x0, c
        add     x0, x0, :lo12:c
        ldp     d31, d29, [x0, 128]
        ldr     d30, [x0, 144]
        fmadd   d0, d0, d0, d31
        ldr     d31, [x0, 152]
        fadd    d0, d0, d29
        fadd    d0, d0, d30
        fadd    d0, d0, d31
        ret

but with the pass, we generate:

f:
.LFB0:
        adrp    x0, c
        add     x0, x0, :lo12:c
        ldp     d31, d29, [x0, 128]
        fmadd   d0, d0, d0, d31
        ldp     d30, d31, [x0, 144]
        fadd    d0, d0, d29
        fadd    d0, d0, d30
        fadd    d0, d0, d31
        ret

The pass is local (only considers a BB at a time).  In theory, it should
be possible to extend it to run over EBBs, at least in the case of pure
(MEM_READONLY_P) loads, but this is left for future work.

The pass works by identifying two kinds of bases: tree decls obtained
via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
If a candidate memory access has a MEM_EXPR base, then we track it via
this base, and otherwise if it is of a simple reg + <imm> form, we track
it via the RTL-SSA def_info for the register.

For each BB, for a given kind of base, we build up a hash table mapping
the base to an access_group.  The access_group data structure holds a
list of accesses at each offset relative to the same base.  It uses a
splay tree to support efficient insertion (while walking the bb), and
the nodes are chained using a linked list to support efficient
iteration (while doing the transformation).

For each base, we then iterate over the access_group to identify
adjacent accesses, and try to form load/store pairs for those insns that
access adjacent memory.

The pass is currently run twice, both before and after register
allocation.  The first copy of the pass is run late in the pre-RA RTL
pipeline, immediately after sched1, since it was found that sched1 was
increasing register pressure when the pass was run before.  The second
copy of the pass runs immediately before peephole2, so as to get any
opportunities that the existing ldp/stp peepholes can handle.

There are some cases that we punt on before RA, e.g.
accesses relative to eliminable regs (such as the soft frame pointer).
We do this since we can't know the elimination offset before RA, and we
want to avoid the RA reloading the offset (due to being out of ldp/stp
immediate range) as this can generate worse code.

The post-RA copy of the pass is there to pick up the crumbs that were
left behind / things we punted on in the pre-RA pass.  Among other
things, it's needed to handle accesses relative to the stack pointer.
It can also handle code that didn't exist at the time the pre-RA pass
was run (spill code, prologue/epilogue code).

This is an initial implementation, and there are (among other possible
improvements) the following notable caveats / missing features that are
left for future work, but could give further improvements:

 - Moving accesses between BBs within in an EBB, see above.
 - Out-of-range opportunities: currently the pass refuses to form pairs
   if there isn't a suitable base register with an immediate in range
   for ldp/stp, but it can be profitable to emit anchor addresses in the
   case that there are four or more out-of-range nearby accesses that can
   be formed into pairs.  This is handled by the current ldp/stp
   peepholes, so it would be good to support this in the future.
 - Discovery: currently we prioritize MEM_EXPR bases over RTL bases, which can
   lead to us missing opportunities in the case that two accesses have distinct
   MEM_EXPR bases (i.e. different DECLs) but they are still adjacent in memory
   (e.g. adjacent variables on the stack).  I hope to address this for GCC 15,
   hopefully getting to the point where we can remove the ldp/stp peepholes and
   scheduling hooks.  Furthermore it would be nice to make the pass aware of
   section anchors (adding these as a third kind of base) allowing merging
   accesses to adjacent variables within the same section.

gcc/ChangeLog:

        * config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64.
        * config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
        before and after RA.
        * config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
        * config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
        (-mlate-ldp-fusion): New.
        (--param=aarch64-ldp-alias-check-limit): New.
        (--param=aarch64-ldp-writeback): New.
        * config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
        * config/aarch64/aarch64-ldp-fusion.cc: New file.