From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=seie=HY=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id E66BF3858D20
	for <gcc-patches@gcc.gnu.org>; Wed, 13 Dec 2023 23:31:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E66BF3858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org E66BF3858D20
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702510281; cv=none;
	b=XZEbSUeBYD03XBGI8EFE0H3oCCOy/XHA2JRHGndRArCSA70g2ywsT6bwbknedRPXMj7ZM0dCcFm/N9Hv3r86Mnt4TL6DSs9rT4o41tVMwcGM7bXn+T/RYObQP4t9L20cO5sRWp7Lru0WrBz3pHjZMmqO49HTsGJrNkKpGUz3P5k=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1702510281; c=relaxed/simple;
	bh=FiDyAmWBsS+nCCwnhSyoW0nT5e1cvw58h99AEtZEEVE=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=Am4QOVcDVHOS0vONtrfHrtsvHOeojBQh1ezJb5zqJhal00RLGCGt6/aP2OcyhNnUOBtATwxtO7wvlAkPMvvESHdGkliFjYfomjD7o+piTJSn4l+TIcTZzw6ASD9eCGPC7xgILWDGBu+FT7IGGSNAx38D1r84PO3it0INmn7JJkI=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5EDC7C15;
	Wed, 13 Dec 2023 15:32:00 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A39483F5A1;
	Wed, 13 Dec 2023 15:31:13 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Alex Coplan <alex.coplan@arm.com>
Mail-Followup-To: Alex Coplan <alex.coplan@arm.com>,gcc-patches@gcc.gnu.org,  Kyrylo Tkachov <kyrylo.tkachov@arm.com>, richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org,  Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Subject: Re: [PATCH v3 10/11] aarch64: Add new load/store pair fusion pass
References: <ZXHbUKrMdYHxJ7Dg@arm.com>
Date: Wed, 13 Dec 2023 23:31:12 +0000
In-Reply-To: <ZXHbUKrMdYHxJ7Dg@arm.com> (Alex Coplan's message of "Thu, 7 Dec
	2023 14:48:48 +0000")
Message-ID: <mptr0jpy91b.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-21.8 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Thanks for the update.  The new comments are really nice, and I think
make the implementation much easier to follow.

I was going to say OK with the changes below, but there's one question/
comment near the end about the double list walk.

Alex Coplan <alex.coplan@arm.com> writes:
> +// Convenience wrapper around strip_offset that can also look through
> +// RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
> +// MEM so that we know the mode of the access.
> +static rtx ldp_strip_offset (rtx mem, poly_int64 *offset)

Formatting nit, should be:

static rtx
ldp_strip_offset (rtx mem, poly_int64 *offset)

> +{
> +  rtx addr = XEXP (mem, 0);
> +
> +  switch (GET_CODE (addr))
> +    {
> +    case PRE_MODIFY:
> +    case POST_MODIFY:
> +      addr = strip_offset (XEXP (addr, 1), offset);
> +      gcc_checking_assert (REG_P (addr));
> +      gcc_checking_assert (rtx_equal_p (XEXP (XEXP (mem, 0), 0), addr));
> +      break;
> +    case PRE_INC:
> +    case POST_INC:
> +      addr = XEXP (addr, 0);
> +      *offset = GET_MODE_SIZE (GET_MODE (mem));
> +      gcc_checking_assert (REG_P (addr));
> +      break;
> +    case PRE_DEC:
> +    case POST_DEC:
> +      addr = XEXP (addr, 0);
> +      *offset = -GET_MODE_SIZE (GET_MODE (mem));
> +      gcc_checking_assert (REG_P (addr));
> +      break;
> +
> +    default:
> +      addr = strip_offset (addr, offset);
> +    }
> +
> +  return addr;
> +}
> [...]
> +// Main function to begin pair discovery.  Given a memory access INSN,
> +// determine whether it could be a candidate for fusing into an ldp/stp,
> +// and if so, track it in the appropriate data structure for this basic
> +// block.  LOAD_P is true if the access is a load, and MEM is the mem
> +// rtx that occurs in INSN.
> +void
> +ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> +{
> +  // We can't combine volatile MEMs, so punt on these.
> +  if (MEM_VOLATILE_P (mem))
> +    return;
> +
> +  // Ignore writeback accesses if the param says to do so.
> +  if (!aarch64_ldp_writeback
> +      && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
> +    return;
> +
> +  const machine_mode mem_mode = GET_MODE (mem);
> +  if (!ldp_operand_mode_ok_p (mem_mode))
> +    return;
> +
> +  // Note ldp_operand_mode_ok_p already rejected VL modes.
> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> +
> +  rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
> +
> +  // We want to segregate FP/SIMD accesses from GPR accesses.
> +  //
> +  // Before RA, we use the modes, noting that stores of constant zero
> +  // operands use GPRs (even in non-integer modes).  After RA, we use
> +  // the hard register numbers.
> +  const bool fpsimd_op_p
> +    = reload_completed
> +    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> +    : (GET_MODE_CLASS (mem_mode) != MODE_INT
> +       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> +
> +  const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
> +
> +  if (track_via_mem_expr (insn, mem, lfs))
> +    return;
> +
> +  poly_int64 mem_off;
> +  rtx addr = XEXP (mem, 0);
> +  const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> +  rtx base = ldp_strip_offset (mem, &mem_off);
> +  if (!REG_P (base))
> +    return;
> +
> +  // Need to calculate two (possibly different) offsets:
> +  //  - Offset at which the access occurs.
> +  //  - Offset of the new base def.
> +  poly_int64 access_off;
> +  if (autoinc_p && any_post_modify_p (addr))
> +    access_off = 0;
> +  else
> +    access_off = mem_off;
> +
> +  poly_int64 new_def_off = mem_off;
> +
> +  // Punt on accesses relative to the eliminable regs: since we don't
> +  // know the elimination offset pre-RA, we should postpone forming
> +  // pairs on such accesses until after RA.
> +  if (!reload_completed
> +      && (REGNO (base) == FRAME_POINTER_REGNUM
> +	  || REGNO (base) == ARG_POINTER_REGNUM))
> +    return;

FTR, this reason still doesn't feel entirely convincing.  Although we don't
know the offset, we do know that the offset will be consistent for both
accesses, and LRA will/should be able to reload the pair address if
necessary.  In fact we could even end up with fewer reloads (given the
new single-mem representation of the pair).

But perhaps one reason to defer is that elimination reduces the number
of fixed base registers in play, and so there should be strictly more
fusing opportunities after register allocation.

No need to change anything though, just noting it in passing.

> +
> +  // Now need to find def of base register.
> +  def_info *base_def;
> +  use_info *base_use = find_access (insn->uses (), REGNO (base));
> +  gcc_assert (base_use);
> +  base_def = base_use->def ();

Think this'd be more natural as:

  use_info *base_use = find_access (insn->uses (), REGNO (base));
  gcc_assert (base_use);
  use_info *base_def = base_use->def ();

> +  if (!base_def)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "base register (regno %d) of insn %d is undefined",
> +		 REGNO (base), insn->uid ());
> +      return;
> +    }
> +
> +  alt_base *canon_base = canon_base_map.get (base_def);
> +  if (canon_base)
> +    {
> +      // Express this as the combined offset from the canonical base.
> +      base_def = canon_base->base;
> +      new_def_off += canon_base->offset;
> +      access_off += canon_base->offset;
> +    }
> +
> +  if (autoinc_p)
> +    {
> +      auto def = find_access (insn->defs (), REGNO (base));
> +      gcc_assert (def);
> +
> +      // Record that DEF = BASE_DEF + MEM_OFF.
> +      if (dump_file)
> +	{
> +	  pretty_printer pp;
> +	  pp_access (&pp, def, 0);
> +	  pp_string (&pp, " = ");
> +	  pp_access (&pp, base_def, 0);
> +	  fprintf (dump_file, "[bb %u] recording %s + ",
> +		   m_bb->index (), pp_formatted_text (&pp));
> +	  print_dec (new_def_off, dump_file);
> +	  fprintf (dump_file, "\n");
> +	}
> +
> +      alt_base base_rec { base_def, new_def_off };
> +      if (canon_base_map.put (def, base_rec))
> +	gcc_unreachable (); // Base defs should be unique.
> +    }
> +
> +  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
> +  // the access size.
> +  if (!multiple_p (mem_off, mem_size))
> +    return;
> +
> +  const auto key = std::make_pair (base_def, encode_lfs (lfs));
> +  access_group &group = def_map.get_or_insert (key, NULL);
> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
> +  group.track (alloc, access_off, insn);
> +
> +  if (dump_file)
> +    {
> +      pretty_printer pp;
> +      pp_access (&pp, base_def, 0);
> +
> +      fprintf (dump_file, "[bb %u] tracking insn %d via %s",
> +	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
> +      fprintf (dump_file,
> +	       " [L=%d, WB=%d, FP=%d, %smode, off=",
> +	       lfs.load_p, autoinc_p, lfs.fpsimd_p, mode_name[mem_mode]);
> +      print_dec (access_off, dump_file);
> +      fprintf (dump_file, "]\n");
> +    }
> +}
> +
> +// Dummy predicate that never ignores any insns.
> +static bool no_ignore (insn_info *) { return false; }
> +
> +// Return the latest dataflow hazard before INSN.
> +//
> +// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
> +// dataflow purposes.  This is needed when considering changing the RTL base of
> +// an access discovered through a MEM_EXPR base.
> +//
> +// If IGNORE_INSN is non-NULL, we should further ignore any hazards arising
> +// from that insn.
> +//
> +// N.B. we ignore any defs/uses of memory here as we deal with that separately,
> +// making use of alias disambiguation.
> +static insn_info *
> +latest_hazard_before (insn_info *insn, rtx *ignore,
> +		      insn_info *ignore_insn = nullptr)
> +{
> +  insn_info *result = nullptr;
> +
> +  // Return true if we registered the hazard.
> +  auto hazard = [&](insn_info *h) -> bool
> +    {
> +      gcc_checking_assert (*h < *insn);
> +      if (h == ignore_insn)
> +	return false;
> +
> +      if (!result || *h > *result)
> +	result = h;
> +
> +      return true;
> +    };
> +
> +  rtx pat = PATTERN (insn->rtl ());
> +  auto ignore_use = [&](use_info *u)
> +    {
> +      if (u->is_mem ())
> +	return true;
> +
> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
> +    };
> +
> +  // Find defs of uses in INSN (RaW).
> +  for (auto use : insn->uses ())
> +    if (!ignore_use (use) && use->def ())
> +      hazard (use->def ()->insn ());
> +
> +  // Find previous defs (WaW) or previous uses (WaR) of defs in INSN.
> +  for (auto def : insn->defs ())
> +    {
> +      if (def->is_mem ())
> +	continue;
> +
> +      if (def->prev_def ())
> +	{
> +	  hazard (def->prev_def ()->insn ()); // WaW
> +
> +	  auto set = dyn_cast<set_info *> (def->prev_def ());
> +	  if (set && set->has_nondebug_insn_uses ())
> +	    for (auto use : set->reverse_nondebug_insn_uses ())
> +	      if (use->insn () != insn && hazard (use->insn ())) // WaR
> +		break;
> +	}
> +
> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
> +	continue;
> +
> +      // Also need to check backwards for call clobbers (WaW).
> +      for (auto call_group : def->ebb ()->call_clobbers ())
> +	{
> +	  if (!call_group->clobbers (def->resource ()))
> +	    continue;
> +
> +	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
> +							   def->insn (),
> +							   no_ignore);
> +	  if (clobber_insn)
> +	    hazard (clobber_insn);
> +	}
> +

Nit: excess blank line.

> +    }
> +
> +  return result;
> +}
> [...]
> +// Ensure we have a sensible scheme for combining REG_NOTEs
> +// given two candidate insns I1 and I2 where *I1 < *I2.

The comment doesn't seem to match the code.  The function isn't checking
whether the combination is possible, and aborts rather than returning
failure.  Maybe something along the lines of "Return the notes that
should be attached to a combination of I1 and I2, where *I1 < *I2."
(just suggested wording, feel free to use something else).

> +static rtx
> +combine_reg_notes (insn_info *i1, insn_info *i2)
> +{
> +  bool found_eh_region = false;
> +  rtx result = NULL_RTX;
> +  result = filter_notes (REG_NOTES (i2->rtl ()), result, &found_eh_region);
> +  return filter_notes (REG_NOTES (i1->rtl ()), result, &found_eh_region);
> +}
> [...]
> +// INSNS contains either {nullptr, pair insn} (when promoting an existing
> +// non-writeback pair) or contains the candidate insns used to form the pair
> +// (when fusing a new pair).
> +//
> +// PAIR_RANGE specifies where we want to form the final pair.
> +// INITIAL_OFFSET gives the current base offset for the pair,
> +// INITIAL_WRITEBACK says whether either of the initial accesses had
> +// writeback.

Maybe: Bit I of INITIAL_WRITEBACK is set if INSNS[I] initially had writeback.

> +// ACCESS_SIZE gives the access size for a single arm of the pair.
> +// BASE_DEF gives the initial def of the base register consumed by the pair.
> +//
> +// Given the above, this function looks for a trailing destructive update of the
> +// base register.  If there is one, we choose the first such update after
> +// INSNS[1] that is still in the same BB as our pair.  We return the
> +// new def in *ADD_DEF and the resulting writeback effect in
> +// *WRITEBACK_EFFECT.
> +static insn_info *
> +find_trailing_add (insn_info *insns[2],
> +		   const insn_range_info &pair_range,
> +		   int initial_writeback,
> +		   rtx *writeback_effect,
> +		   def_info **add_def,
> +		   def_info *base_def,
> +		   poly_int64 initial_offset,
> +		   unsigned access_size)
> +{
> +  insn_info *pair_dst = pair_range.singleton ();
> +  gcc_assert (pair_dst);
> +
> +  def_info *def = base_def->next_def ();
> +
> +  // In the case that either of the initial pair insns had writeback,
> +  // then there will be intervening defs of the base register.
> +  // Skip over these.
> +  for (int i = 0; i < 2; i++)
> +    if (initial_writeback & (1 << i))
> +      {
> +	gcc_assert (def->insn () == insns[i]);
> +	def = def->next_def ();
> +      }
> +
> +  if (!def || def->bb () != pair_dst->bb ())
> +    return nullptr;
> +
> +  // DEF should now be the first def of the base register after PAIR_DST.
> +  insn_info *cand = def->insn ();
> +  gcc_assert (*cand > *pair_dst);
> +
> +  const auto base_regno = base_def->regno ();
> +
> +  // If CAND doesn't also use our base register,
> +  // it can't destructively update it.
> +  if (!find_access (cand->uses (), base_regno))
> +    return nullptr;
> +
> +  auto rti = cand->rtl ();
> +
> +  if (!INSN_P (rti))
> +    return nullptr;
> +
> +  auto pat = PATTERN (rti);
> +  if (GET_CODE (pat) != SET)
> +    return nullptr;
> +
> +  auto dest = XEXP (pat, 0);
> +  if (!REG_P (dest) || REGNO (dest) != base_regno)
> +    return nullptr;
> +
> +  poly_int64 offset;
> +  rtx rhs_base = strip_offset (XEXP (pat, 1), &offset);
> +  if (!REG_P (rhs_base)
> +      || REGNO (rhs_base) != base_regno
> +      || !offset.is_constant ())
> +    return nullptr;
> +
> +  // If the initial base offset is zero, we can handle any add offset
> +  // (post-inc).  Otherwise, we require the offsets to match (pre-inc).
> +  if (!known_eq (initial_offset, 0) && !known_eq (offset, initial_offset))
> +    return nullptr;
> +
> +  auto off_hwi = offset.to_constant ();
> +
> +  if (off_hwi % access_size != 0)
> +    return nullptr;
> +
> +  off_hwi /= access_size;
> +
> +  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
> +    return nullptr;
> +
> +  auto dump_prefix = [&]()
> +    {
> +      if (!insns[0])
> +	fprintf (dump_file, "existing pair i%d: ", insns[1]->uid ());
> +      else
> +	fprintf (dump_file, "  (%d,%d)",
> +		 insns[0]->uid (), insns[1]->uid ());
> +    };
> +
> +  insn_info *hazard = latest_hazard_before (cand, nullptr, insns[1]);
> +  if (!hazard || *hazard <= *pair_dst)
> +    {
> +      if (dump_file)
> +	{
> +	  dump_prefix ();
> +	  fprintf (dump_file,
> +		   "folding in trailing add (%d) to use writeback form\n",
> +		   cand->uid ());
> +	}
> +
> +      *add_def = def;
> +      *writeback_effect = copy_rtx (pat);
> +      return cand;
> +    }
> +
> +  if (dump_file)
> +    {
> +      dump_prefix ();
> +      fprintf (dump_file,
> +	       "can't fold in trailing add (%d), hazard = %d\n",
> +	       cand->uid (), hazard->uid ());
> +    }
> +
> +  return nullptr;
> +}
> [...]
> +// Try and actually fuse the pair given by insns I1 and I2.

Please document the other parameters too (something like
find_trailing_add would be good).

> +bool
> +ldp_bb_info::fuse_pair (bool load_p,
> +			unsigned access_size,
> +			int writeback,
> +			insn_info *i1, insn_info *i2,
> +			base_cand &base,
> +			const insn_range_info &move_range)
> +{
> +  auto attempt = crtl->ssa->new_change_attempt ();
> +
> +  auto make_change = [&attempt](insn_info *insn)
> +    {
> +      return crtl->ssa->change_alloc<insn_change> (attempt, insn);
> +    };
> +  auto make_delete = [&attempt](insn_info *insn)
> +    {
> +      return crtl->ssa->change_alloc<insn_change> (attempt,
> +						   insn,
> +						   insn_change::DELETE);
> +    };
> +
> +  insn_info *first = (*i1 < *i2) ? i1 : i2;
> +  insn_info *second = (first == i1) ? i2 : i1;
> +
> +  insn_info *insns[2] = { first, second };
> +
> +  auto_vec<insn_change *, 4> changes (4);
> +  auto_vec<int, 2> tombstone_uids (2);
> +
> +  rtx pats[2] = {
> +    PATTERN (first->rtl ()),
> +    PATTERN (second->rtl ())
> +  };
> +
> +  use_array input_uses[2] = { first->uses (), second->uses () };
> +  def_array input_defs[2] = { first->defs (), second->defs () };
> +
> +  int changed_insn = -1;
> +  if (base.from_insn != -1)
> +    {
> +      // If we're not already using a shared base, we need
> +      // to re-write one of the accesses to use the base from
> +      // the other insn.
> +      gcc_checking_assert (base.from_insn == 0 || base.from_insn == 1);
> +      changed_insn = !base.from_insn;
> +
> +      rtx base_pat = pats[base.from_insn];
> +      rtx change_pat = pats[changed_insn];
> +      rtx base_mem = XEXP (base_pat, load_p);
> +      rtx change_mem = XEXP (change_pat, load_p);
> +
> +      const bool lower_base_p = (insns[base.from_insn] == i1);
> +      HOST_WIDE_INT adjust_amt = access_size;
> +      if (!lower_base_p)
> +	adjust_amt *= -1;
> +
> +      rtx change_reg = XEXP (change_pat, !load_p);
> +      machine_mode mode_for_mem = GET_MODE (change_mem);
> +      rtx effective_base = drop_writeback (base_mem);
> +      rtx new_mem = adjust_address_nv (effective_base,
> +				       mode_for_mem,
> +				       adjust_amt);
> +      rtx new_set = load_p
> +	? gen_rtx_SET (change_reg, new_mem)
> +	: gen_rtx_SET (new_mem, change_reg);
> +
> +      pats[changed_insn] = new_set;
> +
> +      auto keep_use = [&](use_info *u)
> +	{
> +	  return refers_to_regno_p (u->regno (), u->regno () + 1,
> +				    change_pat, &XEXP (change_pat, load_p));
> +	};
> +
> +      // Drop any uses that only occur in the old address.
> +      input_uses[changed_insn] = filter_accesses (attempt,
> +						  input_uses[changed_insn],
> +						  keep_use);
> +    }
> +
> +  rtx writeback_effect = NULL_RTX;
> +  if (writeback)
> +    writeback_effect = extract_writebacks (load_p, pats, changed_insn);
> +
> +  const auto base_regno = base.def->regno ();
> +
> +  if (base.from_insn == -1 && (writeback & 1))
> +    {
> +      // If the first of the candidate insns had a writeback form, we'll need to
> +      // drop the use of the updated base register from the second insn's uses.
> +      //
> +      // N.B. we needn't worry about the base register occurring as a store
> +      // operand, as we checked that there was no non-address true dependence
> +      // between the insns in try_fuse_pair.
> +      gcc_checking_assert (find_access (input_uses[1], base_regno));
> +      input_uses[1] = check_remove_regno_access (attempt,
> +						 input_uses[1],
> +						 base_regno);
> +    }
> +
> +  // Go through and drop uses that only occur in register notes,
> +  // as we won't be preserving those.
> +  for (int i = 0; i < 2; i++)
> +    {
> +      auto rti = insns[i]->rtl ();
> +      if (!REG_NOTES (rti))
> +	continue;
> +
> +      input_uses[i] = remove_note_accesses (attempt, input_uses[i]);
> +    }
> +
> +  // Edge case: if the first insn is a writeback load and the
> +  // second insn is a non-writeback load which transfers into the base
> +  // register, then we should drop the writeback altogether as the
> +  // update of the base register from the second load should prevail.
> +  //
> +  // For example:
> +  //   ldr x2, [x1], #8
> +  //   ldr x1, [x1]
> +  //   -->
> +  //   ldp x2, x1, [x1]
> +  if (writeback == 1
> +      && load_p
> +      && find_access (input_defs[1], base_regno))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "  ldp: i%d has wb but subsequent i%d has non-wb "
> +		 "update of base (r%d), dropping wb\n",
> +		 insns[0]->uid (), insns[1]->uid (), base_regno);
> +      gcc_assert (writeback_effect);
> +      writeback_effect = NULL_RTX;
> +    }
> +
> +  // So far the patterns have been in instruction order,
> +  // now we want them in offset order.
> +  if (i1 != first)
> +    std::swap (pats[0], pats[1]);
> +
> +  poly_int64 offsets[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      rtx mem = XEXP (pats[i], load_p);
> +      gcc_checking_assert (MEM_P (mem));
> +      rtx base = strip_offset (XEXP (mem, 0), offsets + i);
> +      gcc_checking_assert (REG_P (base));
> +      gcc_checking_assert (base_regno == REGNO (base));
> +    }
> +
> +  // If either of the original insns had writeback, but the resulting pair insn
> +  // does not (can happen e.g. in the ldp edge case above, or if the writeback
> +  // effects cancel out), then drop the def(s) of the base register as
> +  // appropriate.
> +  //
> +  // Also drop the first def in the case that both of the original insns had
> +  // writeback.  The second def could well have uses, but the first def should
> +  // only be used by the second insn (and we dropped that use above).
> +  for (int i = 0; i < 2; i++)
> +    if ((!writeback_effect && (writeback & (1 << i)))
> +	|| (i == 0 && writeback == 3))
> +      input_defs[i] = check_remove_regno_access (attempt,
> +						 input_defs[i],
> +						 base_regno);
> +
> +  // If we don't currently have a writeback pair, and we don't have
> +  // a load that clobbers the base register, look for a trailing destructive
> +  // update of the base register and try and fold it in to make this into a
> +  // writeback pair.
> +  insn_info *trailing_add = nullptr;
> +  if (aarch64_ldp_writeback > 1
> +      && !writeback_effect
> +      && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
> +					 XEXP (pats[0], 0), nullptr)
> +		      && !refers_to_regno_p (base_regno, base_regno + 1,
> +					     XEXP (pats[1], 0), nullptr))))
> +    {
> +      def_info *add_def;
> +      trailing_add = find_trailing_add (insns, move_range, writeback,
> +					&writeback_effect,
> +					&add_def, base.def, offsets[0],
> +					access_size);
> +      if (trailing_add)
> +	{
> +	  // The def of the base register from the trailing add should prevail.
> +	  input_defs[0] = insert_access (attempt, add_def, input_defs[0]);
> +	  gcc_assert (input_defs[0].is_valid ());
> +	}
> +    }
> +
> +  // Now that we know what base mem we're going to use, check if it's OK
> +  // with the ldp/stp policy.
> +  rtx first_mem = XEXP (pats[0], load_p);
> +  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> +						load_p,
> +						GET_MODE (first_mem)))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
> +		 i1->uid (), i2->uid ());
> +      return false;
> +    }
> +
> +  rtx reg_notes = combine_reg_notes (first, second);
> +
> +  rtx pair_pat;
> +  if (writeback_effect)
> +    {
> +      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
> +      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> +    }
> +  else if (load_p)
> +    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> +				      XEXP (pats[1], 0),
> +				      XEXP (pats[0], 1));
> +  else
> +    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> +				       XEXP (pats[0], 1),
> +				       XEXP (pats[1], 1));
> +
> +  insn_change *pair_change = nullptr;
> +  auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> +      rtx_insn *rti = change->insn ()->rtl ();
> +      gcc_assert (validate_unshare_change (rti, &PATTERN (rti), pair_pat,
> +					   true));
> +      gcc_assert (validate_change (rti, &REG_NOTES (rti),
> +				   reg_notes, true));

gcc_asserts can be compiled out, so usually it's necessary to do:

  if (!...)
    gcc_unreachable ();

instead.  But for these it's probably best just to drop the assertions.
The return values aren't meant to be useful for in-group changes,
since the real checking happens at the end.

More cases of the same thing later.

Formatting nit, but: the level of indentation doesn't seem to match the
braces.

> +  };
> +
> +  if (load_p)
> +    {
> +      changes.quick_push (make_delete (first));
> +      pair_change = make_change (second);
> +      changes.quick_push (pair_change);
> +
> +      pair_change->move_range = move_range;
> +      pair_change->new_defs = merge_access_arrays (attempt,
> +						   input_defs[0],
> +						   input_defs[1]);
> +      gcc_assert (pair_change->new_defs.is_valid ());
> +
> +      pair_change->new_uses
> +	= merge_access_arrays (attempt,
> +			       drop_memory_access (input_uses[0]),
> +			       drop_memory_access (input_uses[1]));
> +      gcc_assert (pair_change->new_uses.is_valid ());
> +      set_pair_pat (pair_change);
> +    }
> +  else
> +    {
> +      insn_info *store_to_change = decide_stp_strategy (first, second,
> +							move_range);
> +
> +      if (store_to_change && dump_file)
> +	fprintf (dump_file, "  stp: re-purposing store %d\n",
> +		 store_to_change->uid ());
> +
> +      insn_change *change;
> +      for (int i = 0; i < 2; i++)
> +	{
> +	  change = make_change (insns[i]);
> +	  if (insns[i] == store_to_change)
> +	    {
> +	      set_pair_pat (change);
> +	      change->new_uses = merge_access_arrays (attempt,
> +						      input_uses[0],
> +						      input_uses[1]);
> +	      auto d1 = drop_memory_access (input_defs[0]);
> +	      auto d2 = drop_memory_access (input_defs[1]);
> +	      change->new_defs = merge_access_arrays (attempt, d1, d2);
> +	      gcc_assert (change->new_defs.is_valid ());
> +	      def_info *stp_def = memory_access (store_to_change->defs ());
> +	      change->new_defs = insert_access (attempt,
> +						stp_def,
> +						change->new_defs);
> +	      gcc_assert (change->new_defs.is_valid ());
> +	      change->move_range = move_range;
> +	      pair_change = change;
> +	    }
> +	  else
> +	    {
> +	      // Note that we are turning this insn into a tombstone,
> +	      // we need to keep track of these if we go ahead with the
> +	      // change.
> +	      tombstone_uids.quick_push (insns[i]->uid ());
> +	      rtx_insn *rti = insns[i]->rtl ();
> +	      gcc_assert (validate_change (rti, &PATTERN (rti),
> +					   gen_tombstone (), true));
> +	      gcc_assert (validate_change (rti, &REG_NOTES (rti),
> +					   NULL_RTX, true));
> +	      change->new_uses = use_array (nullptr, 0);
> +	    }
> +	  gcc_assert (change->new_uses.is_valid ());
> +	  changes.quick_push (change);
> +	}
> +
> +      if (!store_to_change)
> +	{
> +	  // Tricky case.  Cannot re-purpose existing insns for stp.
> +	  // Need to insert new insn.
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "  stp fusion: cannot re-purpose candidate stores\n");
> +
> +	  auto new_insn = crtl->ssa->create_insn (attempt, INSN, pair_pat);
> +	  change = make_change (new_insn);
> +	  change->move_range = move_range;
> +	  change->new_uses = merge_access_arrays (attempt,
> +						  input_uses[0],
> +						  input_uses[1]);
> +	  gcc_assert (change->new_uses.is_valid ());
> +
> +	  auto d1 = drop_memory_access (input_defs[0]);
> +	  auto d2 = drop_memory_access (input_defs[1]);
> +	  change->new_defs = merge_access_arrays (attempt, d1, d2);
> +	  gcc_assert (change->new_defs.is_valid ());
> +
> +	  auto new_set = crtl->ssa->create_set (attempt, new_insn, memory);
> +	  change->new_defs = insert_access (attempt, new_set,
> +					    change->new_defs);
> +	  gcc_assert (change->new_defs.is_valid ());
> +	  changes.safe_insert (1, change);
> +	  pair_change = change;
> +	}
> +    }
> +
> +  if (trailing_add)
> +    changes.quick_push (make_delete (trailing_add));
> +
> +  auto n_changes = changes.length ();
> +  gcc_checking_assert (n_changes >= 2 && n_changes <= 4);
> +
> +

Formatting nit: excess blank line.

> +  auto is_changing = insn_is_changing (changes);
> +  for (unsigned i = 0; i < n_changes; i++)
> +    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
> +
> +  // Check the pair pattern is recog'd.
> +  if (!rtl_ssa::recog_ignoring (attempt, *pair_change, is_changing))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "  failed to form pair, recog failed\n");
> +
> +      // Free any reg notes we allocated.
> +      while (reg_notes)
> +	{
> +	  rtx next = XEXP (reg_notes, 1);
> +	  free_EXPR_LIST_node (reg_notes);
> +	  reg_notes = next;
> +	}
> +      cancel_changes (0);
> +      return false;
> +    }
> +
> +  gcc_assert (crtl->ssa->verify_insn_changes (changes));
> +
> +  confirm_change_group ();
> +  crtl->ssa->change_insns (changes);
> +
> +  gcc_checking_assert (tombstone_uids.length () <= 2);
> +  for (auto uid : tombstone_uids)
> +    track_tombstone (uid);
> +
> +  return true;
> +}
> [...]
> +// Given INSNS (in program order) which are known to be adjacent, look
> +// to see if either insn has a suitable RTL (register) base that we can
> +// use to form a pair.  Push these to BASE_CANDS if we find any.  CAND_MEMs
> +// gives the relevant mems from the candidate insns, ACCESS_SIZE gives the
> +// size of a single candidate access, and REVERSED says whether the accesses
> +// are inverted in offset order.
> +//
> +// Returns an integer where bit (1 << i) is set if INSNS[i] uses writeback
> +// addressing.
> +static int
> +get_viable_bases (insn_info *insns[2],
> +		  vec<base_cand> &base_cands,
> +		  rtx cand_mems[2],
> +		  unsigned access_size,
> +		  bool reversed)
> +{
> +  // We discovered this pair through a common base.  Need to ensure that
> +  // we have a common base register that is live at both locations.
> +  def_info *base_defs[2] = {};
> +  int writeback = 0;
> +  for (int i = 0; i < 2; i++)
> +    {
> +      const bool is_lower = (i == reversed);
> +      poly_int64 poly_off;
> +      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
> +      if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
> +	writeback |= (1 << i);
> +
> +      if (!REG_P (base) || !poly_off.is_constant ())
> +	continue;
> +
> +      // Punt on accesses relative to eliminable regs.  Since we don't know the
> +      // elimination offset pre-RA, we should postpone forming pairs on such
> +      // accesses until after RA.
> +      if (!reload_completed
> +	  && (REGNO (base) == FRAME_POINTER_REGNUM
> +	      || REGNO (base) == ARG_POINTER_REGNUM))
> +	continue;
> +
> +      HOST_WIDE_INT base_off = poly_off.to_constant ();
> +
> +      // It should be unlikely that we ever punt here, since MEM_EXPR offset
> +      // alignment should be a good proxy for register offset alignment.
> +      if (base_off % access_size != 0)
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "base not viable, offset misaligned (insn %d)\n",
> +		     insns[i]->uid ());
> +	  continue;
> +	}
> +
> +      base_off /= access_size;
> +
> +      if (!is_lower)
> +	base_off--;
> +
> +      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> +	continue;
> +
> +      for (auto use : insns[i]->uses ())
> +	if (use->is_reg () && use->regno () == REGNO (base))
> +	  {
> +	    base_defs[i] = use->def ();
> +	    break;
> +	  }

Looks like this could use your new find_access helper.

> +    }
> +
> +  if (!base_defs[0] && !base_defs[1])
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "no viable base register for pair (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return writeback;
> +    }
> +
> +  for (int i = 0; i < 2; i++)
> +    if ((writeback & (1 << i)) && !base_defs[i])
> +      {
> +	if (dump_file)
> +	  fprintf (dump_file, "insn %d has writeback but base isn't viable\n",
> +		   insns[i]->uid ());
> +	return writeback;
> +      }
> +
> +  if (writeback == 3
> +      && base_defs[0]->regno () != base_defs[1]->regno ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "pair (%d,%d): double writeback with distinct regs (%d,%d): "
> +		 "punting\n",
> +		 insns[0]->uid (), insns[1]->uid (),
> +		 base_defs[0]->regno (), base_defs[1]->regno ());
> +      return writeback;
> +    }
> +
> +  if (base_defs[0] && base_defs[1]
> +      && base_defs[0]->regno () == base_defs[1]->regno ())
> +    {
> +      // Easy case: insns already share the same base reg.
> +      base_cands.quick_push (base_defs[0]);
> +      return writeback;
> +    }
> +
> +  // Otherwise, we know that one of the bases must change.
> +  //
> +  // Note that if there is writeback we must use the writeback base
> +  // (we know now there is exactly one).
> +  for (int i = 0; i < 2; i++)
> +    if (base_defs[i] && (!writeback || (writeback & (1 << i))))
> +      base_cands.quick_push (base_cand { base_defs[i], i });
> +
> +  return writeback;
> +}
> +
> +// Given two adjacent memory accesses of the same size, I1 and I2, try
> +// and see if we can merge them into a ldp or stp.
> +//
> +// ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
> +// if the accesses are both loads, otherwise they are both stores.
> +bool
> +ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> +			    insn_info *i1, insn_info *i2)
> +{
> +  if (dump_file)
> +    fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
> +	     load_p, i1->uid (), i2->uid ());
> +
> +  insn_info *insns[2];
> +  bool reversed = false;
> +  if (*i1 < *i2)
> +    {
> +      insns[0] = i1;
> +      insns[1] = i2;
> +    }
> +  else
> +    {
> +      insns[0] = i2;
> +      insns[1] = i1;
> +      reversed = true;
> +    }
> +
> +  rtx cand_mems[2];
> +  rtx reg_ops[2];
> +  rtx pats[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      pats[i] = PATTERN (insns[i]->rtl ());
> +      cand_mems[i] = XEXP (pats[i], load_p);
> +      reg_ops[i] = XEXP (pats[i], !load_p);
> +    }
> +
> +  if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "punting on ldp due to reg conflcits (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  if (cfun->can_throw_non_call_exceptions
> +      && (find_reg_note (insns[0]->rtl (), REG_EH_REGION, NULL_RTX)
> +	  || find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
> +      && insn_could_throw_p (insns[0]->rtl ())
> +      && insn_could_throw_p (insns[1]->rtl ()))

My comment in the previous review still stands here: I don't see why
the insn_could_throw_p tests are needed.  In particular, if somehow
one of the insns throws internally and the other doesn't, the one
that throws internally should be enough to prevent the pair (AIUI).

I think we should just delete the last two lines and trust the notes.

> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "can't combine insns with EH side effects (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  auto_vec<base_cand, 2> base_cands (2);
> +
> +  int writeback = get_viable_bases (insns, base_cands, cand_mems,
> +				    access_size, reversed);
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "no viable base for pair (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  rtx *ignore = &XEXP (pats[1], load_p);
> +  for (auto use : insns[1]->uses ())
> +    if (!use->is_mem ()
> +	&& refers_to_regno_p (use->regno (), use->regno () + 1, pats[1], ignore)
> +	&& use->def () && use->def ()->insn () == insns[0])
> +      {
> +	// N.B. we allow a true dependence on the base address, as this
> +	// happens in the case of auto-inc accesses.  Consider a post-increment
> +	// load followed by a regular indexed load, for example.
> +	if (dump_file)
> +	  fprintf (dump_file,
> +		   "%d has non-address true dependence on %d, rejecting pair\n",
> +		   insns[1]->uid (), insns[0]->uid ());
> +	return false;
> +      }
> +
> +  unsigned i = 0;
> +  while (i < base_cands.length ())
> +    {
> +      base_cand &cand = base_cands[i];
> +
> +      rtx *ignore[2] = {};
> +      for (int j = 0; j < 2; j++)
> +	if (cand.from_insn == !j)
> +	  ignore[j] = &XEXP (cand_mems[j], 0);
> +
> +      insn_info *h = first_hazard_after (insns[0], ignore[0]);
> +      if (h && *h <= *insns[1])
> +	cand.hazards[0] = h;
> +
> +      h = latest_hazard_before (insns[1], ignore[1]);
> +      if (h && *h >= *insns[0])
> +	cand.hazards[1] = h;
> +
> +      if (!cand.viable ())
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "pair (%d,%d): rejecting base %d due to dataflow "
> +		     "hazards (%d,%d)\n",
> +		     insns[0]->uid (),
> +		     insns[1]->uid (),
> +		     cand.def->regno (),
> +		     cand.hazards[0]->uid (),
> +		     cand.hazards[1]->uid ());
> +
> +	  base_cands.ordered_remove (i);
> +	}
> +      else
> +	i++;
> +    }
> +
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "can't form pair (%d,%d) due to dataflow hazards\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  insn_info *alias_hazards[4] = {};
> +
> +  // First def of memory after the first insn, and last def of memory
> +  // before the second insn, respectively.
> +  def_info *mem_defs[2] = {};
> +  if (load_p)
> +    {
> +      if (!MEM_READONLY_P (cand_mems[0]))
> +	{
> +	  mem_defs[0] = memory_access (insns[0]->uses ())->def ();
> +	  gcc_checking_assert (mem_defs[0]);
> +	  mem_defs[0] = mem_defs[0]->next_def ();
> +	}
> +      if (!MEM_READONLY_P (cand_mems[1]))
> +	{
> +	  mem_defs[1] = memory_access (insns[1]->uses ())->def ();
> +	  gcc_checking_assert (mem_defs[1]);
> +	}
> +    }
> +  else
> +    {
> +      mem_defs[0] = memory_access (insns[0]->defs ())->next_def ();
> +      mem_defs[1] = memory_access (insns[1]->defs ())->prev_def ();
> +      gcc_checking_assert (mem_defs[0]);
> +      gcc_checking_assert (mem_defs[1]);
> +    }
> +
> +  auto tombstone_p = [&](insn_info *insn) -> bool {
> +    return m_emitted_tombstone
> +	   && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
> +  };
> +
> +  store_walker<false, decltype(tombstone_p)>
> +    forward_store_walker (mem_defs[0], cand_mems[0], insns[1], tombstone_p);
> +
> +  store_walker<true, decltype(tombstone_p)>
> +    backward_store_walker (mem_defs[1], cand_mems[1], insns[0], tombstone_p);
> +
> +  alias_walker *walkers[4] = {};
> +  if (mem_defs[0])
> +    walkers[0] = &forward_store_walker;
> +  if (mem_defs[1])
> +    walkers[1] = &backward_store_walker;
> +
> +  if (load_p && (mem_defs[0] || mem_defs[1]))
> +    do_alias_analysis (alias_hazards, walkers, load_p);
> +  else
> +    {
> +      // We want to find any loads hanging off the first store.
> +      mem_defs[0] = memory_access (insns[0]->defs ());
> +      load_walker<false> forward_load_walker (mem_defs[0], insns[0], insns[1]);
> +      load_walker<true> backward_load_walker (mem_defs[1], insns[1], insns[0]);
> +      walkers[2] = &forward_load_walker;
> +      walkers[3] = &backward_load_walker;
> +      do_alias_analysis (alias_hazards, walkers, load_p);
> +      // Now consolidate hazards back down.
> +      if (alias_hazards[2]
> +	  && (!alias_hazards[0] || (*alias_hazards[2] < *alias_hazards[0])))
> +	alias_hazards[0] = alias_hazards[2];
> +
> +      if (alias_hazards[3]
> +	  && (!alias_hazards[1] || (*alias_hazards[3] > *alias_hazards[1])))
> +	alias_hazards[1] = alias_hazards[3];
> +    }
> +
> +  if (alias_hazards[0] && alias_hazards[1]
> +      && *alias_hazards[0] <= *alias_hazards[1])
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "cannot form pair (%d,%d) due to alias conflicts (%d,%d)\n",
> +		 i1->uid (), i2->uid (),
> +		 alias_hazards[0]->uid (), alias_hazards[1]->uid ());
> +      return false;
> +    }
> +
> +  // Now narrow the hazards on each base candidate using
> +  // the alias hazards.
> +  i = 0;
> +  while (i < base_cands.length ())
> +    {
> +      base_cand &cand = base_cands[i];
> +      if (alias_hazards[0] && (!cand.hazards[0]
> +			       || *alias_hazards[0] < *cand.hazards[0]))
> +	cand.hazards[0] = alias_hazards[0];
> +      if (alias_hazards[1] && (!cand.hazards[1]
> +			       || *alias_hazards[1] > *cand.hazards[1]))
> +	cand.hazards[1] = alias_hazards[1];
> +
> +      if (cand.viable ())
> +	i++;
> +      else
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file, "pair (%d,%d): rejecting base %d due to "
> +				"alias/dataflow hazards (%d,%d)",
> +				insns[0]->uid (), insns[1]->uid (),
> +				cand.def->regno (),
> +				cand.hazards[0]->uid (),
> +				cand.hazards[1]->uid ());
> +
> +	  base_cands.ordered_remove (i);
> +	}
> +    }
> +
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "cannot form pair (%d,%d) due to alias/dataflow hazards",
> +		 insns[0]->uid (), insns[1]->uid ());
> +
> +      return false;
> +    }
> +
> +  base_cand *base = &base_cands[0];
> +  if (base_cands.length () > 1)
> +    {
> +      // If there are still multiple viable bases, it makes sense
> +      // to choose one that allows us to reduce register pressure,
> +      // for loads this means moving further down, for stores this
> +      // means moving further up.
> +      gcc_checking_assert (base_cands.length () == 2);
> +      const int hazard_i = !load_p;
> +      if (base->hazards[hazard_i])
> +	{
> +	  if (!base_cands[1].hazards[hazard_i])
> +	    base = &base_cands[1];
> +	  else if (load_p
> +		   && *base_cands[1].hazards[hazard_i]
> +		      > *(base->hazards[hazard_i]))
> +	    base = &base_cands[1];
> +	  else if (!load_p
> +		   && *base_cands[1].hazards[hazard_i]
> +		      < *(base->hazards[hazard_i]))
> +	    base = &base_cands[1];
> +	}
> +    }
> +
> +  // Otherwise, hazards[0] > hazards[1].
> +  // Pair can be formed anywhere in (hazards[1], hazards[0]).
> +  insn_range_info range (insns[0], insns[1]);
> +  if (base->hazards[1])
> +    range.first = base->hazards[1];
> +  if (base->hazards[0])
> +    range.last = base->hazards[0]->prev_nondebug_insn ();
> +
> +  // Placement strategy: push loads down and pull stores up, this should
> +  // help register pressure by reducing live ranges.
> +  if (load_p)
> +    range.first = range.last;
> +  else
> +    range.last = range.first;
> +
> +  if (dump_file)
> +    {
> +      auto print_hazard = [](insn_info *i)
> +	{
> +	  if (i)
> +	    fprintf (dump_file, "%d", i->uid ());
> +	  else
> +	    fprintf (dump_file, "-");
> +	};
> +      auto print_pair = [print_hazard](insn_info **i)
> +	{
> +	  print_hazard (i[0]);
> +	  fprintf (dump_file, ",");
> +	  print_hazard (i[1]);
> +	};
> +
> +      fprintf (dump_file, "fusing pair [L=%d] (%d,%d), base=%d, hazards: (",
> +	      load_p, insns[0]->uid (), insns[1]->uid (),
> +	      base->def->regno ());
> +      print_pair (base->hazards);
> +      fprintf (dump_file, "), move_range: (%d,%d)\n",
> +	       range.first->uid (), range.last->uid ());
> +    }
> +
> +  return fuse_pair (load_p, access_size, writeback,
> +		    i1, i2, *base, range);
> +}
> [...]
> +// The algorithm is straightforward: if we consider a combined list
> +// of candidates X obtained by merging LEFT_LIST and RIGHT_LIST in
> +// program order, then we advance through X until we
> +// reach a crossing point (where X[i] and X[i+1] come from different
> +// source lists).
> +//
> +// At this point we know X[i] and X[i+1] are adjacent accesses, and
> +// we try to fuse them into a pair.  If this succeeds, we remove X[i]
> +// and X[i+1] from their original lists and continue as above.  We
> +// queue the access that came from RIGHT_LIST for deletion by adding
> +// it to TO_DELETE, so that we don't try and merge it in subsequent
> +// iterations of transform_for_base.  See below for a description of the
> +// handling in the failure case.
> +void
> +ldp_bb_info::merge_pairs (insn_list_t &left_list,
> +			  insn_list_t &right_list,
> +			  hash_set<insn_info *> &to_delete,
> +			  bool load_p,
> +			  unsigned access_size)
> +{
> +  auto iter_l = left_list.begin ();
> +  auto iter_r = right_list.begin ();
> +
> +  while (!left_list.empty () && !right_list.empty ())
> +    {
> +      auto next_l = std::next (iter_l);
> +      auto next_r = std::next (iter_r);
> +      if (**iter_l < **iter_r
> +	  && next_l != left_list.end ()
> +	  && **next_l < **iter_r)
> +	{
> +	  iter_l = next_l;
> +	  continue;
> +	}
> +      else if (**iter_r < **iter_l
> +	       && next_r != right_list.end ()
> +	       && **next_r < **iter_l)
> +	{
> +	  iter_r = next_r;
> +	  continue;
> +	}
> +
> +      if (try_fuse_pair (load_p, access_size, *iter_l, *iter_r))
> +	{
> +	  if (to_delete.add (*iter_r))
> +	    gcc_unreachable (); // Shouldn't get added twice.
> +
> +	  iter_l = erase_one (left_list, iter_l);
> +	  iter_r = erase_one (right_list, iter_r);
> +	}
> +      else
> +	{
> +	  // If we failed to merge the pair, then we delete the entire
> +	  // prefix of insns that originated from the same source list.
> +	  // The rationale for this is as follows.
> +	  //
> +	  // In the store case, the insns in the prefix can't be
> +	  // re-ordered over each other as they are guaranteed to store
> +	  // to the same location, so we're guaranteed not to lose
> +	  // opportunities by doing this.
> +	  //
> +	  // In the load case, subsequent loads from the same location
> +	  // are either redundant (in which case they should have been
> +	  // cleaned up by an earlier optimization pass) or there is an
> +	  // intervening aliasing hazard, in which case we can't
> +	  // re-order them anyway, so provided earlier passes have
> +	  // cleaned up redundant loads, we shouldn't miss opportunities
> +	  // by doing this.
> +	  if (**iter_l < **iter_r)
> +	    // Delete everything from l_begin to iter_l, inclusive.
> +	    iter_l = erase_prefix (left_list, iter_l);
> +	  else
> +	    // Delete everything from r_begin to iter_r, inclusive.
> +	    iter_r = erase_prefix (right_list, iter_r);
> +	}
> +    }
> +}

It seems like this would be simpler without the erase_prefixes.  The function
never moves backwards through the list (AIUI) and so it shouldn't matter
whether the elements before iter_l or iter_r are erased or not.  And AIUI
left_list is never used again after this function returns.

Perhaps then it wouldn't be necessary to copy right_list before
calling this function, since the erase_ones would erase only those
accesses that can't be considered candidates for the next offset
pairing.  That should then simplify...

> +
> +// Given a list of insns LEFT_ORIG with all accesses adjacent to
> +// those in RIGHT_ORIG, try and form them into pairs.
> +//
> +// Return true iff we formed all the RIGHT_ORIG candidates into
> +// pairs.
> +bool
> +ldp_bb_info::try_form_pairs (insn_list_t *left_orig,
> +			     insn_list_t *right_orig,
> +			     bool load_p, unsigned access_size)
> +{
> +  // Make a copy of the right list which we can modify to
> +  // exclude candidates locally for this invocation.
> +  insn_list_t right_copy (*right_orig);
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "try_form_pairs [L=%d], cand vecs ", load_p);
> +      dump_insn_list (dump_file, *left_orig);
> +      fprintf (dump_file, " x ");
> +      dump_insn_list (dump_file, right_copy);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  // List of candidate insns to delete from the original right_list
> +  // (because they were formed into a pair).
> +  hash_set<insn_info *> to_delete;
> +
> +  // Now we have a 2D matrix of candidates, traverse it to try and
> +  // find a pair of insns that are already adjacent (within the
> +  // merged list of accesses).
> +  merge_pairs (*left_orig, right_copy, to_delete, load_p, access_size);
> +
> +  // If we formed all right candidates into pairs,
> +  // then we can skip the next iteration.
> +  if (to_delete.elements () == right_orig->size ())
> +    return true;
> +
> +  // Delete items from to_delete.
> +  auto right_iter = right_orig->begin ();
> +  auto right_end = right_orig->end ();
> +  while (right_iter != right_end)
> +    {
> +      auto right_next = std::next (right_iter);
> +
> +      if (to_delete.contains (*right_iter))
> +	{
> +	  right_orig->erase (right_iter);
> +	  right_end = right_orig->end ();
> +	}
> +
> +      right_iter = right_next;
> +    }
> +
> +  return false;

...this function.  Perhaps I've misunderstood the logic though.

> +}
> [...]
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index f5a518202a1..116ec1892dc 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -271,6 +271,16 @@ mtrack-speculation
>  Target Var(aarch64_track_speculation)
>  Generate code to track when the CPU might be speculating incorrectly.
>  
> +mearly-ldp-fusion
> +Target Var(flag_aarch64_early_ldp_fusion) Optimization Init(1)
> +Enable the pre-RA AArch64-specific pass to fuse loads and stores into
> +ldp and stp instructions.

Probably best to expand "RA".  Maybe:

Try to form LDP and STP instructions before register allocation.

(based on fschedule-insns) or:

Enable a pass that tries to form LDP and STP instructions before register
allocation.

Or simply use the first sentence from the .texi documentation (which LGTM).

Similarly for mlate-ldp-fusion.

Thanks,
Richard

> +
> +mlate-ldp-fusion
> +Target Var(flag_aarch64_late_ldp_fusion) Optimization Init(1)
> +Enable the post-RA AArch64-specific pass to fuse loads and stores into
> +ldp and stp instructions.
> +
>  mstack-protector-guard=
>  Target RejectNegative Joined Enum(stack_protector_guard) Var(aarch64_stack_protector_guard) Init(SSP_GLOBAL)
>  Use given stack-protector guard.
> @@ -360,3 +370,16 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
>  
>  EnumValue
>  Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> +
> +-param=aarch64-ldp-alias-check-limit=
> +Target Joined UInteger Var(aarch64_ldp_alias_check_limit) Init(8) IntegerRange(0, 65536) Param
> +Limit on number of alias checks performed when attempting to form an ldp/stp.
> +
> +-param=aarch64-ldp-writeback=
> +Target Joined UInteger Var(aarch64_ldp_writeback) Init(2) IntegerRange(0,2) Param
> +Param to control which writeback opportunities we try to handle in the
> +load/store pair fusion pass.  A value of zero disables writeback
> +handling.  One means we try to form pairs involving one or more existing
> +individual writeback accesses where possible.  A value of two means we
> +also try to opportunistically form writeback opportunities by folding in
> +trailing destructive updates of the base register used by a pair.
> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index 0d96ae3d0b2..f7b24256b4d 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -194,6 +194,13 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
>  	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>  		$(srcdir)/config/aarch64/aarch64-cc-fusion.cc
>  
> +aarch64-ldp-fusion.o: $(srcdir)/config/aarch64/aarch64-ldp-fusion.cc \
> +    $(CONFIG_H) $(SYSTEM_H) $(CORETYPES_H) $(BACKEND_H) $(RTL_H) $(DF_H) \
> +    $(RTL_SSA_H) cfgcleanup.h tree-pass.h ordered-hash-map.h tree-dfa.h \
> +    fold-const.h tree-hash-traits.h print-tree.h
> +	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> +		$(srcdir)/config/aarch64/aarch64-ldp-fusion.cc
> +
>  comma=,
>  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
>  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 32f535e1ed4..29b4a337549 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
>  -moverride=@var{string}  -mverbose-cost-dump
>  -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
>  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> --moutline-atomics }
> +-moutline-atomics -mearly-ldp-fusion -mlate-ldp-fusion}
>  
>  @emph{Adapteva Epiphany Options}
>  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> @@ -16774,6 +16774,20 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
>  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
>  source pointer is aligned to at least double the alignment of the type.
>  
> +@item aarch64-ldp-alias-check-limit
> +Limit on the number of alias checks performed by the AArch64 load/store pair
> +fusion pass when attempting to form an ldp/stp.  Higher values make the pass
> +more aggressive at re-ordering loads over stores, at the expense of increased
> +compile time.
> +
> +@item aarch64-ldp-writeback
> +Param to control which writeback opportunities we try to handle in the AArch64
> +load/store pair fusion pass.  A value of zero disables writeback handling.  One
> +means we try to form pairs involving one or more existing individual writeback
> +accesses where possible.  A value of two means we also try to opportunistically
> +form writeback opportunities by folding in trailing destructive updates of the
> +base register used by a pair.
> +
>  @item aarch64-loop-vect-issue-rate-niters
>  The tuning for some AArch64 CPUs tries to take both latencies and issue
>  rates into account when deciding whether a loop should be vectorized
> @@ -21190,6 +21204,16 @@ Enable compiler hardening against straight line speculation (SLS).
>  In addition, @samp{-mharden-sls=all} enables all SLS hardening while
>  @samp{-mharden-sls=none} disables all SLS hardening.
>  
> +@opindex mearly-ldp-fusion
> +@item -mearly-ldp-fusion
> +Enable the copy of the AArch64 load/store pair fusion pass that runs before
> +register allocation.  Enabled by default at @samp{-O} and above.
> +
> +@opindex mlate-ldp-fusion
> +@item -mlate-ldp-fusion
> +Enable the copy of the AArch64 load/store pair fusion pass that runs after
> +register allocation.  Enabled by default at @samp{-O} and above.
> +
>  @opindex msve-vector-bits
>  @item -msve-vector-bits=@var{bits}
>  Specify the number of bits in an SVE vector register.  This option only has