public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Alex Coplan <alex.coplan@arm.com>
To: Ajit Agarwal <aagarwa1@linux.ibm.com>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
	"Kewen.Lin" <linkw@linux.ibm.com>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	Michael Meissner <meissner@linux.ibm.com>,
	David Edelsohn <dje.gcc@gmail.com>,
	Peter Bergner <bergner@linux.ibm.com>,
	gcc-patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file
Date: Wed, 10 Apr 2024 09:12:14 +0100	[thread overview]
Message-ID: <ZhZJ3pG5e8ocskYU@arm.com> (raw)
In-Reply-To: <54ab422a-4ffa-4ca9-8029-eb23c42ed653@linux.ibm.com>

Hi Ajit,

On 09/04/2024 20:59, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 09/04/24 8:39 pm, Alex Coplan wrote:
> > On 09/04/2024 20:01, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>
> >>>>
> >>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>> Hello Alex/Richard:
> >>>>>>
> >>>>>> All review comments are incorporated.
> >>>>>
> >>>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
> >>>>> preparatory patch as we discussed.
> >>>>>
> >>>>> Sorry for another meta comment, but: I think the reason that the Linaro
> >>>>> CI isn't running tests on your patches is actually because you're
> >>>>> sending 1/3 of a series but not sending the rest of the series.
> >>>>>
> >>>>> So please can you either send this as an individual preparatory patch
> >>>>> (not marked as a series) or if you're going to send a series (e.g. with
> >>>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> >>>>> series when you make updates.  That way the CI should test your patches,
> >>>>> which would be helpful.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>>>
> >>>>>> Common infrastructure of load store pair fusion is divided into target
> >>>>>> independent and target dependent changed code.
> >>>>>>
> >>>>>> Target independent code is the Generic code with pure virtual function
> >>>>>> to interface betwwen target independent and dependent code.
> >>>>>>
> >>>>>> Target dependent code is the implementation of pure virtual function for
> >>>>>> aarch64 target and the call to target independent code.
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Ajit
> >>>>>>
> >>>>>>
> >>>>>> aarch64: Place target independent and dependent changed code in one file
> >>>>>>
> >>>>>> Common infrastructure of load store pair fusion is divided into target
> >>>>>> independent and target dependent changed code.
> >>>>>>
> >>>>>> Target independent code is the Generic code with pure virtual function
> >>>>>> to interface betwwen target independent and dependent code.
> >>>>>>
> >>>>>> Target dependent code is the implementation of pure virtual function for
> >>>>>> aarch64 target and the call to target independent code.
> >>>>>>
> >>>>>> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> >>>>>>
> >>>>>> gcc/ChangeLog:
> >>>>>>
> >>>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
> >>>>>> 	independent and dependent changed code.
> >>>>>
> >>>>> You're going to need a proper ChangeLog eventually, but I guess there's
> >>>>> no need for that right now.
> >>>>>
> >>>>>> ---
> >>>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
> >>>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
> >>>>>>
> >>>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> index 22ed95eb743..cb21b514ef7 100644
> >>>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> @@ -138,8 +138,122 @@ struct alt_base
> >>>>>>    poly_int64 offset;
> >>>>>>  };
> >>>>>>  
> >>>>>> +// Virtual base class for load/store walkers used in alias analysis.
> >>>>>> +struct alias_walker
> >>>>>> +{
> >>>>>> +  virtual bool conflict_p (int &budget) const = 0;
> >>>>>> +  virtual insn_info *insn () const = 0;
> >>>>>> +  virtual bool valid () const  = 0;
> >>>>>
> >>>>> Heh, looking at this made me realise there is a whitespace bug here in
> >>>>> the existing code (double space after const).  Sorry about that!  I'll
> >>>>> push an obvious fix for that.
> >>>>>
> >>>>>> +  virtual void advance () = 0;
> >>>>>> +};
> >>>>>> +
> >>>>>> +struct pair_fusion {
> >>>>>> +
> >>>>>> +  pair_fusion () {};
> >>>>>
> >>>>> This ctor looks pointless at the moment.  Perhaps instead we could put
> >>>>> the contents of ldp_fusion_init in here and then delete that function?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>
> >>>>>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >>>>>> +			       bool load_p) = 0;
> >>>>>
> >>>>> Please can we have comments above each of these virtual functions
> >>>>> describing any parameters, what the purpose of the hook is, and the
> >>>>> interpretation of the return value?  This will serve as the
> >>>>> documentation for other targets that want to make use of the pass.
> >>>>>
> >>>>> It might make sense to have a default-false implementation for
> >>>>> fpsimd_op_p, especially if you don't want to make use of this bit for
> >>>>> rs6000.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>>> +
> >>>>>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> >>>>>> +  virtual bool pair_trailing_writeback_p () = 0;
> >>>>>
> >>>>> Sorry for the run-around, but: I think this and
> >>>>> handle_writeback_opportunities () should be the same function, either
> >>>>> returning an enum or taking an enum and returning a boolean.
> >>>>>
> >>>>> At a minimum they should have more similar sounding names.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>
> >>>>>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> >>>>>> +				      machine_mode mem_mode) = 0;
> >>>>>> +  virtual int pair_mem_alias_check_limit () = 0;
> >>>>>> +  virtual bool handle_writeback_opportunities () = 0 ;
> >>>>>> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> >>>>>> +					machine_mode mode) = 0;
> >>>>>> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
> >>>>>
> >>>>> Nit: excess whitespace after pats,
> >>>>>
> >>>>>> +			    bool load_p) = 0;
> >>>>>> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
> >>>>>> +  virtual bool track_load_p () = 0;
> >>>>>> +  virtual bool track_store_p () = 0;
> >>>>>
> >>>>> I think it would probably make more sense for these two to have
> >>>>> default-true implementations rather than being pure virtual functions.
> >>>>>
> >>>>> Also, sorry for the bikeshedding, but please can we keep the plural
> >>>>> names (so track_loads_p and track_stores_p).
> >>>>>
> >>>>>> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
> >>>>>
> >>>>> Why does this need to be virtualised?  I would it expect it to
> >>>>> just be insns.empty () on all targets.
> >>>>>
> >>>>>> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
> >>>>>> +  void ldp_fusion_bb (bb_info *bb);
> >>>>>
> >>>>> Just to check, are you planning to rename this function in a separate
> >>>>> patch?
> >>>>>
> >>>>
> >>>> Yes.
> >>>>>> +
> >>>>>> +  ~pair_fusion() { }
> >>>>>
> >>>>> As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
> >>>>> in here and then delete that function?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +};
> >>>>>> +
> >>>>>> +struct aarch64_pair_fusion : public pair_fusion
> >>>>>> +{
> >>>>>> +public:
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> Nit: there shouldn't be a need for this as public is already the default
> >>>>> for structs.
> >>>>>
> >>>>>> +  aarch64_pair_fusion () : pair_fusion () {};
> >>>>>
> >>>>> This ctor looks pointless, the default one should do, so can we remove
> >>>>> this?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >>>>>> +		    bool load_p) override final
> >>>>>> +  {
> >>>>>> +    const bool fpsimd_op_p
> >>>>>> +      = reload_completed
> >>>>>> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >>>>>> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >>>>>> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >>>>>> +    return fpsimd_op_p;
> >>>>>
> >>>>> Can this just directly return the boolean expression instead of storing
> >>>>> it into a variable first?
> >>>>>
> >>>> Addressed.
> >>>>>> +  }
> >>>>>> +
> >>>>>> +  bool pair_mem_promote_writeback_p (rtx pat)
> >>>>>> +  {
> >>>>>> +     if (reload_completed
> >>>>>> +	 && aarch64_ldp_writeback > 1
> >>>>>> +	 && GET_CODE (pat) == PARALLEL
> >>>>>> +	 && XVECLEN (pat, 0) == 2)
> >>>>>> +	return true;
> >>>>>> +
> >>>>>> +     return false;
> >>>>>> +  }
> >>>>>> +
> >>>>>> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> >>>>>> +				machine_mode mode)
> >>>>>> +  {
> >>>>>> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> >>>>>> +						     load_p,
> >>>>>> +						     mode);
> >>>>>> +  }
> >>>>>> +  bool pair_operand_mode_ok_p (machine_mode mode);
> >>>>>
> >>>>> Why have you implemented this (one-line function) out-of-line but the
> >>>>> others are implemented inline?  Please be consistent.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +
> >>>>>> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
> >>>>>> +
> >>>>>> +  bool pair_trailing_writeback_p  ()
> >>>>>> +  {
> >>>>>> +    return aarch64_ldp_writeback > 1;
> >>>>>> +  }
> >>>>>> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> >>>>>> +			      machine_mode mem_mode)
> >>>>>> +  {
> >>>>>> +    return (load_p
> >>>>>> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
> >>>>>> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
> >>>>>> +  }
> >>>>>> +  int pair_mem_alias_check_limit ()
> >>>>>> +  {
> >>>>>> +    return aarch64_ldp_alias_check_limit;
> >>>>>> +  }
> >>>>>> +  bool handle_writeback_opportunities ()
> >>>>>> +  {
> >>>>>> +    return aarch64_ldp_writeback;
> >>>>>> +  }
> >>>>>> +  bool track_load_p ()
> >>>>>> +  {
> >>>>>> +    const bool track_loads
> >>>>>> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>> +    return track_loads;
> >>>>>> +  }
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> As above, can this just return the boolean expression directly without
> >>>>> the temporary?
> >>>>>
> >>>>>> +  bool track_store_p ()
> >>>>>> +  {
> >>>>>> +    const bool track_stores
> >>>>>> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>> +    return track_stores;
> >>>>>> +  }
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> Likewise.
> >>>>>
> >>>>>> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
> >>>>>> +  {
> >>>>>> +    return insns.empty();
> >>>>>> +  }
> >>>>>
> >>>>> As mentioned on the declaration, I don't see why this needs to be
> >>>>> virtualised.
> >>>>>
> >>>>>> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
> >>>>>> +  {
> >>>>>> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
> >>>>>> +  }
> >>>>>
> >>>>> Hmm, this looks to test exactly the opposite condition from what the
> >>>>> name suggests, i.e. the implementation looks like what I would expect
> >>>>> for a function called pair_mem_out_of_range_p.
> >>>>>
> >>>>> Please can you invert the condition and fix up callers appropriately?
> >>>>>
> >>>>> This suggests you're not really thinking about the code you're writing,
> >>>>> please try to think a bit more carefully before posting patches.
> >>>>>
> >>>>
> >>>> Sorry for my mistake as I have overlooked it.
> >>>>>> +};
> >>>>>> +
> >>>>>> +
> >>>>>>  // State used by the pass for a given basic block.
> >>>>>> -struct ldp_bb_info
> >>>>>> +struct pair_fusion_bb_info
> >>>>>>  {
> >>>>>>    using def_hash = nofree_ptr_hash<def_info>;
> >>>>>>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
> >>>>>> @@ -160,14 +274,17 @@ struct ldp_bb_info
> >>>>>>  
> >>>>>>    static const size_t obstack_alignment = sizeof (void *);
> >>>>>>    bb_info *m_bb;
> >>>>>> +  pair_fusion *bb_state;
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>
> >>>>> Can we call this m_pass instead of bb_state?  The pair_fusion class is
> >>>>> explicitly at the pass level rather than the bb level.
> >>>>>
> >>>>> Also I think this (and indeed m_bb, although that's pre-existing) could
> >>>>> be private members.  Not sure there's a good reason for making them
> >>>>> public.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>> It might be slightly nicer to use a reference for m_pass unless there's
> >>>>> a good reason to use a pointer, although I don't feel strongly about
> >>>>> this.
> >>>>>
> >>>>>>  
> >>>>>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> >>>>>> +  pair_fusion_bb_info (bb_info *bb,
> >>>>>> +		      aarch64_pair_fusion *d) : m_bb (bb),
> >>>>>
> >>>>> This generic class needs to take the generic pass type, i.e. just a
> >>>>> pair_fusion * rather than an aarch64_pair_fusion *.
> >>>>>
> >>>>>> +		      bb_state (d), m_emitted_tombstone (false)
> >>>>>>    {
> >>>>>>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
> >>>>>>  				obstack_alignment, obstack_chunk_alloc,
> >>>>>>  				obstack_chunk_free);
> >>>>>>    }
> >>>>>> -  ~ldp_bb_info ()
> >>>>>> +  ~pair_fusion_bb_info ()
> >>>>>>    {
> >>>>>>      obstack_free (&m_obstack, nullptr);
> >>>>>>  
> >>>>>> @@ -177,10 +294,32 @@ struct ldp_bb_info
> >>>>>>  	bitmap_obstack_release (&m_bitmap_obstack);
> >>>>>>        }
> >>>>>>    }
> >>>>>> +  void track_access (insn_info *, bool load, rtx mem);
> >>>>>> +  void transform ();
> >>>>>> +  void cleanup_tombstones ();
> >>>>>
> >>>>> I think most (or indeed all?) of the changes here shouldn't be needed.
> >>>>> AFAICS, there shouldn't be a need to:
> >>>>>  - drop inline from these functions.
> >>>>>  - change the accessibility of any members.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> Plese can you drop these changes or explain why they're needed (from
> >>>>> here until the end of the class).
> >>>>>
> >>>>>> +  void merge_pairs (insn_list_t &, insn_list_t &,
> >>>>>> +		    bool load_p, unsigned access_size);
> >>>>>> +  void transform_for_base (int load_size, access_group &group);
> >>>>>> +
> >>>>>> +  bool try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>> +			     insn_info *i1, insn_info *i2);
> >>>>>> +
> >>>>>> +  bool fuse_pair (bool load_p, unsigned access_size,
> >>>>>> +		  int writeback,
> >>>>>> +		  insn_info *i1, insn_info *i2,
> >>>>>> +		  base_cand &base,
> >>>>>> +		  const insn_range_info &move_range);
> >>>>>>  
> >>>>>> -  inline void track_access (insn_info *, bool load, rtx mem);
> >>>>>> -  inline void transform ();
> >>>>>> -  inline void cleanup_tombstones ();
> >>>>>> +  void do_alias_analysis (insn_info *alias_hazards[4],
> >>>>>> +			  alias_walker *walkers[4],
> >>>>>> +			  bool load_p);
> >>>>>> +
> >>>>>> +  void track_tombstone (int uid);
> >>>>>> +
> >>>>>> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> >>>>>> +
> >>>>>> +  template<typename Map>
> >>>>>> +    void traverse_base_map (Map &map);
> >>>>>>  
> >>>>>>  private:
> >>>>>>    obstack m_obstack;
> >>>>>> @@ -191,30 +330,32 @@ private:
> >>>>>>    bool m_emitted_tombstone;
> >>>>>>  
> >>>>>>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
> >>>>>> +};
> >>>>>>  
> >>>>>> -  template<typename Map>
> >>>>>> -  inline void traverse_base_map (Map &map);
> >>>>>> -  inline void transform_for_base (int load_size, access_group &group);
> >>>>>> -
> >>>>>> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
> >>>>>> -			   bool load_p, unsigned access_size);
> >>>>>> -
> >>>>>> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>> -			     insn_info *i1, insn_info *i2);
> >>>>>> -
> >>>>>> -  inline bool fuse_pair (bool load_p, unsigned access_size,
> >>>>>> -			 int writeback,
> >>>>>> -			 insn_info *i1, insn_info *i2,
> >>>>>> -			 base_cand &base,
> >>>>>> -			 const insn_range_info &move_range);
> >>>>>> -
> >>>>>> -  inline void track_tombstone (int uid);
> >>>>>> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
> >>>>>
> >>>>> Style nit: newline after rtx.
> >>>>>
> >>>> Addressed.
> >>>>
> >>>>>> +		  rtx writeback,
> >>>>>> +		  bool load_p)
> >>>>>> +  {
> >>>>>> +    rtx pair_pat;
> >>>>>>  
> >>>>>> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> >>>>>> -};
> >>>>>> +    if (writeback)
> >>>>>> +      {
> >>>>>> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
> >>>>>> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> >>>>>> +      }
> >>>>>> +    else if (load_p)
> >>>>>> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> >>>>>> +					XEXP (pats[1], 0),
> >>>>>> +					XEXP (pats[0], 1));
> >>>>>> +    else
> >>>>>> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> >>>>>> +					 XEXP (pats[0], 1),
> >>>>>> +					 XEXP (pats[1], 1));
> >>>>>> +     return pair_pat;
> >>>>>
> >>>>> Can the assignments just be direct returns here, please?  Then
> >>>>> get rid of the temporary.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>>> +  }
> >>>>>>  
> >>>>>>  splay_tree_node<access_record *> *
> >>>>>> -ldp_bb_info::node_alloc (access_record *access)
> >>>>>> +pair_fusion_bb_info::node_alloc (access_record *access)
> >>>>>>  {
> >>>>>>    using T = splay_tree_node<access_record *>;
> >>>>>>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
> >>>>>> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
> >>>>>>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
> >>>>>>  // MEM so that we know the mode of the access.
> >>>>>>  static rtx
> >>>>>> -ldp_strip_offset (rtx mem, poly_int64 *offset)
> >>>>>> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
> >>>>>
> >>>>> I thought we discussed splitting the renaming into a separate patch?
> >>>>>
> >>>>>>  {
> >>>>>>    rtx addr = XEXP (mem, 0);
> >>>>>>  
> >>>>>> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
> >>>>>>    return reload_completed || mode != TImode;
> >>>>>>  }
> >>>>>>  
> >>>>>> +bool
> >>>>>> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
> >>>>>> +{
> >>>>>> +  return ldp_operand_mode_ok_p (mode);
> >>>>>> +}
> >>>>>> +
> >>>>>>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
> >>>>>>  // into an integer for use as a hash table key.
> >>>>>>  static int
> >>>>>> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
> >>>>>>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
> >>>>>>  // LFS is used as part of the key to the hash table, see track_access.
> >>>>>>  bool
> >>>>>> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>>> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>>
> >>>>> Again, please do any renaming in a separate patch.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  {
> >>>>>>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
> >>>>>>      return false;
> >>>>>> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>>>    const machine_mode mem_mode = GET_MODE (mem);
> >>>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> >>>>>>  
> >>>>>> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
> >>>>>> -  // multiple of the access size, and we believe that misaligned offsets on
> >>>>>> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
> >>>>>> +  // Punt on misaligned offsets.  Paired memory access instructions require
> >>>>>> +  // offsets to be a multiple of the access size, and we believe that
> >>>>>> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
> >>>>>> +  // offsets w.r.t. RTL bases.
> >>>>>>    if (!multiple_p (offset, mem_size))
> >>>>>>      return false;
> >>>>>>  
> >>>>>> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>>>  }
> >>>>>>  
> >>>>>>  // Main function to begin pair discovery.  Given a memory access INSN,
> >>>>>> -// determine whether it could be a candidate for fusing into an ldp/stp,
> >>>>>> +// determine whether it could be a candidate for fusing into an pair mem,
> >>>>>
> >>>>> s/an pair mem/a paired access/
> >>>>>
> >>>>>>  // and if so, track it in the appropriate data structure for this basic
> >>>>>>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
> >>>>>>  // rtx that occurs in INSN.
> >>>>>>  void
> >>>>>> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>  {
> >>>>>>    // We can't combine volatile MEMs, so punt on these.
> >>>>>>    if (MEM_VOLATILE_P (mem))
> >>>>>>      return;
> >>>>>>  
> >>>>>> -  // Ignore writeback accesses if the param says to do so.
> >>>>>> -  if (!aarch64_ldp_writeback
> >>>>>> +  // Ignore writeback accesses if the param says to do so
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> This comment needs updating, I guess something like "if the hook says to
> >>>>> do so".  But also please don't delete the full stop.
> >>>>>
> >>>>>> +  if (!bb_state->handle_writeback_opportunities ()
> >>>>>>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
> >>>>>>      return;
> >>>>>>  
> >>>>>>    const machine_mode mem_mode = GET_MODE (mem);
> >>>>>> -  if (!ldp_operand_mode_ok_p (mem_mode))
> >>>>>> +
> >>>>>
> >>>>> I don't think we need the extra whitespace here.
> >>>>>
> >>>>>> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
> >>>>>>      return;
> >>>>>>  
> >>>>>>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
> >>>>>>  
> >>>>>> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
> >>>>>> -  if (load_p
> >>>>>> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
> >>>>>> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
> >>>>>> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
> >>>>>>      return;
> >>>>>> -
> >>>>>>    // We want to segregate FP/SIMD accesses from GPR accesses.
> >>>>>>    //
> >>>>>>    // Before RA, we use the modes, noting that stores of constant zero
> >>>>>>    // operands use GPRs (even in non-integer modes).  After RA, we use
> >>>>>>    // the hard register numbers.
> >>>>>> -  const bool fpsimd_op_p
> >>>>>> -    = reload_completed
> >>>>>> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >>>>>> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >>>>>> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >>>>>> -
> >>>>>> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
> >>>>>> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
> >>>>>> +  // Note pair_operand_mode_ok_p already rejected VL modes.
> >>>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> >>>>>>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
> >>>>>>  
> >>>>>> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>    poly_int64 mem_off;
> >>>>>>    rtx addr = XEXP (mem, 0);
> >>>>>>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> >>>>>> -  rtx base = ldp_strip_offset (mem, &mem_off);
> >>>>>> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
> >>>>>
> >>>>> Again, let's split the renaming off into a separate patch.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>    if (!REG_P (base))
> >>>>>>      return;
> >>>>>>  
> >>>>>> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>    // elimination offset pre-RA, we should postpone forming pairs on such
> >>>>>>    // accesses until after RA.
> >>>>>>    //
> >>>>>> -  // As it stands, addresses with offsets in range for LDR but not
> >>>>>> -  // in range for LDP/STP are currently reloaded inefficiently,
> >>>>>> +  // As it stands, addresses in range for an individual load/store but not
> >>>>>> +  // for a paired access are currently reloaded inefficiently,
> >>>>>>    // ending up with a separate base register for each pair.
> >>>>>>    //
> >>>>>>    // In theory LRA should make use of
> >>>>>> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>    // that calls targetm.legitimize_address_displacement.
> >>>>>>    //
> >>>>>>    // So for now, it's better to punt when we can't be sure that the
> >>>>>> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
> >>>>>> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
> >>>>>> +  // offset is in range for paired access.  Out-of-range cases can then be
> >>>>>> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
> >>>>>>    // would be nice to handle known out-of-range opportunities in the
> >>>>>>    // pass itself (for stack accesses, this would be in the post-RA pass).
> >>>>>>    if (!reload_completed
> >>>>>> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>  	gcc_unreachable (); // Base defs should be unique.
> >>>>>>      }
> >>>>>>  
> >>>>>> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
> >>>>>> -  // the access size.
> >>>>>> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
> >>>>>> +  // to be a multiple of the access size.
> >>>>>>    if (!multiple_p (mem_off, mem_size))
> >>>>>>      return;
> >>>>>>  
> >>>>>> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
> >>>>>>  // making use of alias disambiguation.
> >>>>>>  static insn_info *
> >>>>>>  latest_hazard_before (insn_info *insn, rtx *ignore,
> >>>>>> -		      insn_info *ignore_insn = nullptr)
> >>>>>> +		      insn_info *ignore_insn = 0)
> >>>>>
> >>>>> Is this change needed?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  {
> >>>>>>    insn_info *result = nullptr;
> >>>>>>  
> >>>>>> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
> >>>>>>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> >>>>>>  
> >>>>>>        poly_int64 offset;
> >>>>>> -      rtx this_base = ldp_strip_offset (mem, &offset);
> >>>>>> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
> >>>>>>        gcc_assert (REG_P (this_base));
> >>>>>>        if (base_reg)
> >>>>>>  	gcc_assert (rtx_equal_p (base_reg, this_base));
> >>>>>> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
> >>>>>>  
> >>>>>>    off_hwi /= access_size;
> >>>>>>  
> >>>>>> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
> >>>>>> +  pair_fusion *pfuse;
> >>>>>> +  aarch64_pair_fusion derived;
> >>>>>> +  pfuse = &derived;
> >>>>>
> >>>>> Huh? We'll want to use the same instance of the pass structure (pair_fusion)
> >>>>> that gets created at the top level, so that should get passed around everywhere.
> >>>>>
> >>>>> In the case of find_trailing_add we probably don't want to add any more
> >>>>> parameters (as Segher noted there are already too many), so perhaps the
> >>>>> easiest thing would be to make find_trailing_add itself a (non-virtual)
> >>>>> member function of pair_fusion.
> >>>>>
> >>>>> Then here we can just query pair_mem_in_range_p directly.
> >>>>>
> >>>>> For try_promote_writeback itself you can either make it a non-virtual
> >>>>> member function of pair_fusion or just pass a reference to the pass
> >>>>> object in when you call it.
> >>>>>
> >>>>> Note also that you don't have to create a pointer of the parent type in
> >>>>> order to call a virtual function on a derived object, you can just call
> >>>>> it directly.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +
> >>>>>> +  if (pfuse->pair_mem_in_range_p (off_hwi))
> >>>>>>      return nullptr;
> >>>>>
> >>>>> Note of course this condition needs inverting when you invert the
> >>>>> implementation.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  
> >>>>>>    auto dump_prefix = [&]()
> >>>>>> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
> >>>>>>  // We just emitted a tombstone with uid UID, track it in a bitmap for
> >>>>>>  // this BB so we can easily identify it later when cleaning up tombstones.
> >>>>>>  void
> >>>>>> -ldp_bb_info::track_tombstone (int uid)
> >>>>>> +pair_fusion_bb_info::track_tombstone (int uid)
> >>>>>
> >>>>> Again, please split the renaming off.  Same with other instances below.
> >>>>>
> >>>>>>  {
> >>>>>>    if (!m_emitted_tombstone)
> >>>>>>      {
> >>>>>> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
> >>>>>>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
> >>>>>>  			       == RTX_AUTOINC);
> >>>>>>  
> >>>>>> -	  base = ldp_strip_offset (mem, &offset);
> >>>>>> +	  base = pair_mem_strip_offset (mem, &offset);
> >>>>>>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
> >>>>>>  	}
> >>>>>>        fixup_debug_use (attempt, use, def, base, offset);
> >>>>>> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
> >>>>>>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
> >>>>>>  // a singleton range which says where to place the pair.
> >>>>>>  bool
> >>>>>> -ldp_bb_info::fuse_pair (bool load_p,
> >>>>>> +pair_fusion_bb_info::fuse_pair (bool load_p,
> >>>>>>  			unsigned access_size,
> >>>>>>  			int writeback,
> >>>>>>  			insn_info *i1, insn_info *i2,
> >>>>>> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>      {
> >>>>>>        if (dump_file)
> >>>>>>  	fprintf (dump_file,
> >>>>>> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
> >>>>>> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
> >>>>>>  		 "update of base (r%d), dropping wb\n",
> >>>>>>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
> >>>>>>        gcc_assert (writeback_effect);
> >>>>>> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>      }
> >>>>>>  
> >>>>>>    // If either of the original insns had writeback, but the resulting pair insn
> >>>>>> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
> >>>>>> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
> >>>>>>    // effects cancel out), then drop the def(s) of the base register as
> >>>>>>    // appropriate.
> >>>>>>    //
> >>>>>> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>    // update of the base register and try and fold it in to make this into a
> >>>>>>    // writeback pair.
> >>>>>>    insn_info *trailing_add = nullptr;
> >>>>>> -  if (aarch64_ldp_writeback > 1
> >>>>>> +  if (bb_state->pair_trailing_writeback_p ()
> >>>>>>        && !writeback_effect
> >>>>>>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
> >>>>>>  					 XEXP (pats[0], 0), nullptr)
> >>>>>> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>      }
> >>>>>>  
> >>>>>>    // Now that we know what base mem we're going to use, check if it's OK
> >>>>>> -  // with the ldp/stp policy.
> >>>>>> +  // with the pair mem policy.
> >>>>>>    rtx first_mem = XEXP (pats[0], load_p);
> >>>>>> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> >>>>>> -						load_p,
> >>>>>> -						GET_MODE (first_mem)))
> >>>>>> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
> >>>>>> +			  load_p,
> >>>>>> +			  GET_MODE (first_mem)))
> >>>>>>      {
> >>>>>>        if (dump_file)
> >>>>>> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
> >>>>>> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
> >>>>>>  		 i1->uid (), i2->uid ());
> >>>>>>        return false;
> >>>>>>      }
> >>>>>> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>    rtx reg_notes = combine_reg_notes (first, second, load_p);
> >>>>>>  
> >>>>>>    rtx pair_pat;
> >>>>>> -  if (writeback_effect)
> >>>>>> -    {
> >>>>>> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
> >>>>>> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> >>>>>> -    }
> >>>>>> -  else if (load_p)
> >>>>>> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> >>>>>> -				      XEXP (pats[1], 0),
> >>>>>> -				      XEXP (pats[0], 1));
> >>>>>> -  else
> >>>>>> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> >>>>>> -				       XEXP (pats[0], 1),
> >>>>>> -				       XEXP (pats[1], 1));
> >>>>>>  
> >>>>>> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
> >>>>>>    insn_change *pair_change = nullptr;
> >>>>>> +
> >>>>>>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> >>>>>>      rtx_insn *rti = change->insn ()->rtl ();
> >>>>>>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
> >>>>>> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
> >>>>>>    return false;
> >>>>>>  }
> >>>>>>  
> >>>>>> -// Virtual base class for load/store walkers used in alias analysis.
> >>>>>> -struct alias_walker
> >>>>>> -{
> >>>>>> -  virtual bool conflict_p (int &budget) const = 0;
> >>>>>> -  virtual insn_info *insn () const = 0;
> >>>>>> -  virtual bool valid () const  = 0;
> >>>>>> -  virtual void advance () = 0;
> >>>>>> -};
> >>>>>> -
> >>>>>>  // Implement some common functionality used by both store_walker
> >>>>>>  // and load_walker.
> >>>>>>  template<bool reverse>
> >>>>>> @@ -2259,13 +2383,13 @@ public:
> >>>>>>  //
> >>>>>>  // We try to maintain the invariant that if a walker becomes invalid, we
> >>>>>>  // set its pointer to null.
> >>>>>> -static void
> >>>>>> -do_alias_analysis (insn_info *alias_hazards[4],
> >>>>>> +void
> >>>>>> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
> >>>>>
> >>>>> It would probably make more sense for this to be a (non-virtual) member
> >>>>> function of the pair_fusion class, since it doesn't need to access the
> >>>>> BB-specific state.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  		   alias_walker *walkers[4],
> >>>>>>  		   bool load_p)
> >>>>>>  {
> >>>>>>    const int n_walkers = 2 + (2 * !load_p);
> >>>>>> -  int budget = aarch64_ldp_alias_check_limit;
> >>>>>> +  int budget = bb_state->pair_mem_alias_check_limit ();
> >>>>>>  
> >>>>>>    auto next_walker = [walkers,n_walkers](int current) -> int {
> >>>>>>      for (int j = 1; j <= n_walkers; j++)
> >>>>>> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
> >>>>>>      {
> >>>>>>        const bool is_lower = (i == reversed);
> >>>>>>        poly_int64 poly_off;
> >>>>>> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
> >>>>>> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
> >>>>>>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
> >>>>>>  	writeback |= (1 << i);
> >>>>>>  
> >>>>>> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
> >>>>>>  	continue;
> >>>>>>  
> >>>>>>        // Punt on accesses relative to eliminable regs.  See the comment in
> >>>>>> -      // ldp_bb_info::track_access for a detailed explanation of this.
> >>>>>> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
> >>>>>>        if (!reload_completed
> >>>>>>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
> >>>>>>  	      || REGNO (base) == ARG_POINTER_REGNUM))
> >>>>>> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
> >>>>>>        if (!is_lower)
> >>>>>>  	base_off--;
> >>>>>>  
> >>>>>> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> >>>>>> +      pair_fusion *pfuse;
> >>>>>> +      aarch64_pair_fusion derived;
> >>>>>> +      pfuse = &derived;
> >>>>>
> >>>>> Again, please don't do this but instead use the same instance
> >>>>> everywhere.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +
> >>>>>> +      if (pfuse->pair_mem_in_range_p (base_off))
> >>>>>>  	continue;
> >>>>>>  
> >>>>>>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
> >>>>>> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
> >>>>>>  }
> >>>>>>  
> >>>>>>  // Given two adjacent memory accesses of the same size, I1 and I2, try
> >>>>>> -// and see if we can merge them into a ldp or stp.
> >>>>>> +// and see if we can merge them into a paired accesses load and store.
> >>>>>
> >>>>> "into a load pair or store pair" or "into a paired access" if you
> >>>>> prefer.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  //
> >>>>>>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
> >>>>>>  // if the accesses are both loads, otherwise they are both stores.
> >>>>>>  bool
> >>>>>> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>>  			    insn_info *i1, insn_info *i2)
> >>>>>>  {
> >>>>>>    if (dump_file)
> >>>>>> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>>      {
> >>>>>>        if (dump_file)
> >>>>>>  	fprintf (dump_file,
> >>>>>> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
> >>>>>> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
> >>>>>>  		 insns[0]->uid (), insns[1]->uid ());
> >>>>>>        return false;
> >>>>>>      }
> >>>>>> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
> >>>>>>  // we can't re-order them anyway, so provided earlier passes have cleaned up
> >>>>>>  // redundant loads, we shouldn't miss opportunities by doing this.
> >>>>>>  void
> >>>>>> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>>  			  insn_list_t &right_list,
> >>>>>>  			  bool load_p,
> >>>>>>  			  unsigned access_size)
> >>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>>>  // merge_pairs.
> >>>>>>  void
> >>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>> -				 access_group &group)
> >>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>> +					 access_group &group)
> >>>>>>  {
> >>>>>>    const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>    const unsigned access_size = lfs.size;
> >>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>  		       access.cand_insns,
> >>>>>>  		       lfs.load_p,
> >>>>>>  		       access_size);
> >>>>>> -	  skip_next = access.cand_insns.empty ();
> >>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> >>>>>
> >>>>> As above, why is this needed?
> >>>>
> >>>> For rs6000 we want to return always true. as load store pair
> >>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>> to generate always true for rs6000 to skip pairs as above.
> >>>
> >>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
> >>> load/store pair instructions where the two arms of the access are storing
> >>> operands of different sizes?  Or something else?
> >>>
> >>> As it stands the logic is to skip the next iteration only if we
> >>> exhausted all the candidate insns for the current access.  In the case
> >>> that we didn't exhaust all such candidates, then the idea is that when
> >>> access becomes prev_access, we can attempt to use those candidates as
> >>> the "left-hand side" of a pair in the next iteration since we failed to
> >>> use them as the "right-hand side" of a pair in the current iteration.
> >>> I don't see why you wouldn't want that behaviour.  Please can you
> >>> explain?
> >>>
> >>
> >> In merge_pair we get the 2 load candiates one load from 0 offset and
> >> other load is from 16th offset. Then in next iteration we get load
> >> from 16th offset and other load from 32 offset. In next iteration
> >> we get load from 32 offset and other load from 48 offset.
> >>
> >> For example:
> >>
> >> Currently we get the load candiates as follows.
> >>
> >> pairs:
> >>
> >> load from 0th offset.
> >> load from 16th offset.
> >>
> >> next pairs:
> >>
> >> load from 16th offset.
> >> load from 32th offset.
> >>
> >> next pairs:
> >>
> >> load from 32th offset
> >> load from 48th offset.
> >>
> >> Instead in rs6000 we should get:
> >>
> >> pairs:
> >>
> >> load from 0th offset
> >> load from 16th offset.
> >>
> >> next pairs:
> >>
> >> load from 32th offset
> >> load from 48th offset.
> > 
> > Hmm, so then I guess my question is: why wouldn't you consider merging
> > the pair with offsets (16,32) for rs6000?  Is it because you have a
> > stricter alignment requirement on the base pair offsets (i.e. they have
> > to be a multiple of 32 when the operand size is 16)?  So the pair
> > offsets have to be a multiple of the entire pair size rather than a
> > single operand size> 
> 
> We get load pair at a certain point with (0,16) and other program
> point we get load pair (32, 48).
> 
> In current implementation it takes offsets loads as (0, 16),
> (16, 32), (32, 48).
> 
> But In rs6000 we want  the load pair to be merged at different points
> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
> 0 offset and other load (32, 48) with lxvp with 32 offset.
> 
> In current case it will merge with lxvp with 0 offset and lxvp with
> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
> is incorrect in our case as the (16-32) case 16 offset will not
> load from even register and break for rs6000.

Sorry, I think I'm still missing something here.  Why does the address offset
affect the parity of the tranfser register?  ISTM they needn't be related at
all (and indeed we can't even know the parity of the transfer register before
RA, but perhaps you're only intending on running the pass after RA?)

How do you plan on handling this even-odd requirement for rs6000?

Thanks,
Alex

> 
> lxvp should load from even registers and then loaded value will
> be in even register and even register +1 (which is odd).
> 
> Thanks & Regards
> Ajit
> > If that is the case then I think it would be better to introduce a
> > virtual function (say pair_offset_alignment_ok_p) that vets the base
> > offset of the pair (prev_access->offset in transform_for_base).  I guess
> > it would also take access_size as a parameter and for aarch64 it should
> > check:
> > 
> >   multiple_p (offset, access_size)
> > 
> > and for rs6000 it could check:
> > 
> >   multiple_p (offset, access_size * 2)
> > 
> > and we would then incorporate a call to that predicate in the else if
> > condition of tranform_for_base.
> > 
> > It would have the same limitation whereby we assume that MEM_EXPR offset
> > alignment is a good proxy for RTL offset alignment, but we already
> > depend on that with the multiple_p check in track_via_mem_expr.
> > 
> > Thanks,
> > Alex
> > 
> >>
> >> Thanks & Regards
> >> Ajit
> >>> Thanks,
> >>> Alex
> >>>
> >>>>
> >>>>>
> >>>>>>  	}
> >>>>>>        prev_access = &access;
> >>>>>>      }
> >>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
> >>>>>>  // of mem to previous defs when we do this.
> >>>>>>  void
> >>>>>> -ldp_bb_info::cleanup_tombstones ()
> >>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
> >>>>>>  {
> >>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
> >>>>>>    if (!m_emitted_tombstone)
> >>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
> >>>>>>  
> >>>>>>  template<typename Map>
> >>>>>>  void
> >>>>>> -ldp_bb_info::traverse_base_map (Map &map)
> >>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
> >>>>>>  {
> >>>>>>    for (auto kv : map)
> >>>>>>      {
> >>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>  }
> >>>>>>  
> >>>>>>  void
> >>>>>> -ldp_bb_info::transform ()
> >>>>>> +pair_fusion_bb_info::transform ()
> >>>>>>  {
> >>>>>>    traverse_base_map (expr_map);
> >>>>>>    traverse_base_map (def_map);
> >>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
> >>>>>>  // for load/store candidates.  If running after RA, also try and promote
> >>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
> >>>>>>  // candidates into pairs.
> >>>>>> -void ldp_fusion_bb (bb_info *bb)
> >>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
> >>>>>>  {
> >>>>>> -  const bool track_loads
> >>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>> -  const bool track_stores
> >>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>> +  const bool track_loads = track_load_p ();
> >>>>>> +  const bool track_stores = track_store_p ();
> >>>>>>  
> >>>>>> -  ldp_bb_info bb_state (bb);
> >>>>>
> >>>>> This:
> >>>>>
> >>>>>> +  aarch64_pair_fusion derived;
> >>>>>
> >>>>> can be deleted and then:
> >>>>>
> >>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
> >>>>>
> >>>>> can just be:
> >>>>>
> >>>>>   pair_fusion_bb_info bb_info (bb, this);
> >>>>>
> >>>>> (or you can pass *this if you make bb_info take a reference).
> >>>>>
> >>>>> I don't think there's a particular need to change the variable name
> >>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
> >>>>> with the RTL-SSA structure of the same name as the latter.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  
> >>>>>>    for (auto insn : bb->nondebug_insns ())
> >>>>>>      {
> >>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
> >>>>>>  	continue;
> >>>>>>  
> >>>>>>        rtx pat = PATTERN (rti);
> >>>>>> -      if (reload_completed
> >>>>>> -	  && aarch64_ldp_writeback > 1
> >>>>>> -	  && GET_CODE (pat) == PARALLEL
> >>>>>> -	  && XVECLEN (pat, 0) == 2)
> >>>>>> +      if (pair_mem_promote_writeback_p (pat))
> >>>>>>  	try_promote_writeback (insn);
> >>>>>
> >>>>> It looks like try_promote_writeback itself will need some further work
> >>>>> to make it target-independent.  I suppose this check:
> >>>>>
> >>>>>   auto rti = insn->rtl ();
> >>>>>   const auto attr = get_attr_ldpstp (rti);
> >>>>>   if (attr == LDPSTP_NONE)
> >>>>>     return;
> >>>>>
> >>>>>   bool load_p = (attr == LDPSTP_LDP);
> >>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> >>>>>
> >>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
> >>>>> added, potentially changing it to return a boolean for load_p.
> >>>>>
> >>>>> Then I guess we will need hooks for destructuring the pair insn and
> >>>>> another hook to wrap aarch64_gen_writeback_pair.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  
> >>>>>>        if (GET_CODE (pat) != SET)
> >>>>>>  	continue;
> >>>>>>  
> >>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
> >>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
> >>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
> >>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> >>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
> >>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
> >>>>>>      }
> >>>>>>  
> >>>>>> -  bb_state.transform ();
> >>>>>> -  bb_state.cleanup_tombstones ();
> >>>>>> +  bb_info.transform ();
> >>>>>> +  bb_info.cleanup_tombstones ();
> >>>>>>  }
> >>>>>>  
> >>>>>>  void ldp_fusion ()
> >>>>>>  {
> >>>>>>    ldp_fusion_init ();
> >>>>>> +  pair_fusion *pfuse;
> >>>>>> +  aarch64_pair_fusion derived;
> >>>>>> +  pfuse = &derived;
> >>>>>
> >>>>> This is indeed the one place where I think it is acceptable to
> >>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
> >>>>> pointer to the parent class, just call any function you like directly.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  
> >>>>>>    for (auto bb : crtl->ssa->bbs ())
> >>>>>> -    ldp_fusion_bb (bb);
> >>>>>> +    pfuse->ldp_fusion_bb (bb);
> >>>>>
> >>>>> I think even the code to iterate over bbs should itself be a member
> >>>>> function of pair_fusion (say "run") and then that becomes part of the
> >>>>> generic code.
> >>>>>
> >>>>> So this function would just become:
> >>>>>
> >>>>> aarch64_pair_fusion pass;
> >>>>> pass.run ();
> >>>>>
> >>>>> and could be inlined into the caller.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> Perhaps you could also add an early return like:
> >>>>>
> >>>>>   if (!track_loads_p () && !track_stores_p ())
> >>>>>     return;
> >>>>>
> >>>>> in pair_fusion::run () and then remove the corresponding code from
> >>>>> pass_ldp_fusion::gate?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>
> >>>>>>  
> >>>>>>    ldp_fusion_destroy ();
> >>>>>>  }
> >>>>>> -- 
> >>>>>> 2.39.3
> >>>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Alex
> >>>>
> >>>> Thanks & Regards
> >>>> Ajit

  reply	other threads:[~2024-04-10  8:12 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-05  8:23 Ajit Agarwal
2024-04-05 16:33 ` Alex Coplan
2024-04-09 12:00   ` Ajit Agarwal
2024-04-09 13:59     ` Alex Coplan
2024-04-09 14:31       ` Ajit Agarwal
2024-04-09 15:09         ` Alex Coplan
2024-04-09 15:29           ` Ajit Agarwal
2024-04-10  8:12             ` Alex Coplan [this message]
2024-04-10 10:01               ` Ajit Agarwal
2024-04-10 14:22                 ` Alex Coplan
2024-04-10 18:18                   ` Ajit Agarwal
2024-04-11 14:25                     ` Alex Coplan
2024-04-12 14:32                       ` Ajit Agarwal
2024-04-12 14:45                         ` Alex Coplan
2024-04-12 17:32                           ` Ajit Agarwal
2024-04-14 16:59                             ` Ajit Agarwal
2024-04-22  7:31                               ` Ajit Agarwal
2024-05-03  8:55                                 ` Alex Coplan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZhZJ3pG5e8ocskYU@arm.com \
    --to=alex.coplan@arm.com \
    --cc=aagarwa1@linux.ibm.com \
    --cc=bergner@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=linkw@linux.ibm.com \
    --cc=meissner@linux.ibm.com \
    --cc=richard.sandiford@arm.com \
    --cc=segher@kernel.crashing.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).