From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=OCIQ=LR=linux.ibm.com=aagarwa1@sourceware.org>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5])
	by sourceware.org (Postfix) with ESMTPS id 5F4A23858D38
	for <gcc-patches@gcc.gnu.org>; Fri, 12 Apr 2024 17:33:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5F4A23858D38
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 5F4A23858D38
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.158.5
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1712943191; cv=none;
	b=D8PQGBROQMNkR36FuBEzZau+XemyFSymnTX4J51ccEPQ1yrBJ7dqNIccdHOpfe30IgEcReklLRi9sTa2LhHr7poSszVYdMRcjBumbp0U/hqqO6TzEqfu+jMeb/SNi6KyEjSTRzZRp6Y65worYuptBciNeINc/GCqfwB2owwJksM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1712943191; c=relaxed/simple;
	bh=cw3cpbeVFcFyK9r5oyClUJN4sLqSE5kwVGrZQ0/Qkxg=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=lIemTa356N81/88/CtCtEKNKCZ5U+THYazwzzZ0ImfZcsKWz68vuHohxls+9TNWgbaCpFutHGK3NoSl7JrrtmHV7z7wpx6zXiVaYQHCATjaY3uWw81/7dJ1bgHpueNtuH+7hQKbJfz0zpc7Jn7ofiiyxXxb32HeL9bjkuFnAW9Y=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from pps.filterd (m0353723.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 43CH0ft3012936;
	Fri, 12 Apr 2024 17:33:06 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date :
 mime-version : subject : to : cc : references : from : in-reply-to :
 content-type : content-transfer-encoding; s=pp1;
 bh=4ft/D3TuSDm9lqd0y/NOVl0Yx6+EwSB4hjtEaKMF6UE=;
 b=d8qVIR0OyMfBF0L3rvYtWNXiJtl1DgUnEtMG8+HBr6a93IhgqACyg1wUh+oI6ksdqYVy
 Rd6KflWaeDUky2L7iTG/VZPRqVM76zMYkbGKVk5K2hBkFYSr4cP8ZTb9j07IU6m9WV+5
 A0RZIQgG6Jh//HCW3jZb/YpEm5yAPyCx5r7oAt2ErHE1lxbRdYnehvGscTzxzJ2DqkdX
 bvuEtRZy5jDHq1u7qcSNCDydOi89EYRNauaxrC//TcDPakO3Ro/+OCug8QiFeeQ7wFxa
 u5lDYFksU4s7Tq4QfyRv5udQOrzJrgLxuQmLQECoif6PAACoh9uN3kN1ESu2CcUW8qjW Ug== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3xf4pg8pd3-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 12 Apr 2024 17:33:05 +0000
Received: from m0353723.ppops.net (m0353723.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 43CHX5bk028934;
	Fri, 12 Apr 2024 17:33:05 GMT
Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3xf4pg8pd1-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 12 Apr 2024 17:33:05 +0000
Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1])
	by ppma11.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 43CG5slY016965;
	Fri, 12 Apr 2024 17:33:04 GMT
Received: from smtprelay03.dal12v.mail.ibm.com ([172.16.1.5])
	by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 3xbke32tap-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 12 Apr 2024 17:33:04 +0000
Received: from smtpav04.dal12v.mail.ibm.com (smtpav04.dal12v.mail.ibm.com [10.241.53.103])
	by smtprelay03.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 43CHX18C59441592
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Fri, 12 Apr 2024 17:33:03 GMT
Received: from smtpav04.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 83B9158065;
	Fri, 12 Apr 2024 17:33:01 +0000 (GMT)
Received: from smtpav04.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id E56BF5805E;
	Fri, 12 Apr 2024 17:32:53 +0000 (GMT)
Received: from [9.43.112.105] (unknown [9.43.112.105])
	by smtpav04.dal12v.mail.ibm.com (Postfix) with ESMTP;
	Fri, 12 Apr 2024 17:32:53 +0000 (GMT)
Message-ID: <ae430b0c-3224-4da2-ae90-41aa66f8d5e8@linux.ibm.com>
Date: Fri, 12 Apr 2024 23:02:51 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH V4 1/3] aarch64: Place target independent and dependent
 changed code in one file
To: Alex Coplan <alex.coplan@arm.com>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
        "Kewen.Lin" <linkw@linux.ibm.com>,
        Segher Boessenkool <segher@kernel.crashing.org>,
        Michael Meissner <meissner@linux.ibm.com>,
        David Edelsohn
 <dje.gcc@gmail.com>,
        Peter Bergner <bergner@linux.ibm.com>,
        gcc-patches <gcc-patches@gcc.gnu.org>
References: <ZhVJ17JfKAHzrNpv@arm.com>
 <a1a614fb-e670-44f2-8821-40e0289ef159@linux.ibm.com>
 <ZhVaLOLg8my3VA6c@arm.com>
 <54ab422a-4ffa-4ca9-8029-eb23c42ed653@linux.ibm.com>
 <ZhZJ3pG5e8ocskYU@arm.com>
 <ea70e9a7-ad50-4e2d-8494-4f0c36354d67@linux.ibm.com>
 <Zhagvsj2Xn/b5Okv@arm.com>
 <eb43075f-d03a-4ce1-aae1-2ed9e5389a34@linux.ibm.com>
 <Zhfy1uNwiBMvh+P7@arm.com>
 <595cd863-3ba5-492b-84d1-ab470411507e@linux.ibm.com>
 <ZhlJEoAW9izedHqA@arm.com>
Content-Language: en-US
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
In-Reply-To: <ZhlJEoAW9izedHqA@arm.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: L0ufxppatmXApUki5Of7Id9oDQYiWfmQ
X-Proofpoint-ORIG-GUID: GgFyaGqKlW0012PqDNQ3ewoLye1Gy9dC
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.272,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26
 definitions=2024-04-12_13,2024-04-09_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 bulkscore=0
 adultscore=0 suspectscore=0 clxscore=1015 phishscore=0 mlxlogscore=999
 impostorscore=0 spamscore=0 malwarescore=0 lowpriorityscore=0
 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2404010000 definitions=main-2404120127
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hello Alex:

On 12/04/24 8:15 pm, Alex Coplan wrote:
> On 12/04/2024 20:02, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 11/04/24 7:55 pm, Alex Coplan wrote:
>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>>>> Hi Ajit,
>>>>>
>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>>>> Hi Ajit,
>>>>>>>
>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>>>> Hello Alex:
>>>>>>>>
>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>>>> Hello Alex:
>>>>>>>>>>
>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All review comments are incorporated.
>>>>> <snip>
>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>> -				 access_group &group)
>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>> +					 access_group &group)
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>  		       access.cand_insns,
>>>>>>>>>>>>>>  		       lfs.load_p,
>>>>>>>>>>>>>>  		       access_size);
>>>>>>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>>>>>
>>>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>>>
>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>>>
>>>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>>>>>>>>>> load/store pair instructions where the two arms of the access are storing
>>>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>>>
>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>>>> explain?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>>>
>>>>>>>>>> For example:
>>>>>>>>>>
>>>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>>>
>>>>>>>>>> pairs:
>>>>>>>>>>
>>>>>>>>>> load from 0th offset.
>>>>>>>>>> load from 16th offset.
>>>>>>>>>>
>>>>>>>>>> next pairs:
>>>>>>>>>>
>>>>>>>>>> load from 16th offset.
>>>>>>>>>> load from 32th offset.
>>>>>>>>>>
>>>>>>>>>> next pairs:
>>>>>>>>>>
>>>>>>>>>> load from 32th offset
>>>>>>>>>> load from 48th offset.
>>>>>>>>>>
>>>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>>>
>>>>>>>>>> pairs:
>>>>>>>>>>
>>>>>>>>>> load from 0th offset
>>>>>>>>>> load from 16th offset.
>>>>>>>>>>
>>>>>>>>>> next pairs:
>>>>>>>>>>
>>>>>>>>>> load from 32th offset
>>>>>>>>>> load from 48th offset.
>>>>>>>>>
>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>>>> single operand size> 
>>>>>>>>
>>>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>>>> point we get load pair (32, 48).
>>>>>>>>
>>>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>>>> (16, 32), (32, 48).
>>>>>>>>
>>>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>>>
>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>>>> load from even register and break for rs6000.
>>>>>>>
>>>>>>> Sorry, I think I'm still missing something here.  Why does the address offset
>>>>>>> affect the parity of the tranfser register?  ISTM they needn't be related at
>>>>>>> all (and indeed we can't even know the parity of the transfer register before
>>>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>>>
>>>>>>
>>>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>>>> replaced with lxvp.
>>>>>>
>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>>>> even register and even_register + 1.
>>>>>>
>>>>>> We replace the above load pair with lxvp instruction and then we
>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>>>> we merge (32,48).
>>>>>
>>>>> Ok, but the existing logic should already account for this.  I.e. if we
>>>>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
>>>>> So I don't see that there's anything specific to lxvp that requires this logic
>>>>> to change, _unless_ you have a stricter alignment requirement on the offsets as
>>>>> I mentioned before.
>>>>>
>>>>
>>>> Thanks for the suggestion. It worked for rs6000 also with current changes.
>>>> Sorry for the confusion.
>>>
>>> Alright, glad we got to the bottom of this!
>>
>> Thanks.
>>>
>>>>
>>>>>>
>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>>>> the register transfer as we are doing fusion pass before RA.
>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>> check:
>>>>>>>>>
>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>
>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>
>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>
>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>
>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>
>>>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>>>> and aarch64. I am sending subsequent patch in some time that address
>>>>>> above.
>>>>>>
>>>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>>>
>>>>>>
>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>>>> generate register pair and thats what we generate and implement
>>>>>> in rs6000 target
>>>>>> code.
>>>>>
>>>>> Ah, this is coming back to me now.  Sorry, I should have remembered this from
>>>>> the previous discussion with Richard S.
>>>>>
>>>>> Apologies for going on a slight tangent here, but if you're running
>>>>> before RA are you planning to create a new OImode pseudo register for
>>>>> the lxvp insn and then somehow update uses of the old transfer registers
>>>>> to replace them with subregs of that OImode pseudo?  
>>>>
>>>> Yes I do the same as you have mentioned. We generate register pairs
>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>>>> 16 offset.
>>>>
>>>> Or do you just plan
>>>>> or replacing the individual loads with moves (instead of deleting them)?
>>>>> I guess the latter would be simpler and might work better in the
>>>>> presence of hard regs.
>>>>>
>>>>
>>>> Would you mind explaining how to generate register pairs with lxvp by
>>>> replacing loads with moves.
>>>
>>> Yeah, so suppose you have something like:
>>>
>>> (set (reg:V16QI v1) (mem:V16QI addr))
>>> (set (reg:V16QI v2) (mem:V16QI addr+16))
>>>
>>> then when you insert the lxvp you can then (logically) replace the
>>> original load instructions with moves from the appropriate subreg, as
>>> follows:
>>>
>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
>>>
>>
>> Any Pseudo created with gen_rtx_REG like 
>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
>> with unrecognize insn by LRA.
> 
> I'm not surprised that goes wrong: you can't just create a new REG
> rtx in a different mode and reuse the regno of an existing pseudo.
> 
>>
>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
>> out with new_defs Pseudo register is not found in
>> change->new_defs.
> 
> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.
> 

Would you mind explaining how can I add and RTL-SSA def for the
new pseudo.

>>
>> Also the sequential register pairs are not generated by
>> Register Allocator.
> 
> So how do you get the sequential pairs with your current approach?  My
> understanding was that what I suggested above doesn't really change what
> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
> understand the approach taken in your initial patchset.
> 

I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp
and then at the use point of pair_pseudo generate the following.

(subreg:V16QI (reg:OI pair-pseudo) 0))
(subreg:V16QI (reg:OI pair-pseudo) 16))

Thanks & Regards
Ajit
> Thanks,
> Alex
> 
>>
>> Thats why I haven't used above method as I also thought
>> through.
>>
>> Please let me know what do you think.
>>
>> Thanks & Regards
>> Ajit 
>>> now I'm not sure off-hand if this exact combination of subregs and mode
>>> changes is valid, but hopefully you get the idea.  The benefit of this
>>> approach is that it keeps the transformation local and is more
>>> compile-time efficient (we don't have to look through all mentions of
>>> v1/v2 and replace them with subregs).  I'd hope that the RA can then
>>> clean up any redundant moves (especially in the case that v1/v2 are
>>> pseudos).
>>>
>>> That would mean that you don't need all the grubbing around with DF
>>> looking for occurrences of the transfer regs.  Instead we'd just need
>>> some logic to insert the extra moves after the lxvp for rs6000, and I
>>> think this might fit more cleanly into the current pass.
>>>
>>> Does that make sense?  In any case, this shouldn't really affect the
>>> preparatory aarch64 patch because we were planning to defer adding any
>>> hooks that are only needed for rs6000 from the initial aarch64/generic
>>> split.
>>>
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>  
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit 
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>> lxvp should load from even registers and then loaded value will
>>>>>>>> be in even register and even register +1 (which is odd).
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit
>>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>> check:
>>>>>>>>>
>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>
>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>
>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>
>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>
>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Ajit
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alex
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>>>
>>>>>>>>>>>>> This:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>
>>>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>>>
>>>>>>>>>>>>> can just be:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>>>
>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>>>  	try_promote_writeback (insn);
>>>>>>>>>>>>>
>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>>>
>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>>>>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>>>>>>>> generic code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>>>
>>>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>>>> pass.run ();
>>>>>>>>>>>>>
>>>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>
>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Alex
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>> Ajit