From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=F7Zu=L3=linux.ibm.com=aagarwa1@sourceware.org>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1])
	by sourceware.org (Postfix) with ESMTPS id 8BD233858C78
	for <gcc-patches@gcc.gnu.org>; Mon, 22 Apr 2024 07:32:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8BD233858C78
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8BD233858C78
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.156.1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1713771180; cv=none;
	b=JDxjY+p2BZGXDZsKhCrxfSZr7zNFsvqRCbbYxbvSRe7CuhvuP/YDJZbM5riqDnWI9wsgwXjH2MJyYbL3y4ojIwzAzk2Kr8qraeP0pQVwMFHbpHZoNBRy4yCljvYLo109CWgU1CG4Q5qVT03829NHLUm2OCpBK3xzEvChM+/fCQ0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1713771180; c=relaxed/simple;
	bh=a+Yp2VJ8DBI8fwuFvNk1nwS4VksQLlUvBycLgq3aerc=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:From:To; b=i+cDm+w8z8T/ei3v/gibuesplrODxNOf5Ymt9HdZY4aOYbyX/w/yVtcn18JvuUkS3oDs36mfyfjt7uKUEB3f8SvJ0EjFzgd+4U1TUmS98vew/qixaWU3XfPDprdoJCiqWnxUd+AAwoZzHraBobdCpwziMMOH8OUjPdSXzx2Or2w=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from pps.filterd (m0353726.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 43M7WiVJ000504;
	Mon, 22 Apr 2024 07:32:44 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date :
 mime-version : subject : from : to : cc : references : in-reply-to :
 content-type : content-transfer-encoding; s=pp1;
 bh=t97tWb5j1lGTUrwh/yvN7r81ap16EPZMkEV6aROiUVw=;
 b=STjqcc3F/HLIDM4fF0jxIXcTCYzwEIsNRL7oR4AUs1fzmCzW4R7xFLO/tZyuNaK2GGBW
 aKbzYPxwdt5MOItby9q5aVMYHxGCp42kSlCpnG+9Im330oynPu0Dwna15/eFRHkXcqEF
 EhkZL/xk8GYmbaCjHy99uXqp+9qRGD2BlUg/PfV3rpvp/4baXLhDlWnyqEsH8HK2WmvQ
 4nAP2LSujhhX5yrTWh+zbVmnu/piHkL5sxUvJbWrd75C9PYv6nrpDICEvliW+Cr11Mbg
 tuhdNsTMze5zFvl8G4VuKjiDp3gXpj1DvD8AVAIqZPipkIiNQ3oYDfPfvDwuWu9Srjds cQ== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3xnkk9r00r-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 22 Apr 2024 07:32:44 +0000
Received: from m0353726.ppops.net (m0353726.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 43M7WhmK000492;
	Mon, 22 Apr 2024 07:32:43 GMT
Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3xnkk9r00n-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 22 Apr 2024 07:32:43 +0000
Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1])
	by ppma21.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 43M6a6s1023012;
	Mon, 22 Apr 2024 07:32:41 GMT
Received: from smtprelay05.wdc07v.mail.ibm.com ([172.16.1.72])
	by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3xms1npc51-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 22 Apr 2024 07:32:41 +0000
Received: from smtpav01.wdc07v.mail.ibm.com (smtpav01.wdc07v.mail.ibm.com [10.39.53.228])
	by smtprelay05.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 43M7WcGv24248832
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Mon, 22 Apr 2024 07:32:40 GMT
Received: from smtpav01.wdc07v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id BBDA15809A;
	Mon, 22 Apr 2024 07:32:38 +0000 (GMT)
Received: from smtpav01.wdc07v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 8DD1D58063;
	Mon, 22 Apr 2024 07:31:33 +0000 (GMT)
Received: from [9.43.29.170] (unknown [9.43.29.170])
	by smtpav01.wdc07v.mail.ibm.com (Postfix) with ESMTP;
	Mon, 22 Apr 2024 07:31:33 +0000 (GMT)
Message-ID: <f6e23de5-3246-4ef7-86ad-1050e8e92818@linux.ibm.com>
Date: Mon, 22 Apr 2024 13:01:31 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH V4 1/3] aarch64: Place target independent and dependent
 changed code in one file
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
To: Alex Coplan <alex.coplan@arm.com>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
        "Kewen.Lin" <linkw@linux.ibm.com>,
        Segher Boessenkool <segher@kernel.crashing.org>,
        Michael Meissner <meissner@linux.ibm.com>,
        David Edelsohn
 <dje.gcc@gmail.com>,
        Peter Bergner <bergner@linux.ibm.com>,
        gcc-patches <gcc-patches@gcc.gnu.org>
References: <ZhVJ17JfKAHzrNpv@arm.com>
 <a1a614fb-e670-44f2-8821-40e0289ef159@linux.ibm.com>
 <ZhVaLOLg8my3VA6c@arm.com>
 <54ab422a-4ffa-4ca9-8029-eb23c42ed653@linux.ibm.com>
 <ZhZJ3pG5e8ocskYU@arm.com>
 <ea70e9a7-ad50-4e2d-8494-4f0c36354d67@linux.ibm.com>
 <Zhagvsj2Xn/b5Okv@arm.com>
 <eb43075f-d03a-4ce1-aae1-2ed9e5389a34@linux.ibm.com>
 <Zhfy1uNwiBMvh+P7@arm.com>
 <595cd863-3ba5-492b-84d1-ab470411507e@linux.ibm.com>
 <ZhlJEoAW9izedHqA@arm.com>
 <ae430b0c-3224-4da2-ae90-41aa66f8d5e8@linux.ibm.com>
 <52a437cc-4452-4e0e-86a9-84e653401275@linux.ibm.com>
Content-Language: en-US
In-Reply-To: <52a437cc-4452-4e0e-86a9-84e653401275@linux.ibm.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: gX1VwNx7F2FdBH_DWP1UG1BJ1hf1OU-E
X-Proofpoint-ORIG-GUID: pqV2_jDuM8CAdaRe0S7Ez9ZJiDPBra6Z
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.293,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26
 definitions=2024-04-22_04,2024-04-19_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 clxscore=1015
 suspectscore=0 phishscore=0 adultscore=0 lowpriorityscore=0 malwarescore=0
 bulkscore=0 mlxlogscore=999 priorityscore=1501 spamscore=0 impostorscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2404010000
 definitions=main-2404220034
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hello Alex:

On 14/04/24 10:29 pm, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 12/04/24 11:02 pm, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 12/04/24 8:15 pm, Alex Coplan wrote:
>>> On 12/04/2024 20:02, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 11/04/24 7:55 pm, Alex Coplan wrote:
>>>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>>>>>> Hi Ajit,
>>>>>>>
>>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>>>>>> Hello Alex:
>>>>>>>>
>>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>>>>>> Hi Ajit,
>>>>>>>>>
>>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>>>>>> Hello Alex:
>>>>>>>>>>
>>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>>>>>> Hello Alex:
>>>>>>>>>>>>
>>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> All review comments are incorporated.
>>>>>>> <snip>
>>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>> -				 access_group &group)
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>> +					 access_group &group)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>>  		       access.cand_insns,
>>>>>>>>>>>>>>>>  		       lfs.load_p,
>>>>>>>>>>>>>>>>  		       access_size);
>>>>>>>>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>>>>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>>>>>>>>>>>> load/store pair instructions where the two arms of the access are storing
>>>>>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>>>>>
>>>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>>>>>> explain?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>>>>>
>>>>>>>>>>>> For example:
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>>>>>
>>>>>>>>>>>> pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 0th offset.
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>> load from 32th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 32th offset
>>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>>>>>
>>>>>>>>>>>> pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 0th offset
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 32th offset
>>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>>
>>>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>>>>>> single operand size> 
>>>>>>>>>>
>>>>>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>>>>>> point we get load pair (32, 48).
>>>>>>>>>>
>>>>>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>>>>>> (16, 32), (32, 48).
>>>>>>>>>>
>>>>>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>>>>>
>>>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>>>>>> load from even register and break for rs6000.
>>>>>>>>>
>>>>>>>>> Sorry, I think I'm still missing something here.  Why does the address offset
>>>>>>>>> affect the parity of the tranfser register?  ISTM they needn't be related at
>>>>>>>>> all (and indeed we can't even know the parity of the transfer register before
>>>>>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>>>>>
>>>>>>>>
>>>>>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>>>>>> replaced with lxvp.
>>>>>>>>
>>>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>>>>>> even register and even_register + 1.
>>>>>>>>
>>>>>>>> We replace the above load pair with lxvp instruction and then we
>>>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>>>>>> we merge (32,48).
>>>>>>>
>>>>>>> Ok, but the existing logic should already account for this.  I.e. if we
>>>>>>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
>>>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
>>>>>>> So I don't see that there's anything specific to lxvp that requires this logic
>>>>>>> to change, _unless_ you have a stricter alignment requirement on the offsets as
>>>>>>> I mentioned before.
>>>>>>>
>>>>>>
>>>>>> Thanks for the suggestion. It worked for rs6000 also with current changes.
>>>>>> Sorry for the confusion.
>>>>>
>>>>> Alright, glad we got to the bottom of this!
>>>>
>>>> Thanks.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>>>>>> the register transfer as we are doing fusion pass before RA.
>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>>>> check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>>
>>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>>
>>>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>>
>>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>>
>>>>>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>>>>>> and aarch64. I am sending subsequent patch in some time that address
>>>>>>>> above.
>>>>>>>>
>>>>>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>>>>>
>>>>>>>>
>>>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>>>>>> generate register pair and thats what we generate and implement
>>>>>>>> in rs6000 target
>>>>>>>> code.
>>>>>>>
>>>>>>> Ah, this is coming back to me now.  Sorry, I should have remembered this from
>>>>>>> the previous discussion with Richard S.
>>>>>>>
>>>>>>> Apologies for going on a slight tangent here, but if you're running
>>>>>>> before RA are you planning to create a new OImode pseudo register for
>>>>>>> the lxvp insn and then somehow update uses of the old transfer registers
>>>>>>> to replace them with subregs of that OImode pseudo?  
>>>>>>
>>>>>> Yes I do the same as you have mentioned. We generate register pairs
>>>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>>>>>> 16 offset.
>>>>>>
>>>>>> Or do you just plan
>>>>>>> or replacing the individual loads with moves (instead of deleting them)?
>>>>>>> I guess the latter would be simpler and might work better in the
>>>>>>> presence of hard regs.
>>>>>>>
>>>>>>
>>>>>> Would you mind explaining how to generate register pairs with lxvp by
>>>>>> replacing loads with moves.
>>>>>
>>>>> Yeah, so suppose you have something like:
>>>>>
>>>>> (set (reg:V16QI v1) (mem:V16QI addr))
>>>>> (set (reg:V16QI v2) (mem:V16QI addr+16))
>>>>>
>>>>> then when you insert the lxvp you can then (logically) replace the
>>>>> original load instructions with moves from the appropriate subreg, as
>>>>> follows:
>>>>>
>>>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
>>>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
>>>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
>>>>>
>>>>
>>>> Any Pseudo created with gen_rtx_REG like 
>>>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
>>>> with unrecognize insn by LRA.
>>>
>>> I'm not surprised that goes wrong: you can't just create a new REG
>>> rtx in a different mode and reuse the regno of an existing pseudo.
>>>
> 
> Thanks for the suggestion.
>>>>
>>>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
>>>> out with new_defs Pseudo register is not found in
>>>> change->new_defs.
>>>
>>> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.
>>>
>>
>> Would you mind explaining how can I add and RTL-SSA def for the
>> new pseudo.
> 
> I have added and RTL-SSA def for the new pseudo. With that I could
> get register oairs correctly.
>>
>>>>
>>>> Also the sequential register pairs are not generated by
>>>> Register Allocator.
>>>
>>> So how do you get the sequential pairs with your current approach?  My
>>> understanding was that what I suggested above doesn't really change what
>>> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
>>> understand the approach taken in your initial patchset.
>>>
>>
>> I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp
>> and then at the use point of pair_pseudo generate the following.
>>
>> (subreg:V16QI (reg:OI pair-pseudo) 0))
>> (subreg:V16QI (reg:OI pair-pseudo) 16))
>>
> 
> I get register pairs correctly generating RTL as you have 
> suggested but we get extra moves that would impact the performance.
> 
> Please let me know what do you think.
> 

For the below testcase:

#include <altivec.h>

void
foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
{
  __vector_quad acc;
  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
  *dst = acc;
}


My earlier implementation without moves generates the following assembly
for the above testcase:

.LFB0:
        .cfi_startproc
        .localentry     _Z3fooPu13__vector_quadPDv16_hS1_,1
        lxvp %vs32,0(%r4)
        xvf32ger 0,%vs34,%vs33
        xvf32gerpp 0,%vs34,%vs32
        xxmfacc 0
        stxvp %vs2,0(%r3)
        stxvp %vs0,32(%r3)
        blr


With moves you have suggested I get the below generated assembly:
.LFB0:
        .cfi_startproc
        .localentry     _Z3fooPu13__vector_quadPDv16_hS1_,1
        lxvp %vs0,0(%r4)
        xxlor %vs33,%vs1,%vs1
        xxlor %vs32,%vs0,%vs0
        xvf32ger 0,%vs34,%vs33
        xvf32gerpp 0,%vs34,%vs32
        xxmfacc 0
        stxvp %vs2,0(%r3)
        stxvp %vs0,32(%r3)
        blr


As you see with register moves two extra xxlor instructions are 
generated.

Please let me know what do you think.

Thanks & Regards
Ajit
> Thanks & Regards
> Ajit
>> Thanks & Regards
>> Ajit
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> Thats why I haven't used above method as I also thought
>>>> through.
>>>>
>>>> Please let me know what do you think.
>>>>
>>>> Thanks & Regards
>>>> Ajit 
>>>>> now I'm not sure off-hand if this exact combination of subregs and mode
>>>>> changes is valid, but hopefully you get the idea.  The benefit of this
>>>>> approach is that it keeps the transformation local and is more
>>>>> compile-time efficient (we don't have to look through all mentions of
>>>>> v1/v2 and replace them with subregs).  I'd hope that the RA can then
>>>>> clean up any redundant moves (especially in the case that v1/v2 are
>>>>> pseudos).
>>>>>
>>>>> That would mean that you don't need all the grubbing around with DF
>>>>> looking for occurrences of the transfer regs.  Instead we'd just need
>>>>> some logic to insert the extra moves after the lxvp for rs6000, and I
>>>>> think this might fit more cleanly into the current pass.
>>>>>
>>>>> Does that make sense?  In any case, this shouldn't really affect the
>>>>> preparatory aarch64 patch because we were planning to defer adding any
>>>>> hooks that are only needed for rs6000 from the initial aarch64/generic
>>>>> split.
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>  
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit 
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> lxvp should load from even registers and then loaded value will
>>>>>>>>>> be in even register and even register +1 (which is odd).
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Ajit
>>>>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>>>> check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>>
>>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>>
>>>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>>
>>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alex
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>> Ajit
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can just be:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>>>>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>>>>>  	try_promote_writeback (insn);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>>>>>>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>>>>>>>>>> generic code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>>>>>> pass.run ();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>>>> Ajit