From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=XZOK=I3=linux.ibm.com=aagarwa1@sourceware.org>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1])
	by sourceware.org (Postfix) with ESMTPS id 277D23858C20
	for <gcc-patches@gcc.gnu.org>; Wed, 17 Jan 2024 09:34:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 277D23858C20
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 277D23858C20
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.156.1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705484089; cv=none;
	b=ZkXRhehKICCxA1pBAHVIAEOfKW6hSF6MWPMkRDaQRXuNJCXdhZYc8GBi9wwCgx8CZxqhT3pIe7tWHbeJfoq6XYm4OutfqHmbPA4ueBCWBure5zkCAxjkvLSSB6urFwodtc1D71UTLCqINBvWOBl85+xUtG3jSuDfa5DmouvoVvU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1705484089; c=relaxed/simple;
	bh=/77gr9KXy/kfiX0+CNk2J/xTOteQHCwCsW61iQxNh4Q=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=roB+HGMdyYHxQeSTRjjd1qRGKyI204YgmB1R74oG8SpMuT+/t/dfnczeFu9EBLekzQq3VyTlUY9Te2p8G2DzHzH7xf8LbCiw2fw2Aqu2fnaOazQZVKQyQxHsxEKw4NQgC66MOSh46sO9xeCxPI3jk2aknyKfkr9LYyr+ZNS4mk8=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from pps.filterd (m0356517.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 40H9RNdG008995;
	Wed, 17 Jan 2024 09:34:43 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date :
 mime-version : subject : to : cc : references : from : in-reply-to :
 content-type : content-transfer-encoding; s=pp1;
 bh=wrOLnipu2N+Y433mNBNvv3n7G4ppL3bLcvNf6dFM3Es=;
 b=nLBb7SiPhdfLwKpA6Fbg/m6TseXgwx0UdjI0fquIbGUB8oi40HwmQzkQdevnu1H0ysRK
 1aaKhh+BjetdHoEfQjgQYgximTFhF42E3Sf9OE+Hy6SnMtw3UDawzvIl0K0xwz5dg7n3
 Z00uOWGfowXwgSRL3dX/W0CflE6RC2nkjBaj8Mz1pvMbneeV+kTfTcBVXfd4d2UoXIXi
 gq8PZwE35afpcJMaAHWfAMzF7ZhHOB/5xu14tvf+pHk7tshPb5A3k1KHHUZ2p1+hetcb
 GWAsEhD3BhtO1mi6Uej13UuyNmnX+7EbB3ooaDGYK6BNQKnMS+fYOZB/uDwefsEkru5d Uw== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3vpc98g75v-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 17 Jan 2024 09:34:43 +0000
Received: from m0356517.ppops.net (m0356517.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 40H9TITh015828;
	Wed, 17 Jan 2024 09:34:42 GMT
Received: from ppma12.dal12v.mail.ibm.com (dc.9e.1632.ip4.static.sl-reverse.com [50.22.158.220])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3vpc98g75e-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 17 Jan 2024 09:34:42 +0000
Received: from pps.filterd (ppma12.dal12v.mail.ibm.com [127.0.0.1])
	by ppma12.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 40H8oJoi003704;
	Wed, 17 Jan 2024 09:34:41 GMT
Received: from smtprelay03.dal12v.mail.ibm.com ([172.16.1.5])
	by ppma12.dal12v.mail.ibm.com (PPS) with ESMTPS id 3vm4usvkhc-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 17 Jan 2024 09:34:41 +0000
Received: from smtpav06.wdc07v.mail.ibm.com (smtpav06.wdc07v.mail.ibm.com [10.39.53.233])
	by smtprelay03.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 40H9YeTr63111658
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Wed, 17 Jan 2024 09:34:41 GMT
Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id D74F85804E;
	Wed, 17 Jan 2024 09:34:40 +0000 (GMT)
Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 5E6CB5803F;
	Wed, 17 Jan 2024 09:34:37 +0000 (GMT)
Received: from [9.43.122.120] (unknown [9.43.122.120])
	by smtpav06.wdc07v.mail.ibm.com (Postfix) with ESMTP;
	Wed, 17 Jan 2024 09:34:37 +0000 (GMT)
Message-ID: <dbdfaf62-7185-47fb-8ee2-9afdf20f50b6@linux.ibm.com>
Date: Wed, 17 Jan 2024 15:04:35 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load)
 lxv with lxvp
Content-Language: en-US
To: "Kewen.Lin" <linkw@linux.ibm.com>,
        Richard Biener <richard.guenther@gmail.com>,
        Peter Bergner <bergner@linux.ibm.com>
Cc: Vladimir Makarov <vmakarov.gcc@gmail.com>,
        Michael Meissner <meissner@linux.ibm.com>,
        Segher Boessenkool <segher@kernel.crashing.org>,
        David Edelsohn <dje.gcc@gmail.com>,
        gcc-patches <gcc-patches@gcc.gnu.org>,
        Richard Sandiford <richard.sandiford@arm.com>, alex.coplan@arm.com
References: <f4131a17-61dd-4720-8667-b20ec240d1f0@linux.ibm.com>
 <CAFiYyc2hUn=MjrsrCExCcZJRp2=AAAidCzx_4YHdCVB20RFm5g@mail.gmail.com>
 <7b7e1da7-19bd-4192-b5a3-db3fed3a0aaf@linux.ibm.com>
 <00272349-aa2a-4ea3-9859-913b7b4fe049@linux.ibm.com>
 <0efa1db8-323e-4cdb-9b95-bf6cee25d03d@linux.ibm.com>
 <f66709f8-39cb-024c-1fcb-416a3a3efc90@linux.ibm.com>
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
In-Reply-To: <f66709f8-39cb-024c-1fcb-416a3a3efc90@linux.ibm.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: b9WudrT01hTlhpQl-jQMV5ApZMO2LgBu
X-Proofpoint-GUID: RCkBUiNc9dSpEEh7_rBui2LJcSfrKM-5
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26
 definitions=2024-01-17_04,2024-01-16_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0
 suspectscore=0 phishscore=0 mlxscore=0 malwarescore=0 clxscore=1011
 adultscore=0 priorityscore=1501 mlxlogscore=999 impostorscore=0
 spamscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2311290000 definitions=main-2401170066
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hello Kewen:

On 17/01/24 12:32 pm, Kewen.Lin wrote:
> on 2024/1/16 06:22, Ajit Agarwal wrote:
>> Hello Richard:
>>
>> On 15/01/24 6:25 pm, Ajit Agarwal wrote:
>>>
>>>
>>> On 15/01/24 6:14 pm, Ajit Agarwal wrote:
>>>> Hello Richard:
>>>>
>>>> On 15/01/24 3:03 pm, Richard Biener wrote:
>>>>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>>>>>
>>>>>> Hello All:
>>>>>>
>>>>>> This patch add the vecload pass to replace adjacent memory accesses lxv with lxvp
>>>>>> instructions. This pass is added before ira pass.
>>>>>>
>>>>>> vecload pass removes one of the defined adjacent lxv (load) and replace with lxvp.
>>>>>> Due to removal of one of the defined loads the allocno is has only uses but
>>>>>> not defs.
>>>>>>
>>>>>> Due to this IRA pass doesn't assign register pairs like registers in sequence.
>>>>>> Changes are made in IRA register allocator to assign sequential registers to
>>>>>> adjacent loads.
>>>>>>
>>>>>> Some of the registers are cleared and are not set as profitable registers due
>>>>>> to zero cost is greater than negative costs and checks are added to compare
>>>>>> positive costs.
>>>>>>
>>>>>> LRA register is changed not to reassign them to different register and form
>>>>>> the sequential register pairs intact.
>>>>>>
>>>>>> contrib/check_GNU_style.sh run on patch looks good.
>>>>>>
>>>>>> Bootstrapped and regtested for powerpc64-linux-gnu.
>>>>>>
>>>>>> Spec2017 benchmarks are run and I get impressive benefits for some of the FP
>>>>>> benchmarks.
>>>>> i
>>>>> I want to point out the aarch64 target recently got a ld/st fusion
>>>>> pass which sounds
>>>>> related.  It would be nice to have at least common infrastructure for
>>>>> this (the aarch64
>>>>> one also looks quite more powerful)
> 
> Thank Richi for pointing out this pass.  Yeah, it would be nice if we can share
> something common.  CC the author Alex as well in case he have more insightful
> comments.
> 
>>>>
>>>> load/store fusion pass in aarch64 is scheduled to use before peephole2 pass 
>>>> and after register allocator pass. In our case, if we do after register allocator
>>>> then we should keep register assigned to lower offset load and other load
>>>> that is adjacent to previous load with offset difference of 16 is removed.
>>>>
>>>> Then we are left with one load with lower offset and register assigned 
>>>> by register allocator for lower offset load should be lower than other
>>>> adjacent load. If not, we need to change it to lower register and 
>>>> propagate them with all the uses of the variable. Similary for other
>>>> adjacent load that we are removing, register needs to be propagated to
>>>> all the uses.
>>>>
>>>> In that case we are doing the work of register allocator. In most of our
>>>> example testcases the lower offset load is assigned greater register 
>>>> than other adjacent load by register allocator and hence we are left
>>>> with propagating them always and almost redoing the register allocator
>>>> work.
>>>>
>>>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases
>>>> considering the above scenario.
>>>>
>>>> Please let me know what do you think. 
>>
>> I have gone through the implementation of ld/st fusion in aarch64.
>>
>> Here is my understanding:
>>
>> First all its my mistake that I have mentioned in my earlier mail that 
>> this pass is done before peephole2 after RA-pass.
>>
>> This pass does it before RA-pass early before early-remat and 
>> also before peephole2 after RA-pass.
>>
>> This pass does load fusion 2 ldr instruction with adjacent accesses
>> into ldp instruction.
>>
>> The assembly syntax of ldp instruction is
>>
>> ldp w3, w7, [x0]
>>
>> It loads [X0] into w3 and [X0+4] into W7.
>>
>> Both registers that forms pairs are mentioned in ldp instructions
>> and might not be in sequntial order like first register is W3 and
>> then next register would be W3+1.
>>
>> Thats why the pass before RA-pass works as it has both the defs
>> and may not be required in sequential order like first_reg and then
>> first_reg+1. It can be any valid registers.
>>
>>
>> But in lxvp instructions:
>>
>> lxv vs32, 0(r2)
>> lxv vs45, 16(r2)
>>
>> When we combine above lxv instruction into lxvp, lxvp instruction
>> becomes
>>
>> lxvp vs32, 0(r2)
>>
>> wherein in lxvp  r2+0 is loaded into vs32 and r2+16 is loaded into vs33 
>> register (sequential registers). vs33 is hidden in lxvp instruction.
>> This is mandatory requirement for lxvp instruction and cannot be in 
>> any other sequence. register assignment difference should be 1.
> 
> Note that the first register number in the pair should be even, it
> means the so-called sequential order should be X, X + 1 (X is even).
> This is also the reason why we preferred this pairing to be done
> before RA (can catch more opportunities).
> 
>>
>> All the uses of r45 has to be propagated with r33.
> 
> I think you meant s/r45/vs45/ and s/r33/vs33/.
> 

Yes I meant the same.

>>
>> And also register allocator can allocate two lxv instructions
>> in the following registers.
>>
>> lxv vs33, 0(r2)
>> lxv vs32, 16(r2)
>>
>> To generate lxvp for above lxv instructions 
>>
>> lxvp vs32, 0(r2).
>>
>> And all the registers vs33 has to be propagated with vs32 and vs32
>> has to be propagated with vs33 if we do vecload pass after RA-pass.
>>
>> If we do before RA-pass the IRA and LRA register allocation cannot
>> assign register with a difference of 1 and the order difference can
>> be anything with a positive difference.
> 
> This sounds unexpected.  IMHO if you adopt OOmode for the paired load,
> RA should be able to allocate two sequential vsx registers and the
> first is even, since OOmode is only ok for even vsx register and its
> size makes it take two consecutive vsx registers.
> 
> Hi Peter, is my understanding correct?
> 

I tried all the combination in the past RA is not allocating sequential
register. I dont see any such code in RA that generates sequential registers.

We need to add explicit code in RA to generate sequential registers.
This is what I have done in this patch submitted.
>>
>> IRA allocated one in vs32 and other can in vs45.
>>
>> In vecload pass we remove one lxv from 2 lxv instruction and 2nd
>> lxv instruction with offset of 16 is removed and the use of register
>> with 2nd lxv's will not have defs and IRA pass cannot allocate
>> them in order with a difference of 1.
> 
> With Peter's patch to allow subreg from OOmode, I'd expect that we
> replace all uses of the first lxv (from low part address) with
> (subreg:VnX <the result of lxvp> <low offset>) and all uses of the
> second lxv (from high address) with (subreg:VnX <the result of lxvp> 
> <high offset>), currently after vecload, with the associated vecload.C,
> we transform from:
> 
> (insn 8 4 10 2 (set (reg:V16QI 124 [ *ptr_4(D) ])
>         (mem:V16QI (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) 1186 {vsx_movv16qi_64bit}
>      (nil))
> (insn 10 8 9 2 (set (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>         (mem:V16QI (plus:DI (reg/v/f:DI 122 [ ptr ])
>                 (const_int 16 [0x10])) [0 MEM[(__vector unsigned char *)ptr_4(D) + 16B]+0 S16 A128])) 1186 {vsx_movv16qi_64bit}
>      (expr_list:REG_DEAD (reg/v/f:DI 122 [ ptr ])
>         (nil)))
> (insn 9 10 11 2 (set (reg:XO 119 [ _7 ])
>         (unspec:XO [
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 124 [ *ptr_4(D) ])
>             ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger}
>      (expr_list:REG_DEAD (reg:V16QI 124 [ *ptr_4(D) ])
>         (nil)))
> (insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
>         (unspec:XO [
>                 (reg:XO 119 [ _7 ])
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>             ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp}
>      (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>         (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
>             (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
>                 (nil)))))
> 
> to:
> 
> (insn 19 4 9 2 (set (reg:OO 124 [ *ptr_4(D) ])
>         (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
>      (nil))
> (insn 9 19 11 2 (set (reg:XO 119 [ _7 ])
>         (unspec:XO [
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>             ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger}
>      (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
>         (nil)))
> (insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
>         (unspec:XO [
>                 (reg:XO 119 [ _7 ])
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:OO 124 [ *ptr_4(D) ])
>             ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp}
>      (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>         (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
>             (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
> 
> 
> After vecload IMHO this code sequence doesn't look valid, no insn
> defines pseudo 125, pseudo 124 is with OOmode but it's used to
> replace a use of V16QI.  Both are expected to be subreg from pseudo
> 124 (should be better to use a new pseduo for the paired load).
> 

I tried that in the past and no sequential code is generated in RA.

> Without Peter's patch, UNSPEC_MMA_EXTRACT can be used for extracting,
> I think it should only result in sub-optimal code with possible extra
> register moves but not have any correctness issue.
> 
> This patch can NOT be bootstrapped on x86_64-redhat-linux, I guess
> it's caused by the proposed RA changes.  And it's NOT regress-tested
> on Power10 LE with some go failures (note that I configured with
> --enable-languages="c,c++,fortran,objc,obj-c++,go"), also on Power9
> BE and LE with a few failures.
>

Did you try to bootstrapped on x86_64-redhat-linux. This patch did
not bootstrapped?. If so I will fix them.
 
> btw, I think Mike wants to test this pairing support on both adjacent
> loads and stores, so could you also extend the pairing to cover the
> stores, which can be guarded in a separated flag and disabled by
> default by considering the known issue on paired store, then Mike
> can test if the pairing can satisfy what he looked for.
>

Sure I will do that.

Thanks & Regards
Ajit
 
> BR,
> Kewen
>