From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=DDtZ=I3=linux.ibm.com=linkw@sourceware.org>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5])
	by sourceware.org (Postfix) with ESMTPS id DD86C3858CDB
	for <gcc-patches@gcc.gnu.org>; Wed, 17 Jan 2024 07:02:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DD86C3858CDB
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org DD86C3858CDB
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.158.5
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705474937; cv=none;
	b=dbFyNj4+Y8NB3I7gsR9uger7vG0cng+ni93L00LNF8Ez8nxUyWiHvpBQnPBywlPMNp6B2kszqjBXOTHefylwR5+ObwzODxAOKlTkT1CvOlNLY3wSmlDDoO71nhV2nqaHGgkECUlJF4/kTCEVm6hy9Ge/hsRJmOamPan/X5Nklrg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1705474937; c=relaxed/simple;
	bh=LX+3oPyZbX76G4uVuW46qzSgDkrKsDmW+NNqZWuACRI=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=A9vTvWlmf4iEpLoR/JWEhnIpPeRYHVSRC1gxfCa2WEY5+/O6fbwUykh56bqF4OMk6rWHTCMqmkWByqwQzYoFHEVQmV3w0lkGu7euo7Ny2f0sz3Ntv2F6Yua41w5rczzs/OIgmO/pxm0aqyu+1gjBM1H0fvU0rvHfoOhT0kgUvgo=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from pps.filterd (m0353723.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 40H6vlMH011954;
	Wed, 17 Jan 2024 07:02:11 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date :
 mime-version : subject : to : cc : references : from : in-reply-to :
 content-type : content-transfer-encoding; s=pp1;
 bh=buwRSOvhYDv/GX+zRmGFzA2wBKVJgqXUpxZOzARrBqs=;
 b=qBJuI9hdw4Gdc9tsgstC1jC680Z19RS03WqgtxtHo8CxRlVIwTDen1ne1ZTYtUEeCteY
 g/lM0+iGUgGCMe/MOvghOqjI70KPSV5flWp5J0pxeNheLJIz7eY6k/8Uv7JYI/ozUel3
 AoUy548m7pz1/SdUdNnvyhIfYhYeHv5WDqL3vi7/Ie4WajjymlNe+WSQfh9RsGQfUVUm
 xXjnhxhb6clvFM8rzvAr2C+sS69P79wYnfu5QK1E9j4qr4kN/gEA18fLp1VwbtI5ymXq
 i20bkEznmE6Yon3V8WWT+yMlPKvcv6azXprLB/jzYn4QVa0LApQXLnJc63Kn0Po5qI7N +w== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3vpa3583q5-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 17 Jan 2024 07:02:11 +0000
Received: from m0353723.ppops.net (m0353723.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 40H70QOP017961;
	Wed, 17 Jan 2024 07:02:10 GMT
Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3vpa3583pp-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 17 Jan 2024 07:02:10 +0000
Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1])
	by ppma21.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 40H4nN5i008541;
	Wed, 17 Jan 2024 07:02:10 GMT
Received: from smtprelay04.fra02v.mail.ibm.com ([9.218.2.228])
	by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3vm5unkpb2-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 17 Jan 2024 07:02:09 +0000
Received: from smtpav05.fra02v.mail.ibm.com (smtpav05.fra02v.mail.ibm.com [10.20.54.104])
	by smtprelay04.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 40H7267d17498766
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Wed, 17 Jan 2024 07:02:07 GMT
Received: from smtpav05.fra02v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id CB31B2004B;
	Wed, 17 Jan 2024 07:02:06 +0000 (GMT)
Received: from smtpav05.fra02v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 9084920040;
	Wed, 17 Jan 2024 07:02:03 +0000 (GMT)
Received: from [9.200.54.186] (unknown [9.200.54.186])
	by smtpav05.fra02v.mail.ibm.com (Postfix) with ESMTP;
	Wed, 17 Jan 2024 07:02:03 +0000 (GMT)
Message-ID: <f66709f8-39cb-024c-1fcb-416a3a3efc90@linux.ibm.com>
Date: Wed, 17 Jan 2024 15:02:01 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.15.1
Subject: Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load)
 lxv with lxvp
Content-Language: en-US
To: Ajit Agarwal <aagarwa1@linux.ibm.com>,
        Richard Biener <richard.guenther@gmail.com>,
        Peter Bergner <bergner@linux.ibm.com>
Cc: Vladimir Makarov <vmakarov.gcc@gmail.com>,
        Michael Meissner <meissner@linux.ibm.com>,
        Segher Boessenkool <segher@kernel.crashing.org>,
        David Edelsohn <dje.gcc@gmail.com>,
        gcc-patches <gcc-patches@gcc.gnu.org>,
        Richard Sandiford <richard.sandiford@arm.com>, alex.coplan@arm.com
References: <f4131a17-61dd-4720-8667-b20ec240d1f0@linux.ibm.com>
 <CAFiYyc2hUn=MjrsrCExCcZJRp2=AAAidCzx_4YHdCVB20RFm5g@mail.gmail.com>
 <7b7e1da7-19bd-4192-b5a3-db3fed3a0aaf@linux.ibm.com>
 <00272349-aa2a-4ea3-9859-913b7b4fe049@linux.ibm.com>
 <0efa1db8-323e-4cdb-9b95-bf6cee25d03d@linux.ibm.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
In-Reply-To: <0efa1db8-323e-4cdb-9b95-bf6cee25d03d@linux.ibm.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: 5BMna7RxUyIMOIKratt2mUlmBfITPbjo
X-Proofpoint-GUID: HKYbO4z636uGwW7j9hYNdQ2ZTDsK_KaF
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26
 definitions=2024-01-17_02,2024-01-16_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 clxscore=1011 bulkscore=0 adultscore=0 spamscore=0 malwarescore=0
 phishscore=0 mlxscore=0 mlxlogscore=999 suspectscore=0 lowpriorityscore=0
 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2311290000 definitions=main-2401170047
X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

on 2024/1/16 06:22, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 15/01/24 6:25 pm, Ajit Agarwal wrote:
>>
>>
>> On 15/01/24 6:14 pm, Ajit Agarwal wrote:
>>> Hello Richard:
>>>
>>> On 15/01/24 3:03 pm, Richard Biener wrote:
>>>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>>>>
>>>>> Hello All:
>>>>>
>>>>> This patch add the vecload pass to replace adjacent memory accesses lxv with lxvp
>>>>> instructions. This pass is added before ira pass.
>>>>>
>>>>> vecload pass removes one of the defined adjacent lxv (load) and replace with lxvp.
>>>>> Due to removal of one of the defined loads the allocno is has only uses but
>>>>> not defs.
>>>>>
>>>>> Due to this IRA pass doesn't assign register pairs like registers in sequence.
>>>>> Changes are made in IRA register allocator to assign sequential registers to
>>>>> adjacent loads.
>>>>>
>>>>> Some of the registers are cleared and are not set as profitable registers due
>>>>> to zero cost is greater than negative costs and checks are added to compare
>>>>> positive costs.
>>>>>
>>>>> LRA register is changed not to reassign them to different register and form
>>>>> the sequential register pairs intact.
>>>>>
>>>>> contrib/check_GNU_style.sh run on patch looks good.
>>>>>
>>>>> Bootstrapped and regtested for powerpc64-linux-gnu.
>>>>>
>>>>> Spec2017 benchmarks are run and I get impressive benefits for some of the FP
>>>>> benchmarks.
>>>> i
>>>> I want to point out the aarch64 target recently got a ld/st fusion
>>>> pass which sounds
>>>> related.  It would be nice to have at least common infrastructure for
>>>> this (the aarch64
>>>> one also looks quite more powerful)

Thank Richi for pointing out this pass.  Yeah, it would be nice if we can share
something common.  CC the author Alex as well in case he have more insightful
comments.

>>>
>>> load/store fusion pass in aarch64 is scheduled to use before peephole2 pass 
>>> and after register allocator pass. In our case, if we do after register allocator
>>> then we should keep register assigned to lower offset load and other load
>>> that is adjacent to previous load with offset difference of 16 is removed.
>>>
>>> Then we are left with one load with lower offset and register assigned 
>>> by register allocator for lower offset load should be lower than other
>>> adjacent load. If not, we need to change it to lower register and 
>>> propagate them with all the uses of the variable. Similary for other
>>> adjacent load that we are removing, register needs to be propagated to
>>> all the uses.
>>>
>>> In that case we are doing the work of register allocator. In most of our
>>> example testcases the lower offset load is assigned greater register 
>>> than other adjacent load by register allocator and hence we are left
>>> with propagating them always and almost redoing the register allocator
>>> work.
>>>
>>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases
>>> considering the above scenario.
>>>
>>> Please let me know what do you think. 
> 
> I have gone through the implementation of ld/st fusion in aarch64.
> 
> Here is my understanding:
> 
> First all its my mistake that I have mentioned in my earlier mail that 
> this pass is done before peephole2 after RA-pass.
> 
> This pass does it before RA-pass early before early-remat and 
> also before peephole2 after RA-pass.
> 
> This pass does load fusion 2 ldr instruction with adjacent accesses
> into ldp instruction.
> 
> The assembly syntax of ldp instruction is
> 
> ldp w3, w7, [x0]
> 
> It loads [X0] into w3 and [X0+4] into W7.
> 
> Both registers that forms pairs are mentioned in ldp instructions
> and might not be in sequntial order like first register is W3 and
> then next register would be W3+1.
> 
> Thats why the pass before RA-pass works as it has both the defs
> and may not be required in sequential order like first_reg and then
> first_reg+1. It can be any valid registers.
> 
> 
> But in lxvp instructions:
> 
> lxv vs32, 0(r2)
> lxv vs45, 16(r2)
> 
> When we combine above lxv instruction into lxvp, lxvp instruction
> becomes
> 
> lxvp vs32, 0(r2)
> 
> wherein in lxvp  r2+0 is loaded into vs32 and r2+16 is loaded into vs33 
> register (sequential registers). vs33 is hidden in lxvp instruction.
> This is mandatory requirement for lxvp instruction and cannot be in 
> any other sequence. register assignment difference should be 1.

Note that the first register number in the pair should be even, it
means the so-called sequential order should be X, X + 1 (X is even).
This is also the reason why we preferred this pairing to be done
before RA (can catch more opportunities).

> 
> All the uses of r45 has to be propagated with r33.

I think you meant s/r45/vs45/ and s/r33/vs33/.

> 
> And also register allocator can allocate two lxv instructions
> in the following registers.
> 
> lxv vs33, 0(r2)
> lxv vs32, 16(r2)
> 
> To generate lxvp for above lxv instructions 
> 
> lxvp vs32, 0(r2).
> 
> And all the registers vs33 has to be propagated with vs32 and vs32
> has to be propagated with vs33 if we do vecload pass after RA-pass.
> 
> If we do before RA-pass the IRA and LRA register allocation cannot
> assign register with a difference of 1 and the order difference can
> be anything with a positive difference.

This sounds unexpected.  IMHO if you adopt OOmode for the paired load,
RA should be able to allocate two sequential vsx registers and the
first is even, since OOmode is only ok for even vsx register and its
size makes it take two consecutive vsx registers.

Hi Peter, is my understanding correct?

> 
> IRA allocated one in vs32 and other can in vs45.
> 
> In vecload pass we remove one lxv from 2 lxv instruction and 2nd
> lxv instruction with offset of 16 is removed and the use of register
> with 2nd lxv's will not have defs and IRA pass cannot allocate
> them in order with a difference of 1.

With Peter's patch to allow subreg from OOmode, I'd expect that we
replace all uses of the first lxv (from low part address) with
(subreg:VnX <the result of lxvp> <low offset>) and all uses of the
second lxv (from high address) with (subreg:VnX <the result of lxvp> 
<high offset>), currently after vecload, with the associated vecload.C,
we transform from:

(insn 8 4 10 2 (set (reg:V16QI 124 [ *ptr_4(D) ])
        (mem:V16QI (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) 1186 {vsx_movv16qi_64bit}
     (nil))
(insn 10 8 9 2 (set (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
        (mem:V16QI (plus:DI (reg/v/f:DI 122 [ ptr ])
                (const_int 16 [0x10])) [0 MEM[(__vector unsigned char *)ptr_4(D) + 16B]+0 S16 A128])) 1186 {vsx_movv16qi_64bit}
     (expr_list:REG_DEAD (reg/v/f:DI 122 [ ptr ])
        (nil)))
(insn 9 10 11 2 (set (reg:XO 119 [ _7 ])
        (unspec:XO [
                (reg/v:V16QI 123 [ src ])
                (reg:V16QI 124 [ *ptr_4(D) ])
            ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger}
     (expr_list:REG_DEAD (reg:V16QI 124 [ *ptr_4(D) ])
        (nil)))
(insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
        (unspec:XO [
                (reg:XO 119 [ _7 ])
                (reg/v:V16QI 123 [ src ])
                (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
            ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp}
     (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
        (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
            (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
                (nil)))))

to:

(insn 19 4 9 2 (set (reg:OO 124 [ *ptr_4(D) ])
        (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
     (nil))
(insn 9 19 11 2 (set (reg:XO 119 [ _7 ])
        (unspec:XO [
                (reg/v:V16QI 123 [ src ])
                (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
            ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger}
     (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
        (nil)))
(insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
        (unspec:XO [
                (reg:XO 119 [ _7 ])
                (reg/v:V16QI 123 [ src ])
                (reg:OO 124 [ *ptr_4(D) ])
            ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp}
     (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
        (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
            (expr_list:REG_DEAD (reg:XO 119 [ _7 ])


After vecload IMHO this code sequence doesn't look valid, no insn
defines pseudo 125, pseudo 124 is with OOmode but it's used to
replace a use of V16QI.  Both are expected to be subreg from pseudo
124 (should be better to use a new pseduo for the paired load).

Without Peter's patch, UNSPEC_MMA_EXTRACT can be used for extracting,
I think it should only result in sub-optimal code with possible extra
register moves but not have any correctness issue.

This patch can NOT be bootstrapped on x86_64-redhat-linux, I guess
it's caused by the proposed RA changes.  And it's NOT regress-tested
on Power10 LE with some go failures (note that I configured with
--enable-languages="c,c++,fortran,objc,obj-c++,go"), also on Power9
BE and LE with a few failures.

btw, I think Mike wants to test this pairing support on both adjacent
loads and stores, so could you also extend the pairing to cover the
stores, which can be guarded in a separated flag and disabled by
default by considering the known issue on paired store, then Mike
can test if the pairing can satisfy what he looked for.

BR,
Kewen