From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 277D23858C20 for ; Wed, 17 Jan 2024 09:34:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 277D23858C20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 277D23858C20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705484089; cv=none; b=ZkXRhehKICCxA1pBAHVIAEOfKW6hSF6MWPMkRDaQRXuNJCXdhZYc8GBi9wwCgx8CZxqhT3pIe7tWHbeJfoq6XYm4OutfqHmbPA4ueBCWBure5zkCAxjkvLSSB6urFwodtc1D71UTLCqINBvWOBl85+xUtG3jSuDfa5DmouvoVvU= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705484089; c=relaxed/simple; bh=/77gr9KXy/kfiX0+CNk2J/xTOteQHCwCsW61iQxNh4Q=; h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=roB+HGMdyYHxQeSTRjjd1qRGKyI204YgmB1R74oG8SpMuT+/t/dfnczeFu9EBLekzQq3VyTlUY9Te2p8G2DzHzH7xf8LbCiw2fw2Aqu2fnaOazQZVKQyQxHsxEKw4NQgC66MOSh46sO9xeCxPI3jk2aknyKfkr9LYyr+ZNS4mk8= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 40H9RNdG008995; Wed, 17 Jan 2024 09:34:43 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=pp1; bh=wrOLnipu2N+Y433mNBNvv3n7G4ppL3bLcvNf6dFM3Es=; b=nLBb7SiPhdfLwKpA6Fbg/m6TseXgwx0UdjI0fquIbGUB8oi40HwmQzkQdevnu1H0ysRK 1aaKhh+BjetdHoEfQjgQYgximTFhF42E3Sf9OE+Hy6SnMtw3UDawzvIl0K0xwz5dg7n3 Z00uOWGfowXwgSRL3dX/W0CflE6RC2nkjBaj8Mz1pvMbneeV+kTfTcBVXfd4d2UoXIXi gq8PZwE35afpcJMaAHWfAMzF7ZhHOB/5xu14tvf+pHk7tshPb5A3k1KHHUZ2p1+hetcb GWAsEhD3BhtO1mi6Uej13UuyNmnX+7EbB3ooaDGYK6BNQKnMS+fYOZB/uDwefsEkru5d Uw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3vpc98g75v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 17 Jan 2024 09:34:43 +0000 Received: from m0356517.ppops.net (m0356517.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 40H9TITh015828; Wed, 17 Jan 2024 09:34:42 GMT Received: from ppma12.dal12v.mail.ibm.com (dc.9e.1632.ip4.static.sl-reverse.com [50.22.158.220]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3vpc98g75e-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 17 Jan 2024 09:34:42 +0000 Received: from pps.filterd (ppma12.dal12v.mail.ibm.com [127.0.0.1]) by ppma12.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 40H8oJoi003704; Wed, 17 Jan 2024 09:34:41 GMT Received: from smtprelay03.dal12v.mail.ibm.com ([172.16.1.5]) by ppma12.dal12v.mail.ibm.com (PPS) with ESMTPS id 3vm4usvkhc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 17 Jan 2024 09:34:41 +0000 Received: from smtpav06.wdc07v.mail.ibm.com (smtpav06.wdc07v.mail.ibm.com [10.39.53.233]) by smtprelay03.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 40H9YeTr63111658 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 17 Jan 2024 09:34:41 GMT Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D74F85804E; Wed, 17 Jan 2024 09:34:40 +0000 (GMT) Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5E6CB5803F; Wed, 17 Jan 2024 09:34:37 +0000 (GMT) Received: from [9.43.122.120] (unknown [9.43.122.120]) by smtpav06.wdc07v.mail.ibm.com (Postfix) with ESMTP; Wed, 17 Jan 2024 09:34:37 +0000 (GMT) Message-ID: Date: Wed, 17 Jan 2024 15:04:35 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp Content-Language: en-US To: "Kewen.Lin" , Richard Biener , Peter Bergner Cc: Vladimir Makarov , Michael Meissner , Segher Boessenkool , David Edelsohn , gcc-patches , Richard Sandiford , alex.coplan@arm.com References: <7b7e1da7-19bd-4192-b5a3-db3fed3a0aaf@linux.ibm.com> <00272349-aa2a-4ea3-9859-913b7b4fe049@linux.ibm.com> <0efa1db8-323e-4cdb-9b95-bf6cee25d03d@linux.ibm.com> From: Ajit Agarwal In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: b9WudrT01hTlhpQl-jQMV5ApZMO2LgBu X-Proofpoint-GUID: RCkBUiNc9dSpEEh7_rBui2LJcSfrKM-5 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2024-01-17_04,2024-01-16_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 suspectscore=0 phishscore=0 mlxscore=0 malwarescore=0 clxscore=1011 adultscore=0 priorityscore=1501 mlxlogscore=999 impostorscore=0 spamscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311290000 definitions=main-2401170066 X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hello Kewen: On 17/01/24 12:32 pm, Kewen.Lin wrote: > on 2024/1/16 06:22, Ajit Agarwal wrote: >> Hello Richard: >> >> On 15/01/24 6:25 pm, Ajit Agarwal wrote: >>> >>> >>> On 15/01/24 6:14 pm, Ajit Agarwal wrote: >>>> Hello Richard: >>>> >>>> On 15/01/24 3:03 pm, Richard Biener wrote: >>>>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal wrote: >>>>>> >>>>>> Hello All: >>>>>> >>>>>> This patch add the vecload pass to replace adjacent memory accesses lxv with lxvp >>>>>> instructions. This pass is added before ira pass. >>>>>> >>>>>> vecload pass removes one of the defined adjacent lxv (load) and replace with lxvp. >>>>>> Due to removal of one of the defined loads the allocno is has only uses but >>>>>> not defs. >>>>>> >>>>>> Due to this IRA pass doesn't assign register pairs like registers in sequence. >>>>>> Changes are made in IRA register allocator to assign sequential registers to >>>>>> adjacent loads. >>>>>> >>>>>> Some of the registers are cleared and are not set as profitable registers due >>>>>> to zero cost is greater than negative costs and checks are added to compare >>>>>> positive costs. >>>>>> >>>>>> LRA register is changed not to reassign them to different register and form >>>>>> the sequential register pairs intact. >>>>>> >>>>>> contrib/check_GNU_style.sh run on patch looks good. >>>>>> >>>>>> Bootstrapped and regtested for powerpc64-linux-gnu. >>>>>> >>>>>> Spec2017 benchmarks are run and I get impressive benefits for some of the FP >>>>>> benchmarks. >>>>> i >>>>> I want to point out the aarch64 target recently got a ld/st fusion >>>>> pass which sounds >>>>> related. It would be nice to have at least common infrastructure for >>>>> this (the aarch64 >>>>> one also looks quite more powerful) > > Thank Richi for pointing out this pass. Yeah, it would be nice if we can share > something common. CC the author Alex as well in case he have more insightful > comments. > >>>> >>>> load/store fusion pass in aarch64 is scheduled to use before peephole2 pass >>>> and after register allocator pass. In our case, if we do after register allocator >>>> then we should keep register assigned to lower offset load and other load >>>> that is adjacent to previous load with offset difference of 16 is removed. >>>> >>>> Then we are left with one load with lower offset and register assigned >>>> by register allocator for lower offset load should be lower than other >>>> adjacent load. If not, we need to change it to lower register and >>>> propagate them with all the uses of the variable. Similary for other >>>> adjacent load that we are removing, register needs to be propagated to >>>> all the uses. >>>> >>>> In that case we are doing the work of register allocator. In most of our >>>> example testcases the lower offset load is assigned greater register >>>> than other adjacent load by register allocator and hence we are left >>>> with propagating them always and almost redoing the register allocator >>>> work. >>>> >>>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases >>>> considering the above scenario. >>>> >>>> Please let me know what do you think. >> >> I have gone through the implementation of ld/st fusion in aarch64. >> >> Here is my understanding: >> >> First all its my mistake that I have mentioned in my earlier mail that >> this pass is done before peephole2 after RA-pass. >> >> This pass does it before RA-pass early before early-remat and >> also before peephole2 after RA-pass. >> >> This pass does load fusion 2 ldr instruction with adjacent accesses >> into ldp instruction. >> >> The assembly syntax of ldp instruction is >> >> ldp w3, w7, [x0] >> >> It loads [X0] into w3 and [X0+4] into W7. >> >> Both registers that forms pairs are mentioned in ldp instructions >> and might not be in sequntial order like first register is W3 and >> then next register would be W3+1. >> >> Thats why the pass before RA-pass works as it has both the defs >> and may not be required in sequential order like first_reg and then >> first_reg+1. It can be any valid registers. >> >> >> But in lxvp instructions: >> >> lxv vs32, 0(r2) >> lxv vs45, 16(r2) >> >> When we combine above lxv instruction into lxvp, lxvp instruction >> becomes >> >> lxvp vs32, 0(r2) >> >> wherein in lxvp r2+0 is loaded into vs32 and r2+16 is loaded into vs33 >> register (sequential registers). vs33 is hidden in lxvp instruction. >> This is mandatory requirement for lxvp instruction and cannot be in >> any other sequence. register assignment difference should be 1. > > Note that the first register number in the pair should be even, it > means the so-called sequential order should be X, X + 1 (X is even). > This is also the reason why we preferred this pairing to be done > before RA (can catch more opportunities). > >> >> All the uses of r45 has to be propagated with r33. > > I think you meant s/r45/vs45/ and s/r33/vs33/. > Yes I meant the same. >> >> And also register allocator can allocate two lxv instructions >> in the following registers. >> >> lxv vs33, 0(r2) >> lxv vs32, 16(r2) >> >> To generate lxvp for above lxv instructions >> >> lxvp vs32, 0(r2). >> >> And all the registers vs33 has to be propagated with vs32 and vs32 >> has to be propagated with vs33 if we do vecload pass after RA-pass. >> >> If we do before RA-pass the IRA and LRA register allocation cannot >> assign register with a difference of 1 and the order difference can >> be anything with a positive difference. > > This sounds unexpected. IMHO if you adopt OOmode for the paired load, > RA should be able to allocate two sequential vsx registers and the > first is even, since OOmode is only ok for even vsx register and its > size makes it take two consecutive vsx registers. > > Hi Peter, is my understanding correct? > I tried all the combination in the past RA is not allocating sequential register. I dont see any such code in RA that generates sequential registers. We need to add explicit code in RA to generate sequential registers. This is what I have done in this patch submitted. >> >> IRA allocated one in vs32 and other can in vs45. >> >> In vecload pass we remove one lxv from 2 lxv instruction and 2nd >> lxv instruction with offset of 16 is removed and the use of register >> with 2nd lxv's will not have defs and IRA pass cannot allocate >> them in order with a difference of 1. > > With Peter's patch to allow subreg from OOmode, I'd expect that we > replace all uses of the first lxv (from low part address) with > (subreg:VnX ) and all uses of the > second lxv (from high address) with (subreg:VnX > ), currently after vecload, with the associated vecload.C, > we transform from: > > (insn 8 4 10 2 (set (reg:V16QI 124 [ *ptr_4(D) ]) > (mem:V16QI (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) 1186 {vsx_movv16qi_64bit} > (nil)) > (insn 10 8 9 2 (set (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ]) > (mem:V16QI (plus:DI (reg/v/f:DI 122 [ ptr ]) > (const_int 16 [0x10])) [0 MEM[(__vector unsigned char *)ptr_4(D) + 16B]+0 S16 A128])) 1186 {vsx_movv16qi_64bit} > (expr_list:REG_DEAD (reg/v/f:DI 122 [ ptr ]) > (nil))) > (insn 9 10 11 2 (set (reg:XO 119 [ _7 ]) > (unspec:XO [ > (reg/v:V16QI 123 [ src ]) > (reg:V16QI 124 [ *ptr_4(D) ]) > ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger} > (expr_list:REG_DEAD (reg:V16QI 124 [ *ptr_4(D) ]) > (nil))) > (insn 11 9 12 2 (set (reg:XO 120 [ _9 ]) > (unspec:XO [ > (reg:XO 119 [ _7 ]) > (reg/v:V16QI 123 [ src ]) > (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ]) > ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp} > (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ]) > (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ]) > (expr_list:REG_DEAD (reg:XO 119 [ _7 ]) > (nil))))) > > to: > > (insn 19 4 9 2 (set (reg:OO 124 [ *ptr_4(D) ]) > (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1 > (nil)) > (insn 9 19 11 2 (set (reg:XO 119 [ _7 ]) > (unspec:XO [ > (reg/v:V16QI 123 [ src ]) > (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ]) > ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger} > (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ]) > (nil))) > (insn 11 9 12 2 (set (reg:XO 120 [ _9 ]) > (unspec:XO [ > (reg:XO 119 [ _7 ]) > (reg/v:V16QI 123 [ src ]) > (reg:OO 124 [ *ptr_4(D) ]) > ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp} > (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ]) > (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ]) > (expr_list:REG_DEAD (reg:XO 119 [ _7 ]) > > > After vecload IMHO this code sequence doesn't look valid, no insn > defines pseudo 125, pseudo 124 is with OOmode but it's used to > replace a use of V16QI. Both are expected to be subreg from pseudo > 124 (should be better to use a new pseduo for the paired load). > I tried that in the past and no sequential code is generated in RA. > Without Peter's patch, UNSPEC_MMA_EXTRACT can be used for extracting, > I think it should only result in sub-optimal code with possible extra > register moves but not have any correctness issue. > > This patch can NOT be bootstrapped on x86_64-redhat-linux, I guess > it's caused by the proposed RA changes. And it's NOT regress-tested > on Power10 LE with some go failures (note that I configured with > --enable-languages="c,c++,fortran,objc,obj-c++,go"), also on Power9 > BE and LE with a few failures. > Did you try to bootstrapped on x86_64-redhat-linux. This patch did not bootstrapped?. If so I will fix them. > btw, I think Mike wants to test this pairing support on both adjacent > loads and stores, so could you also extend the pairing to cover the > stores, which can be guarded in a separated flag and disabled by > default by considering the known issue on paired store, then Mike > can test if the pairing can satisfy what he looked for. > Sure I will do that. Thanks & Regards Ajit > BR, > Kewen >