From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=b53L=NC=linux.ibm.com=aagarwa1@sourceware.org>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1])
	by sourceware.org (Postfix) with ESMTPS id 756E13870871
	for <gcc-patches@gcc.gnu.org>; Fri, 31 May 2024 16:59:35 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 756E13870871
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 756E13870871
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.156.1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717174778; cv=none;
	b=c/0lz4s5h7Vj6AN5awctsDSEJWKQnPnOQDqn0JVbMqbA+cGXm3OprUOQfBe/lTTnLor/Z2qAag364fI4SMaUKjAxrjUpMin8wtm5BLyQr6B/opyCaEj7fD9JQkVZQHvggVfvgbE/h+T3Tm7T8wgYH2CFgCAmm4ilxyyeBpJe1ws=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1717174778; c=relaxed/simple;
	bh=KVm9huL/cHmW9Xbw49/8uJ3ErS2/y/mEXWOtgMgALHQ=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=n+JpPMe4CZE9jU6z//TYYBT3fIlEKqYMLLUz7Misrc9DkvaittXPx2D1ACvuxz+Ks4WMBWYhlJIkTP9lj8aMG2393Q2qc4j5djnMkczyEx/fy4gxCgyazKyzGKkpkCb8zvSRup3K4rnMnl805YmK7uCqJ3bVrXTjJsDnP6JXPC4=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from pps.filterd (m0353729.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 44VGlGH6003465;
	Fri, 31 May 2024 16:59:31 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com;
 h=content-transfer-encoding : content-type : date : from : in-reply-to :
 message-id : mime-version : references : subject : to; s=pp1;
 bh=JN0px/hv745AZ/KkhRtj+pC3JoS0GIeCrrZEuQwsE54=;
 b=mHRNOKLeaF4BUf/suP1o83kJuXnk2aqEDHHrU9kkpTC93omc0DwfoSvnFJmWphR+aQK0
 +pIVHD+t159xiSt77pWhhs5DEn5JNcLWv68Geh8WGYNjR9Tadnkh4me+a2pZSWCPL2DI
 gpnRLgVQgACMTkKGiCg0k37wtJSpz1MF7SnmRQdIrR5Lk+mVgBmNfGBSN4oLF4HRQhw2
 1q5modWHbvTUk57kKX8sqJg9yUuqKGo0Ob0TyBv1ZiC9ZzADIZBpQ/QC9WWSzn+0MTst
 54IdsGwKAvXYMtnJny74njVkt3KlhSAQODK/6d/ixwecpc1KYwrdaFlkSMb94MxEKTv6 wA== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3yfjccr0mm-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 31 May 2024 16:59:30 +0000
Received: from m0353729.ppops.net (m0353729.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 44VGsXGY014088;
	Fri, 31 May 2024 16:59:30 GMT
Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3yfjccr0mj-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 31 May 2024 16:59:30 +0000
Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1])
	by ppma13.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 44VG2HuE002783;
	Fri, 31 May 2024 16:59:29 GMT
Received: from smtprelay02.wdc07v.mail.ibm.com ([172.16.1.69])
	by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 3ydpb110wq-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 31 May 2024 16:59:29 +0000
Received: from smtpav06.wdc07v.mail.ibm.com (smtpav06.wdc07v.mail.ibm.com [10.39.53.233])
	by smtprelay02.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 44VGxP2k28639766
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Fri, 31 May 2024 16:59:27 GMT
Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 901DC58060;
	Fri, 31 May 2024 16:59:25 +0000 (GMT)
Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 00A2B58068;
	Fri, 31 May 2024 16:59:22 +0000 (GMT)
Received: from [9.43.13.10] (unknown [9.43.13.10])
	by smtpav06.wdc07v.mail.ibm.com (Postfix) with ESMTP;
	Fri, 31 May 2024 16:59:21 +0000 (GMT)
Message-ID: <95a33b0a-8090-4218-a62c-da1f53bebbb7@linux.ibm.com>
Date: Fri, 31 May 2024 22:29:19 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [Patch, rs6000, aarch64, middle-end] Add implementation for
 different targets for pair mem fusion
To: Alex Coplan <alex.coplan@arm.com>, "Kewen.Lin" <linkw@linux.ibm.com>,
        Segher Boessenkool <segher@kernel.crashing.org>,
        Michael Meissner <meissner@linux.ibm.com>,
        Peter Bergner <bergner@linux.ibm.com>,
        David Edelsohn <dje.gcc@gmail.com>,
        gcc-patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
References: <53ba46de-6c01-4c68-bd98-1ba6950a793a@linux.ibm.com>
 <mpt34py71e8.fsf@arm.com>
 <cc31b3fb-552b-43bd-9029-5b84f6e6c437@linux.ibm.com>
 <mpted9i59n1.fsf@arm.com>
Content-Language: en-US
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
In-Reply-To: <mpted9i59n1.fsf@arm.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: v5HTJqTZQ7DryuAHD8eoFQJivHxWFq1E
X-Proofpoint-ORIG-GUID: lTx6KfD4NgCxj1htBGMrTMNw3-ZZFWwG
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.293,Aquarius:18.0.1039,Hydra:6.0.650,FMLib:17.12.28.16
 definitions=2024-05-31_12,2024-05-30_01,2024-05-17_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 suspectscore=0
 spamscore=0 impostorscore=0 priorityscore=1501 malwarescore=0 bulkscore=0
 lowpriorityscore=0 clxscore=1015 phishscore=0 mlxscore=0 mlxlogscore=999
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2405010000
 definitions=main-2405310128
X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,KAM_MANYTO,KAM_SHORT,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hello Richard:

On 31/05/24 8:08 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> Hello All:
>>>>
>>>> Common infrastructure using generic code for pair mem fusion of different
>>>> targets.
>>>>
>>>> rs6000 target specific specific code implements virtual functions defined
>>>> by generic code.
>>>>
>>>> Code is implemented with pure virtual functions to interface with target
>>>> code.
>>>>
>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>
>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>>
>>>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>>>
>>>> Common infrastructure using generic code for pair mem fusion of different
>>>> targets.
>>>>
>>>> rs6000 target specific specific code implements virtual functions defined
>>>> by generic code.
>>>>
>>>> Code is implemented with pure virtual functions to interface with target
>>>> code.
>>>>
>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>
>>>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>> 	implementation of additional virtual functions added in pair_fusion
>>>> 	struct.
>>>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>>>> 	before pass_early_remat.
>>>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>> 	Add target specific implementation using pure virtual
>>>> 	functions.
>>>> 	* config.gcc: Add new object file.
>>>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>> 	fusion pass.
>>>> 	* config/rs6000/t-rs6000: Add new rule.
>>>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>> 	from private.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> 	* g++.target/powerpc/me-fusion.C: New test.
>>>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>>>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>> ---
>>>
>>> This isn't a complete review, just some initial questions & comments
>>> about selected parts.
>>>
>>>> [...]
>>>> +/* Check whether load can be fusable or not.
>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>> +bool
>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>> +{
>>>> +  rtx_insn *insn = info->rtl ();
>>>> +
>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>>>> +      return false;
>>>
>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>> note.  What's the reason for doing this?  Are you trying to avoid
>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>
>>
>> We have used the above check because of some SPEC benchmarks failing with
>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>
>> By adding the checks the benchmarks passes and also it improves the
>> performance.
>>
>> This checks were added during initial implementation of pair fusion
>> pass.
>>
>> I will investigate further if this check is still required or not.
> 
> Thanks.  If it does affect SPEC results, it would be good to look
> at the underlying reason, as a justification for the check.
> 
> AIUI, the case Alex was due to the way that the RA recognises:
> 
>   (set (reg R) (mem address-of-a-stack-variable))
>     REG_EQUIV: (mem address-of-a-stack-variable)
> 
> where the REG_EQUIV is either explicit or detected by the RA.
> If R needs to be spilled, it can then be spilled to its existing
> location on the stack.  And if R needs to be spilled in the
> instruction above (because of register pressure before the first
> use of R), the RA is able to delete the instruction.
> 
> But if that is the reason, the condition should be restricted
> to cases in which the note is a memory.
> 
> I think Alex had tried something similar and found that it wasn't
> always effective.
> 

Sure I will check.
>> [...]
>>>> +		   && info->is_real ())
>>>> +		  {
>>>> +		    rtx_insn *rtl_insn = info->rtl ();
>>>> +		    rtx set = single_set (rtl_insn);
>>>> +
>>>> +		    if (set == NULL_RTX)
>>>> +		      return false;
>>>> +
>>>> +		    rtx op0 = SET_SRC (set);
>>>> +		    if (GET_CODE (op0) != UNSPEC)
>>>> +		      return false;
>>>
>>> What's the motivation for rejecting unspecs?  It's unsual to treat
>>> all unspecs as a distinct group.
>>>
>>> Also, using single_set means that the function still lets through
>>> parallels of two sets in which the sources are unspecs.  Is that
>>> intentional?
>>>
>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>>> need to be described in comments, so that other people coming to this
>>> code later can understand the motivation.  The same thing applies to
>>> other decisions in the patch.
>>>
>>
>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of load
>> in UNSPEC. Thats why this check is added.
> 
> Can you give an example (in terms of rtl)?


(insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ])
        (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65]+0 S16 A8]))  {*movoo}
     (nil))
(insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ])
        (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ])
                (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65 + 32B]+0 S16 A8]))  {*movoo}
     (nil))
(insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ])
        (plus:DI (reg:DI 118 [ ivtmp.73 ])
            (const_int 64 [0x40]))) 66 {*adddi3}
     (nil))
(insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ])
        (unspec:V16QI [
                (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0)
                (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16)
                (reg:V16QI 300)
            ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
     (nil))

This is the example where UNSPEC is used with 256 bit OOmode.

> 
>> [...]
>>>> [...]
>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>> +					      rtl_ssa::insn_info *i2) = 0;
>>>> +
>>>
>>> This name seems a bit misleading.  The function is used in:
>>>
>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>        reversed = true;
>>>      }
>>>  
>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>> +    return false;
>>> +
>>>    rtx cand_mems[2];
>>>    rtx reg_ops[2];
>>>    rtx pats[2];
>>>
>>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>>>
>>> It looks like the rs6000 override requires the original insns to be
>>> in offset order.  Could you say why that's necessary?  (Both in email
>>> and as a comment in the code.)
>>>
>>
>> Yes rs6000 requires the original load insns to be in offset order.
>> Some regression tests like vect-outer-4f fails if we do load pair
>> fusion with load offsets are not in offset order as this breaks lxvp 
>> semantics.
> 
> How does it break the semantics though?  In principle, the generic code
> only fuses if it has "proved" that the loads can happen in either order.
> So it shouldn't matter which order the hardware does things in.
> 
> Could you give an example of the kind of situation that you want
> to avoid, and why it generates the wrong result?
>

(insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
     (nil))
(insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
     (nil))

insn 22 and insn 31 is merged in the failure case and breaks the code.

 
> Thanks,
> Richard
>

Thanks & Regards
Ajit