From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <guojiufu@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5])
	by sourceware.org (Postfix) with ESMTPS id D37A43857C45;
	Wed,  9 Nov 2022 07:52:01 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D37A43857C45
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
Received: from pps.filterd (m0098420.ppops.net [127.0.0.1])
	by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2A96vCIT004950;
	Wed, 9 Nov 2022 07:52:01 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject
 : references : date : in-reply-to : message-id : mime-version :
 content-type : content-transfer-encoding; s=pp1;
 bh=Ls8K8UYIsZyWDhl4GQ2iYMb9WSKnDZXfl/TN+Mv0Kzk=;
 b=srwFrffnE2wSNziWLgR6AeGO7GmnrdrubGM40mHPbH3+oIAveqgt2xm/gEmM3DJZVDF6
 rn+R68pdjNu6PlGHeAp2j/P9Yv+lyeHv3+SDLHFZtn1fZXlD13VU6cr77sbcapyEQfIq
 JaUlasWn6gojmJfqiEztUJz3ZczpLzp6+UnvDyohYkpf7x6rHw5nEyn0cKtVSlSqPoME
 SVglA8b9IsmpMoMVMeOgzKEI3cmOp18E8lQhyFfclbTIYsPbpUMuasx7AQrvB74v8Id4
 mSksiw7194Rs4XBGPqHNttt1QMDWuY8x6DPMB1inhYWuchou2ZSvzqWKYEexvDlEXAfK kQ== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3kr7d1hjtb-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 09 Nov 2022 07:52:01 +0000
Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2A97irxN013065;
	Wed, 9 Nov 2022 07:52:00 GMT
Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27])
	by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3kr7d1hjt0-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 09 Nov 2022 07:52:00 +0000
Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1])
	by ppma05wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2A97oSls020557;
	Wed, 9 Nov 2022 07:52:00 GMT
Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17])
	by ppma05wdc.us.ibm.com with ESMTP id 3kngpm4bqj-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 09 Nov 2022 07:52:00 +0000
Received: from smtpav02.dal12v.mail.ibm.com ([9.208.128.128])
	by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2A97pufB15401552
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Wed, 9 Nov 2022 07:51:57 GMT
Received: from smtpav02.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id D707358051;
	Wed,  9 Nov 2022 07:51:58 +0000 (GMT)
Received: from smtpav02.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 8879858060;
	Wed,  9 Nov 2022 07:51:58 +0000 (GMT)
Received: from pike (unknown [9.5.12.127])
	by smtpav02.dal12v.mail.ibm.com (Postfix) with ESMTPS;
	Wed,  9 Nov 2022 07:51:58 +0000 (GMT)
From: Jiufu Guo <guojiufu@linux.ibm.com>
To: Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org>
Cc: Richard Biener <rguenther@suse.de>,
        Segher Boessenkool <segher@kernel.crashing.org>,
        Jeff Law <jeffreyalaw@gmail.com>, rguenth@gcc.gnu.org,
        pinskia@gcc.gnu.org, linkw@gcc.gnu.org, dje.gcc@gmail.com
Subject: Re: [RFC] propgation leap over memory copy for struct
References: <20221031024235.110995-1-guojiufu@linux.ibm.com>
	<daf54634-cb3e-a7f8-213d-c18ba781a3ef@gmail.com>
	<20221101004956.GL25951@gate.crashing.org>
	<7e1qqnwb36.fsf@pike.rch.stglabs.ibm.com>
	<381qr8s3-53n-pr61-7r1n-6q8q71nsqnq@fhfr.qr>
	<7esfiuum3n.fsf@pike.rch.stglabs.ibm.com>
Date: Wed, 09 Nov 2022 15:51:52 +0800
In-Reply-To: <7esfiuum3n.fsf@pike.rch.stglabs.ibm.com> (Jiufu Guo via
	Gcc-patches's message of "Tue, 08 Nov 2022 12:05:48 +0800")
Message-ID: <7eleoktvjb.fsf@pike.rch.stglabs.ibm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: z-6vgZX3CvlhE9hnYGpk9PxR3ynQzafb
X-Proofpoint-ORIG-GUID: 5KV1mrpXhmDVNt5dSixWFuoaaxaiXmSZ
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.219,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1
 definitions=2022-11-09_02,2022-11-08_01,2022-06-22_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 adultscore=0
 priorityscore=1501 bulkscore=0 clxscore=1015 spamscore=0 malwarescore=0
 phishscore=0 lowpriorityscore=0 impostorscore=0 mlxlogscore=999
 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2210170000 definitions=main-2211090055
X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:

> Richard Biener <rguenther@suse.de> writes:
>
>> On Tue, 1 Nov 2022, Jiufu Guo wrote:
>>
>>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>>>=20
>>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>>> >> >We know that for struct variable assignment, memory copy may be use=
d.
>>> >> >And for memcpy, we may load and store more bytes as possible at one=
 time.
>>> >> >While it may be not best here:
>>> >
>>> >> So the first question in my mind is can we do better at the gimple=20
>>> >> phase?=C2=A0 For the second case in particular can't we just "return=
 a"=20
>>> >> rather than copying a into <retval> then returning <retval>?=C2=A0 T=
his feels=20
>>> >> a lot like the return value optimization from C++.=C2=A0 I'm not sur=
e if it=20
>>> >> applies to the first case or not, it's been a long time since I look=
ed=20
>>> >> at NRV optimizations, but it might be worth poking around in there a=
 bit=20
>>> >> (tree-nrv.cc).
>>> >
>>> > If it is a bigger struct you end up with quite a lot of stuff in
>>> > registers.  GCC will eventually put that all in memory so it will work
>>> > out fine in the end, but you are likely to get inefficient code.
>>> Yes.  We may need to use memory to save regiters for big struct.
>>> Small struct may be practical to use registers.  We may leverage the
>>> idea that: some type of small struct are passing to function through
>>> registers.=20
>>>=20
>>> >
>>> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
>>> > would be useful to put smaller temportaries, say 32 bytes and smaller,
>>> > in registers instead of in memory.
>>> I think you mean:  we should try to registers to avoid memory accesing,
>>> and using registers would be ok for more bytes memcpy(32bytes).
>>> Great sugguestion, thanks a lot!
>>>=20
>>> Like below idea:
>>> [r100:TI, r101:TI] =3D src;  //Or r100:OI/OO =3D src;
>>> dest =3D [r100:TI, r101:TI];
>>>=20
>>> Currently, for 8bytes structure, we are using TImode for it.
>>> And subreg/fwprop/cse passes are able to optimize it as expected.
>>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
>>> I'm not sure if current infrastructure supports to use two more
>>> registers for one structure.
>>>=20
>>> >
>>> >> But even so, these kinds of things are still bound to happen, so it'=
s=20
>>> >> probably worth thinking about if we can do better in RTL as well.
>>> >
>>> > Always.  It is a mistake to think that having better high-level
>>> > optimisations means that you don't need good low-level optimisations
>>> > anymore: in fact deficiencies there become more glaringly apparent if
>>> > the early pipeline opts become better :-)
>>> Understant, thanks :)
>>>=20
>>> >
>>> >> The first thing that comes to my mind is to annotate memcpy calls th=
at=20
>>> >> are structure assignments.=C2=A0 The idea here is that we may want t=
o expand=20
>>> >> a memcpy differently in those cases.=C2=A0=C2=A0 Changing how we exp=
and an opaque=20
>>> >> memcpy call is unlikely to be beneficial in most cases.=C2=A0 But ch=
anging=20
>>> >> how we expand a structure copy may be beneficial by exposing the=20
>>> >> underlying field values.=C2=A0=C2=A0 This would roughly correspond t=
o your method=20
>>> >> #1.
>>> >>=20
>>> >> Or instead of changing how we expand, teach the optimizers about the=
se=20
>>> >> annotated memcpy calls -- they're just a a copy of each field. =C2=
=A0 That's=20
>>> >> how CSE and the propagators could treat them. After some point we'd=
=20
>>> >> lower them in the usual ways, but at least early in the RTL pipeline=
 we=20
>>> >> could keep them as annotated memcpy calls.=C2=A0 This roughly corres=
ponds to=20
>>> >> your second suggestion.
>>> >
>>> > Ideally this won't ever make it as far as RTL, if the structures do n=
ot
>>> > need to go via memory.  All high-level optimissations should have been
>>> > done earlier, and hopefully it was not expand tiself that forced stuff
>>> > into memory!  :-/
>>> Currently, after early gimple optimization, the struct member accessing
>>> may still need to be in memory (if the mode of the struct is BLK).
>>> For example:
>>>=20
>>> _Bool foo (const A a) { return a.a[0] > 1.0; }
>>>=20
>>> The optimized gimple would be:
>>>   _1 =3D a.a[0];
>>>   _3 =3D _1 > 1.0e+0;
>>>   return _3;
>>>=20
>>> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
>>> and "a.a[0]" is also reading from memory.  It may be better to use
>>> "f1" for "a.a[0]" here.
>>>=20
>>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF=
; DF}"
>>> for the struct (BLK may be changed), and using 4 DF registers to access
>>> the structure in expand pass.
>>
>> I think for cases like this it might be a good idea to perform
>> SRA-like analysis at RTL expansion time when we know how parameters
>> arrive (in pieces) and take that knowledge into account when
>> assigning the RTL to a decl.  The same applies for the return ABI.
>> Since we rely on RTL to elide copies to/from return/argument
>> registers/slots we have to assign "layout compatible" registers
>> to the corresponding auto vars.
>>
In other words, for this kind of parameter, creating a few scalars
for each pieces.  And the 'accessing to the paramter' is expanded to
'accessing the scalars' accordingly.
This would also able to avoid memory accesing for the paramter.

Maybe we could use something like "parallel:M? {DF;DF;DF;DF}" or
"parallel:M? {DI;DI;DI;DI}" to group the scalars in DECL_RTL.

For this, we would need to support 'move'/'access' these sub-RTLs.

Any more sugguestions? Thanks.

BR,
Jeff(Jiufu)

> Thanks for pointing out this!
> This looks like a kind of SRA, especially for parm and return value.
> As you pointed out, there is something that we may need to take care
> to adjust:
> 1. We would use the "layout compatible" mode reg for the scalar. e.g.
> DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}".
>
> 2. For an aggregate that will be assigned to return value, before
> expanding to 'return stmt', we may not sure if need to assign
> 'scalar rtl(s)' to decl.=20
> To handle this issue, we may use 'scalar rtl(s)' for all struct decl
> as if it is parm or return result.
> Then method3 may be similar to this idea: using "parallel RTL" for
> the decl (may use DECL_RTL directly).
>
> Please point out any misunderstandings or suggestions.
> Thanks again!
>
> BR,
> Jeff(Jiufu)
>
>>>=20
>>> Thanks again for your kindly and helpful comments!
>>>=20
>>> BR,
>>> Jeff(Jiufu)
>>>=20
>>> >
>>> >
>>> > Segher
>>>=20