From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <guojiufu@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1])
	by sourceware.org (Postfix) with ESMTPS id 49CAD3858C62;
	Tue,  8 Nov 2022 04:05:54 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 49CAD3858C62
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
Received: from pps.filterd (m0098409.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2A81qSbO010852;
	Tue, 8 Nov 2022 04:05:52 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject
 : references : date : in-reply-to : message-id : mime-version :
 content-type : content-transfer-encoding; s=pp1;
 bh=7yLWxZLUQPiREwYP7GWX5fII7TGz+TYMdxE6pAb8cYU=;
 b=ooVZpcqHnP4F2iNs8O+aM3Q4pTqAklTw0wnOJewktW4vVOW+O6HTbbPOhfVhHfhpxw5d
 BxElxkj7Tdb1PI7r6KANqtWhXaa7aekhRXYpbE1bnQh5QZohtwuGbMw1sXrOAMumXNxK
 4up4qy0YSeD+Ok8orlkYc1zOTSavhpGuSBMmjorxL03FBcZADYzKsBc4EOTwBL2xjk/C
 sz/8VUJSh86MJwiYy0nNRz3TKUuSoNkrQXdkzmQG+FNq//eXKTorbmpklUIlKtEI19Y/
 tSbxE0yqZdMIg66dT6X+l9u395XERLn+8WjjPglCz8ktk7vow7GlKyFv2Hc5tI/67lP1 MQ== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3kp14xt0mf-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 08 Nov 2022 04:05:52 +0000
Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2A83jgN5004658;
	Tue, 8 Nov 2022 04:05:52 GMT
Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3kp14xt0m2-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 08 Nov 2022 04:05:51 +0000
Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1])
	by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2A83o5Nf025394;
	Tue, 8 Nov 2022 04:05:50 GMT
Received: from b03cxnp07028.gho.boulder.ibm.com (b03cxnp07028.gho.boulder.ibm.com [9.17.130.15])
	by ppma01dal.us.ibm.com with ESMTP id 3kngsy0x13-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 08 Nov 2022 04:05:50 +0000
Received: from smtpav02.dal12v.mail.ibm.com ([9.208.128.128])
	by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2A845om249152386
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Tue, 8 Nov 2022 04:05:51 GMT
Received: from smtpav02.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 433B258051;
	Tue,  8 Nov 2022 04:05:49 +0000 (GMT)
Received: from smtpav02.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id E35A958060;
	Tue,  8 Nov 2022 04:05:48 +0000 (GMT)
Received: from pike (unknown [9.5.12.127])
	by smtpav02.dal12v.mail.ibm.com (Postfix) with ESMTPS;
	Tue,  8 Nov 2022 04:05:48 +0000 (GMT)
From: Jiufu Guo <guojiufu@linux.ibm.com>
To: Richard Biener <rguenther@suse.de>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
        Jeff Law <jeffreyalaw@gmail.com>, gcc-patches@gcc.gnu.org,
        rguenth@gcc.gnu.org, pinskia@gcc.gnu.org, linkw@gcc.gnu.org,
        dje.gcc@gmail.com
Subject: Re: [RFC] propgation leap over memory copy for struct
References: <20221031024235.110995-1-guojiufu@linux.ibm.com>
	<daf54634-cb3e-a7f8-213d-c18ba781a3ef@gmail.com>
	<20221101004956.GL25951@gate.crashing.org>
	<7e1qqnwb36.fsf@pike.rch.stglabs.ibm.com>
	<381qr8s3-53n-pr61-7r1n-6q8q71nsqnq@fhfr.qr>
Date: Tue, 08 Nov 2022 12:05:48 +0800
In-Reply-To: <381qr8s3-53n-pr61-7r1n-6q8q71nsqnq@fhfr.qr> (Richard Biener's
	message of "Sat, 5 Nov 2022 15:13:55 +0100 (CET)")
Message-ID: <7esfiuum3n.fsf@pike.rch.stglabs.ibm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: Z95DRvXtZw5p-PnBPu22tQg8qVSj_Vo8
X-Proofpoint-GUID: iqJ5YtsTSPTAIoVtqLqD3jW1pcLE5vi9
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.219,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1
 definitions=2022-11-07_11,2022-11-07_02,2022-06-22_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 lowpriorityscore=0 adultscore=0 phishscore=0 malwarescore=0
 priorityscore=1501 mlxlogscore=999 mlxscore=0 impostorscore=0
 clxscore=1011 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2210170000 definitions=main-2211080020
X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener <rguenther@suse.de> writes:

> On Tue, 1 Nov 2022, Jiufu Guo wrote:
>
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>>=20
>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> >> >We know that for struct variable assignment, memory copy may be used.
>> >> >And for memcpy, we may load and store more bytes as possible at one =
time.
>> >> >While it may be not best here:
>> >
>> >> So the first question in my mind is can we do better at the gimple=20
>> >> phase?=C2=A0 For the second case in particular can't we just "return =
a"=20
>> >> rather than copying a into <retval> then returning <retval>?=C2=A0 Th=
is feels=20
>> >> a lot like the return value optimization from C++.=C2=A0 I'm not sure=
 if it=20
>> >> applies to the first case or not, it's been a long time since I looke=
d=20
>> >> at NRV optimizations, but it might be worth poking around in there a =
bit=20
>> >> (tree-nrv.cc).
>> >
>> > If it is a bigger struct you end up with quite a lot of stuff in
>> > registers.  GCC will eventually put that all in memory so it will work
>> > out fine in the end, but you are likely to get inefficient code.
>> Yes.  We may need to use memory to save regiters for big struct.
>> Small struct may be practical to use registers.  We may leverage the
>> idea that: some type of small struct are passing to function through
>> registers.=20
>>=20
>> >
>> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
>> > would be useful to put smaller temportaries, say 32 bytes and smaller,
>> > in registers instead of in memory.
>> I think you mean:  we should try to registers to avoid memory accesing,
>> and using registers would be ok for more bytes memcpy(32bytes).
>> Great sugguestion, thanks a lot!
>>=20
>> Like below idea:
>> [r100:TI, r101:TI] =3D src;  //Or r100:OI/OO =3D src;
>> dest =3D [r100:TI, r101:TI];
>>=20
>> Currently, for 8bytes structure, we are using TImode for it.
>> And subreg/fwprop/cse passes are able to optimize it as expected.
>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
>> I'm not sure if current infrastructure supports to use two more
>> registers for one structure.
>>=20
>> >
>> >> But even so, these kinds of things are still bound to happen, so it's=
=20
>> >> probably worth thinking about if we can do better in RTL as well.
>> >
>> > Always.  It is a mistake to think that having better high-level
>> > optimisations means that you don't need good low-level optimisations
>> > anymore: in fact deficiencies there become more glaringly apparent if
>> > the early pipeline opts become better :-)
>> Understant, thanks :)
>>=20
>> >
>> >> The first thing that comes to my mind is to annotate memcpy calls tha=
t=20
>> >> are structure assignments.=C2=A0 The idea here is that we may want to=
 expand=20
>> >> a memcpy differently in those cases.=C2=A0=C2=A0 Changing how we expa=
nd an opaque=20
>> >> memcpy call is unlikely to be beneficial in most cases.=C2=A0 But cha=
nging=20
>> >> how we expand a structure copy may be beneficial by exposing the=20
>> >> underlying field values.=C2=A0=C2=A0 This would roughly correspond to=
 your method=20
>> >> #1.
>> >>=20
>> >> Or instead of changing how we expand, teach the optimizers about thes=
e=20
>> >> annotated memcpy calls -- they're just a a copy of each field. =C2=A0=
 That's=20
>> >> how CSE and the propagators could treat them. After some point we'd=20
>> >> lower them in the usual ways, but at least early in the RTL pipeline =
we=20
>> >> could keep them as annotated memcpy calls.=C2=A0 This roughly corresp=
onds to=20
>> >> your second suggestion.
>> >
>> > Ideally this won't ever make it as far as RTL, if the structures do not
>> > need to go via memory.  All high-level optimissations should have been
>> > done earlier, and hopefully it was not expand tiself that forced stuff
>> > into memory!  :-/
>> Currently, after early gimple optimization, the struct member accessing
>> may still need to be in memory (if the mode of the struct is BLK).
>> For example:
>>=20
>> _Bool foo (const A a) { return a.a[0] > 1.0; }
>>=20
>> The optimized gimple would be:
>>   _1 =3D a.a[0];
>>   _3 =3D _1 > 1.0e+0;
>>   return _3;
>>=20
>> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
>> and "a.a[0]" is also reading from memory.  It may be better to use
>> "f1" for "a.a[0]" here.
>>=20
>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF;=
 DF}"
>> for the struct (BLK may be changed), and using 4 DF registers to access
>> the structure in expand pass.
>
> I think for cases like this it might be a good idea to perform
> SRA-like analysis at RTL expansion time when we know how parameters
> arrive (in pieces) and take that knowledge into account when
> assigning the RTL to a decl.  The same applies for the return ABI.
> Since we rely on RTL to elide copies to/from return/argument
> registers/slots we have to assign "layout compatible" registers
> to the corresponding auto vars.
>
Thanks for pointing out this!
This looks like a kind of SRA, especially for parm and return value.
As you pointed out, there is something that we may need to take care
to adjust:
1. We would use the "layout compatible" mode reg for the scalar. e.g.
DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}".

2. For an aggregate that will be assigned to return value, before
expanding to 'return stmt', we may not sure if need to assign
'scalar rtl(s)' to decl.=20
To handle this issue, we may use 'scalar rtl(s)' for all struct decl
as if it is parm or return result.
Then method3 may be similar to this idea: using "parallel RTL" for
the decl (may use DECL_RTL directly).

Please point out any misunderstandings or suggestions.
Thanks again!

BR,
Jeff(Jiufu)

>>=20
>> Thanks again for your kindly and helpful comments!
>>=20
>> BR,
>> Jeff(Jiufu)
>>=20
>> >
>> >
>> > Segher
>>=20