From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 49CAD3858C62; Tue, 8 Nov 2022 04:05:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 49CAD3858C62 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2A81qSbO010852; Tue, 8 Nov 2022 04:05:52 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : references : date : in-reply-to : message-id : mime-version : content-type : content-transfer-encoding; s=pp1; bh=7yLWxZLUQPiREwYP7GWX5fII7TGz+TYMdxE6pAb8cYU=; b=ooVZpcqHnP4F2iNs8O+aM3Q4pTqAklTw0wnOJewktW4vVOW+O6HTbbPOhfVhHfhpxw5d BxElxkj7Tdb1PI7r6KANqtWhXaa7aekhRXYpbE1bnQh5QZohtwuGbMw1sXrOAMumXNxK 4up4qy0YSeD+Ok8orlkYc1zOTSavhpGuSBMmjorxL03FBcZADYzKsBc4EOTwBL2xjk/C sz/8VUJSh86MJwiYy0nNRz3TKUuSoNkrQXdkzmQG+FNq//eXKTorbmpklUIlKtEI19Y/ tSbxE0yqZdMIg66dT6X+l9u395XERLn+8WjjPglCz8ktk7vow7GlKyFv2Hc5tI/67lP1 MQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3kp14xt0mf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 08 Nov 2022 04:05:52 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2A83jgN5004658; Tue, 8 Nov 2022 04:05:52 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3kp14xt0m2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 08 Nov 2022 04:05:51 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2A83o5Nf025394; Tue, 8 Nov 2022 04:05:50 GMT Received: from b03cxnp07028.gho.boulder.ibm.com (b03cxnp07028.gho.boulder.ibm.com [9.17.130.15]) by ppma01dal.us.ibm.com with ESMTP id 3kngsy0x13-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 08 Nov 2022 04:05:50 +0000 Received: from smtpav02.dal12v.mail.ibm.com ([9.208.128.128]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2A845om249152386 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 8 Nov 2022 04:05:51 GMT Received: from smtpav02.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 433B258051; Tue, 8 Nov 2022 04:05:49 +0000 (GMT) Received: from smtpav02.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E35A958060; Tue, 8 Nov 2022 04:05:48 +0000 (GMT) Received: from pike (unknown [9.5.12.127]) by smtpav02.dal12v.mail.ibm.com (Postfix) with ESMTPS; Tue, 8 Nov 2022 04:05:48 +0000 (GMT) From: Jiufu Guo To: Richard Biener Cc: Segher Boessenkool , Jeff Law , gcc-patches@gcc.gnu.org, rguenth@gcc.gnu.org, pinskia@gcc.gnu.org, linkw@gcc.gnu.org, dje.gcc@gmail.com Subject: Re: [RFC] propgation leap over memory copy for struct References: <20221031024235.110995-1-guojiufu@linux.ibm.com> <20221101004956.GL25951@gate.crashing.org> <7e1qqnwb36.fsf@pike.rch.stglabs.ibm.com> <381qr8s3-53n-pr61-7r1n-6q8q71nsqnq@fhfr.qr> Date: Tue, 08 Nov 2022 12:05:48 +0800 In-Reply-To: <381qr8s3-53n-pr61-7r1n-6q8q71nsqnq@fhfr.qr> (Richard Biener's message of "Sat, 5 Nov 2022 15:13:55 +0100 (CET)") Message-ID: <7esfiuum3n.fsf@pike.rch.stglabs.ibm.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: Z95DRvXtZw5p-PnBPu22tQg8qVSj_Vo8 X-Proofpoint-GUID: iqJ5YtsTSPTAIoVtqLqD3jW1pcLE5vi9 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-11-07_11,2022-11-07_02,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 lowpriorityscore=0 adultscore=0 phishscore=0 malwarescore=0 priorityscore=1501 mlxlogscore=999 mlxscore=0 impostorscore=0 clxscore=1011 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2210170000 definitions=main-2211080020 X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Biener writes: > On Tue, 1 Nov 2022, Jiufu Guo wrote: > >> Segher Boessenkool writes: >>=20 >> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: >> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> >> >We know that for struct variable assignment, memory copy may be used. >> >> >And for memcpy, we may load and store more bytes as possible at one = time. >> >> >While it may be not best here: >> > >> >> So the first question in my mind is can we do better at the gimple=20 >> >> phase?=C2=A0 For the second case in particular can't we just "return = a"=20 >> >> rather than copying a into then returning ?=C2=A0 Th= is feels=20 >> >> a lot like the return value optimization from C++.=C2=A0 I'm not sure= if it=20 >> >> applies to the first case or not, it's been a long time since I looke= d=20 >> >> at NRV optimizations, but it might be worth poking around in there a = bit=20 >> >> (tree-nrv.cc). >> > >> > If it is a bigger struct you end up with quite a lot of stuff in >> > registers. GCC will eventually put that all in memory so it will work >> > out fine in the end, but you are likely to get inefficient code. >> Yes. We may need to use memory to save regiters for big struct. >> Small struct may be practical to use registers. We may leverage the >> idea that: some type of small struct are passing to function through >> registers.=20 >>=20 >> > >> > OTOH, 8 bytes isn't as big as we would want these days, is it? So it >> > would be useful to put smaller temportaries, say 32 bytes and smaller, >> > in registers instead of in memory. >> I think you mean: we should try to registers to avoid memory accesing, >> and using registers would be ok for more bytes memcpy(32bytes). >> Great sugguestion, thanks a lot! >>=20 >> Like below idea: >> [r100:TI, r101:TI] =3D src; //Or r100:OI/OO =3D src; >> dest =3D [r100:TI, r101:TI]; >>=20 >> Currently, for 8bytes structure, we are using TImode for it. >> And subreg/fwprop/cse passes are able to optimize it as expected. >> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; >> I'm not sure if current infrastructure supports to use two more >> registers for one structure. >>=20 >> > >> >> But even so, these kinds of things are still bound to happen, so it's= =20 >> >> probably worth thinking about if we can do better in RTL as well. >> > >> > Always. It is a mistake to think that having better high-level >> > optimisations means that you don't need good low-level optimisations >> > anymore: in fact deficiencies there become more glaringly apparent if >> > the early pipeline opts become better :-) >> Understant, thanks :) >>=20 >> > >> >> The first thing that comes to my mind is to annotate memcpy calls tha= t=20 >> >> are structure assignments.=C2=A0 The idea here is that we may want to= expand=20 >> >> a memcpy differently in those cases.=C2=A0=C2=A0 Changing how we expa= nd an opaque=20 >> >> memcpy call is unlikely to be beneficial in most cases.=C2=A0 But cha= nging=20 >> >> how we expand a structure copy may be beneficial by exposing the=20 >> >> underlying field values.=C2=A0=C2=A0 This would roughly correspond to= your method=20 >> >> #1. >> >>=20 >> >> Or instead of changing how we expand, teach the optimizers about thes= e=20 >> >> annotated memcpy calls -- they're just a a copy of each field. =C2=A0= That's=20 >> >> how CSE and the propagators could treat them. After some point we'd=20 >> >> lower them in the usual ways, but at least early in the RTL pipeline = we=20 >> >> could keep them as annotated memcpy calls.=C2=A0 This roughly corresp= onds to=20 >> >> your second suggestion. >> > >> > Ideally this won't ever make it as far as RTL, if the structures do not >> > need to go via memory. All high-level optimissations should have been >> > done earlier, and hopefully it was not expand tiself that forced stuff >> > into memory! :-/ >> Currently, after early gimple optimization, the struct member accessing >> may still need to be in memory (if the mode of the struct is BLK). >> For example: >>=20 >> _Bool foo (const A a) { return a.a[0] > 1.0; } >>=20 >> The optimized gimple would be: >> _1 =3D a.a[0]; >> _3 =3D _1 > 1.0e+0; >> return _3; >>=20 >> During expand to RTL, parm 'a' is store to memory from arg regs firstly, >> and "a.a[0]" is also reading from memory. It may be better to use >> "f1" for "a.a[0]" here. >>=20 >> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF;= DF}" >> for the struct (BLK may be changed), and using 4 DF registers to access >> the structure in expand pass. > > I think for cases like this it might be a good idea to perform > SRA-like analysis at RTL expansion time when we know how parameters > arrive (in pieces) and take that knowledge into account when > assigning the RTL to a decl. The same applies for the return ABI. > Since we rely on RTL to elide copies to/from return/argument > registers/slots we have to assign "layout compatible" registers > to the corresponding auto vars. > Thanks for pointing out this! This looks like a kind of SRA, especially for parm and return value. As you pointed out, there is something that we may need to take care to adjust: 1. We would use the "layout compatible" mode reg for the scalar. e.g. DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}". 2. For an aggregate that will be assigned to return value, before expanding to 'return stmt', we may not sure if need to assign 'scalar rtl(s)' to decl.=20 To handle this issue, we may use 'scalar rtl(s)' for all struct decl as if it is parm or return result. Then method3 may be similar to this idea: using "parallel RTL" for the decl (may use DECL_RTL directly). Please point out any misunderstandings or suggestions. Thanks again! BR, Jeff(Jiufu) >>=20 >> Thanks again for your kindly and helpful comments! >>=20 >> BR, >> Jeff(Jiufu) >>=20 >> > >> > >> > Segher >>=20