From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id ADAF53858C2D; Tue, 16 Aug 2022 06:45:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org ADAF53858C2D Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27G6STmJ025281; Tue, 16 Aug 2022 06:45:24 GMT Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j060erd9u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 16 Aug 2022 06:45:23 +0000 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27G6SuDj027166; Tue, 16 Aug 2022 06:45:23 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j060erd96-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 16 Aug 2022 06:45:23 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27G6iPiR024062; Tue, 16 Aug 2022 06:45:22 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma01wdc.us.ibm.com with ESMTP id 3hx3k98gcw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 16 Aug 2022 06:45:22 +0000 Received: from b01ledav001.gho.pok.ibm.com (b01ledav001.gho.pok.ibm.com [9.57.199.106]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27G6jLGf57016776 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 16 Aug 2022 06:45:21 GMT Received: from b01ledav001.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BCE3628058; Tue, 16 Aug 2022 06:45:21 +0000 (GMT) Received: from b01ledav001.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7740A2805C; Tue, 16 Aug 2022 06:45:21 +0000 (GMT) Received: from pike (unknown [9.5.12.127]) by b01ledav001.gho.pok.ibm.com (Postfix) with ESMTPS; Tue, 16 Aug 2022 06:45:21 +0000 (GMT) From: Jiufu Guo To: Richard Biener Cc: GCC Patches , David Edelsohn , Segher Boessenkool , linkw@gcc.gnu.org Subject: Re: [RFC]rs6000: split complicated constant to memory References: <20220815052519.194582-1-guojiufu@linux.ibm.com> <7ek078ludi.fsf@pike.rch.stglabs.ibm.com> Date: Tue, 16 Aug 2022 14:45:12 +0800 In-Reply-To: <7ek078ludi.fsf@pike.rch.stglabs.ibm.com> (Jiufu Guo's message of "Tue, 16 Aug 2022 11:50:17 +0800") Message-ID: <7ev8qsk7pj.fsf@pike.rch.stglabs.ibm.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: Flo_aBp8-WOzs_gmCYt-uY0Adq3QYrNm X-Proofpoint-GUID: V_Ue1HyhTEcYHx-J8VNx3F6lFRLetVlT X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-16_04,2022-08-16_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 lowpriorityscore=0 adultscore=0 clxscore=1015 malwarescore=0 suspectscore=0 spamscore=0 priorityscore=1501 phishscore=0 bulkscore=0 mlxscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208160024 X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, KAM_STOCKGEN, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Aug 2022 06:45:28 -0000 Jiufu Guo writes: > Hi, > > Richard Biener writes: > >> On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches >> wrote: >>> >>> Hi, >>> >>> This patch tries to put the constant into constant pool if building the >>> constant requires 3 or more instructions. >>> >>> But there is a concern: I'm wondering if this patch is really profitable. >>> >>> Because, as I tested, 1. for simple case, if instructions are not been run >>> in parallel, loading constant from memory maybe faster; but 2. if there >>> are some instructions could run in parallel, loading constant from memory >>> are not win comparing with building constant. As below examples. >>> >>> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; >>> for f2.c and f4.c, 'loading' constant are visibly slower. >>> >>> For real-world cases, both kinds of code sequences exist. >>> >>> So, I'm not sure if we need to push this patch. >>> >>> Run a lot of times (1000000000) below functions to check runtime. >>> f1.c: >>> long foo (long *arg, long*, long *) >>> { >>> *arg = 0x1234567800000000; >>> } >>> asm building constant: >>> lis 10,0x1234 >>> ori 10,10,0x5678 >>> sldi 10,10,32 >>> vs. asm loading >>> addis 10,2,.LC0@toc@ha >>> ld 10,.LC0@toc@l(10) >>> The runtime between 'building' and 'loading' are similar: some times the >>> 'building' is faster; sometimes 'loading' is faster. And the difference is >>> slight. >> >> I wonder if it is possible to decide this during scheduling - chose the >> variant that, when the result is needed, is cheaper? Post-RA might >> be a bit difficult (I see the load from memory needs the TOC, but then >> when the TOC is not available we could just always emit the build form), >> and pre-reload precision might be not good enough to make this worth >> the experiment? > Thanks a lot for your comments! > > Yes, Post-RA may not handle all cases. > If there is no TOC avaiable, we are not able to load the const through > TOC. As Segher point out: crtl->uses_const_pool maybe an approximation > way. > Sched2 pass could optimize some cases(e.g. for f2.c and f4.c), but for > some cases, it may not distrubuted those 'building' instructions. > > So, maybe we add a peephole after sched2. If the five-instructions > to building constant are still successive, then using 'load' to replace > (need to check TOC available). > While I'm not sure if it is worthy. Oh, as checking the object files (from GCC bootstrap and spec), it is rare that the five-instructions are successive. It is often 1(or 2) insns are distributed, and other 4(or 3) instructions are successive. So, using peephole may not very helpful. BR, Jeff(Jiufu) > >> >> Of course the scheduler might lack on the technical side as well. > > > BR, > Jeff(Jiufu) > >> >>> >>> f2.c >>> long foo (long *arg, long *arg2, long *arg3) >>> { >>> *arg = 0x1234567800000000; >>> *arg2 = 0x7965234700000000; >>> *arg3 = 0x4689123700000000; >>> } >>> asm building constant: >>> lis 7,0x1234 >>> lis 10,0x7965 >>> lis 9,0x4689 >>> ori 7,7,0x5678 >>> ori 10,10,0x2347 >>> ori 9,9,0x1237 >>> sldi 7,7,32 >>> sldi 10,10,32 >>> sldi 9,9,32 >>> vs. loading >>> addis 7,2,.LC0@toc@ha >>> addis 10,2,.LC1@toc@ha >>> addis 9,2,.LC2@toc@ha >>> ld 7,.LC0@toc@l(7) >>> ld 10,.LC1@toc@l(10) >>> ld 9,.LC2@toc@l(9) >>> For this case, 'loading' is always slower than 'building' (>15%). >>> >>> f3.c >>> long foo (long *arg, long *, long *) >>> { >>> *arg = 384307168202282325; >>> } >>> lis 10,0x555 >>> ori 10,10,0x5555 >>> sldi 10,10,32 >>> oris 10,10,0x5555 >>> ori 10,10,0x5555 >>> For this case, 'building' (through 5 instructions) are slower, and 'loading' >>> is faster ~5%; >>> >>> f4.c >>> long foo (long *arg, long *arg2, long *arg3) >>> { >>> *arg = 384307168202282325; >>> *arg2 = -6148914691236517205; >>> *arg3 = 768614336404564651; >>> } >>> lis 7,0x555 >>> lis 10,0xaaaa >>> lis 9,0xaaa >>> ori 7,7,0x5555 >>> ori 10,10,0xaaaa >>> ori 9,9,0xaaaa >>> sldi 7,7,32 >>> sldi 10,10,32 >>> sldi 9,9,32 >>> oris 7,7,0x5555 >>> oris 10,10,0xaaaa >>> oris 9,9,0xaaaa >>> ori 7,7,0x5555 >>> ori 10,10,0xaaab >>> ori 9,9,0xaaab >>> For this cases, since 'building' constant are parallel, 'loading' is slower: >>> ~8%. On p10, 'loading'(through 'pld') is also slower >4%. >>> >>> >>> BR, >>> Jeff(Jiufu) >>> >>> --- >>> gcc/config/rs6000/rs6000.cc | 14 ++++++++++++++ >>> gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++ >>> 2 files changed, 25 insertions(+) >>> create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c >>> >>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc >>> index 4b727d2a500..3798e11bdbc 100644 >>> --- a/gcc/config/rs6000/rs6000.cc >>> +++ b/gcc/config/rs6000/rs6000.cc >>> @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source) >>> c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000; >>> emit_move_insn (lo, GEN_INT (c)); >>> } >>> + else if (base_reg_operand (dest, mode) >>> + && num_insns_constant (source, mode) > 2) >>> + { >>> + rtx sym = force_const_mem (mode, source); >>> + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) >>> + && use_toc_relative_ref (XEXP (sym, 0), mode)) >>> + { >>> + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); >>> + sym = gen_const_mem (mode, toc); >>> + set_mem_alias_set (sym, get_TOC_alias_set ()); >>> + } >>> + >>> + emit_insn (gen_rtx_SET (dest, sym)); >>> + } >>> else >>> rs6000_emit_set_long_const (dest, c); >>> break; >>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c >>> new file mode 100644 >>> index 00000000000..469a8f64400 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c >>> @@ -0,0 +1,11 @@ >>> +/* PR target/63281 */ >>> +/* { dg-do compile { target lp64 } } */ >>> +/* { dg-options "-O2 -std=c99" } */ >>> + >>> +void >>> +foo (unsigned long long *a) >>> +{ >>> + *a = 0x020805006106003; >>> +} >>> + >>> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */ >>> -- >>> 2.17.1 >>>