From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id C77673858D39; Wed, 12 Jun 2024 03:14:44 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C77673858D39 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C77673858D39 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718162087; cv=none; b=mZSexuIN2NI4oOMupJGoXU1CFP4g3Trn+yxnd3LH3uF8m9Hwt32mzV5ODJWYUz9v3V32HmDrYsa9KdcmItysozQ1dJX1o+uo+2BWOARQTj0xsib8BkCSIrPqI6J4aiuOAn/jgPSfCpiHAyFQkBKUYaWdlRLbv0dIPbEKQQHs3QM= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718162087; c=relaxed/simple; bh=mK1wKxzWJoUC2lrfQ7V1Zmz/YrUYfdtc9yCteB7Yrxw=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=FNp/F2x5yoIBHSRBfuCLnINB+4mZS6IIRlMkhhDFBUVvSzrgrkYS4TQpiJHv9Qe2JfBmxdsMJ8+77Hj/aEfXxq80GmRnpZLZwMPS9rUYwt13ao4D6pqEaokg8IgPCRY/IXRUy3oKdiUvYmUH2pCSnup7TabbyRKVwJcq08a0mO0= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from pps.filterd (m0353723.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 45C1QdAw021314; Wed, 12 Jun 2024 03:14:43 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from :to:cc:subject:references:date:in-reply-to:message-id :mime-version:content-type; s=pp1; bh=VPaAAHK3rnrDN0r61DHJsctzr7 qN19bGokz6cf0yXno=; b=X2zrmnwlvHCGeX9kiQWhqPTHlIHrp9zdHY+B+9vW8P gERnpGyEnZUJ+zHZ+F67Xd87nlDTYdLtt10Au3q7qiazVmQYvrPB8O4SRgN6m+JY pz2CrdNebH8erN/SCKDn8Mljoymze6oJdPEzNUqS9QVA7qOlgfrVVYHDgPdzdyVv TO8Y7F7csi0KFNwG4LywaiuKq+DXoeBLJ/q4W+p7wUXjXrUUo+mMdJF043CO6EcT TKRfY36MPr9/yyqx/b3clyJtxkA6JeN5o8/bmE39rf5AYIm0GKSFBHibTlPheY+G ykmupT8B3ws4x3GNCNb3b2ccQA1WxnOJOrn39w45gumA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3yq0sa0a5v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 12 Jun 2024 03:14:43 +0000 (GMT) Received: from m0353723.ppops.net (m0353723.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 45C3EgNu013254; Wed, 12 Jun 2024 03:14:42 GMT Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3yq0sa0a5s-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 12 Jun 2024 03:14:42 +0000 (GMT) Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1]) by ppma11.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 45C1XweL008716; Wed, 12 Jun 2024 03:14:41 GMT Received: from smtprelay05.wdc07v.mail.ibm.com ([172.16.1.72]) by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 3yn4b3984p-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 12 Jun 2024 03:14:41 +0000 Received: from smtpav02.wdc07v.mail.ibm.com (smtpav02.wdc07v.mail.ibm.com [10.39.53.229]) by smtprelay05.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 45C3Ed5L66584968 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 12 Jun 2024 03:14:41 GMT Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1B87E581F1; Wed, 12 Jun 2024 03:14:39 +0000 (GMT) Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B8845581DC; Wed, 12 Jun 2024 03:14:38 +0000 (GMT) Received: from genoa (unknown [9.40.192.157]) by smtpav02.wdc07v.mail.ibm.com (Postfix) with ESMTPS; Wed, 12 Jun 2024 03:14:38 +0000 (GMT) From: Jiufu Guo To: Segher Boessenkool Cc: gcc-patches@gcc.gnu.org, linkw@gcc.gnu.org, dje.gcc@gmail.com, bergner@linux.ibm.com Subject: Re: [PATCH V4 1/2] split complicate 64bit constant to memory References: <20240611083727.2642461-1-guojiufu@linux.ibm.com> <20240611165959.GE19790@gate.crashing.org> Date: Wed, 12 Jun 2024 11:14:27 +0800 In-Reply-To: <20240611165959.GE19790@gate.crashing.org> (Segher Boessenkool's message of "Tue, 11 Jun 2024 11:59:59 -0500") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: m5fa9wBZmYilN3IEsSE1IvQLK3S_Fz6F X-Proofpoint-GUID: p3dQj6JHQ8zm_ypGAeBDg2fCiBPBK-M5 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1039,Hydra:6.0.680,FMLib:17.12.28.16 definitions=2024-06-11_13,2024-06-11_01,2024-05-17_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 lowpriorityscore=0 adultscore=0 clxscore=1015 bulkscore=0 phishscore=0 suspectscore=0 spamscore=0 mlxlogscore=999 malwarescore=0 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2405170001 definitions=main-2406120019 X-Spam-Status: No, score=-10.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,GIT_PATCH_0,KAM_SHORT,KAM_STOCKGEN,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Segher, Thanks a lot for your review! Segher Boessenkool writes: > Hi! > > On Tue, Jun 11, 2024 at 04:37:25PM +0800, Jiufu Guo wrote: >> Sometimes, a complicated constant is built via 3(or more) >> instructions. Generally speaking, it would not be as fast >> as loading it from the constant pool (as the discussions in >> PR63281): >> "ld" is one instruction. If consider "address/toc" adjust, >> we may count it as 2 instructions. And "pld" may need fewer >> cycles. > > Yeah, three insns to build up a constant always was about as fast/slow Yeah. > as loading a constant from memory. When you need two related constants > loading from memory is slower, so we preferred building them up. You may talking about the functionality "const_anchor" which is enabled in cse1 for two related constants. Like the test case: const_anchors.c, the first one would be loaded from memory if it is complicated, and the second would be computed based on the first one. > >> As testing(SPEC2017), it could get better/stable runtime >> if set the threshold as "> 2" (compare with "> 3"). > > Interesting! About how much speedup did you see? 0.1%? I did not see the proven speedup from SPEC2017. On some benchs(including perlbench), I saw some runs with speedup >1%. But with deeper checking, it would not directly benefit of this functionality, since it was not hit by the hot functions; and the changed constants do not contribute the runtime visibly. Saying "better/stable", because with different bases(interval weeks/months), ">2" introduce less waving; and with different options (-O2, -O3, -Ofast), ">2" seems more stable. > >> As known, because the constant is load from memory by this >> patch, so this functionality may affect the cache missing. >> Also there may be possible side effects on icach. While, >> IMHO, this patch would be still do the right thing. > > Yeah, but almost every codegen patch has *some* icache effect. Your > benchmarks show a net positive effect, that is all that matters in the > end. Yes. :) As using the small bench (like the cases mentioned in comment20 of PR63281), the result shows the prefer choise would be buiding by "2 insns" > "loading from rodata/toc" > "3 insns", and building by "4/5insn" would be the last choise. I mentioned this here, because in one recent run, for one benchmark, I saw 'caching missing' raising, and 'cache missing' was waving with size of the code (not directly related to this patch.) > >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.cc (rs6000_emit_set_const): Update to split >> complicate constant to memory. > > (complicated) > > And don't write "update", every patch is an update :-) > > There are more conditions here, mention those as well? Thanks! This patch only split complicated constants(>2insns) to memory for -m64 under TARGET_ELF. "Split complicate constant to memory for -m64." > > But, why does base_reg_operand matter? And, why for TARGET_ELF only? > That is the only place you tested I'm sure, but make an educated guess > for the rest, why would it not be useful for other binary formats? Thanks for these very insightful questions! "base_reg_operand" is used to avoid to instruction like "r0=[OFF(r0)]" for the case like "r0=0xXXXXX" (I would have a recheck to see if this is really needed before RA.) Yes, I only tested with ELF. I add ELF checking, just because not sure how profitable it is on other targets. In theory, this functionality would be useful for other targets, at least for the 4/5 insns (witout the aspect of parallel execution in the binary). There is another heuristic code in rs6000_emit_move, it is using: ```(TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)``` This does not check ELF. Levearging this code, it would be ok to accept all targets. This heuristic exist a long time ago, I guess it would be before ppc64le (even more history), it may more for -m32. For -m64, as I tested, ">2" would be suitable. > >> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc >> index e4dc629ddcc..f448df289a0 100644 >> --- a/gcc/config/rs6000/rs6000.cc >> +++ b/gcc/config/rs6000/rs6000.cc >> @@ -10240,6 +10240,21 @@ rs6000_emit_set_const (rtx dest, rtx source) >> c = sext_hwi (c, 32); >> emit_move_insn (lo, GEN_INT (c)); >> } >> + >> + else if (base_reg_operand (dest, mode) && TARGET_64BIT >> + && TARGET_ELF && num_insns_constant (source, mode) > 2) >> + { >> + rtx sym = force_const_mem (mode, source); >> + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) >> + && use_toc_relative_ref (XEXP (sym, 0), mode)) >> + { >> + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); >> + sym = gen_const_mem (mode, toc); >> + set_mem_alias_set (sym, get_TOC_alias_set ()); >> + } >> + >> + emit_move_insn (dest, sym); >> + } >> else >> rs6000_emit_set_long_const (dest, c); >> break; > > Is there no utility function to put constants like this in memory? > Like, "output_toc" for example? I guess you mean "force_const_mem" which puts the constant value to constant pool and create "LC" label for it. With "emit_move_insn (dest, force_const_mem (mode, source));" could implement the part of this functionality. While adding code "if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))..." would save one more instructions. So, the code is drafted like this patch. > >> diff --git a/gcc/testsuite/gcc.target/powerpc/const_anchors.c b/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> index 542e2674b12..f33c9a83f5e 100644 >> --- a/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> +++ b/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> @@ -1,5 +1,5 @@ >> /* { dg-do compile { target has_arch_ppc64 } } */ >> -/* { dg-options "-O2" } */ >> +/* { dg-options "-O2 -fdump-rtl-final" } */ >> >> #define C1 0x2351847027482577ULL >> #define C2 0x2351847027482578ULL >> @@ -17,4 +17,5 @@ void __attribute__ ((noinline)) foo1 (long long *a, long long b) >> *a++ = C2; >> } >> >> -/* { dg-final { scan-assembler-times {\maddi\M} 2 } } */ >> +/* Checking "final" instead checking "asm" output to avoid noise. */ >> +/* { dg-final { scan-rtl-dump-times {\madddi3\M} 2 "final" } } */ > > So, in the RTL dump it finds a named pattern adddi3, while in the asm > you find random other addi insns? Hardly worth a comment in the Yes, other "addi"s for memory address occur in asm. Thanks for your kind review. I would remove this no-go comment. > testcase itself, but heh. The point is that you check for the RTL > pattern name instead of the machine insn. The compiler pass you check > in isn't changed even! > >> --- a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c >> +++ b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c >> @@ -6,11 +6,18 @@ >> /* { dg-final { scan-assembler-times {\mori\M} 4 } } */ >> /* { dg-final { scan-assembler-times {\mrldimi\M} 2 } } */ >> >> +/* The below macro helps to avoid loading constant from memory. */ >> +#define CONST_AVOID_BASE_REG(DEST, CST) \ >> + { \ >> + register long long d asm ("r0") = CST; \ >> + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ >> + } > > This needs an actual changelog, not just "update". Something as simple > as "New macro to force constant into a reg" is fine, but more than > "update" :-) Thanks a lot for point this out! > >> +/* The below macro helps to avoid loading constant from memory. */ >> +#define CONST_AVOID_BASE_REG(DEST, CST) \ >> + { \ >> + register long long d asm ("r0") = CST; \ >> + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ >> + } > > "Forces the constant into a register", yes. OK, thanks. > >> --- a/gcc/testsuite/gcc.target/powerpc/pr106550_1.c >> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550_1.c >> @@ -4,17 +4,19 @@ >> /* { dg-options "-O2 -mdejagnu-cpu=power10 -fdisable-rtl-split1" } */ >> /* force the constant splitter run after RA: -fdisable-rtl-split1. */ >> >> +/* The below marco helps to avoid using paddi and avoid loading from memory. */ > > (macro) OK. > >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c >> @@ -0,0 +1,11 @@ >> +/* Check loading constant from memory pool. */ >> +/* { dg-options "-O2 -mpowerpc64" } */ >> + >> +void >> +foo (unsigned long long *a) >> +{ >> + *a++ = 0x2351847027482577ULL; >> +} >> + >> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 { target { lp64 } } } } */ > > Why target lp64 only? You have -mpowerpc64 already, does that not make > us use ld here always? For -m32, we need other code to support it. And the patch for "-m32 -mpowerpc" is the second patch of this series. Here 'lp64' is safe guard for "-m64", in case there is concerns for "-m32" patch. Thanks again for your review!! BR, Jeff(Jiufu) Guo. > > > Segher