From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gate.crashing.org (gate.crashing.org [63.228.1.57]) by sourceware.org (Postfix) with ESMTP id 1E508385E03D; Tue, 14 Jun 2022 22:35:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1E508385E03D Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=kernel.crashing.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kernel.crashing.org Received: from gate.crashing.org (localhost.localdomain [127.0.0.1]) by gate.crashing.org (8.14.1/8.14.1) with ESMTP id 25EMYMBa017987; Tue, 14 Jun 2022 17:34:22 -0500 Received: (from segher@localhost) by gate.crashing.org (8.14.1/8.14.1/Submit) id 25EMYLQh017986; Tue, 14 Jun 2022 17:34:21 -0500 X-Authentication-Warning: gate.crashing.org: segher set sender to segher@kernel.crashing.org using -f Date: Tue, 14 Jun 2022 17:34:21 -0500 From: Segher Boessenkool To: Jiufu Guo Cc: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, linkw@gcc.gnu.org, amodra@gmail.com Subject: Re: [PATCH V2]rs6000: Store complicated constant into pool Message-ID: <20220614223421.GB25951@gate.crashing.org> References: <20220614132355.117887-1-guojiufu@linux.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220614132355.117887-1-guojiufu@linux.ibm.com> User-Agent: Mutt/1.4.2.3i X-Spam-Status: No, score=-3.2 required=5.0 tests=BAYES_00, JMQ_SPF_NEUTRAL, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jun 2022 22:35:24 -0000 Hi! On Tue, Jun 14, 2022 at 09:23:55PM +0800, Jiufu Guo wrote: > This patch reduces the threshold of instruction number for storing > constant to pool and update cost for constant and mem accessing. > And then if building the constant needs more than 2 instructions (or > more than 1 instruction on P10), then prefer to load it from constant > pool. Have you tried with different limits? And, p10 is a red herring, you actually test if prefixed insns are used. > * config/rs6000/rs6000.cc (rs6000_cannot_force_const_mem): > Exclude rtx with code 'HIGH'. > --- a/gcc/config/rs6000/rs6000.cc > +++ b/gcc/config/rs6000/rs6000.cc > @@ -9706,8 +9706,9 @@ rs6000_init_stack_protect_guard (void) > static bool > rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x) > { > - if (GET_CODE (x) == HIGH > - && GET_CODE (XEXP (x, 0)) == UNSPEC) > + /* Exclude CONSTANT HIGH part. e.g. > + (high:DI (symbol_ref:DI ("var") [flags 0xc0] )). */ > + if (GET_CODE (x) == HIGH) > return true; So, why is this? You didn't explain. > @@ -11139,7 +11140,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode) > && FP_REGNO_P (REGNO (operands[0]))) > || !CONST_INT_P (operands[1]) > || (num_insns_constant (operands[1], mode) > - > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2))) > + > (TARGET_PREFIXED ? 1 : 2))) > && !toc_relative_expr_p (operands[1], false, NULL, NULL) > && (TARGET_CMODEL == CMODEL_SMALL > || can_create_pseudo_p () This is the more obvious part. > @@ -22101,6 +22102,14 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code, > > case CONST_DOUBLE: > case CONST_WIDE_INT: > + /* It may needs a few insns for const to SET. -1 for outer SET code. */ > + if (outer_code == SET) > + { > + *total = COSTS_N_INSNS (num_insns_constant (x, mode)) - 1; > + return true; > + } > + /* FALLTHRU */ > + > case CONST: > case HIGH: > case SYMBOL_REF: But this again isn't an obvious improvement at all, and needs performance testing -- separately of the other changes. > @@ -22110,8 +22119,12 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code, > case MEM: > /* When optimizing for size, MEM should be slightly more expensive > than generating address, e.g., (plus (reg) (const)). > - L1 cache latency is about two instructions. */ > - *total = !speed ? COSTS_N_INSNS (1) + 1 : COSTS_N_INSNS (2); > + L1 cache latency is about two instructions. > + For prefixed load (pld), we would set it slightly faster than > + than two instructions. */ > + *total = !speed > + ? COSTS_N_INSNS (1) + 1 > + : TARGET_PREFIXED ? COSTS_N_INSNS (2) - 1 : COSTS_N_INSNS (2); > if (rs6000_slow_unaligned_access (mode, MEM_ALIGN (x))) > *total += COSTS_N_INSNS (100); > return true; And this is completely independent of the rest as well. Cost 5 or 7 are completely random numbers, why did you pick these? Does it work better than 8, or 4, etc.? > --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c > +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c > @@ -1,7 +1,7 @@ > /* { dg-do compile { target { powerpc*-*-* } } } */ > /* { dg-require-effective-target lp64 } */ > /* { dg-options "-O" } */ > -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */ > +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */ Why? This is still better generated in code, no? It should never be loaded from a constant pool (it is hex 4000_0000_0000_0000, easy to construct with just one or two insns). > --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c > +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c > @@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; } > unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; } > unsigned long long mskse() { return 0xffff1234ffff1234ULL; } > > -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */ > +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */ Please make this the exact number of times you want to see rldimi and the number of times you want a load. Btw, you need to write \m(?:rldimi|ld|pld)\M or it will mean \mrldimi or ld or pld\M (and that "ld" will match anything that "pld$" will match of course). So no doubt this will improve things, but we need testing of each part separately. Also look at code size, or differences in the generated code in general: this is much more sensitive to detect than performance, and is not itself sensitive to things like system load, so a) is easier to measure, and b) has more useful outputs, outputs that tell more of the whole story. Segher