From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gate.crashing.org (gate.crashing.org [63.228.1.57]) by sourceware.org (Postfix) with ESMTP id 0E7F23858C56; Thu, 1 Sep 2022 21:53:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0E7F23858C56 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=kernel.crashing.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kernel.crashing.org Received: from gate.crashing.org (localhost.localdomain [127.0.0.1]) by gate.crashing.org (8.14.1/8.14.1) with ESMTP id 281LqXoL008327; Thu, 1 Sep 2022 16:52:34 -0500 Received: (from segher@localhost) by gate.crashing.org (8.14.1/8.14.1/Submit) id 281LqXWi008326; Thu, 1 Sep 2022 16:52:33 -0500 X-Authentication-Warning: gate.crashing.org: segher set sender to segher@kernel.crashing.org using -f Date: Thu, 1 Sep 2022 16:52:33 -0500 From: Segher Boessenkool To: Jiufu Guo Cc: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, linkw@gcc.gnu.org, meissner@linux.ibm.com Subject: Re: [PATCH 1/2] Using pli(paddi) and rotate to build 64bit constants Message-ID: <20220901215233.GJ25951@gate.crashing.org> References: <20220901032400.23692-1-guojiufu@linux.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220901032400.23692-1-guojiufu@linux.ibm.com> User-Agent: Mutt/1.4.2.3i X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,JMQ_SPF_NEUTRAL,KAM_DMARC_STATUS,KAM_SHORT,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi! This patch is a clear improvement :-) On Thu, Sep 01, 2022 at 11:24:00AM +0800, Jiufu Guo wrote: > As mentioned in PR106550, since pli could support 34bits immediate, we could > use less instructions(3insn would be ok) to build 64bits constant with pli. > For example, for constant 0x020805006106003, we could generate it with: > asm code1: > pli 9,101736451 (0x6106003) > sldi 9,9,32 > paddi 9,9, 2130000 (0x0208050) 3 insns, 2 insns dependent on the previous, each. > or asm code2: > pli 10, 2130000 > pli 9, 101736451 > rldimi 9, 10, 32, 0 3 insns, 1 insn dependent on both others. > Testing with simple cases as below, run them a lot of times: > f1.c > long __attribute__ ((noinline)) foo (long *arg,long *,long*) > { > *arg = 0x2351847027482577; > } > 5insns: base > pli+sldi+paddi: similar -0.08% > pli+pli+rldimi: faster +0.66% This mostly tests how well this micro-benchmark is scheduled. More time is spent in the looping and function calls (not shown)! > f2.c > long __attribute__ ((noinline)) foo (long *arg, long *arg2, long *arg3) > { > *arg = 0x2351847027482577; > *arg2 = 0x3257845024384680; > *arg3 = 0x1245abcef9240dec; > } > 5nisns: base > pli+sldi+paddi: faster +1.35% > pli+pli+rldimi: faster +5.49% > > f2.c would be more meaningful. Because 'sched passes' are effective for > f2.c, but 'scheds' do less thing for f1.c. It still is a too small example to mean much without looking at a pipeview, or at the very least perf. But the results show a solid improvement as expected ;-) > gcc/ChangeLog: > PR target/106550 > * config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add 'pli' for > constant building. "Use pli." ? > gcc/testsuite/ChangeLog: > PR target/106550 > * gcc.target/powerpc/pr106550.c: New test. > + else if (TARGET_PREFIXED) > + { > + /* pli 9,high32 + pli 10,low32 + rldimi 9,10,32,0. */ But not just 9 and 10. Use A and B or X and Y or H and L or something like that? The comment goes... > + if (can_create_pseudo_p ()) > + { ... here. > + temp = gen_reg_rtx (DImode); > + rtx temp1 = gen_reg_rtx (DImode); > + emit_move_insn (copy_rtx (temp), GEN_INT ((ud4 << 16) | ud3)); > + emit_move_insn (copy_rtx (temp1), GEN_INT ((ud2 << 16) | ud1)); > + > + emit_insn (gen_rotldi3_insert_3 (dest, temp, GEN_INT (32), temp1, > + GEN_INT (0xffffffff))); > + } > + No blank line here please. > + /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32. */ > + else > + { The comment goes here, in the block it refers to. Comments for a block are the first thing *in* the block. > + emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3)); > + > + emit_move_insn (copy_rtx (dest), > + gen_rtx_ASHIFT (DImode, copy_rtx (dest), > + GEN_INT (32))); > + > + bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO; There should be a test that we so the right thing (or *a* right thing, anyway; a working thing; but hopefully a reasonably fast thing) for !can_use_paddi. > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c > @@ -0,0 +1,14 @@ > +/* PR target/106550 */ > +/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10" } */ > + > +void > +foo (unsigned long long *a) > +{ > + *a++ = 0x020805006106003; > + *a++ = 0x2351847027482577; > +} > + > +/* 3 insns for each constant: pli+sldi+paddi or pli+pli+rldimi. > + And 3 additional insns: std+std+blr: 9 insns totally. */ > +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */ Also test the expected insns separately please? The std's (just with \mstd so it will catch all variations as well), the blr, the pli's and the rldimi etc.? We also should test all special cases as well. Especially those that do not happen all over the place, we will notice something is broken there easy enough. But unlikely cases can take years to show up. Okay for trunk with the formatting fixed. Thank you! Segher