From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x52f.google.com (mail-ed1-x52f.google.com [IPv6:2a00:1450:4864:20::52f]) by sourceware.org (Postfix) with ESMTPS id 44FD53858438; Mon, 15 Aug 2022 08:08:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 44FD53858438 Received: by mail-ed1-x52f.google.com with SMTP id b96so8709818edf.0; Mon, 15 Aug 2022 01:08:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc; bh=s0ilgWs7BjdEXSOxhmtbFr+96onmjBMgceNP3LNJmnM=; b=xYaH7NfJB8g/KfQPXvaExFStXP4qjeP+R4HJ7qQHVCNwEnUXrwCUWhc9LTQKN6vN29 c8wngcmaG17NywUaYEeYHeu07pLSg0boVsyonPJtcn0gdqYikon2jSf3az3qIMLFMd+6 AvYg9gGjlAFYbAN7AUtKhVGtLa0jEB1njG2Q3rUTIPxsJTkj5f1FTXIzAyZ36Vuhldh2 Euky+D8daCqcJYYXOaOUntynYiTqN8kqUJTL9TefX6DcOYtZfYZj6jncGwO+7JXjN1qX 3XjjXtFB3NNlnFclqaSYL3OtAdslIJH+VSwV3Scq3b4oKPW8E4hsbC2dvmlbXE4dxor9 C/RQ== X-Gm-Message-State: ACgBeo07T6dtisjRZn9rbEXghmuBbRSicVXBPhIThgGhw2m8reEBj43G 90LUSGbMpOGRcXUCyhPNxkv7o1mnmSW/3slj75E= X-Google-Smtp-Source: AA6agR5iquqQ1zySCqfLOt79K8XReuA/dOAfhOltZ4O9Js8rcU03avtRIGTaw4na1HtADKoCGHnm01BUvs8avAFsabU= X-Received: by 2002:a05:6402:2b98:b0:43e:107:183d with SMTP id fj24-20020a0564022b9800b0043e0107183dmr13807820edb.366.1660550885069; Mon, 15 Aug 2022 01:08:05 -0700 (PDT) MIME-Version: 1.0 References: <20220815052519.194582-1-guojiufu@linux.ibm.com> In-Reply-To: <20220815052519.194582-1-guojiufu@linux.ibm.com> From: Richard Biener Date: Mon, 15 Aug 2022 10:07:52 +0200 Message-ID: Subject: Re: [RFC]rs6000: split complicated constant to memory To: Jiufu Guo Cc: GCC Patches , David Edelsohn , Segher Boessenkool , linkw@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Aug 2022 08:08:08 -0000 On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches wrote: > > Hi, > > This patch tries to put the constant into constant pool if building the > constant requires 3 or more instructions. > > But there is a concern: I'm wondering if this patch is really profitable. > > Because, as I tested, 1. for simple case, if instructions are not been run > in parallel, loading constant from memory maybe faster; but 2. if there > are some instructions could run in parallel, loading constant from memory > are not win comparing with building constant. As below examples. > > For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; > for f2.c and f4.c, 'loading' constant are visibly slower. > > For real-world cases, both kinds of code sequences exist. > > So, I'm not sure if we need to push this patch. > > Run a lot of times (1000000000) below functions to check runtime. > f1.c: > long foo (long *arg, long*, long *) > { > *arg = 0x1234567800000000; > } > asm building constant: > lis 10,0x1234 > ori 10,10,0x5678 > sldi 10,10,32 > vs. asm loading > addis 10,2,.LC0@toc@ha > ld 10,.LC0@toc@l(10) > The runtime between 'building' and 'loading' are similar: some times the > 'building' is faster; sometimes 'loading' is faster. And the difference is > slight. I wonder if it is possible to decide this during scheduling - chose the variant that, when the result is needed, is cheaper? Post-RA might be a bit difficult (I see the load from memory needs the TOC, but then when the TOC is not available we could just always emit the build form), and pre-reload precision might be not good enough to make this worth the experiment? Of course the scheduler might lack on the technical side as well. > > f2.c > long foo (long *arg, long *arg2, long *arg3) > { > *arg = 0x1234567800000000; > *arg2 = 0x7965234700000000; > *arg3 = 0x4689123700000000; > } > asm building constant: > lis 7,0x1234 > lis 10,0x7965 > lis 9,0x4689 > ori 7,7,0x5678 > ori 10,10,0x2347 > ori 9,9,0x1237 > sldi 7,7,32 > sldi 10,10,32 > sldi 9,9,32 > vs. loading > addis 7,2,.LC0@toc@ha > addis 10,2,.LC1@toc@ha > addis 9,2,.LC2@toc@ha > ld 7,.LC0@toc@l(7) > ld 10,.LC1@toc@l(10) > ld 9,.LC2@toc@l(9) > For this case, 'loading' is always slower than 'building' (>15%). > > f3.c > long foo (long *arg, long *, long *) > { > *arg = 384307168202282325; > } > lis 10,0x555 > ori 10,10,0x5555 > sldi 10,10,32 > oris 10,10,0x5555 > ori 10,10,0x5555 > For this case, 'building' (through 5 instructions) are slower, and 'loading' > is faster ~5%; > > f4.c > long foo (long *arg, long *arg2, long *arg3) > { > *arg = 384307168202282325; > *arg2 = -6148914691236517205; > *arg3 = 768614336404564651; > } > lis 7,0x555 > lis 10,0xaaaa > lis 9,0xaaa > ori 7,7,0x5555 > ori 10,10,0xaaaa > ori 9,9,0xaaaa > sldi 7,7,32 > sldi 10,10,32 > sldi 9,9,32 > oris 7,7,0x5555 > oris 10,10,0xaaaa > oris 9,9,0xaaaa > ori 7,7,0x5555 > ori 10,10,0xaaab > ori 9,9,0xaaab > For this cases, since 'building' constant are parallel, 'loading' is slower: > ~8%. On p10, 'loading'(through 'pld') is also slower >4%. > > > BR, > Jeff(Jiufu) > > --- > gcc/config/rs6000/rs6000.cc | 14 ++++++++++++++ > gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++ > 2 files changed, 25 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c > > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc > index 4b727d2a500..3798e11bdbc 100644 > --- a/gcc/config/rs6000/rs6000.cc > +++ b/gcc/config/rs6000/rs6000.cc > @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source) > c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000; > emit_move_insn (lo, GEN_INT (c)); > } > + else if (base_reg_operand (dest, mode) > + && num_insns_constant (source, mode) > 2) > + { > + rtx sym = force_const_mem (mode, source); > + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) > + && use_toc_relative_ref (XEXP (sym, 0), mode)) > + { > + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); > + sym = gen_const_mem (mode, toc); > + set_mem_alias_set (sym, get_TOC_alias_set ()); > + } > + > + emit_insn (gen_rtx_SET (dest, sym)); > + } > else > rs6000_emit_set_long_const (dest, c); > break; > diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c > new file mode 100644 > index 00000000000..469a8f64400 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c > @@ -0,0 +1,11 @@ > +/* PR target/63281 */ > +/* { dg-do compile { target lp64 } } */ > +/* { dg-options "-O2 -std=c99" } */ > + > +void > +foo (unsigned long long *a) > +{ > + *a = 0x020805006106003; > +} > + > +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */ > -- > 2.17.1 >