From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <guojiufu@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id ADAF53858C2D;
 Tue, 16 Aug 2022 06:45:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org ADAF53858C2D
Received: from pps.filterd (m0098404.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27G6STmJ025281;
 Tue, 16 Aug 2022 06:45:24 GMT
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j060erd9u-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 16 Aug 2022 06:45:23 +0000
Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1])
 by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27G6SuDj027166;
 Tue, 16 Aug 2022 06:45:23 GMT
Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com
 [169.55.85.253])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j060erd96-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 16 Aug 2022 06:45:23 +0000
Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1])
 by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27G6iPiR024062;
 Tue, 16 Aug 2022 06:45:22 GMT
Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com
 [9.57.198.24]) by ppma01wdc.us.ibm.com with ESMTP id 3hx3k98gcw-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 16 Aug 2022 06:45:22 +0000
Received: from b01ledav001.gho.pok.ibm.com (b01ledav001.gho.pok.ibm.com
 [9.57.199.106])
 by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 27G6jLGf57016776
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Tue, 16 Aug 2022 06:45:21 GMT
Received: from b01ledav001.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id BCE3628058;
 Tue, 16 Aug 2022 06:45:21 +0000 (GMT)
Received: from b01ledav001.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 7740A2805C;
 Tue, 16 Aug 2022 06:45:21 +0000 (GMT)
Received: from pike (unknown [9.5.12.127])
 by b01ledav001.gho.pok.ibm.com (Postfix) with ESMTPS;
 Tue, 16 Aug 2022 06:45:21 +0000 (GMT)
From: Jiufu Guo <guojiufu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>, David Edelsohn <dje.gcc@gmail.com>, 
 Segher Boessenkool <segher@kernel.crashing.org>, linkw@gcc.gnu.org
Subject: Re: [RFC]rs6000: split complicated constant to memory
References: <20220815052519.194582-1-guojiufu@linux.ibm.com>
 <CAFiYyc0vqQyLzzov8ghFXiz1VrLRBDfGfddws1BxQXAJJXKg0Q@mail.gmail.com>
 <7ek078ludi.fsf@pike.rch.stglabs.ibm.com>
Date: Tue, 16 Aug 2022 14:45:12 +0800
In-Reply-To: <7ek078ludi.fsf@pike.rch.stglabs.ibm.com> (Jiufu Guo's message of
 "Tue, 16 Aug 2022 11:50:17 +0800")
Message-ID: <7ev8qsk7pj.fsf@pike.rch.stglabs.ibm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: Flo_aBp8-WOzs_gmCYt-uY0Adq3QYrNm
X-Proofpoint-GUID: V_Ue1HyhTEcYHx-J8VNx3F6lFRLetVlT
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1
 definitions=2022-08-16_04,2022-08-16_01,2022-06-22_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 impostorscore=0
 lowpriorityscore=0 adultscore=0 clxscore=1015 malwarescore=0
 suspectscore=0 spamscore=0 priorityscore=1501 phishscore=0 bulkscore=0
 mlxscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2207270000 definitions=main-2208160024
X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, KAM_STOCKGEN,
 RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Aug 2022 06:45:28 -0000

Jiufu Guo <guojiufu@linux.ibm.com> writes:

> Hi,
>
> Richard Biener <richard.guenther@gmail.com> writes:
>
>> On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>> Hi,
>>>
>>> This patch tries to put the constant into constant pool if building the
>>> constant requires 3 or more instructions.
>>>
>>> But there is a concern: I'm wondering if this patch is really profitable.
>>>
>>> Because, as I tested, 1. for simple case, if instructions are not been run
>>> in parallel, loading constant from memory maybe faster; but 2. if there
>>> are some instructions could run in parallel, loading constant from memory
>>> are not win comparing with building constant.  As below examples.
>>>
>>> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect;
>>> for f2.c and f4.c, 'loading' constant are visibly slower.
>>>
>>> For real-world cases, both kinds of code sequences exist.
>>>
>>> So, I'm not sure if we need to push this patch.
>>>
>>> Run a lot of times (1000000000) below functions to check runtime.
>>> f1.c:
>>> long foo (long *arg, long*, long *)
>>> {
>>>   *arg = 0x1234567800000000;
>>> }
>>> asm building constant:
>>>         lis 10,0x1234
>>>         ori 10,10,0x5678
>>>         sldi 10,10,32
>>> vs.  asm loading
>>>         addis 10,2,.LC0@toc@ha
>>>         ld 10,.LC0@toc@l(10)
>>> The runtime between 'building' and 'loading' are similar: some times the
>>> 'building' is faster; sometimes 'loading' is faster. And the difference is
>>> slight.
>>
>> I wonder if it is possible to decide this during scheduling - chose the
>> variant that, when the result is needed, is cheaper?  Post-RA might
>> be a bit difficult (I see the load from memory needs the TOC, but then
>> when the TOC is not available we could just always emit the build form),
>> and pre-reload precision might be not good enough to make this worth
>> the experiment?
> Thanks a lot for your comments!
>
> Yes, Post-RA may not handle all cases.
> If there is no TOC avaiable, we are not able to load the const through
> TOC.  As Segher point out: crtl->uses_const_pool maybe an approximation
> way.
> Sched2 pass could optimize some cases(e.g. for f2.c and f4.c), but for
> some cases, it may not distrubuted those 'building' instructions.
>
> So, maybe we add a peephole after sched2.  If the five-instructions
> to building constant are still successive, then using 'load' to replace
> (need to check TOC available).
> While I'm not sure if it is worthy.

Oh, as checking the object files (from GCC bootstrap and spec), it is rare
that the five-instructions are successive.  It is often 1(or 2) insns
are distributed, and other 4(or 3) instructions are successive.
So, using peephole may not very helpful.

BR,
Jeff(Jiufu)

>
>>
>> Of course the scheduler might lack on the technical side as well.
>
>
> BR,
> Jeff(Jiufu)
>
>>
>>>
>>> f2.c
>>> long foo (long *arg, long *arg2, long *arg3)
>>> {
>>>   *arg = 0x1234567800000000;
>>>   *arg2 = 0x7965234700000000;
>>>   *arg3 = 0x4689123700000000;
>>> }
>>> asm building constant:
>>>         lis 7,0x1234
>>>         lis 10,0x7965
>>>         lis 9,0x4689
>>>         ori 7,7,0x5678
>>>         ori 10,10,0x2347
>>>         ori 9,9,0x1237
>>>         sldi 7,7,32
>>>         sldi 10,10,32
>>>         sldi 9,9,32
>>> vs. loading
>>>         addis 7,2,.LC0@toc@ha
>>>         addis 10,2,.LC1@toc@ha
>>>         addis 9,2,.LC2@toc@ha
>>>         ld 7,.LC0@toc@l(7)
>>>         ld 10,.LC1@toc@l(10)
>>>         ld 9,.LC2@toc@l(9)
>>> For this case, 'loading' is always slower than 'building' (>15%).
>>>
>>> f3.c
>>> long foo (long *arg, long *, long *)
>>> {
>>>   *arg = 384307168202282325;
>>> }
>>>         lis 10,0x555
>>>         ori 10,10,0x5555
>>>         sldi 10,10,32
>>>         oris 10,10,0x5555
>>>         ori 10,10,0x5555
>>> For this case, 'building' (through 5 instructions) are slower, and 'loading'
>>> is faster ~5%;
>>>
>>> f4.c
>>> long foo (long *arg, long *arg2, long *arg3)
>>> {
>>>   *arg = 384307168202282325;
>>>   *arg2 = -6148914691236517205;
>>>   *arg3 = 768614336404564651;
>>> }
>>>         lis 7,0x555
>>>         lis 10,0xaaaa
>>>         lis 9,0xaaa
>>>         ori 7,7,0x5555
>>>         ori 10,10,0xaaaa
>>>         ori 9,9,0xaaaa
>>>         sldi 7,7,32
>>>         sldi 10,10,32
>>>         sldi 9,9,32
>>>         oris 7,7,0x5555
>>>         oris 10,10,0xaaaa
>>>         oris 9,9,0xaaaa
>>>         ori 7,7,0x5555
>>>         ori 10,10,0xaaab
>>>         ori 9,9,0xaaab
>>> For this cases, since 'building' constant are parallel, 'loading' is slower:
>>> ~8%. On p10, 'loading'(through 'pld') is also slower >4%.
>>>
>>>
>>> BR,
>>> Jeff(Jiufu)
>>>
>>> ---
>>>  gcc/config/rs6000/rs6000.cc                | 14 ++++++++++++++
>>>  gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++
>>>  2 files changed, 25 insertions(+)
>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
>>>
>>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>>> index 4b727d2a500..3798e11bdbc 100644
>>> --- a/gcc/config/rs6000/rs6000.cc
>>> +++ b/gcc/config/rs6000/rs6000.cc
>>> @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source)
>>>           c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000;
>>>           emit_move_insn (lo, GEN_INT (c));
>>>         }
>>> +      else if (base_reg_operand (dest, mode)
>>> +              && num_insns_constant (source, mode) > 2)
>>> +       {
>>> +         rtx sym = force_const_mem (mode, source);
>>> +         if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))
>>> +             && use_toc_relative_ref (XEXP (sym, 0), mode))
>>> +           {
>>> +             rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest));
>>> +             sym = gen_const_mem (mode, toc);
>>> +             set_mem_alias_set (sym, get_TOC_alias_set ());
>>> +           }
>>> +
>>> +         emit_insn (gen_rtx_SET (dest, sym));
>>> +       }
>>>        else
>>>         rs6000_emit_set_long_const (dest, c);
>>>        break;
>>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>>> new file mode 100644
>>> index 00000000000..469a8f64400
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>>> @@ -0,0 +1,11 @@
>>> +/* PR target/63281 */
>>> +/* { dg-do compile { target lp64 } } */
>>> +/* { dg-options "-O2 -std=c99" } */
>>> +
>>> +void
>>> +foo (unsigned long long *a)
>>> +{
>>> +  *a = 0x020805006106003;
>>> +}
>>> +
>>> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */
>>> --
>>> 2.17.1
>>>