From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <guojiufu@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 845D23856DFE;
 Thu, 16 Jun 2022 07:47:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 845D23856DFE
Received: from pps.filterd (m0098409.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25G7Nr5E013320;
 Thu, 16 Jun 2022 07:47:55 GMT
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gqk06aupg-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Thu, 16 Jun 2022 07:47:55 +0000
Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1])
 by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25G7bJNB015946;
 Thu, 16 Jun 2022 07:47:54 GMT
Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com
 [169.62.189.11])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gqk06aung-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Thu, 16 Jun 2022 07:47:54 +0000
Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1])
 by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25G7ZOFG011079;
 Thu, 16 Jun 2022 07:47:53 GMT
Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com
 [9.57.198.26]) by ppma03dal.us.ibm.com with ESMTP id 3gmjpa8f9e-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Thu, 16 Jun 2022 07:47:53 +0000
Received: from b01ledav001.gho.pok.ibm.com (b01ledav001.gho.pok.ibm.com
 [9.57.199.106])
 by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 25G7lrlE10551880
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Thu, 16 Jun 2022 07:47:53 GMT
Received: from b01ledav001.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id EFAC728059;
 Thu, 16 Jun 2022 07:47:52 +0000 (GMT)
Received: from b01ledav001.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 9FAA428058;
 Thu, 16 Jun 2022 07:47:52 +0000 (GMT)
Received: from pike (unknown [9.5.12.127])
 by b01ledav001.gho.pok.ibm.com (Postfix) with ESMTPS;
 Thu, 16 Jun 2022 07:47:52 +0000 (GMT)
From: Jiufu Guo <guojiufu@linux.ibm.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, linkw@gcc.gnu.org,
 amodra@gmail.com
Subject: Re: [PATCH V2]rs6000: Store complicated constant into pool
References: <20220614132355.117887-1-guojiufu@linux.ibm.com>
 <20220614223421.GB25951@gate.crashing.org>
 <h48tu8me3tu.fsf@genoa.aus.stglabs.ibm.com>
 <20220615174441.GD25951@gate.crashing.org>
Date: Thu, 16 Jun 2022 15:47:49 +0800
In-Reply-To: <20220615174441.GD25951@gate.crashing.org> (Segher Boessenkool's
 message of "Wed, 15 Jun 2022 12:44:41 -0500")
Message-ID: <7eczf99hi2.fsf@pike.rch.stglabs.ibm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: -CKfAAagmBPrZUYYrGFslP52SyFXqxLy
X-Proofpoint-ORIG-GUID: vmnE0iZLEsJqYROZCDdk4blQL_1o_-lm
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514
 definitions=2022-06-16_04,2022-06-15_01,2022-02-23_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 clxscore=1015
 lowpriorityscore=0 malwarescore=0 impostorscore=0 bulkscore=0 phishscore=0
 mlxlogscore=999 suspectscore=0 priorityscore=1501 adultscore=0 spamscore=0
 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2204290000 definitions=main-2206160028
X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, KAM_SHORT, KAM_STOCKGEN, RCVD_IN_MSPIKE_H2,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 16 Jun 2022 07:47:58 -0000

Segher Boessenkool <segher@kernel.crashing.org> writes:

Hi Segher!

Thanks for your comments and suggestions!!

> Hi!
>
> On Wed, Jun 15, 2022 at 04:19:41PM +0800, Jiufu Guo wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > Have you tried with different limits?
>> I drafted cases(C code, and updated asm code) to test runtime costs for
>> different code sequences:  building constants with 5 insns; with 3
>> insns and 2 insns; and loading from rodata.
>
> I'm not asking about trivial examples here.  Have you tried the
> resulting compiler, on some big real-life code, with various choices for
> these essentially random choices, on different cpus?

I tested the patch with spec2017 on p9 and p10.  I only tested combined
variations: 'threshold 2 insns' + 'constant cost COSTS_N_INSNS(N)
- 1' (or do not -1).  And I find only few benchmarks are affected
noticeablely. 
I will test more variations.  Any suggestions about workloads?

>
> Huge changes like this need quite a bit of experimental data to back it
> up, or some convincing other evidence.  That is the only way to move
> forward: anything else will resemble Brownian motion :-(

Understood! I would collect more data to see if it is good in general.

>
>> >  And, p10 is a red herring, you
>> > actually test if prefixed insns are used.
>> 
>> 'pld' is the instruction which we care about instead target p10 :-). So
>> in patch, 'TARGET_PREFIXED' is tested.
>
> Huh.  I would think "pli" is the most important thing here!  Which is
> not a load instruction at all, notwithstanding its name.

"pli" could help a lot on constant building!  When loading constant from
memory, "pld" would also be good for some cases. :-)

>
>> >> --- a/gcc/config/rs6000/rs6000.cc
>> >> +++ b/gcc/config/rs6000/rs6000.cc
>> >> @@ -9706,8 +9706,9 @@ rs6000_init_stack_protect_guard (void)
>> >>  static bool
>> >>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>> >>  {
>> >> -  if (GET_CODE (x) == HIGH
>> >> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
>> >> +  /* Exclude CONSTANT HIGH part.  e.g.
>> >> +     (high:DI (symbol_ref:DI ("var") [flags 0xc0] <var_decl>)).  */
>> >> +  if (GET_CODE (x) == HIGH)
>> >>      return true;
>> >
>> > So, why is this?  You didn't explain.
>> 
>> In the function rs6000_cannot_force_const_mem, we already excluded
>> "HIGH with UNSPEC" like: "high:P (unspec:P [symbol_ref..".
>> I find, "high:DI (symbol_ref:DI" is similar, which may also need to
>> exclude.  Half high part of address would be invalid for constant pool.
>> 
>> Or maybe, we could use below code to exclude it.
>>   /* high part of a symbol_ref could not be put into constant pool.  */
>>   if (GET_CODE (x) == HIGH
>>       && (GET_CODE (XEXP (x, 0)) == UNSPEC || SYMBOL_REF_P (XEXP (x, 0))))
>> 
>> One interesting thing is how the rtx "(high:DI (symbol_ref:DI (var" is
>> generated.  It seems in the middle of optimization, it is checked to
>> see if ok to store in constant memory.
>
> It probably is fine to not allow the HIGH of *anything* to be put in a
> constant pool, sure.  But again, that is a change independent of other
> things, and it could use some supporting evidence.

In code, I will add comments about these examples of high half part
address that need to be excluded.

/* The high part address is invalid for constant pool. The form of
address high part would be: "high:P (unspec:P [symbol_ref".  In the
middle of one pass, the form could also be "high:DI (symbol_ref".
Returns true for rtx with HIGH code. */

>
>> >>      case CONST_DOUBLE:
>> >>      case CONST_WIDE_INT:
>> >> +      /* It may needs a few insns for const to SET. -1 for outer SET code.  */
>> >> +      if (outer_code == SET)
>> >> +	{
>> >> +	  *total = COSTS_N_INSNS (num_insns_constant (x, mode)) - 1;
>> >> +	  return true;
>> >> +	}
>> >> +      /* FALLTHRU */
>> >> +
>> >>      case CONST:
>> >>      case HIGH:
>> >>      case SYMBOL_REF:
>> >
>> > But this again isn't an obvious improvement at all, and needs
>> > performance testing -- separately of the other changes.
>> 
>> This code updates the cost of constant assigning, we need this to avoid
>> CSE to replace the constant loading with constant assigning.
>
> No.  This is rtx_cost, which makes a cost estimate of random RTL, not
> typically corresponding to existing RTL insns at all.  We do not use it
> a lot anymore thankfully (we use insn_cost in most places), but things
> like CSE still use it.

This change increases rtx_cost for constant on SET. Because CSE is using
the rtx_cost.  Then with this, we could avoid CSE get lower cost
incorrectly on complicated constant assigning. 

>
>> The above change (the threshold for storing const in the pool) depends
>> on this code.
>
> Yes, and a gazillion other places as well still (or about 50 really :-) )
>
>> > And this is completely independent of the rest as well.  Cost 5 or 7 are
>> > completely random numbers, why did you pick these?  Does it work better
>> > than 8, or 4, etc.?
>> For 'pld', asm code would be "pld %r9,.LC0@pcrel"; as tests, it is
>> faster than 2insns (like, "li+oris"), but still slower than one
>> instruction (e.g. li, pli).  So, "COSTS_N_INSNS (2) - 1" is selected.
>> And with this, cse will not replace 'pld' with 'li + oris'. 
>
> (That is WinNT assembler syntax btw.  We never use that.  We write
> either "pld 9,.LC0@pcrel" or "pld r9,.LC0@pcrel".)
Thanks.
We used to see asm code: "pld 9,.LC0@pcrel".  Which I want to show is
"pld r9,.LC0@pcrel" :-).  I did not find a way to dump this asm code.
"-mregnames" generates "%r9".

>
> COSTS_N_INSNS(1) means 4, always.  The costs are not additive at all in
> reality of course, so you cannot simply add them and get any sensible
> result :-(
>
> Why did you choose 7?  Your explanation makes 5 and 6 good candidates
> as well.  My money is on 6 performing best here, but there is no way to
> know without actually trying out things.

Thanks for sugguestion! 
As tests, it would be a value between 4-8. I also selected
"5"(slightly slower than 4--one instruction), selected "7" because it
may means close to the cost of two instructions.
"6" maybe better, I will have a test.

>
>> >> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> >> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> >> @@ -1,7 +1,7 @@
>> >>  /* { dg-do compile { target { powerpc*-*-* } } } */
>> >>  /* { dg-require-effective-target lp64 } */
>> >>  /* { dg-options "-O" } */
>> >> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
>> >> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
>> >
>> > Why?  This is still better generated in code, no?  It should never be
>> > loaded from a constant pool (it is hex 4000_0000_0000_0000, easy to
>> > construct with just one or two insns).
>> 
>> For p8/9, two insns "lis 0x4000+sldi 32" are used:
>>         addis %r3,%r2,.LANCHOR0@toc@ha
>>         addi %r3,%r3,.LANCHOR0@toc@l
>>         lis %r9,0x4000
>>         sldi %r9,%r9,32
>>         add %r3,%r3,%r9
>> 	blr
>
> That does not mean putting this constant in the constant pool is a good
> idea at all, of course.
>
>> On p10, as expected, 'pld' would be better than 'lis+sldi'.
>
> Is it?

With simple cases, it shows 'pld' seems better. For perlbench, it may
also indicate this. But I did not test this part separately.
As you suggested, I will collect more data to check this change.

>
>> And for this case, it also saves other instructions(addis/addi):
>>         pld %r3,.LC1@pcrel
>>         blr
>
> This explanation then belongs in your commit message :-)
>
>> > Btw, you need to write
>> >   \m(?:rldimi|ld|pld)\M
>> Oh, thanks!  I did not aware of this.
>
> (?:xxx) is just the same as (xxx), but the parentheses in the latter are
> capturing, which messes up scan-assembler-times, so you need
> non-capturing grouping.
Thanks.
>
>> Maybe I should separate out two cases one for "-mdejagnu-cpu=power10",
>> and one for "-mdejagnu-cpu=power7", because the load instructions
>> the number would be different.
>
> Yeah, it is probably best to never allow both a load and non-load insns
> in the same check.
OK, thanks.
>
>> > So no doubt this will improve things, but we need testing of each part
>> > separately.  Also look at code size, or differences in the generated
>> > code in general: this is much more sensitive to detect than performance,
>> > and is not itself sensitive to things like system load, so a) is easier
>> > to measure, and b) has more useful outputs, outputs that tell more of
>> > the whole story.
>> Thanks for your suggestions!
>> I also feel it is very sensitive when I test performance.
>
> Yeah, there are constants *everywhere* in typical compiler output, so a
> small change will easily move the needle already.
>
>> The change
>> (e.g. cost of constant) may affect other optimization passes indirectly.
>
> That too, yes.
>
>> I will also check the object changes (object size or differences). I may
>> use objects from GCC bootstrap and GCC test suite.
>
> Good choice :-)
>
> Looking forward to it,

Any comments or suggestions, thanks for point out!

BR,
Jiufu Guo

>
>
> Segher