From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <guojiufu@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 1DD9838582B4;
 Wed, 15 Jun 2022 08:19:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1DD9838582B4
Received: from pps.filterd (m0098396.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 25F8FMt3029612;
 Wed, 15 Jun 2022 08:19:50 GMT
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gpr32p0b8-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 15 Jun 2022 08:19:49 +0000
Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1])
 by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 25F8IXID009223;
 Wed, 15 Jun 2022 08:19:49 GMT
Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com
 [169.62.189.11])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gpr32p0ak-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 15 Jun 2022 08:19:49 +0000
Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1])
 by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 25F85KNY023516;
 Wed, 15 Jun 2022 08:19:48 GMT
Received: from b03cxnp07028.gho.boulder.ibm.com
 (b03cxnp07028.gho.boulder.ibm.com [9.17.130.15])
 by ppma03dal.us.ibm.com with ESMTP id 3gmjpa05hh-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 15 Jun 2022 08:19:48 +0000
Received: from b03ledav005.gho.boulder.ibm.com
 (b03ledav005.gho.boulder.ibm.com [9.17.130.236])
 by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 25F8JlPr34603400
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Wed, 15 Jun 2022 08:19:47 GMT
Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 22853BE054;
 Wed, 15 Jun 2022 08:19:47 +0000 (GMT)
Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id CE835BE053;
 Wed, 15 Jun 2022 08:19:46 +0000 (GMT)
Received: from genoa (unknown [9.40.192.157])
 by b03ledav005.gho.boulder.ibm.com (Postfix) with ESMTPS;
 Wed, 15 Jun 2022 08:19:46 +0000 (GMT)
From: Jiufu Guo <guojiufu@linux.ibm.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, linkw@gcc.gnu.org,
 amodra@gmail.com
Subject: Re: [PATCH V2]rs6000: Store complicated constant into pool
In-Reply-To: <20220614223421.GB25951@gate.crashing.org> (Segher Boessenkool's
 message of "Tue, 14 Jun 2022 17:34:21 -0500")
References: <20220614132355.117887-1-guojiufu@linux.ibm.com>
 <20220614223421.GB25951@gate.crashing.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
Date: Wed, 15 Jun 2022 16:19:41 +0800
Message-ID: <h48tu8me3tu.fsf@genoa.aus.stglabs.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: MVPFp43U8qUKXOdYAa4eCcxAXVjeKfRX
X-Proofpoint-ORIG-GUID: oCztZ-ntLJIe1fOGoNIfp57EASCJ0su8
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514
 definitions=2022-06-15_03,2022-06-13_01,2022-02-23_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 mlxlogscore=999
 suspectscore=0 malwarescore=0 phishscore=0 impostorscore=0 adultscore=0
 clxscore=1015 bulkscore=0 lowpriorityscore=0 priorityscore=1501 mlxscore=0
 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2204290000 definitions=main-2206150029
X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, KAM_SHORT, KAM_STOCKGEN, RCVD_IN_MSPIKE_H2,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Jun 2022 08:19:53 -0000

Segher Boessenkool <segher@kernel.crashing.org> writes:

Hi Segher,

Thanks so much for your review and sugguestions!

> Hi!
>
> On Tue, Jun 14, 2022 at 09:23:55PM +0800, Jiufu Guo wrote:
>> This patch reduces the threshold of instruction number for storing
>> constant to pool and update cost for constant and mem accessing.
>> And then if building the constant needs more than 2 instructions (or
>> more than 1 instruction on P10), then prefer to load it from constant
>> pool.
>
> Have you tried with different limits?  And, p10 is a red herring, you
> actually test if prefixed insns are used.

I drafted cases(C code, and updated asm code) to test runtime costs for
different code sequences:  building constants with 5 insns; with 3
insns and 2 insns; and loading from rodata.
'pld' is the instruction which we care about instead target p10 :-). So
in patch, 'TARGET_PREFIXED' is tested.

>
>> 	* config/rs6000/rs6000.cc (rs6000_cannot_force_const_mem):
>> 	Exclude rtx with code 'HIGH'.
>
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -9706,8 +9706,9 @@ rs6000_init_stack_protect_guard (void)
>>  static bool
>>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>>  {
>> -  if (GET_CODE (x) == HIGH
>> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
>> +  /* Exclude CONSTANT HIGH part.  e.g.
>> +     (high:DI (symbol_ref:DI ("var") [flags 0xc0] <var_decl>)).  */
>> +  if (GET_CODE (x) == HIGH)
>>      return true;
>
> So, why is this?  You didn't explain.

In the function rs6000_cannot_force_const_mem, we already excluded
"HIGH with UNSPEC" like: "high:P (unspec:P [symbol_ref..".
I find, "high:DI (symbol_ref:DI" is similar, which may also need to
exclude.  Half high part of address would be invalid for constant pool.

Or maybe, we could use below code to exclude it.
  /* high part of a symbol_ref could not be put into constant pool.  */
  if (GET_CODE (x) == HIGH
      && (GET_CODE (XEXP (x, 0)) == UNSPEC || SYMBOL_REF_P (XEXP (x, 0))))

One interesting thing is how the rtx "(high:DI (symbol_ref:DI (var" is
generated.  It seems in the middle of optimization, it is checked to
see if ok to store in constant memory.
  
>
>> @@ -11139,7 +11140,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>>  		    && FP_REGNO_P (REGNO (operands[0])))
>>  		   || !CONST_INT_P (operands[1])
>>  		   || (num_insns_constant (operands[1], mode)
>> -		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
>> +		       > (TARGET_PREFIXED ? 1 : 2)))
>>  	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
>>  	       && (TARGET_CMODEL == CMODEL_SMALL
>>  		   || can_create_pseudo_p ()
>
> This is the more obvious part.
>
>> @@ -22101,6 +22102,14 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
>>  
>>      case CONST_DOUBLE:
>>      case CONST_WIDE_INT:
>> +      /* It may needs a few insns for const to SET. -1 for outer SET code.  */
>> +      if (outer_code == SET)
>> +	{
>> +	  *total = COSTS_N_INSNS (num_insns_constant (x, mode)) - 1;
>> +	  return true;
>> +	}
>> +      /* FALLTHRU */
>> +
>>      case CONST:
>>      case HIGH:
>>      case SYMBOL_REF:
>
> But this again isn't an obvious improvement at all, and needs
> performance testing -- separately of the other changes.

This code updates the cost of constant assigning, we need this to avoid
CSE to replace the constant loading with constant assigning.
The above change (the threshold for storing const in the pool) depends
on this code.
This part does not depend on other part, so this part could be tested to
tune separately.

>
>> @@ -22110,8 +22119,12 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
>>      case MEM:
>>        /* When optimizing for size, MEM should be slightly more expensive
>>  	 than generating address, e.g., (plus (reg) (const)).
>> -	 L1 cache latency is about two instructions.  */
>> -      *total = !speed ? COSTS_N_INSNS (1) + 1 : COSTS_N_INSNS (2);
>> +	 L1 cache latency is about two instructions.
>> +	 For prefixed load (pld), we would set it slightly faster than
>> +	 than two instructions. */
>> +      *total = !speed
>> +		 ? COSTS_N_INSNS (1) + 1
>> +		 : TARGET_PREFIXED ? COSTS_N_INSNS (2) - 1 : COSTS_N_INSNS (2);
>>        if (rs6000_slow_unaligned_access (mode, MEM_ALIGN (x)))
>>  	*total += COSTS_N_INSNS (100);
>>        return true;
>
> And this is completely independent of the rest as well.  Cost 5 or 7 are
> completely random numbers, why did you pick these?  Does it work better
> than 8, or 4, etc.?
For 'pld', asm code would be "pld %r9,.LC0@pcrel"; as tests, it is
faster than 2insns (like, "li+oris"), but still slower than one
instruction (e.g. li, pli).  So, "COSTS_N_INSNS (2) - 1" is selected.
And with this, cse will not replace 'pld' with 'li + oris'. 

>
>
>> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> @@ -1,7 +1,7 @@
>>  /* { dg-do compile { target { powerpc*-*-* } } } */
>>  /* { dg-require-effective-target lp64 } */
>>  /* { dg-options "-O" } */
>> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
>> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
>
> Why?  This is still better generated in code, no?  It should never be
> loaded from a constant pool (it is hex 4000_0000_0000_0000, easy to
> construct with just one or two insns).

For p8/9, two insns "lis 0x4000+sldi 32" are used:
        addis %r3,%r2,.LANCHOR0@toc@ha
        addi %r3,%r3,.LANCHOR0@toc@l
        lis %r9,0x4000
        sldi %r9,%r9,32
        add %r3,%r3,%r9
	blr

On p10, as expected, 'pld' would be better than 'lis+sldi'.
And for this case, it also saves other instructions(addis/addi):
        pld %r3,.LC1@pcrel
        blr
....
.LC1:
     	.quad   .LANCHOR0+4611686018427387904
....
        .set    .LANCHOR0,. + 0
        .type   x, @object
        .size   x, 4
x:
        .zero   4
        
>
>> --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> @@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
>>  unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
>>  unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
>>  
>> -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
>> +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
>
> Please make this the exact number of times you want to see rldimi and
> the number of times you want a load.
>
> Btw, you need to write
>   \m(?:rldimi|ld|pld)\M
Oh, thanks!  I did not aware of this.

> or it will mean
>   \mrldimi
> or
>   ld
> or
>   pld\M
> (and that "ld" will match anything that "pld$" will match of course).
Maybe I should separate out two cases one for "-mdejagnu-cpu=power10",
and one for "-mdejagnu-cpu=power7", because the load instructions
the number would be different.

>
>
> So no doubt this will improve things, but we need testing of each part
> separately.  Also look at code size, or differences in the generated
> code in general: this is much more sensitive to detect than performance,
> and is not itself sensitive to things like system load, so a) is easier
> to measure, and b) has more useful outputs, outputs that tell more of
> the whole story.
Thanks for your suggestions!
I also feel it is very sensitive when I test performance. The change
(e.g. cost of constant) may affect other optimization passes indirectly.
I will also check the object changes (object size or differences). I may
use objects from GCC bootstrap and GCC test suite.

BR,
Jiufu Guo

>
>
> Segher