public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Jiufu Guo <guojiufu@linux.ibm.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com, linkw@gcc.gnu.org,
	amodra@gmail.com
Subject: Re: [PATCH V2]rs6000: Store complicated constant into pool
Date: Wed, 15 Jun 2022 16:19:41 +0800	[thread overview]
Message-ID: <h48tu8me3tu.fsf@genoa.aus.stglabs.ibm.com> (raw)
In-Reply-To: <20220614223421.GB25951@gate.crashing.org> (Segher Boessenkool's message of "Tue, 14 Jun 2022 17:34:21 -0500")

Segher Boessenkool <segher@kernel.crashing.org> writes:

Hi Segher,

Thanks so much for your review and sugguestions!

> Hi!
>
> On Tue, Jun 14, 2022 at 09:23:55PM +0800, Jiufu Guo wrote:
>> This patch reduces the threshold of instruction number for storing
>> constant to pool and update cost for constant and mem accessing.
>> And then if building the constant needs more than 2 instructions (or
>> more than 1 instruction on P10), then prefer to load it from constant
>> pool.
>
> Have you tried with different limits?  And, p10 is a red herring, you
> actually test if prefixed insns are used.

I drafted cases(C code, and updated asm code) to test runtime costs for
different code sequences:  building constants with 5 insns; with 3
insns and 2 insns; and loading from rodata.
'pld' is the instruction which we care about instead target p10 :-). So
in patch, 'TARGET_PREFIXED' is tested.

>
>> 	* config/rs6000/rs6000.cc (rs6000_cannot_force_const_mem):
>> 	Exclude rtx with code 'HIGH'.
>
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -9706,8 +9706,9 @@ rs6000_init_stack_protect_guard (void)
>>  static bool
>>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>>  {
>> -  if (GET_CODE (x) == HIGH
>> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
>> +  /* Exclude CONSTANT HIGH part.  e.g.
>> +     (high:DI (symbol_ref:DI ("var") [flags 0xc0] <var_decl>)).  */
>> +  if (GET_CODE (x) == HIGH)
>>      return true;
>
> So, why is this?  You didn't explain.

In the function rs6000_cannot_force_const_mem, we already excluded
"HIGH with UNSPEC" like: "high:P (unspec:P [symbol_ref..".
I find, "high:DI (symbol_ref:DI" is similar, which may also need to
exclude.  Half high part of address would be invalid for constant pool.

Or maybe, we could use below code to exclude it.
  /* high part of a symbol_ref could not be put into constant pool.  */
  if (GET_CODE (x) == HIGH
      && (GET_CODE (XEXP (x, 0)) == UNSPEC || SYMBOL_REF_P (XEXP (x, 0))))

One interesting thing is how the rtx "(high:DI (symbol_ref:DI (var" is
generated.  It seems in the middle of optimization, it is checked to
see if ok to store in constant memory.
  
>
>> @@ -11139,7 +11140,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>>  		    && FP_REGNO_P (REGNO (operands[0])))
>>  		   || !CONST_INT_P (operands[1])
>>  		   || (num_insns_constant (operands[1], mode)
>> -		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
>> +		       > (TARGET_PREFIXED ? 1 : 2)))
>>  	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
>>  	       && (TARGET_CMODEL == CMODEL_SMALL
>>  		   || can_create_pseudo_p ()
>
> This is the more obvious part.
>
>> @@ -22101,6 +22102,14 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
>>  
>>      case CONST_DOUBLE:
>>      case CONST_WIDE_INT:
>> +      /* It may needs a few insns for const to SET. -1 for outer SET code.  */
>> +      if (outer_code == SET)
>> +	{
>> +	  *total = COSTS_N_INSNS (num_insns_constant (x, mode)) - 1;
>> +	  return true;
>> +	}
>> +      /* FALLTHRU */
>> +
>>      case CONST:
>>      case HIGH:
>>      case SYMBOL_REF:
>
> But this again isn't an obvious improvement at all, and needs
> performance testing -- separately of the other changes.

This code updates the cost of constant assigning, we need this to avoid
CSE to replace the constant loading with constant assigning.
The above change (the threshold for storing const in the pool) depends
on this code.
This part does not depend on other part, so this part could be tested to
tune separately.

>
>> @@ -22110,8 +22119,12 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
>>      case MEM:
>>        /* When optimizing for size, MEM should be slightly more expensive
>>  	 than generating address, e.g., (plus (reg) (const)).
>> -	 L1 cache latency is about two instructions.  */
>> -      *total = !speed ? COSTS_N_INSNS (1) + 1 : COSTS_N_INSNS (2);
>> +	 L1 cache latency is about two instructions.
>> +	 For prefixed load (pld), we would set it slightly faster than
>> +	 than two instructions. */
>> +      *total = !speed
>> +		 ? COSTS_N_INSNS (1) + 1
>> +		 : TARGET_PREFIXED ? COSTS_N_INSNS (2) - 1 : COSTS_N_INSNS (2);
>>        if (rs6000_slow_unaligned_access (mode, MEM_ALIGN (x)))
>>  	*total += COSTS_N_INSNS (100);
>>        return true;
>
> And this is completely independent of the rest as well.  Cost 5 or 7 are
> completely random numbers, why did you pick these?  Does it work better
> than 8, or 4, etc.?
For 'pld', asm code would be "pld %r9,.LC0@pcrel"; as tests, it is
faster than 2insns (like, "li+oris"), but still slower than one
instruction (e.g. li, pli).  So, "COSTS_N_INSNS (2) - 1" is selected.
And with this, cse will not replace 'pld' with 'li + oris'. 

>
>
>> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> @@ -1,7 +1,7 @@
>>  /* { dg-do compile { target { powerpc*-*-* } } } */
>>  /* { dg-require-effective-target lp64 } */
>>  /* { dg-options "-O" } */
>> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
>> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
>
> Why?  This is still better generated in code, no?  It should never be
> loaded from a constant pool (it is hex 4000_0000_0000_0000, easy to
> construct with just one or two insns).

For p8/9, two insns "lis 0x4000+sldi 32" are used:
        addis %r3,%r2,.LANCHOR0@toc@ha
        addi %r3,%r3,.LANCHOR0@toc@l
        lis %r9,0x4000
        sldi %r9,%r9,32
        add %r3,%r3,%r9
	blr

On p10, as expected, 'pld' would be better than 'lis+sldi'.
And for this case, it also saves other instructions(addis/addi):
        pld %r3,.LC1@pcrel
        blr
....
.LC1:
     	.quad   .LANCHOR0+4611686018427387904
....
        .set    .LANCHOR0,. + 0
        .type   x, @object
        .size   x, 4
x:
        .zero   4
        
>
>> --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> @@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
>>  unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
>>  unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
>>  
>> -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
>> +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
>
> Please make this the exact number of times you want to see rldimi and
> the number of times you want a load.
>
> Btw, you need to write
>   \m(?:rldimi|ld|pld)\M
Oh, thanks!  I did not aware of this.

> or it will mean
>   \mrldimi
> or
>   ld
> or
>   pld\M
> (and that "ld" will match anything that "pld$" will match of course).
Maybe I should separate out two cases one for "-mdejagnu-cpu=power10",
and one for "-mdejagnu-cpu=power7", because the load instructions
the number would be different.

>
>
> So no doubt this will improve things, but we need testing of each part
> separately.  Also look at code size, or differences in the generated
> code in general: this is much more sensitive to detect than performance,
> and is not itself sensitive to things like system load, so a) is easier
> to measure, and b) has more useful outputs, outputs that tell more of
> the whole story.
Thanks for your suggestions!
I also feel it is very sensitive when I test performance. The change
(e.g. cost of constant) may affect other optimization passes indirectly.
I will also check the object changes (object size or differences). I may
use objects from GCC bootstrap and GCC test suite.

BR,
Jiufu Guo

>
>
> Segher

  reply	other threads:[~2022-06-15  8:19 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-14 13:23 Jiufu Guo
2022-06-14 22:34 ` Segher Boessenkool
2022-06-15  8:19   ` Jiufu Guo [this message]
2022-06-15 17:44     ` Segher Boessenkool
2022-06-16  7:47       ` Jiufu Guo
2022-06-21 18:10         ` Segher Boessenkool
  -- strict thread matches above, loose matches on Subject: below --
2022-05-10  9:28 [PATCH] Store complicated const " Jiufu Guo
2022-05-23  1:39 ` [PATCH V2]rs6000: Store complicated constant " Jiufu Guo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=h48tu8me3tu.fsf@genoa.aus.stglabs.ibm.com \
    --to=guojiufu@linux.ibm.com \
    --cc=amodra@gmail.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=linkw@gcc.gnu.org \
    --cc=segher@kernel.crashing.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).