public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: "Kewen.Lin" <linkw@linux.ibm.com>
To: Ajit Agarwal <aagarwa1@linux.ibm.com>,
	Richard Biener <richard.guenther@gmail.com>,
	Peter Bergner <bergner@linux.ibm.com>
Cc: Vladimir Makarov <vmakarov.gcc@gmail.com>,
	Michael Meissner <meissner@linux.ibm.com>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>,
	gcc-patches <gcc-patches@gcc.gnu.org>,
	Richard Sandiford <richard.sandiford@arm.com>,
	alex.coplan@arm.com
Subject: Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp
Date: Wed, 17 Jan 2024 15:02:01 +0800	[thread overview]
Message-ID: <f66709f8-39cb-024c-1fcb-416a3a3efc90@linux.ibm.com> (raw)
In-Reply-To: <0efa1db8-323e-4cdb-9b95-bf6cee25d03d@linux.ibm.com>

on 2024/1/16 06:22, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 15/01/24 6:25 pm, Ajit Agarwal wrote:
>>
>>
>> On 15/01/24 6:14 pm, Ajit Agarwal wrote:
>>> Hello Richard:
>>>
>>> On 15/01/24 3:03 pm, Richard Biener wrote:
>>>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>>>>
>>>>> Hello All:
>>>>>
>>>>> This patch add the vecload pass to replace adjacent memory accesses lxv with lxvp
>>>>> instructions. This pass is added before ira pass.
>>>>>
>>>>> vecload pass removes one of the defined adjacent lxv (load) and replace with lxvp.
>>>>> Due to removal of one of the defined loads the allocno is has only uses but
>>>>> not defs.
>>>>>
>>>>> Due to this IRA pass doesn't assign register pairs like registers in sequence.
>>>>> Changes are made in IRA register allocator to assign sequential registers to
>>>>> adjacent loads.
>>>>>
>>>>> Some of the registers are cleared and are not set as profitable registers due
>>>>> to zero cost is greater than negative costs and checks are added to compare
>>>>> positive costs.
>>>>>
>>>>> LRA register is changed not to reassign them to different register and form
>>>>> the sequential register pairs intact.
>>>>>
>>>>> contrib/check_GNU_style.sh run on patch looks good.
>>>>>
>>>>> Bootstrapped and regtested for powerpc64-linux-gnu.
>>>>>
>>>>> Spec2017 benchmarks are run and I get impressive benefits for some of the FP
>>>>> benchmarks.
>>>> i
>>>> I want to point out the aarch64 target recently got a ld/st fusion
>>>> pass which sounds
>>>> related.  It would be nice to have at least common infrastructure for
>>>> this (the aarch64
>>>> one also looks quite more powerful)

Thank Richi for pointing out this pass.  Yeah, it would be nice if we can share
something common.  CC the author Alex as well in case he have more insightful
comments.

>>>
>>> load/store fusion pass in aarch64 is scheduled to use before peephole2 pass 
>>> and after register allocator pass. In our case, if we do after register allocator
>>> then we should keep register assigned to lower offset load and other load
>>> that is adjacent to previous load with offset difference of 16 is removed.
>>>
>>> Then we are left with one load with lower offset and register assigned 
>>> by register allocator for lower offset load should be lower than other
>>> adjacent load. If not, we need to change it to lower register and 
>>> propagate them with all the uses of the variable. Similary for other
>>> adjacent load that we are removing, register needs to be propagated to
>>> all the uses.
>>>
>>> In that case we are doing the work of register allocator. In most of our
>>> example testcases the lower offset load is assigned greater register 
>>> than other adjacent load by register allocator and hence we are left
>>> with propagating them always and almost redoing the register allocator
>>> work.
>>>
>>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases
>>> considering the above scenario.
>>>
>>> Please let me know what do you think. 
> 
> I have gone through the implementation of ld/st fusion in aarch64.
> 
> Here is my understanding:
> 
> First all its my mistake that I have mentioned in my earlier mail that 
> this pass is done before peephole2 after RA-pass.
> 
> This pass does it before RA-pass early before early-remat and 
> also before peephole2 after RA-pass.
> 
> This pass does load fusion 2 ldr instruction with adjacent accesses
> into ldp instruction.
> 
> The assembly syntax of ldp instruction is
> 
> ldp w3, w7, [x0]
> 
> It loads [X0] into w3 and [X0+4] into W7.
> 
> Both registers that forms pairs are mentioned in ldp instructions
> and might not be in sequntial order like first register is W3 and
> then next register would be W3+1.
> 
> Thats why the pass before RA-pass works as it has both the defs
> and may not be required in sequential order like first_reg and then
> first_reg+1. It can be any valid registers.
> 
> 
> But in lxvp instructions:
> 
> lxv vs32, 0(r2)
> lxv vs45, 16(r2)
> 
> When we combine above lxv instruction into lxvp, lxvp instruction
> becomes
> 
> lxvp vs32, 0(r2)
> 
> wherein in lxvp  r2+0 is loaded into vs32 and r2+16 is loaded into vs33 
> register (sequential registers). vs33 is hidden in lxvp instruction.
> This is mandatory requirement for lxvp instruction and cannot be in 
> any other sequence. register assignment difference should be 1.

Note that the first register number in the pair should be even, it
means the so-called sequential order should be X, X + 1 (X is even).
This is also the reason why we preferred this pairing to be done
before RA (can catch more opportunities).

> 
> All the uses of r45 has to be propagated with r33.

I think you meant s/r45/vs45/ and s/r33/vs33/.

> 
> And also register allocator can allocate two lxv instructions
> in the following registers.
> 
> lxv vs33, 0(r2)
> lxv vs32, 16(r2)
> 
> To generate lxvp for above lxv instructions 
> 
> lxvp vs32, 0(r2).
> 
> And all the registers vs33 has to be propagated with vs32 and vs32
> has to be propagated with vs33 if we do vecload pass after RA-pass.
> 
> If we do before RA-pass the IRA and LRA register allocation cannot
> assign register with a difference of 1 and the order difference can
> be anything with a positive difference.

This sounds unexpected.  IMHO if you adopt OOmode for the paired load,
RA should be able to allocate two sequential vsx registers and the
first is even, since OOmode is only ok for even vsx register and its
size makes it take two consecutive vsx registers.

Hi Peter, is my understanding correct?

> 
> IRA allocated one in vs32 and other can in vs45.
> 
> In vecload pass we remove one lxv from 2 lxv instruction and 2nd
> lxv instruction with offset of 16 is removed and the use of register
> with 2nd lxv's will not have defs and IRA pass cannot allocate
> them in order with a difference of 1.

With Peter's patch to allow subreg from OOmode, I'd expect that we
replace all uses of the first lxv (from low part address) with
(subreg:VnX <the result of lxvp> <low offset>) and all uses of the
second lxv (from high address) with (subreg:VnX <the result of lxvp> 
<high offset>), currently after vecload, with the associated vecload.C,
we transform from:

(insn 8 4 10 2 (set (reg:V16QI 124 [ *ptr_4(D) ])
        (mem:V16QI (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) 1186 {vsx_movv16qi_64bit}
     (nil))
(insn 10 8 9 2 (set (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
        (mem:V16QI (plus:DI (reg/v/f:DI 122 [ ptr ])
                (const_int 16 [0x10])) [0 MEM[(__vector unsigned char *)ptr_4(D) + 16B]+0 S16 A128])) 1186 {vsx_movv16qi_64bit}
     (expr_list:REG_DEAD (reg/v/f:DI 122 [ ptr ])
        (nil)))
(insn 9 10 11 2 (set (reg:XO 119 [ _7 ])
        (unspec:XO [
                (reg/v:V16QI 123 [ src ])
                (reg:V16QI 124 [ *ptr_4(D) ])
            ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger}
     (expr_list:REG_DEAD (reg:V16QI 124 [ *ptr_4(D) ])
        (nil)))
(insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
        (unspec:XO [
                (reg:XO 119 [ _7 ])
                (reg/v:V16QI 123 [ src ])
                (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
            ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp}
     (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
        (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
            (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
                (nil)))))

to:

(insn 19 4 9 2 (set (reg:OO 124 [ *ptr_4(D) ])
        (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
     (nil))
(insn 9 19 11 2 (set (reg:XO 119 [ _7 ])
        (unspec:XO [
                (reg/v:V16QI 123 [ src ])
                (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
            ] UNSPEC_MMA_XVF32GER)) 2203 {mma_xvf32ger}
     (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
        (nil)))
(insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
        (unspec:XO [
                (reg:XO 119 [ _7 ])
                (reg/v:V16QI 123 [ src ])
                (reg:OO 124 [ *ptr_4(D) ])
            ] UNSPEC_MMA_XVF32GERPP)) 2217 {mma_xvf32gerpp}
     (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
        (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
            (expr_list:REG_DEAD (reg:XO 119 [ _7 ])


After vecload IMHO this code sequence doesn't look valid, no insn
defines pseudo 125, pseudo 124 is with OOmode but it's used to
replace a use of V16QI.  Both are expected to be subreg from pseudo
124 (should be better to use a new pseduo for the paired load).

Without Peter's patch, UNSPEC_MMA_EXTRACT can be used for extracting,
I think it should only result in sub-optimal code with possible extra
register moves but not have any correctness issue.

This patch can NOT be bootstrapped on x86_64-redhat-linux, I guess
it's caused by the proposed RA changes.  And it's NOT regress-tested
on Power10 LE with some go failures (note that I configured with
--enable-languages="c,c++,fortran,objc,obj-c++,go"), also on Power9
BE and LE with a few failures.

btw, I think Mike wants to test this pairing support on both adjacent
loads and stores, so could you also extend the pairing to cover the
stores, which can be guarded in a separated flag and disabled by
default by considering the known issue on paired store, then Mike
can test if the pairing can satisfy what he looked for.

BR,
Kewen


  reply	other threads:[~2024-01-17  7:02 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-14 15:25 Ajit Agarwal
2024-01-15  9:03 ` Ajit Agarwal
2024-01-15  9:33 ` Richard Biener
2024-01-15 12:44   ` Ajit Agarwal
2024-01-15 12:55     ` Ajit Agarwal
2024-01-15 22:22       ` Ajit Agarwal
2024-01-17  7:02         ` Kewen.Lin [this message]
2024-01-17  9:34           ` Ajit Agarwal
2024-01-17 14:28             ` Michael Matz
2024-01-18 12:17               ` Ajit Agarwal
2024-01-19  4:19       ` Michael Meissner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f66709f8-39cb-024c-1fcb-416a3a3efc90@linux.ibm.com \
    --to=linkw@linux.ibm.com \
    --cc=aagarwa1@linux.ibm.com \
    --cc=alex.coplan@arm.com \
    --cc=bergner@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=meissner@linux.ibm.com \
    --cc=richard.guenther@gmail.com \
    --cc=richard.sandiford@arm.com \
    --cc=segher@kernel.crashing.org \
    --cc=vmakarov.gcc@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).