Re: [PATCH v2 2/2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: xionghu luo <luoxhu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
	David Edelsohn <dje.gcc@gmail.com>,
	Bill Schmidt <wschmidt@linux.ibm.com>,
	Jiufu Guo <guojiufu@linux.ibm.com>,
	linkw@gcc.gnu.org, Richard Sandiford <richard.sandiford@arm.com>
Subject: Re: [PATCH v2 2/2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
Date: Fri, 25 Sep 2020 11:50:40 +0800	[thread overview]
Message-ID: <6d2ccf37-669f-c80d-2efd-1ceb086f14f0@linux.ibm.com> (raw)
In-Reply-To: <CAFiYyc2O+dj-ApbaDODW7i9Jmc0_pdrA3JXAoxQXuwHvqRmAeA@mail.gmail.com>

Hi,

On 2020/9/24 21:27, Richard Biener wrote:
> On Thu, Sep 24, 2020 at 10:21 AM xionghu luo <luoxhu@linux.ibm.com> wrote:
> 
> I'll just comment that
> 
>          xxperm 34,34,33
>          xxinsertw 34,0,12
>          xxperm 34,34,32
> 
> doesn't look like a variable-position insert instruction but
> this is a variable whole-vector rotate plus an insert at index zero
> followed by a variable whole-vector rotate.  I'm not fluend in
> ppc assembly but
> 
>          rlwinm 6,6,2,28,29
>          mtvsrwz 0,5
>          lvsr 1,0,6
>          lvsl 0,0,6
> 
> possibly computes the shift masks for r33/r32?  though
> I do not see those registers mentioned...

For V4SI:
       rlwinm 6,6,2,28,29      // r6*4
       mtvsrwz 0,5             // vs0   <- r5  (0xfe)
       lvsr 1,0,6              // vs33  <- lvsr[r6]
       lvsl 0,0,6              // vs32  <- lvsl[r6] 
       xxperm 34,34,33       
       xxinsertw 34,0,12
       xxperm 34,34,32
       blr


idx = idx * 4; 
0    0      0x4000000030000000200000001   xxperm:0x4000000030000000200000001   vs33:0x101112131415161718191a1b1c1d1e1f  vs32:0x102030405060708090a0b0c0d0e0f
1    4      0x4000000030000000200000001   xxperm:0x1000000040000000300000002   vs33:0xc0d0e0f101112131415161718191a1b   vs32:0x405060708090a0b0c0d0e0f10111213
2    8      0x4000000030000000200000001   xxperm:0x2000000010000000400000003   vs33:0x8090a0b0c0d0e0f1011121314151617   vs32:0x8090a0b0c0d0e0f1011121314151617
3    12     0x4000000030000000200000001   xxperm:0x3000000020000000100000004   vs33:0x405060708090a0b0c0d0e0f10111213   vs32:0xc0d0e0f101112131415161718191a1b

vs34:
 0x40000000300000002000000fe
 0x400000003000000fe00000001
 0x4000000fe0000000200000001
0xfe000000030000000200000001


"xxinsertw 34,0,12" will always insert vs0[32:63] content to the forth word of
target vector, bits[96:127].  Then the second xxperm rotate the modified vector
back. 

All the instructions are register based operation, as Segher replied, power9
supports only fixed position inserts, so we need do some trick here to support
it instead of generate short store wide load instructions.


> 
> This might be a generic viable expansion strathegy btw,
> which is why I asked before whether the CPU supports
> inserts at a variable position ...  the building blocks are
> already there with vec_set at constant zero position
> plus vec_perm_const for the rotates.
> 
> But well, I did ask this question.  Multiple times.
> 
> ppc does _not_ have a VSX instruction
> like xxinsertw r34, r8, r12 where r8 denotes
> the vector element (or byte position or whatever).
> 
> So I don't think vec_set with a variable index is the
> best approach.
> Xionghu - you said even without the patch the stack
> storage is eventually elided but
> 
>          addi 9,1,-16
>          rldic 6,6,2,60
>          stxv 34,-16(1)
>          stwx 5,9,6
>          lxv 34,-16(1)
> 
> still shows stack(?) store/load with a bad STLF penalty.


Sorry that if I didn't describe clearly and misunderstood you, I mean if insert many
instructions(tested with a loop inserted) between "stwx 5,9,6" and "lxv 34,-16(1)",
the store hit load performance issue could be elided, but this is not the solution
we want.

I also changed your test as below and build for X86, seems it also generates
inefficient code?  What my patch does maybe different usage from your pasted
case? 

#define N 32
typedef int T;
typedef T V __attribute__((vector_size(N)));
  V setg3 (V v, int idx, T val)
{
    v[idx&31] = val;
    return v;
}

-O2 -S -mavx -march=znver2:

setg3:
        push    rbp
        and     edi, 31
        mov     rbp, rsp
        and     rsp, -32
        vmovdqa YMMWORD PTR [rsp-32], ymm0
        mov     DWORD PTR [rsp-32+rdi*4], esi
        vmovdqa ymm0, YMMWORD PTR [rsp-32]
        leave
        ret


While idx is constant: 

setg3:
        vpinsrd xmm1, xmm0, esi, 3
        vinserti128     ymm0, ymm0, xmm1, 0x0
        ret

And ARM with -O2 -S -march=armv8.2-a+sve (N change to 16): 

setg3:
        sub     sp, sp, #16
        and     x0, x0, 15
        str     q0, [sp]
        str     w1, [sp, x0, lsl 2]
        ldr     q0, [sp]
        add     sp, sp, 16
        ret

While idx is constant: 

setg3:
        ins     v0.s[3], w1
        ret


Though I've no idea how to optimize this on X86 and ARM with vector instructions
to avoid short store with wide load followed on stack.


Thanks,
Xionghu

next prev parent reply	other threads:[~2020-09-25  3:50 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-18  6:17 [PATCH v2 1/2] IFN: Implement IFN_VEC_SET for ARRAY_REF with VIEW_CONVERT_EXPR Xiong Hu Luo
2020-09-18  6:17 ` [PATCH v2 2/2] rs6000: Expand vec_insert in expander instead of gimple [PR79251] Xiong Hu Luo
2020-09-18 20:19   ` Segher Boessenkool
2020-09-24  8:21     ` xionghu luo
2020-09-24 13:27       ` Richard Biener
2020-09-24 14:55         ` Richard Biener
2020-09-24 19:36           ` Segher Boessenkool
2020-09-25  6:58             ` Richard Biener
2020-09-25 12:21               ` Richard Sandiford
2020-09-25 22:39               ` Segher Boessenkool
2020-09-28  8:00                 ` Richard Biener
2020-09-24 19:02         ` Segher Boessenkool
2020-09-25  3:50         ` xionghu luo [this message]
2020-09-25  5:00           ` Richard Biener
2020-09-18 18:20 ` [PATCH v2 1/2] IFN: Implement IFN_VEC_SET for ARRAY_REF with VIEW_CONVERT_EXPR Segher Boessenkool
2020-09-21  8:31 ` Richard Biener
2020-09-22  3:55   ` [PATCH v3 " xionghu luo
2020-09-23 11:33     ` Richard Biener
2020-09-24  3:24       ` xionghu luo
2020-09-24 12:39         ` Richard Sandiford
2020-09-25  6:51           ` [PATCH v4 1/3] " xionghu luo
2020-09-25 11:32             ` Richard Biener
2020-09-25 13:28             ` Richard Sandiford
2020-09-27  5:45               ` xionghu luo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6d2ccf37-669f-c80d-2efd-1ceb086f14f0@linux.ibm.com \
    --to=luoxhu@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=guojiufu@linux.ibm.com \
    --cc=linkw@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    --cc=richard.sandiford@arm.com \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).