From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: Hongtao Liu <crazylht@gmail.com>,
Hongtao Liu <hongtao.liu@intel.com>,
GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86
Date: Thu, 15 Jul 2021 15:57:26 +0100 [thread overview]
Message-ID: <mptv95baall.fsf@arm.com> (raw)
In-Reply-To: <nycvar.YFH.7.76.2107151320060.10711@zhemvz.fhfr.qr> (Richard Biener's message of "Thu, 15 Jul 2021 13:48:38 +0200 (CEST)")
Richard Biener <rguenther@suse.de> writes:
> On Thu, 15 Jul 2021, Hongtao Liu wrote:
>
>> On Thu, Jul 15, 2021 at 6:45 PM Richard Biener via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>> >
>> > On Thu, Jul 15, 2021 at 12:30 PM Richard Biener <rguenther@suse.de> wrote:
>> > >
>> > > The following extends the existing loop masking support using
>> > > SVE WHILE_ULT to x86 by proving an alternate way to produce the
>> > > mask using VEC_COND_EXPRs. So with --param vect-partial-vector-usage
>> > > you can now enable masked vectorized epilogues (=1) or fully
>> > > masked vector loops (=2).
>> > >
>> > > What's missing is using a scalar IV for the loop control
>> > > (but in principle AVX512 can use the mask here - just the patch
>> > > doesn't seem to work for AVX512 yet for some reason - likely
>> > > expand_vec_cond_expr_p doesn't work there). What's also missing
>> > > is providing more support for predicated operations in the case
>> > > of reductions either via VEC_COND_EXPRs or via implementing
>> > > some of the .COND_{ADD,SUB,MUL...} internal functions as mapping
>> > > to masked AVX512 operations.
>> > >
>> > > For AVX2 and
>> > >
>> > > int foo (unsigned *a, unsigned * __restrict b, int n)
>> > > {
>> > > unsigned sum = 1;
>> > > for (int i = 0; i < n; ++i)
>> > > b[i] += a[i];
>> > > return sum;
>> > > }
>> > >
>> > > we get
>> > >
>> > > .L3:
>> > > vpmaskmovd (%rsi,%rax), %ymm0, %ymm3
>> > > vpmaskmovd (%rdi,%rax), %ymm0, %ymm1
>> > > addl $8, %edx
>> > > vpaddd %ymm3, %ymm1, %ymm1
>> > > vpmaskmovd %ymm1, %ymm0, (%rsi,%rax)
>> > > vmovd %edx, %xmm1
>> > > vpsubd %ymm15, %ymm2, %ymm0
>> > > addq $32, %rax
>> > > vpbroadcastd %xmm1, %ymm1
>> > > vpaddd %ymm4, %ymm1, %ymm1
>> > > vpsubd %ymm15, %ymm1, %ymm1
>> > > vpcmpgtd %ymm1, %ymm0, %ymm0
>> > > vptest %ymm0, %ymm0
>> > > jne .L3
>> > >
>> > > for the fully masked loop body and for the masked epilogue
>> > > we see
>> > >
>> > > .L4:
>> > > vmovdqu (%rsi,%rax), %ymm3
>> > > vpaddd (%rdi,%rax), %ymm3, %ymm0
>> > > vmovdqu %ymm0, (%rsi,%rax)
>> > > addq $32, %rax
>> > > cmpq %rax, %rcx
>> > > jne .L4
>> > > movl %edx, %eax
>> > > andl $-8, %eax
>> > > testb $7, %dl
>> > > je .L11
>> > > .L3:
>> > > subl %eax, %edx
>> > > vmovdqa .LC0(%rip), %ymm1
>> > > salq $2, %rax
>> > > vmovd %edx, %xmm0
>> > > movl $-2147483648, %edx
>> > > addq %rax, %rsi
>> > > vmovd %edx, %xmm15
>> > > vpbroadcastd %xmm0, %ymm0
>> > > vpbroadcastd %xmm15, %ymm15
>> > > vpsubd %ymm15, %ymm1, %ymm1
>> > > vpsubd %ymm15, %ymm0, %ymm0
>> > > vpcmpgtd %ymm1, %ymm0, %ymm0
>> > > vpmaskmovd (%rsi), %ymm0, %ymm1
>> > > vpmaskmovd (%rdi,%rax), %ymm0, %ymm2
>> > > vpaddd %ymm2, %ymm1, %ymm1
>> > > vpmaskmovd %ymm1, %ymm0, (%rsi)
>> > > .L11:
>> > > vzeroupper
>> > >
>> > > compared to
>> > >
>> > > .L3:
>> > > movl %edx, %r8d
>> > > subl %eax, %r8d
>> > > leal -1(%r8), %r9d
>> > > cmpl $2, %r9d
>> > > jbe .L6
>> > > leaq (%rcx,%rax,4), %r9
>> > > vmovdqu (%rdi,%rax,4), %xmm2
>> > > movl %r8d, %eax
>> > > andl $-4, %eax
>> > > vpaddd (%r9), %xmm2, %xmm0
>> > > addl %eax, %esi
>> > > andl $3, %r8d
>> > > vmovdqu %xmm0, (%r9)
>> > > je .L2
>> > > .L6:
>> > > movslq %esi, %r8
>> > > leaq 0(,%r8,4), %rax
>> > > movl (%rdi,%r8,4), %r8d
>> > > addl %r8d, (%rcx,%rax)
>> > > leal 1(%rsi), %r8d
>> > > cmpl %r8d, %edx
>> > > jle .L2
>> > > addl $2, %esi
>> > > movl 4(%rdi,%rax), %r8d
>> > > addl %r8d, 4(%rcx,%rax)
>> > > cmpl %esi, %edx
>> > > jle .L2
>> > > movl 8(%rdi,%rax), %edx
>> > > addl %edx, 8(%rcx,%rax)
>> > > .L2:
>> > >
>> > > I'm giving this a little testing right now but will dig on why
>> > > I don't get masked loops when AVX512 is enabled.
>> >
>> > Ah, a simple thinko - rgroup_controls vectypes seem to be
>> > always VECTOR_BOOLEAN_TYPE_P and thus we can
>> > use expand_vec_cmp_expr_p. The AVX512 fully masked
>> > loop then looks like
>> >
>> > .L3:
>> > vmovdqu32 (%rsi,%rax,4), %ymm2{%k1}
>> > vmovdqu32 (%rdi,%rax,4), %ymm1{%k1}
>> > vpaddd %ymm2, %ymm1, %ymm0
>> > vmovdqu32 %ymm0, (%rsi,%rax,4){%k1}
>> > addq $8, %rax
>> > vpbroadcastd %eax, %ymm0
>> > vpaddd %ymm4, %ymm0, %ymm0
>> > vpcmpud $6, %ymm0, %ymm3, %k1
>> > kortestb %k1, %k1
>> > jne .L3
>> >
>> > I guess for x86 it's not worth preserving the VEC_COND_EXPR
>> > mask generation but other archs may not provide all required vec_cmp
>> > expanders.
>>
>> For the main loop, the full-masked loop's codegen seems much worse.
>> Basically, we need at least 4 instructions to do what while_ult in arm does.
>>
>> vpbroadcastd %eax, %ymm0
>> vpaddd %ymm4, %ymm0, %ymm0
>> vpcmpud $6, %ymm0, %ymm3, %k1
>> kortestb %k1, %k1
>> vs
>> whilelo(or some other while<op>)
>>
>> more instructions are needed for avx2 since there's no direct
>> instruction for .COND_{ADD,SUB..}
>>
>> original
>> .L4:
>> vmovdqu (%rcx,%rax), %ymm1
>> vpaddd (%rdi,%rax), %ymm1, %ymm0
>> vmovdqu %ymm0, (%rcx,%rax)
>> addq $32, %rax
>> cmpq %rax, %rsi
>> jne .L4
>>
>> vs
>> avx512 full-masked loop
>> .L3:
>> vmovdqu32 (%rsi,%rax,4), %ymm2{%k1}
>> vmovdqu32 (%rdi,%rax,4), %ymm1{%k1}
>> vpaddd %ymm2, %ymm1, %ymm0
>> vmovdqu32 %ymm0, (%rsi,%rax,4){%k1}
>> addq $8, %rax
>> vpbroadcastd %eax, %ymm0
>> vpaddd %ymm4, %ymm0, %ymm0
>> vpcmpud $6, %ymm0, %ymm3, %k1
>> kortestb %k1, %k1
>> jne .L3
>>
>> vs
>> avx2 full-masked loop
>> .L3:
>> vpmaskmovd (%rsi,%rax), %ymm0, %ymm3
>> vpmaskmovd (%rdi,%rax), %ymm0, %ymm1
>> addl $8, %edx
>> vpaddd %ymm3, %ymm1, %ymm1
>> vpmaskmovd %ymm1, %ymm0, (%rsi,%rax)
>> vmovd %edx, %xmm1
>> vpsubd %ymm15, %ymm2, %ymm0
>> addq $32, %rax
>> vpbroadcastd %xmm1, %ymm1
>> vpaddd %ymm4, %ymm1, %ymm1
>> vpsubd %ymm15, %ymm1, %ymm1
>> vpcmpgtd %ymm1, %ymm0, %ymm0
>> vptest %ymm0, %ymm0
>> jne .L3
>>
>> vs arm64's code
>>
>> .L3:
>> ld1w z1.s, p0/z, [x1, x3, lsl 2]
>> ld1w z0.s, p0/z, [x0, x3, lsl 2]
>> add z0.s, z0.s, z1.s
>> st1w z0.s, p0, [x1, x3, lsl 2]
>> add x3, x3, x4
>> whilelo p0.s, w3, w2
>> b.any .L3
>
> Yes, that's true - it might still be OK for vectorizing epilogues
> and thus --param vect-partial-vector-usage=1
>
> Can AVX512 do any better than this?
>
> vpbroadcastd %eax, %ymm0
> vpaddd %ymm4, %ymm0, %ymm0
> vpcmpud $6, %ymm0, %ymm3, %k1
>
> Note with multiple types involved things get even worse
> since you need masks for each vector mode. But as far as
> I can see that's the same for SVE (but there we have the
> single-instruction whilelo). I guess we'll also generate
> wrong code at the moment for the case where we need
> multiple vectors to hold the full mask - vect_gen_while
> doesn't seem to be prepared for this?
>
> So with
>
> int foo (unsigned long *a, unsigned * __restrict b, int n)
> {
> unsigned sum = 1;
> for (int i = 0; i < n; ++i)
> {
> b[i] += a[i];
> }
> return sum;
> }
>
> SVE uses
>
> .L3:
> ld1d z0.d, p0/z, [x1, x3, lsl 3]
> ld1d z1.d, p0/z, [x0, x3, lsl 3]
> adr z0.d, [z0.d, z1.d, lsl 2]
> st1d z0.d, p0, [x1, x3, lsl 3]
> add x3, x3, x4
> whilelo p0.d, w3, w2
> b.any .L3
>
> so p0 vs. p0/z, whatever that means and it looks like
> the vector add can somehow concatenate z0.d and z1.d.
> Truly fascinating ;)
It looks like this is from a version in which “b” was also long.
For the loop above I get:
.L3:
ld1d z0.d, p0/z, [x0, x3, lsl 3]
ld1w z1.d, p0/z, [x1, x3, lsl 2]
add z0.s, z0.s, z1.s
st1w z0.d, p0, [x1, x3, lsl 2]
add x3, x3, x4
whilelo p0.d, w3, w2
b.any .L3
where the ld1w is an extending load and the st1w is a truncating store.
But yeah, there's a hard-coded assumption that a mask for 2 SIs
can also control 1 DI, etc. For example:
void foo (unsigned long *a, unsigned * __restrict b, int n)
{
for (int i = 0; i < n; ++i)
{
a[i * 2] += 1;
a[i * 2 + 1] += 2;
b[i * 4] += 1;
b[i * 4 + 1] += 2;
b[i * 4 + 2] += 3;
b[i * 4 + 3] += 4;
}
}
becomes:
.L3:
ld1d z0.d, p0/z, [x0]
add z0.d, z0.d, z2.d
st1d z0.d, p0, [x0]
ld1w z0.s, p0/z, [x1, x3, lsl 2]
add z0.s, z0.s, z1.s
st1w z0.s, p0, [x1, x3, lsl 2]
add x0, x0, x4
add x3, x3, x5
whilelo p0.s, x3, x2
b.any .L3
This is explained more in the big comment above rgroup_controls
(guess you've already seen it).
> It looks like --param vect_partial_vector_usage defaults to 2,
> power forces it to 1 (power10) or 0 otherwise.
>
> I think we'd need a target hook that toggles this per mode
> so we could tune this dependent on AVX512 vectorization vs. AVX2.
Yeah, keying it off the architecture made sense for Power because the
only purpose of len_load/store is to support partial vectorisation.
But I agree a hook is needed now that we have MASK_LOAD/STORE for
general use and are emulating WHILE_ULT.
Thanks,
Richard
next prev parent reply other threads:[~2021-07-15 14:57 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-15 10:30 Richard Biener
2021-07-15 10:45 ` Richard Biener
2021-07-15 11:20 ` Hongtao Liu
2021-07-15 11:48 ` Richard Biener
2021-07-15 14:57 ` Richard Sandiford [this message]
2021-07-15 15:15 ` Richard Biener
2021-07-15 15:31 ` Richard Biener
2021-07-16 9:11 ` Richard Biener
2021-07-20 4:20 ` Hongtao Liu
2021-07-20 7:38 ` Richard Biener
2021-07-20 11:07 ` Hongtao Liu
2021-07-20 11:09 ` Richard Biener
2021-07-21 7:57 ` Hongtao Liu
2021-07-21 8:16 ` Richard Biener
2021-07-21 9:38 ` Hongtao Liu
2021-07-21 10:13 ` Richard Biener
2021-07-16 1:46 ` Hongtao Liu
2021-07-16 6:09 ` Richard Biener
2021-07-15 13:49 ` Richard Sandiford
2021-07-15 13:54 ` Richard Biener
2021-07-20 13:48 ` Richard Biener
2021-07-21 6:17 ` Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mptv95baall.fsf@arm.com \
--to=richard.sandiford@arm.com \
--cc=crazylht@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hongtao.liu@intel.com \
--cc=rguenther@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).