From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
 by sourceware.org (Postfix) with ESMTPS id 0573F3857802
 for <gcc-patches@gcc.gnu.org>; Thu, 15 Jul 2021 15:31:22 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0573F3857802
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out2.suse.de (Postfix) with ESMTP id E6AE71FE43;
 Thu, 15 Jul 2021 15:31:20 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
 t=1626363080; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=adu9MuM/9pbX1d0wa48CmzQpWjLh6NqnMKLG3g33r0Y=;
 b=rVRpAxe8yypTEN7UnqRQwnEhyE9qrQrldguNldcijI3iVQye7GqeE+863m0OQ/xsRInwl6
 +rGTEQcD1bwcrhu2xAuBvwESGBx7qCDEJeXGvqeYmUsMmwFWVC2W+Pu5oTFdvR9QwrAp/W
 fTIYJdarKPBuc3g5MwWteUQP4ye/y+8=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1626363080;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=adu9MuM/9pbX1d0wa48CmzQpWjLh6NqnMKLG3g33r0Y=;
 b=tniWocyuufw4YLg8QBkW/CJcaxlX2Eg+olumWJ/tdJ+b4ZLr3y4NLmNdP6aHryYqUviwrC
 8vlhpJHYBt709IAw==
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id D7EF3A3B8D;
 Thu, 15 Jul 2021 15:31:20 +0000 (UTC)
Date: Thu, 15 Jul 2021 17:31:20 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
cc: Hongtao Liu <crazylht@gmail.com>, Hongtao Liu <hongtao.liu@intel.com>, 
 GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86
In-Reply-To: <nycvar.YFH.7.76.2107151711080.10711@zhemvz.fhfr.qr>
Message-ID: <nycvar.YFH.7.76.2107151728500.10711@zhemvz.fhfr.qr>
References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr>
 <CAFiYyc2YyTZ-4+ab8oPAN_9z3OF52qenRwpf=Pifkm6r8r8NSA@mail.gmail.com>
 <CAMZc-bx6iRMqignBhUAqtS-O+d1SQMvERUcODyOWxfyFo6AD6Q@mail.gmail.com>
 <nycvar.YFH.7.76.2107151320060.10711@zhemvz.fhfr.qr>
 <mptv95baall.fsf@arm.com> <nycvar.YFH.7.76.2107151711080.10711@zhemvz.fhfr.qr>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Jul 2021 15:31:24 -0000

On Thu, 15 Jul 2021, Richard Biener wrote:

> On Thu, 15 Jul 2021, Richard Sandiford wrote:
> 
> > Richard Biener <rguenther@suse.de> writes:
> > > On Thu, 15 Jul 2021, Hongtao Liu wrote:
> > >
> > >> On Thu, Jul 15, 2021 at 6:45 PM Richard Biener via Gcc-patches
> > >> <gcc-patches@gcc.gnu.org> wrote:
> > >> >
> > >> > On Thu, Jul 15, 2021 at 12:30 PM Richard Biener <rguenther@suse.de> wrote:
> > >> > >
> > >> > > The following extends the existing loop masking support using
> > >> > > SVE WHILE_ULT to x86 by proving an alternate way to produce the
> > >> > > mask using VEC_COND_EXPRs.  So with --param vect-partial-vector-usage
> > >> > > you can now enable masked vectorized epilogues (=1) or fully
> > >> > > masked vector loops (=2).
> > >> > >
> > >> > > What's missing is using a scalar IV for the loop control
> > >> > > (but in principle AVX512 can use the mask here - just the patch
> > >> > > doesn't seem to work for AVX512 yet for some reason - likely
> > >> > > expand_vec_cond_expr_p doesn't work there).  What's also missing
> > >> > > is providing more support for predicated operations in the case
> > >> > > of reductions either via VEC_COND_EXPRs or via implementing
> > >> > > some of the .COND_{ADD,SUB,MUL...} internal functions as mapping
> > >> > > to masked AVX512 operations.
> > >> > >
> > >> > > For AVX2 and
> > >> > >
> > >> > > int foo (unsigned *a, unsigned * __restrict b, int n)
> > >> > > {
> > >> > >   unsigned sum = 1;
> > >> > >   for (int i = 0; i < n; ++i)
> > >> > >     b[i] += a[i];
> > >> > >   return sum;
> > >> > > }
> > >> > >
> > >> > > we get
> > >> > >
> > >> > > .L3:
> > >> > >         vpmaskmovd      (%rsi,%rax), %ymm0, %ymm3
> > >> > >         vpmaskmovd      (%rdi,%rax), %ymm0, %ymm1
> > >> > >         addl    $8, %edx
> > >> > >         vpaddd  %ymm3, %ymm1, %ymm1
> > >> > >         vpmaskmovd      %ymm1, %ymm0, (%rsi,%rax)
> > >> > >         vmovd   %edx, %xmm1
> > >> > >         vpsubd  %ymm15, %ymm2, %ymm0
> > >> > >         addq    $32, %rax
> > >> > >         vpbroadcastd    %xmm1, %ymm1
> > >> > >         vpaddd  %ymm4, %ymm1, %ymm1
> > >> > >         vpsubd  %ymm15, %ymm1, %ymm1
> > >> > >         vpcmpgtd        %ymm1, %ymm0, %ymm0
> > >> > >         vptest  %ymm0, %ymm0
> > >> > >         jne     .L3
> > >> > >
> > >> > > for the fully masked loop body and for the masked epilogue
> > >> > > we see
> > >> > >
> > >> > > .L4:
> > >> > >         vmovdqu (%rsi,%rax), %ymm3
> > >> > >         vpaddd  (%rdi,%rax), %ymm3, %ymm0
> > >> > >         vmovdqu %ymm0, (%rsi,%rax)
> > >> > >         addq    $32, %rax
> > >> > >         cmpq    %rax, %rcx
> > >> > >         jne     .L4
> > >> > >         movl    %edx, %eax
> > >> > >         andl    $-8, %eax
> > >> > >         testb   $7, %dl
> > >> > >         je      .L11
> > >> > > .L3:
> > >> > >         subl    %eax, %edx
> > >> > >         vmovdqa .LC0(%rip), %ymm1
> > >> > >         salq    $2, %rax
> > >> > >         vmovd   %edx, %xmm0
> > >> > >         movl    $-2147483648, %edx
> > >> > >         addq    %rax, %rsi
> > >> > >         vmovd   %edx, %xmm15
> > >> > >         vpbroadcastd    %xmm0, %ymm0
> > >> > >         vpbroadcastd    %xmm15, %ymm15
> > >> > >         vpsubd  %ymm15, %ymm1, %ymm1
> > >> > >         vpsubd  %ymm15, %ymm0, %ymm0
> > >> > >         vpcmpgtd        %ymm1, %ymm0, %ymm0
> > >> > >         vpmaskmovd      (%rsi), %ymm0, %ymm1
> > >> > >         vpmaskmovd      (%rdi,%rax), %ymm0, %ymm2
> > >> > >         vpaddd  %ymm2, %ymm1, %ymm1
> > >> > >         vpmaskmovd      %ymm1, %ymm0, (%rsi)
> > >> > > .L11:
> > >> > >         vzeroupper
> > >> > >
> > >> > > compared to
> > >> > >
> > >> > > .L3:
> > >> > >         movl    %edx, %r8d
> > >> > >         subl    %eax, %r8d
> > >> > >         leal    -1(%r8), %r9d
> > >> > >         cmpl    $2, %r9d
> > >> > >         jbe     .L6
> > >> > >         leaq    (%rcx,%rax,4), %r9
> > >> > >         vmovdqu (%rdi,%rax,4), %xmm2
> > >> > >         movl    %r8d, %eax
> > >> > >         andl    $-4, %eax
> > >> > >         vpaddd  (%r9), %xmm2, %xmm0
> > >> > >         addl    %eax, %esi
> > >> > >         andl    $3, %r8d
> > >> > >         vmovdqu %xmm0, (%r9)
> > >> > >         je      .L2
> > >> > > .L6:
> > >> > >         movslq  %esi, %r8
> > >> > >         leaq    0(,%r8,4), %rax
> > >> > >         movl    (%rdi,%r8,4), %r8d
> > >> > >         addl    %r8d, (%rcx,%rax)
> > >> > >         leal    1(%rsi), %r8d
> > >> > >         cmpl    %r8d, %edx
> > >> > >         jle     .L2
> > >> > >         addl    $2, %esi
> > >> > >         movl    4(%rdi,%rax), %r8d
> > >> > >         addl    %r8d, 4(%rcx,%rax)
> > >> > >         cmpl    %esi, %edx
> > >> > >         jle     .L2
> > >> > >         movl    8(%rdi,%rax), %edx
> > >> > >         addl    %edx, 8(%rcx,%rax)
> > >> > > .L2:
> > >> > >
> > >> > > I'm giving this a little testing right now but will dig on why
> > >> > > I don't get masked loops when AVX512 is enabled.
> > >> >
> > >> > Ah, a simple thinko - rgroup_controls vectypes seem to be
> > >> > always VECTOR_BOOLEAN_TYPE_P and thus we can
> > >> > use expand_vec_cmp_expr_p.  The AVX512 fully masked
> > >> > loop then looks like
> > >> >
> > >> > .L3:
> > >> >         vmovdqu32       (%rsi,%rax,4), %ymm2{%k1}
> > >> >         vmovdqu32       (%rdi,%rax,4), %ymm1{%k1}
> > >> >         vpaddd  %ymm2, %ymm1, %ymm0
> > >> >         vmovdqu32       %ymm0, (%rsi,%rax,4){%k1}
> > >> >         addq    $8, %rax
> > >> >         vpbroadcastd    %eax, %ymm0
> > >> >         vpaddd  %ymm4, %ymm0, %ymm0
> > >> >         vpcmpud $6, %ymm0, %ymm3, %k1
> > >> >         kortestb        %k1, %k1
> > >> >         jne     .L3
> > >> >
> > >> > I guess for x86 it's not worth preserving the VEC_COND_EXPR
> > >> > mask generation but other archs may not provide all required vec_cmp
> > >> > expanders.
> > >> 
> > >> For the main loop, the full-masked loop's codegen seems much worse.
> > >> Basically, we need at least 4 instructions to do what while_ult in arm does.
> > >> 
> > >>          vpbroadcastd    %eax, %ymm0
> > >>          vpaddd  %ymm4, %ymm0, %ymm0
> > >>          vpcmpud $6, %ymm0, %ymm3, %k1
> > >>          kortestb        %k1, %k1
> > >> vs
> > >>        whilelo(or some other while<op>)
> > >> 
> > >> more instructions are needed for avx2 since there's no direct
> > >> instruction for .COND_{ADD,SUB..}
> > >> 
> > >> original
> > >> .L4:
> > >>         vmovdqu (%rcx,%rax), %ymm1
> > >>         vpaddd (%rdi,%rax), %ymm1, %ymm0
> > >>         vmovdqu %ymm0, (%rcx,%rax)
> > >>         addq $32, %rax
> > >>         cmpq %rax, %rsi
> > >>         jne .L4
> > >> 
> > >> vs
> > >> avx512 full-masked loop
> > >> .L3:
> > >>          vmovdqu32       (%rsi,%rax,4), %ymm2{%k1}
> > >>          vmovdqu32       (%rdi,%rax,4), %ymm1{%k1}
> > >>          vpaddd  %ymm2, %ymm1, %ymm0
> > >>          vmovdqu32       %ymm0, (%rsi,%rax,4){%k1}
> > >>          addq    $8, %rax
> > >>          vpbroadcastd    %eax, %ymm0
> > >>          vpaddd  %ymm4, %ymm0, %ymm0
> > >>          vpcmpud $6, %ymm0, %ymm3, %k1
> > >>          kortestb        %k1, %k1
> > >>          jne     .L3
> > >> 
> > >> vs
> > >> avx2 full-masked loop
> > >> .L3:
> > >>          vpmaskmovd      (%rsi,%rax), %ymm0, %ymm3
> > >>          vpmaskmovd      (%rdi,%rax), %ymm0, %ymm1
> > >>          addl    $8, %edx
> > >>          vpaddd  %ymm3, %ymm1, %ymm1
> > >>          vpmaskmovd      %ymm1, %ymm0, (%rsi,%rax)
> > >>          vmovd   %edx, %xmm1
> > >>          vpsubd  %ymm15, %ymm2, %ymm0
> > >>          addq    $32, %rax
> > >>          vpbroadcastd    %xmm1, %ymm1
> > >>          vpaddd  %ymm4, %ymm1, %ymm1
> > >>          vpsubd  %ymm15, %ymm1, %ymm1
> > >>          vpcmpgtd        %ymm1, %ymm0, %ymm0
> > >>         vptest  %ymm0, %ymm0
> > >>          jne     .L3
> > >> 
> > >> vs  arm64's code
> > >> 
> > >> .L3:
> > >>     ld1w z1.s, p0/z, [x1, x3, lsl 2]
> > >>     ld1w z0.s, p0/z, [x0, x3, lsl 2]
> > >>     add z0.s, z0.s, z1.s
> > >>     st1w z0.s, p0, [x1, x3, lsl 2]
> > >>     add x3, x3, x4
> > >>     whilelo p0.s, w3, w2
> > >>     b.any .L3
> > >
> > > Yes, that's true - it might still be OK for vectorizing epilogues
> > > and thus --param vect-partial-vector-usage=1
> > >
> > > Can AVX512 do any better than this?
> > >
> > >         vpbroadcastd    %eax, %ymm0
> > >         vpaddd  %ymm4, %ymm0, %ymm0
> > >         vpcmpud $6, %ymm0, %ymm3, %k1
> > >
> > > Note with multiple types involved things get even worse
> > > since you need masks for each vector mode.  But as far as
> > > I can see that's the same for SVE (but there we have the
> > > single-instruction whilelo).  I guess we'll also generate
> > > wrong code at the moment for the case where we need
> > > multiple vectors to hold the full mask - vect_gen_while
> > > doesn't seem to be prepared for this?
> > >
> > > So with
> > >
> > > int foo (unsigned long *a, unsigned * __restrict b, int n)
> > > {
> > >   unsigned sum = 1;
> > >   for (int i = 0; i < n; ++i)
> > >     {
> > >       b[i] += a[i];
> > >     }
> > >   return sum;
> > > }
> > >
> > > SVE uses
> > >
> > > .L3:
> > >         ld1d    z0.d, p0/z, [x1, x3, lsl 3]
> > >         ld1d    z1.d, p0/z, [x0, x3, lsl 3]
> > >         adr     z0.d, [z0.d, z1.d, lsl 2]
> > >         st1d    z0.d, p0, [x1, x3, lsl 3]
> > >         add     x3, x3, x4
> > >         whilelo p0.d, w3, w2
> > >         b.any   .L3
> > >
> > > so p0 vs. p0/z, whatever that means and it looks like
> > > the vector add can somehow concatenate z0.d and z1.d.
> > > Truly fascinating ;)
> > 
> > It looks like this is from a version in which “b” was also long.
> > For the loop above I get:
> > 
> > .L3:
> >         ld1d    z0.d, p0/z, [x0, x3, lsl 3]
> >         ld1w    z1.d, p0/z, [x1, x3, lsl 2]
> >         add     z0.s, z0.s, z1.s
> >         st1w    z0.d, p0, [x1, x3, lsl 2]
> >         add     x3, x3, x4
> >         whilelo p0.d, w3, w2
> >         b.any   .L3
> > 
> > where the ld1w is an extending load and the st1w is a truncating store.
> > 
> > But yeah, there's a hard-coded assumption that a mask for 2 SIs
> > can also control 1 DI, etc.  For example:
> > 
> > void foo (unsigned long *a, unsigned * __restrict b, int n)
> > {
> >   for (int i = 0; i < n; ++i)
> >     {
> >       a[i * 2] += 1;
> >       a[i * 2 + 1] += 2;
> >       b[i * 4] += 1;
> >       b[i * 4 + 1] += 2;
> >       b[i * 4 + 2] += 3;
> >       b[i * 4 + 3] += 4;
> >     }
> > }
> > 
> > becomes:
> > 
> > .L3:
> >         ld1d    z0.d, p0/z, [x0]
> >         add     z0.d, z0.d, z2.d
> >         st1d    z0.d, p0, [x0]
> >         ld1w    z0.s, p0/z, [x1, x3, lsl 2]
> >         add     z0.s, z0.s, z1.s
> >         st1w    z0.s, p0, [x1, x3, lsl 2]
> >         add     x0, x0, x4
> >         add     x3, x3, x5
> >         whilelo p0.s, x3, x2
> >         b.any   .L3
> > 
> > This is explained more in the big comment above rgroup_controls
> > (guess you've already seen it).
> 
> OK, guess I was more looking at
> 
> #define N 32
> int foo (unsigned long *a, unsigned long * __restrict b,
>          unsigned int *c, unsigned int * __restrict d,
>          int n)
> {
>   unsigned sum = 1;
>   for (int i = 0; i < n; ++i)
>     {
>       b[i] += a[i];
>       d[i] += c[i];
>     }
>   return sum;
> }
> 
> where we on x86 AVX512 vectorize with V8DI and V16SI and we
> generate two masks for the two copies of V8DI (VF is 16) and one
> mask for V16SI.  With SVE I see
> 
>         punpklo p1.h, p0.b
>         punpkhi p2.h, p0.b
> 
> that's sth I expected to see for AVX512 as well, using the V16SI
> mask and unpacking that to two V8DI ones.  But I see
> 
>         vpbroadcastd    %eax, %ymm0
>         vpaddd  %ymm12, %ymm0, %ymm0
>         vpcmpud $6, %ymm0, %ymm11, %k3
>         vpbroadcastd    %eax, %xmm0
>         vpaddd  %xmm10, %xmm0, %xmm0
>         vpcmpud $1, %xmm7, %xmm0, %k1
>         vpcmpud $6, %xmm0, %xmm8, %k2
>         kortestb        %k1, %k1
>         jne     .L3
> 
> so three %k masks generated by vpcmpud.  I'll have to look what's
> the magic for SVE and why that doesn't trigger for x86 here.

So answer myself, vect_maybe_permute_loop_masks looks for
vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have
QImode so that doesn't play well here.  Not sure if there
are proper mask instructions to use (I guess there's a shift
and lopart is free).  This is QI:8 to two QI:4 (bits) mask
conversion.  Not sure how to better ask the target here - again
VnBImode might have been easier here.

Sth makes the loop not handled at all (masked) with AVX2.

Richard.

> Richard.
> 
> > > It looks like --param vect_partial_vector_usage defaults to 2,
> > > power forces it to 1 (power10) or 0 otherwise.
> > >
> > > I think we'd need a target hook that toggles this per mode
> > > so we could tune this dependent on AVX512 vectorization vs. AVX2.
> > 
> > Yeah, keying it off the architecture made sense for Power because the
> > only purpose of len_load/store is to support partial vectorisation.
> > But I agree a hook is needed now that we have MASK_LOAD/STORE for
> > general use and are emulating WHILE_ULT.
> > 
> > Thanks,
> > Richard
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)