From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <crazylht@gmail.com>
Received: from mail-vs1-xe31.google.com (mail-vs1-xe31.google.com
 [IPv6:2607:f8b0:4864:20::e31])
 by sourceware.org (Postfix) with ESMTPS id BB7B43857C7F
 for <gcc-patches@gcc.gnu.org>; Tue, 20 Jul 2021 04:14:46 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BB7B43857C7F
Received: by mail-vs1-xe31.google.com with SMTP id f4so10580812vsh.11
 for <gcc-patches@gcc.gnu.org>; Mon, 19 Jul 2021 21:14:46 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=dCwh+zl8I3Tve1WACcY2SZ6Hxt+t8YJWVHYjZ9t775s=;
 b=jJuYVZiNJPB2JQWqBaSo+ux7RmxGw5CWDHkwY3jQJwyUHGnFCu/xavWWYw/pVzG5NG
 xh3grw7vUN0xGJN9V7zfw/TU1v/NEixJEvmZSsh9y1cNPdWtCQIFPvw24rK+tyEKIWvp
 8VPBZVpi7ZTc2LdcKKhUbm2mIOAVtLtYZpszYj81wr7o8IGWwfRcBEIKwRxzFL73Bt8B
 bsTPjoHrehCD4/q5V06bbbMVr5FwjIT6jPmUrTH9JmxR5/hTIKb3aTb3rycKuCte4EzC
 iAW8MyAJRguw8j8LWOsD5ayAafca+JphLMWtyyN0BNYmDik6Sn+/ahYasJES6z1017ae
 ljXQ==
X-Gm-Message-State: AOAM533nDdyTD/SpsRCjL4uPEmj5gW3YM35RSMhK6KVyLBH2pstHvYT8
 HYrWtVycODRSxNKId/0KoAU7iX2FpA4MM1RsUxc=
X-Google-Smtp-Source: ABdhPJyT0pM4SCqY2kjGLaKYvr6bA9mmfzdGdcPeuxpJGKYWA/XPqc9OZufd5zUIhGSy3aOQlqdFpNno+nq8u64M20k=
X-Received: by 2002:a05:6102:675:: with SMTP id
 z21mr28716387vsf.48.1626754486343; 
 Mon, 19 Jul 2021 21:14:46 -0700 (PDT)
MIME-Version: 1.0
References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr>
 <CAFiYyc2YyTZ-4+ab8oPAN_9z3OF52qenRwpf=Pifkm6r8r8NSA@mail.gmail.com>
 <CAMZc-bx6iRMqignBhUAqtS-O+d1SQMvERUcODyOWxfyFo6AD6Q@mail.gmail.com>
 <nycvar.YFH.7.76.2107151320060.10711@zhemvz.fhfr.qr> <mptv95baall.fsf@arm.com>
 <nycvar.YFH.7.76.2107151711080.10711@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2107151728500.10711@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2107161017050.10711@zhemvz.fhfr.qr>
In-Reply-To: <nycvar.YFH.7.76.2107161017050.10711@zhemvz.fhfr.qr>
From: Hongtao Liu <crazylht@gmail.com>
Date: Tue, 20 Jul 2021 12:20:12 +0800
Message-ID: <CAMZc-bw3sDERU6YunDQ_eku_cg4TySCC8bZMX_B83A6hCycfHA@mail.gmail.com>
Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86
To: Richard Biener <rguenther@suse.de>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
 Hongtao Liu <hongtao.liu@intel.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Jul 2021 04:14:48 -0000

On Fri, Jul 16, 2021 at 5:11 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Thu, 15 Jul 2021, Richard Biener wrote:
>
> > On Thu, 15 Jul 2021, Richard Biener wrote:
> >
> > > OK, guess I was more looking at
> > >
> > > #define N 32
> > > int foo (unsigned long *a, unsigned long * __restrict b,
> > >          unsigned int *c, unsigned int * __restrict d,
> > >          int n)
> > > {
> > >   unsigned sum = 1;
> > >   for (int i = 0; i < n; ++i)
> > >     {
> > >       b[i] += a[i];
> > >       d[i] += c[i];
> > >     }
> > >   return sum;
> > > }
> > >
> > > where we on x86 AVX512 vectorize with V8DI and V16SI and we
> > > generate two masks for the two copies of V8DI (VF is 16) and one
> > > mask for V16SI.  With SVE I see
> > >
> > >         punpklo p1.h, p0.b
> > >         punpkhi p2.h, p0.b
> > >
> > > that's sth I expected to see for AVX512 as well, using the V16SI
> > > mask and unpacking that to two V8DI ones.  But I see
> > >
> > >         vpbroadcastd    %eax, %ymm0
> > >         vpaddd  %ymm12, %ymm0, %ymm0
> > >         vpcmpud $6, %ymm0, %ymm11, %k3
> > >         vpbroadcastd    %eax, %xmm0
> > >         vpaddd  %xmm10, %xmm0, %xmm0
> > >         vpcmpud $1, %xmm7, %xmm0, %k1
> > >         vpcmpud $6, %xmm0, %xmm8, %k2
> > >         kortestb        %k1, %k1
> > >         jne     .L3
> > >
> > > so three %k masks generated by vpcmpud.  I'll have to look what's
> > > the magic for SVE and why that doesn't trigger for x86 here.
> >
> > So answer myself, vect_maybe_permute_loop_masks looks for
> > vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have
> > QImode so that doesn't play well here.  Not sure if there
> > are proper mask instructions to use (I guess there's a shift
> > and lopart is free).  This is QI:8 to two QI:4 (bits) mask
Yes, for 16bit and more, we have KUNPCKBW/D/Q. but for 8bit
unpack_lo/hi, only shift.
> > conversion.  Not sure how to better ask the target here - again
> > VnBImode might have been easier here.
>
> So I've managed to "emulate" the unpack_lo/hi for the case of
> !VECTOR_MODE_P masks by using sub-vector select (we're asking
> to turn vector(8) <signed-boolean:1> into two
> vector(4) <signed-boolean:1>) via BIT_FIELD_REF.  That then
> produces the desired single mask producer and
>
>   loop_mask_38 = VIEW_CONVERT_EXPR<vector(4) <signed-boolean:1>>(loop_mask_54);
>   loop_mask_37 = BIT_FIELD_REF <loop_mask_54, 4, 4>;
>
> note for the lowpart we can just view-convert away the excess bits,
> fully re-using the mask.  We generate surprisingly "good" code:
>
>         kmovb   %k1, %edi
>         shrb    $4, %dil
>         kmovb   %edi, %k2
>
> besides the lack of using kshiftrb.  I guess we're just lacking
> a mask register alternative for
Yes, we can do it similar as kor/kand/kxor.
>
> (insn 22 20 25 4 (parallel [
>             (set (reg:QI 94 [ loop_mask_37 ])
>                 (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ])
>                     (const_int 4 [0x4])))
>             (clobber (reg:CC 17 flags))
>         ]) 724 {*lshrqi3_1}
>      (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (nil)))
>
> and so we reload.  For the above cited loop the AVX512 vectorization
> with --param vect-partial-vector-usage=1 does look quite sensible
> to me.  Instead of a SSE vectorized epilogue plus a scalar
> epilogue we get a single fully masked AVX512 "iteration" for both.
> I suppose it's still mostly a code-size optimization (384 bytes
> with the masked epiloge vs. 474 bytes with trunk) since it will
> be likely slower for very low iteration counts but it's good
> for icache usage then and good for less branch predictor usage.
>
> That said, I have to set up SPEC on a AVX512 machine to do
Does patch  land in trunk already, i can have a test on CLX.
> any meaningful measurements (I suspect with just AVX2 we're not
> going to see any benefit from masking).  Hints/help how to fix
> the missing kshiftrb appreciated.
>
> Oh, and if there's only V4DImode and V16HImode data then
> we don't go the vect_maybe_permute_loop_masks path - that is,
> we don't generate the (not used) intermediate mask but end up
> generating two while_ult parts.
>
> Thanks,
> Richard.


-- 
BR,
Hongtao