From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
 by sourceware.org (Postfix) with ESMTPS id 3A26F385E02C
 for <gcc-patches@gcc.gnu.org>; Tue, 20 Jul 2021 07:38:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3A26F385E02C
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out2.suse.de (Postfix) with ESMTP id 110A91FDC8;
 Tue, 20 Jul 2021 07:38:58 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
 t=1626766738; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=bRDwX2iN0PfXkLaIa7BZIRuGvesLHicUXRKSalJ0XhE=;
 b=o13fILC6Cm/ddxOGE5aDxlOMIRyspBT5iDx7VDtx++vUih07GHor6NDcg0LAzzcJueGQp4
 C4TZqVS91xHCMajuOVhzzqb2yVOhuiOqm3IJmOjO+yyKeIphPTH8mykeX8JrcGFn7/V2FN
 wWdN37LpeMCDR8NRelXfgC1hWPmeXpA=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1626766738;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=bRDwX2iN0PfXkLaIa7BZIRuGvesLHicUXRKSalJ0XhE=;
 b=OGW2t5wzMP79wSoQbA/0NL40WtzF1uE4oIPbZu4DMKSuByOy5alkVb/b9IQdy/HJZVt0AI
 n8qYH32M7qDowDCA==
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id F2419A3B8B;
 Tue, 20 Jul 2021 07:38:57 +0000 (UTC)
Date: Tue, 20 Jul 2021 09:38:57 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Hongtao Liu <crazylht@gmail.com>
cc: Richard Sandiford <richard.sandiford@arm.com>, 
 Hongtao Liu <hongtao.liu@intel.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86
In-Reply-To: <CAMZc-bw3sDERU6YunDQ_eku_cg4TySCC8bZMX_B83A6hCycfHA@mail.gmail.com>
Message-ID: <nycvar.YFH.7.76.2107200931290.10711@zhemvz.fhfr.qr>
References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr>
 <CAFiYyc2YyTZ-4+ab8oPAN_9z3OF52qenRwpf=Pifkm6r8r8NSA@mail.gmail.com>
 <CAMZc-bx6iRMqignBhUAqtS-O+d1SQMvERUcODyOWxfyFo6AD6Q@mail.gmail.com>
 <nycvar.YFH.7.76.2107151320060.10711@zhemvz.fhfr.qr>
 <mptv95baall.fsf@arm.com> <nycvar.YFH.7.76.2107151711080.10711@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2107151728500.10711@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2107161017050.10711@zhemvz.fhfr.qr>
 <CAMZc-bw3sDERU6YunDQ_eku_cg4TySCC8bZMX_B83A6hCycfHA@mail.gmail.com>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Jul 2021 07:39:00 -0000

On Tue, 20 Jul 2021, Hongtao Liu wrote:

> On Fri, Jul 16, 2021 at 5:11 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Thu, 15 Jul 2021, Richard Biener wrote:
> >
> > > On Thu, 15 Jul 2021, Richard Biener wrote:
> > >
> > > > OK, guess I was more looking at
> > > >
> > > > #define N 32
> > > > int foo (unsigned long *a, unsigned long * __restrict b,
> > > >          unsigned int *c, unsigned int * __restrict d,
> > > >          int n)
> > > > {
> > > >   unsigned sum = 1;
> > > >   for (int i = 0; i < n; ++i)
> > > >     {
> > > >       b[i] += a[i];
> > > >       d[i] += c[i];
> > > >     }
> > > >   return sum;
> > > > }
> > > >
> > > > where we on x86 AVX512 vectorize with V8DI and V16SI and we
> > > > generate two masks for the two copies of V8DI (VF is 16) and one
> > > > mask for V16SI.  With SVE I see
> > > >
> > > >         punpklo p1.h, p0.b
> > > >         punpkhi p2.h, p0.b
> > > >
> > > > that's sth I expected to see for AVX512 as well, using the V16SI
> > > > mask and unpacking that to two V8DI ones.  But I see
> > > >
> > > >         vpbroadcastd    %eax, %ymm0
> > > >         vpaddd  %ymm12, %ymm0, %ymm0
> > > >         vpcmpud $6, %ymm0, %ymm11, %k3
> > > >         vpbroadcastd    %eax, %xmm0
> > > >         vpaddd  %xmm10, %xmm0, %xmm0
> > > >         vpcmpud $1, %xmm7, %xmm0, %k1
> > > >         vpcmpud $6, %xmm0, %xmm8, %k2
> > > >         kortestb        %k1, %k1
> > > >         jne     .L3
> > > >
> > > > so three %k masks generated by vpcmpud.  I'll have to look what's
> > > > the magic for SVE and why that doesn't trigger for x86 here.
> > >
> > > So answer myself, vect_maybe_permute_loop_masks looks for
> > > vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have
> > > QImode so that doesn't play well here.  Not sure if there
> > > are proper mask instructions to use (I guess there's a shift
> > > and lopart is free).  This is QI:8 to two QI:4 (bits) mask
> Yes, for 16bit and more, we have KUNPCKBW/D/Q. but for 8bit
> unpack_lo/hi, only shift.
> > > conversion.  Not sure how to better ask the target here - again
> > > VnBImode might have been easier here.
> >
> > So I've managed to "emulate" the unpack_lo/hi for the case of
> > !VECTOR_MODE_P masks by using sub-vector select (we're asking
> > to turn vector(8) <signed-boolean:1> into two
> > vector(4) <signed-boolean:1>) via BIT_FIELD_REF.  That then
> > produces the desired single mask producer and
> >
> >   loop_mask_38 = VIEW_CONVERT_EXPR<vector(4) <signed-boolean:1>>(loop_mask_54);
> >   loop_mask_37 = BIT_FIELD_REF <loop_mask_54, 4, 4>;
> >
> > note for the lowpart we can just view-convert away the excess bits,
> > fully re-using the mask.  We generate surprisingly "good" code:
> >
> >         kmovb   %k1, %edi
> >         shrb    $4, %dil
> >         kmovb   %edi, %k2
> >
> > besides the lack of using kshiftrb.  I guess we're just lacking
> > a mask register alternative for
> Yes, we can do it similar as kor/kand/kxor.
> >
> > (insn 22 20 25 4 (parallel [
> >             (set (reg:QI 94 [ loop_mask_37 ])
> >                 (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ])
> >                     (const_int 4 [0x4])))
> >             (clobber (reg:CC 17 flags))
> >         ]) 724 {*lshrqi3_1}
> >      (expr_list:REG_UNUSED (reg:CC 17 flags)
> >         (nil)))
> >
> > and so we reload.  For the above cited loop the AVX512 vectorization
> > with --param vect-partial-vector-usage=1 does look quite sensible
> > to me.  Instead of a SSE vectorized epilogue plus a scalar
> > epilogue we get a single fully masked AVX512 "iteration" for both.
> > I suppose it's still mostly a code-size optimization (384 bytes
> > with the masked epiloge vs. 474 bytes with trunk) since it will
> > be likely slower for very low iteration counts but it's good
> > for icache usage then and good for less branch predictor usage.
> >
> > That said, I have to set up SPEC on a AVX512 machine to do
> Does patch  land in trunk already, i can have a test on CLX.

I'm still experimenting a bit right now but hope to get something
trunk ready at the end of this or beginning next week.  Since it's
disabled by default we can work on improving it during stage1 then.

I'm mostly struggling with the GIMPLE IL to be used for the
mask unpacking since we currently reject both the BIT_FIELD_REF
and the VIEW_CONVERT we generate (why do AVX512 masks not all have
SImode but sometimes QImode and sometimes HImode ...).  Unfortunately
we've dropped whole-vector shifts in favor of VEC_PERM but that
doesn't work well either for integer mode vectors.  So I'm still
playing with my options here and looking for something that doesn't
require too much surgery on the RTL side to recover good mask
register code ...

Another part missing is expanders for the various cond_* patterns

OPTAB_D (cond_add_optab, "cond_add$a")
OPTAB_D (cond_sub_optab, "cond_sub$a")
OPTAB_D (cond_smul_optab, "cond_mul$a")
OPTAB_D (cond_sdiv_optab, "cond_div$a")
OPTAB_D (cond_smod_optab, "cond_mod$a")
OPTAB_D (cond_udiv_optab, "cond_udiv$a")
OPTAB_D (cond_umod_optab, "cond_umod$a")
OPTAB_D (cond_and_optab, "cond_and$a")
OPTAB_D (cond_ior_optab, "cond_ior$a")
OPTAB_D (cond_xor_optab, "cond_xor$a")
OPTAB_D (cond_ashl_optab, "cond_ashl$a")
OPTAB_D (cond_ashr_optab, "cond_ashr$a")
OPTAB_D (cond_lshr_optab, "cond_lshr$a")
OPTAB_D (cond_smin_optab, "cond_smin$a")
OPTAB_D (cond_smax_optab, "cond_smax$a")
OPTAB_D (cond_umin_optab, "cond_umin$a")
OPTAB_D (cond_umax_optab, "cond_umax$a")
OPTAB_D (cond_fma_optab, "cond_fma$a")
OPTAB_D (cond_fms_optab, "cond_fms$a")
OPTAB_D (cond_fnma_optab, "cond_fnma$a")
OPTAB_D (cond_fnms_optab, "cond_fnms$a")

I think the most useful are those for possibly trapping ops
(will be used by if-conversion) and those for reduction operations
(add,min,max) which would enable a masked reduction epilogue.

The good thing is that I've been able to get my hands on a
Cascadelake system so I can at least test things for correctness.

Richard.

> > any meaningful measurements (I suspect with just AVX2 we're not
> > going to see any benefit from masking).  Hints/help how to fix
> > the missing kshiftrb appreciated.
> >
> > Oh, and if there's only V4DImode and V16HImode data then
> > we don't go the vect_maybe_permute_loop_masks path - that is,
> > we don't generate the (not used) intermediate mask but end up
> > generating two while_ult parts.
> >
> > Thanks,
> > Richard.