From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
 by sourceware.org (Postfix) with ESMTPS id 554FB38618F9
 for <gcc-patches@gcc.gnu.org>; Tue, 20 Jul 2021 11:09:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 554FB38618F9
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out2.suse.de (Postfix) with ESMTP id 333ED202C8;
 Tue, 20 Jul 2021 11:09:54 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
 t=1626779394; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=no4kmcJIhXY8thKUrWswRvxcuWCIrOwy0z4cD2CaI+E=;
 b=aoD1DQifgQErV/QKgBNAe1MMt9JEq4ZWUoK5xGeMnZ8XUOhynQEpoq8iV+v37+tOUhyYOq
 IWDx7JaLng9EejWsbACzFQBDplNX8rWf9Cjoq67gaa/QEwwXG75hmth/pWvBh2mbmdysFc
 exeiXN4HDlxVULfJkLeCfpQxZqYM1X8=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1626779394;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=no4kmcJIhXY8thKUrWswRvxcuWCIrOwy0z4cD2CaI+E=;
 b=60BsymsX5Dk0XPNqPMbzZNOLt2uJ6zrp0DD+h9cuxzqiB91d5q4VW8Wa5ZYvg0Q/waaHIY
 ucX3qLJir3vF3DAg==
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id 2726DA3B85;
 Tue, 20 Jul 2021 11:09:54 +0000 (UTC)
Date: Tue, 20 Jul 2021 13:09:54 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Hongtao Liu <crazylht@gmail.com>
cc: Richard Sandiford <richard.sandiford@arm.com>, 
 Hongtao Liu <hongtao.liu@intel.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86
In-Reply-To: <CAMZc-bzu3a7E6bxBMMaTMF8Y4pqr1jJ67cOhUqb2uNL-pw49Fw@mail.gmail.com>
Message-ID: <nycvar.YFH.7.76.2107201306560.10711@zhemvz.fhfr.qr>
References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr>
 <CAFiYyc2YyTZ-4+ab8oPAN_9z3OF52qenRwpf=Pifkm6r8r8NSA@mail.gmail.com>
 <CAMZc-bx6iRMqignBhUAqtS-O+d1SQMvERUcODyOWxfyFo6AD6Q@mail.gmail.com>
 <nycvar.YFH.7.76.2107151320060.10711@zhemvz.fhfr.qr>
 <mptv95baall.fsf@arm.com> <nycvar.YFH.7.76.2107151711080.10711@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2107151728500.10711@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2107161017050.10711@zhemvz.fhfr.qr>
 <CAMZc-bw3sDERU6YunDQ_eku_cg4TySCC8bZMX_B83A6hCycfHA@mail.gmail.com>
 <nycvar.YFH.7.76.2107200931290.10711@zhemvz.fhfr.qr>
 <CAMZc-bzu3a7E6bxBMMaTMF8Y4pqr1jJ67cOhUqb2uNL-pw49Fw@mail.gmail.com>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Jul 2021 11:09:57 -0000

On Tue, 20 Jul 2021, Hongtao Liu wrote:

> On Tue, Jul 20, 2021 at 3:38 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Tue, 20 Jul 2021, Hongtao Liu wrote:
> >
> > > On Fri, Jul 16, 2021 at 5:11 PM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > On Thu, 15 Jul 2021, Richard Biener wrote:
> > > >
> > > > > On Thu, 15 Jul 2021, Richard Biener wrote:
> > > > >
> > > > > > OK, guess I was more looking at
> > > > > >
> > > > > > #define N 32
> > > > > > int foo (unsigned long *a, unsigned long * __restrict b,
> > > > > >          unsigned int *c, unsigned int * __restrict d,
> > > > > >          int n)
> > > > > > {
> > > > > >   unsigned sum = 1;
> > > > > >   for (int i = 0; i < n; ++i)
> > > > > >     {
> > > > > >       b[i] += a[i];
> > > > > >       d[i] += c[i];
> > > > > >     }
> > > > > >   return sum;
> > > > > > }
> > > > > >
> > > > > > where we on x86 AVX512 vectorize with V8DI and V16SI and we
> > > > > > generate two masks for the two copies of V8DI (VF is 16) and one
> > > > > > mask for V16SI.  With SVE I see
> > > > > >
> > > > > >         punpklo p1.h, p0.b
> > > > > >         punpkhi p2.h, p0.b
> > > > > >
> > > > > > that's sth I expected to see for AVX512 as well, using the V16SI
> > > > > > mask and unpacking that to two V8DI ones.  But I see
> > > > > >
> > > > > >         vpbroadcastd    %eax, %ymm0
> > > > > >         vpaddd  %ymm12, %ymm0, %ymm0
> > > > > >         vpcmpud $6, %ymm0, %ymm11, %k3
> > > > > >         vpbroadcastd    %eax, %xmm0
> > > > > >         vpaddd  %xmm10, %xmm0, %xmm0
> > > > > >         vpcmpud $1, %xmm7, %xmm0, %k1
> > > > > >         vpcmpud $6, %xmm0, %xmm8, %k2
> > > > > >         kortestb        %k1, %k1
> > > > > >         jne     .L3
> > > > > >
> > > > > > so three %k masks generated by vpcmpud.  I'll have to look what's
> > > > > > the magic for SVE and why that doesn't trigger for x86 here.
> > > > >
> > > > > So answer myself, vect_maybe_permute_loop_masks looks for
> > > > > vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have
> > > > > QImode so that doesn't play well here.  Not sure if there
> > > > > are proper mask instructions to use (I guess there's a shift
> > > > > and lopart is free).  This is QI:8 to two QI:4 (bits) mask
> > > Yes, for 16bit and more, we have KUNPCKBW/D/Q. but for 8bit
> > > unpack_lo/hi, only shift.
> > > > > conversion.  Not sure how to better ask the target here - again
> > > > > VnBImode might have been easier here.
> > > >
> > > > So I've managed to "emulate" the unpack_lo/hi for the case of
> > > > !VECTOR_MODE_P masks by using sub-vector select (we're asking
> > > > to turn vector(8) <signed-boolean:1> into two
> > > > vector(4) <signed-boolean:1>) via BIT_FIELD_REF.  That then
> > > > produces the desired single mask producer and
> > > >
> > > >   loop_mask_38 = VIEW_CONVERT_EXPR<vector(4) <signed-boolean:1>>(loop_mask_54);
> > > >   loop_mask_37 = BIT_FIELD_REF <loop_mask_54, 4, 4>;
> > > >
> > > > note for the lowpart we can just view-convert away the excess bits,
> > > > fully re-using the mask.  We generate surprisingly "good" code:
> > > >
> > > >         kmovb   %k1, %edi
> > > >         shrb    $4, %dil
> > > >         kmovb   %edi, %k2
> > > >
> > > > besides the lack of using kshiftrb.  I guess we're just lacking
> > > > a mask register alternative for
> > > Yes, we can do it similar as kor/kand/kxor.
> > > >
> > > > (insn 22 20 25 4 (parallel [
> > > >             (set (reg:QI 94 [ loop_mask_37 ])
> > > >                 (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ])
> > > >                     (const_int 4 [0x4])))
> > > >             (clobber (reg:CC 17 flags))
> > > >         ]) 724 {*lshrqi3_1}
> > > >      (expr_list:REG_UNUSED (reg:CC 17 flags)
> > > >         (nil)))
> > > >
> > > > and so we reload.  For the above cited loop the AVX512 vectorization
> > > > with --param vect-partial-vector-usage=1 does look quite sensible
> > > > to me.  Instead of a SSE vectorized epilogue plus a scalar
> > > > epilogue we get a single fully masked AVX512 "iteration" for both.
> > > > I suppose it's still mostly a code-size optimization (384 bytes
> > > > with the masked epiloge vs. 474 bytes with trunk) since it will
> > > > be likely slower for very low iteration counts but it's good
> > > > for icache usage then and good for less branch predictor usage.
> > > >
> > > > That said, I have to set up SPEC on a AVX512 machine to do
> > > Does patch  land in trunk already, i can have a test on CLX.
> >
> > I'm still experimenting a bit right now but hope to get something
> > trunk ready at the end of this or beginning next week.  Since it's
> > disabled by default we can work on improving it during stage1 then.
> >
> > I'm mostly struggling with the GIMPLE IL to be used for the
> > mask unpacking since we currently reject both the BIT_FIELD_REF
> > and the VIEW_CONVERT we generate (why do AVX512 masks not all have
> > SImode but sometimes QImode and sometimes HImode ...).  Unfortunately
> We have  instruction like ktestb which only cases about the low 8
> bits, if we use SImode for all masks, code implementation can become
> complex.
> 
> > we've dropped whole-vector shifts in favor of VEC_PERM but that
> > doesn't work well either for integer mode vectors.  So I'm still
> > playing with my options here and looking for something that doesn't
> > require too much surgery on the RTL side to recover good mask
> > register code ...
> >
> > Another part missing is expanders for the various cond_* patterns
> >
> > OPTAB_D (cond_add_optab, "cond_add$a")
> > OPTAB_D (cond_sub_optab, "cond_sub$a")
> > OPTAB_D (cond_smul_optab, "cond_mul$a")
> > OPTAB_D (cond_sdiv_optab, "cond_div$a")
> > OPTAB_D (cond_smod_optab, "cond_mod$a")
> > OPTAB_D (cond_udiv_optab, "cond_udiv$a")
> > OPTAB_D (cond_umod_optab, "cond_umod$a")
> > OPTAB_D (cond_and_optab, "cond_and$a")
> > OPTAB_D (cond_ior_optab, "cond_ior$a")
> > OPTAB_D (cond_xor_optab, "cond_xor$a")
> > OPTAB_D (cond_ashl_optab, "cond_ashl$a")
> > OPTAB_D (cond_ashr_optab, "cond_ashr$a")
> > OPTAB_D (cond_lshr_optab, "cond_lshr$a")
> > OPTAB_D (cond_smin_optab, "cond_smin$a")
> > OPTAB_D (cond_smax_optab, "cond_smax$a")
> > OPTAB_D (cond_umin_optab, "cond_umin$a")
> > OPTAB_D (cond_umax_optab, "cond_umax$a")
> > OPTAB_D (cond_fma_optab, "cond_fma$a")
> > OPTAB_D (cond_fms_optab, "cond_fms$a")
> > OPTAB_D (cond_fnma_optab, "cond_fnma$a")
> > OPTAB_D (cond_fnms_optab, "cond_fnms$a")
> I guess there's no need for scalar modes, although avx512 mask support
> scalar instructions, it's a bit awkward to generate mask from scalar
> operands.(we need to compare, set flag to gpr, and mov gpr to mask
> register).

Yes, I think if-conversion has all the "scalar" if-converted code
dominated by a if (.IFN_VECTORIZED ()) conditional and thus it will
be only used vectorized.  See ifcvt_can_predicate where it checks

  internal_fn cond_fn = get_conditional_internal_fn (code);
  return (cond_fn != IFN_LAST
          && vectorized_internal_fn_supported_p (cond_fn, lhs_type));

as said this is really orthogonal to the fully masked loop/epilogue
and would benefit vectorization when FP ops can trap.

Richard.