From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vs1-xe31.google.com (mail-vs1-xe31.google.com [IPv6:2607:f8b0:4864:20::e31]) by sourceware.org (Postfix) with ESMTPS id BB7B43857C7F for ; Tue, 20 Jul 2021 04:14:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BB7B43857C7F Received: by mail-vs1-xe31.google.com with SMTP id f4so10580812vsh.11 for ; Mon, 19 Jul 2021 21:14:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=dCwh+zl8I3Tve1WACcY2SZ6Hxt+t8YJWVHYjZ9t775s=; b=jJuYVZiNJPB2JQWqBaSo+ux7RmxGw5CWDHkwY3jQJwyUHGnFCu/xavWWYw/pVzG5NG xh3grw7vUN0xGJN9V7zfw/TU1v/NEixJEvmZSsh9y1cNPdWtCQIFPvw24rK+tyEKIWvp 8VPBZVpi7ZTc2LdcKKhUbm2mIOAVtLtYZpszYj81wr7o8IGWwfRcBEIKwRxzFL73Bt8B bsTPjoHrehCD4/q5V06bbbMVr5FwjIT6jPmUrTH9JmxR5/hTIKb3aTb3rycKuCte4EzC iAW8MyAJRguw8j8LWOsD5ayAafca+JphLMWtyyN0BNYmDik6Sn+/ahYasJES6z1017ae ljXQ== X-Gm-Message-State: AOAM533nDdyTD/SpsRCjL4uPEmj5gW3YM35RSMhK6KVyLBH2pstHvYT8 HYrWtVycODRSxNKId/0KoAU7iX2FpA4MM1RsUxc= X-Google-Smtp-Source: ABdhPJyT0pM4SCqY2kjGLaKYvr6bA9mmfzdGdcPeuxpJGKYWA/XPqc9OZufd5zUIhGSy3aOQlqdFpNno+nq8u64M20k= X-Received: by 2002:a05:6102:675:: with SMTP id z21mr28716387vsf.48.1626754486343; Mon, 19 Jul 2021 21:14:46 -0700 (PDT) MIME-Version: 1.0 References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr> In-Reply-To: From: Hongtao Liu Date: Tue, 20 Jul 2021 12:20:12 +0800 Message-ID: Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86 To: Richard Biener Cc: Richard Sandiford , Hongtao Liu , GCC Patches Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 20 Jul 2021 04:14:48 -0000 On Fri, Jul 16, 2021 at 5:11 PM Richard Biener wrote: > > On Thu, 15 Jul 2021, Richard Biener wrote: > > > On Thu, 15 Jul 2021, Richard Biener wrote: > > > > > OK, guess I was more looking at > > > > > > #define N 32 > > > int foo (unsigned long *a, unsigned long * __restrict b, > > > unsigned int *c, unsigned int * __restrict d, > > > int n) > > > { > > > unsigned sum = 1; > > > for (int i = 0; i < n; ++i) > > > { > > > b[i] += a[i]; > > > d[i] += c[i]; > > > } > > > return sum; > > > } > > > > > > where we on x86 AVX512 vectorize with V8DI and V16SI and we > > > generate two masks for the two copies of V8DI (VF is 16) and one > > > mask for V16SI. With SVE I see > > > > > > punpklo p1.h, p0.b > > > punpkhi p2.h, p0.b > > > > > > that's sth I expected to see for AVX512 as well, using the V16SI > > > mask and unpacking that to two V8DI ones. But I see > > > > > > vpbroadcastd %eax, %ymm0 > > > vpaddd %ymm12, %ymm0, %ymm0 > > > vpcmpud $6, %ymm0, %ymm11, %k3 > > > vpbroadcastd %eax, %xmm0 > > > vpaddd %xmm10, %xmm0, %xmm0 > > > vpcmpud $1, %xmm7, %xmm0, %k1 > > > vpcmpud $6, %xmm0, %xmm8, %k2 > > > kortestb %k1, %k1 > > > jne .L3 > > > > > > so three %k masks generated by vpcmpud. I'll have to look what's > > > the magic for SVE and why that doesn't trigger for x86 here. > > > > So answer myself, vect_maybe_permute_loop_masks looks for > > vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have > > QImode so that doesn't play well here. Not sure if there > > are proper mask instructions to use (I guess there's a shift > > and lopart is free). This is QI:8 to two QI:4 (bits) mask Yes, for 16bit and more, we have KUNPCKBW/D/Q. but for 8bit unpack_lo/hi, only shift. > > conversion. Not sure how to better ask the target here - again > > VnBImode might have been easier here. > > So I've managed to "emulate" the unpack_lo/hi for the case of > !VECTOR_MODE_P masks by using sub-vector select (we're asking > to turn vector(8) into two > vector(4) ) via BIT_FIELD_REF. That then > produces the desired single mask producer and > > loop_mask_38 = VIEW_CONVERT_EXPR>(loop_mask_54); > loop_mask_37 = BIT_FIELD_REF ; > > note for the lowpart we can just view-convert away the excess bits, > fully re-using the mask. We generate surprisingly "good" code: > > kmovb %k1, %edi > shrb $4, %dil > kmovb %edi, %k2 > > besides the lack of using kshiftrb. I guess we're just lacking > a mask register alternative for Yes, we can do it similar as kor/kand/kxor. > > (insn 22 20 25 4 (parallel [ > (set (reg:QI 94 [ loop_mask_37 ]) > (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ]) > (const_int 4 [0x4]))) > (clobber (reg:CC 17 flags)) > ]) 724 {*lshrqi3_1} > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil))) > > and so we reload. For the above cited loop the AVX512 vectorization > with --param vect-partial-vector-usage=1 does look quite sensible > to me. Instead of a SSE vectorized epilogue plus a scalar > epilogue we get a single fully masked AVX512 "iteration" for both. > I suppose it's still mostly a code-size optimization (384 bytes > with the masked epiloge vs. 474 bytes with trunk) since it will > be likely slower for very low iteration counts but it's good > for icache usage then and good for less branch predictor usage. > > That said, I have to set up SPEC on a AVX512 machine to do Does patch land in trunk already, i can have a test on CLX. > any meaningful measurements (I suspect with just AVX2 we're not > going to see any benefit from masking). Hints/help how to fix > the missing kshiftrb appreciated. > > Oh, and if there's only V4DImode and V16HImode data then > we don't go the vect_maybe_permute_loop_masks path - that is, > we don't generate the (not used) intermediate mask but end up > generating two while_ult parts. > > Thanks, > Richard. -- BR, Hongtao