From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id 55CB4385C017 for ; Fri, 16 Jul 2021 09:11:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 55CB4385C017 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 3E36722B14; Fri, 16 Jul 2021 09:11:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1626426714; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=GDV68JhmQKLtyVInkuFB9a84xXOqbJ8SFjV5yC/Fnn8=; b=aIaxwb9/0RL4wz2mUel0GizxNH3lBiVnIOImPpf22rPTfEFholWTSzg+1vyNS3UMTLdwcj FL6cTrqNcmMeJWYDeOSgUyLFH8G1eFomY4MZvTsK+eM1nUFWLkWvnFRHN3aJxM5X1cIaun eicdtdsnqHzjPZzk4uynH41bXwSWQXg= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1626426714; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=GDV68JhmQKLtyVInkuFB9a84xXOqbJ8SFjV5yC/Fnn8=; b=/O1O02AgrHa+TYC0d4eHuNI35umNLNQ/TJDNB9opzYJw4Tu199+rZNuyqSt1iLUN5hywFS Wz42kGCKtcDliQBg== Received: from murzim.suse.de (murzim.suse.de [10.160.4.192]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 2ED76A3BB4; Fri, 16 Jul 2021 09:11:54 +0000 (UTC) Date: Fri, 16 Jul 2021 11:11:53 +0200 (CEST) From: Richard Biener To: Richard Sandiford cc: Hongtao Liu , Hongtao Liu , GCC Patches Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86 In-Reply-To: Message-ID: References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr> User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Jul 2021 09:11:56 -0000 On Thu, 15 Jul 2021, Richard Biener wrote: > On Thu, 15 Jul 2021, Richard Biener wrote: > > > OK, guess I was more looking at > > > > #define N 32 > > int foo (unsigned long *a, unsigned long * __restrict b, > > unsigned int *c, unsigned int * __restrict d, > > int n) > > { > > unsigned sum = 1; > > for (int i = 0; i < n; ++i) > > { > > b[i] += a[i]; > > d[i] += c[i]; > > } > > return sum; > > } > > > > where we on x86 AVX512 vectorize with V8DI and V16SI and we > > generate two masks for the two copies of V8DI (VF is 16) and one > > mask for V16SI. With SVE I see > > > > punpklo p1.h, p0.b > > punpkhi p2.h, p0.b > > > > that's sth I expected to see for AVX512 as well, using the V16SI > > mask and unpacking that to two V8DI ones. But I see > > > > vpbroadcastd %eax, %ymm0 > > vpaddd %ymm12, %ymm0, %ymm0 > > vpcmpud $6, %ymm0, %ymm11, %k3 > > vpbroadcastd %eax, %xmm0 > > vpaddd %xmm10, %xmm0, %xmm0 > > vpcmpud $1, %xmm7, %xmm0, %k1 > > vpcmpud $6, %xmm0, %xmm8, %k2 > > kortestb %k1, %k1 > > jne .L3 > > > > so three %k masks generated by vpcmpud. I'll have to look what's > > the magic for SVE and why that doesn't trigger for x86 here. > > So answer myself, vect_maybe_permute_loop_masks looks for > vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have > QImode so that doesn't play well here. Not sure if there > are proper mask instructions to use (I guess there's a shift > and lopart is free). This is QI:8 to two QI:4 (bits) mask > conversion. Not sure how to better ask the target here - again > VnBImode might have been easier here. So I've managed to "emulate" the unpack_lo/hi for the case of !VECTOR_MODE_P masks by using sub-vector select (we're asking to turn vector(8) into two vector(4) ) via BIT_FIELD_REF. That then produces the desired single mask producer and loop_mask_38 = VIEW_CONVERT_EXPR>(loop_mask_54); loop_mask_37 = BIT_FIELD_REF ; note for the lowpart we can just view-convert away the excess bits, fully re-using the mask. We generate surprisingly "good" code: kmovb %k1, %edi shrb $4, %dil kmovb %edi, %k2 besides the lack of using kshiftrb. I guess we're just lacking a mask register alternative for (insn 22 20 25 4 (parallel [ (set (reg:QI 94 [ loop_mask_37 ]) (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ]) (const_int 4 [0x4]))) (clobber (reg:CC 17 flags)) ]) 724 {*lshrqi3_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) and so we reload. For the above cited loop the AVX512 vectorization with --param vect-partial-vector-usage=1 does look quite sensible to me. Instead of a SSE vectorized epilogue plus a scalar epilogue we get a single fully masked AVX512 "iteration" for both. I suppose it's still mostly a code-size optimization (384 bytes with the masked epiloge vs. 474 bytes with trunk) since it will be likely slower for very low iteration counts but it's good for icache usage then and good for less branch predictor usage. That said, I have to set up SPEC on a AVX512 machine to do any meaningful measurements (I suspect with just AVX2 we're not going to see any benefit from masking). Hints/help how to fix the missing kshiftrb appreciated. Oh, and if there's only V4DImode and V16HImode data then we don't go the vect_maybe_permute_loop_masks path - that is, we don't generate the (not used) intermediate mask but end up generating two while_ult parts. Thanks, Richard.