From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id AEFC93858D28 for ; Wed, 12 Apr 2023 11:17:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org AEFC93858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C4B1D1684; Wed, 12 Apr 2023 04:18:40 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6F2BE3F73F; Wed, 12 Apr 2023 04:17:55 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener ,"juzhe.zhong\@rivai.ai" , gcc-patches , jeffreyalaw , rdapp@linux.ibm.com, linkw@linux.ibm.com, richard.sandiford@arm.com Cc: "juzhe.zhong\@rivai.ai" , gcc-patches , jeffreyalaw , rdapp@linux.ibm.com, linkw@linux.ibm.com Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization References: <20230407014741.139387-1-juzhe.zhong@rivai.ai> <63723855B0BF2130+2023041120125573846623@rivai.ai> <139DA38AFC9CA5B5+2023041216004591287739@rivai.ai> Date: Wed, 12 Apr 2023 12:17:54 +0100 In-Reply-To: (Richard Biener's message of "Wed, 12 Apr 2023 09:29:50 +0000 (UTC)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-25.2 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Biener writes: > On Wed, 12 Apr 2023, juzhe.zhong@rivai.ai wrote: > >> >> >> Thanks for the detailed explanation. Just to clarify - with RVV >> >> there's only a single mask register, v0.t, or did you want to >> >> say an instruction can only specify a single mask register? >> >> RVV has 32 (v0~v31) vector register in total. >> We can store vector data value or mask value in any of them. >> We also have mask-logic instruction for example mask-and between any vector register. >> >> However, any vector operation for example like vadd.vv can only predicated by v0 (in asm is v0.t) which is the first vector register. >> We can predicate vadd.vv with v1 - v31. >> >> So, you can image every time we want to use a mask to predicate a vector operation, we should always first store the mask value >> into v0. >> >> So, we can write intrinsic sequence like this: >> >> vmseq v0,v8,v9 (store mask value to v0) >> vmslt v1,v10,v11 (store mask value to v1) >> vmand v0,v0,v1 >> vadd.vv ...v0.t (predicate mask should always be mask). > > Ah, I see - that explains it well. > >> >> ARM SVE would have a loop control mask and a separate mask >> >> for the if (cond[i]) which would be combined with a mask-and >> >> instruction to a third mask which is then used on the >> >> predicated instructions. >> >> Yeah, I know it. ARM SVE way is a more elegant way than RVV do. >> However, for RVV, we can't follow this flow. >> We don't have a "whilelo" instruction to generate loop control mask. > > Yep. Similar for AVX512 where I have to use a vector compare. I'm > currently using > > { 0, 1, 2 ... } < { remaining_len, remaining_len, ... } > > and careful updating of remaining_len (we know it will either > be adjusted by the full constant vector length or updated to zero). > >> We only can do loop control with length generated by vsetvl. >> And we can only use "v0" to mask predicate vadd.vv, and mask value can only generated by comparison or mask logical instructions. >> >> >> PowerPC and s390x might be able to use WHILE_LEN as well (though >> >> they only have LEN variants of loads and stores) - of course >> >> only "simulating it". For the fixed-vector-length ISAs the >> >> predicated vector loop IMHO makes most sense for the epilogue to >> >> handle low-trip loops better. >> >> Yeah, I wonder how they do the flow control (if (cond[i])). >> For RVV, you can image I will need to add a pattern LEN_MASK_LOAD/LEN_MASK_STORE (length generated by WHILE_LEN and mask generated by comparison) >> >> I think we can CC IBM folks to see whether we can make WHILE_LEN works >> for both IBM and RVV ? > > I've CCed them. Adding WHILE_LEN support to rs6000/s390x would be > mainly the "easy" way to get len-masked (epilog) loop support. I think that already works for them (could be misremembering). However, IIUC, they have no special instruction to calculate the length (unlike for RVV), and so it's open-coded using vect_get_len. I suppose my two questions are: (1) How easy would it be to express WHILE_LEN in normal gimple? I haven't thought about this at all, so the answer might be "very hard". But it reminds me a little of UQDEC on AArch64, which we open-code using MAX_EXPR and MINUS_EXPR (see vect_set_loop_controls_directly). I'm not saying WHILE_LEN is the same operation, just that it seems like it might be open-codeable in a similar way. Even if we can open-code it, we'd still need some way for the target to select the "RVV way" from the "s390/PowerPC way". (2) What effect does using a variable IV step (the result of the WHILE_LEN) have on ivopts? I remember experimenting with something similar once (can't remember the context) and not having a constant step prevented ivopts from making good addresing-mode choices. Thanks, Richard