From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id B18363858C39 for ; Thu, 15 Jul 2021 13:49:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B18363858C39 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4C9E431B; Thu, 15 Jul 2021 06:49:03 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id AF61A3F694; Thu, 15 Jul 2021 06:49:02 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener , gcc-patches@gcc.gnu.org, hongtao.liu@intel.com, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org, hongtao.liu@intel.com Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86 References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr> Date: Thu, 15 Jul 2021 14:49:01 +0100 In-Reply-To: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr> (Richard Biener's message of "Thu, 15 Jul 2021 12:30:20 +0200 (CEST)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_NUMSUBJECT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Jul 2021 13:49:05 -0000 Richard Biener writes: > The following extends the existing loop masking support using > SVE WHILE_ULT to x86 by proving an alternate way to produce the > mask using VEC_COND_EXPRs. So with --param vect-partial-vector-usage > you can now enable masked vectorized epilogues (=1) or fully > masked vector loops (=2). As mentioned on IRC, WHILE_ULT is supposed to ensure that every element after the first zero is also zero. That happens naturally for power-of-2 vectors if the start index is a multiple of the VF. (And at the moment, variable-length vectors are the only way of supporting non-power-of-2 vectors.) This probably works fine for =2 and =1 as things stand, since the vector IVs always start at zero. But if in future we have a single IV counting scalar iterations, and use it even for peeled prologue iterations, we could end up with a situation where the approximation is no longer safe. E.g. suppose we had a uint32_t scalar IV with a limit of (uint32_t)-3. If we peeled 2 iterations for alignment and then had a VF of 8, the final vector would have a start index of (uint32_t)-6 and the vector would be { -1, -1, -1, 0, 0, 0, -1, -1 }. So I think it would be safer to handle this as an alternative to using while, rather than as a direct emulation, so that we can take the extra restrictions into account. Alternatively, we could probably do { 0, 1, 2, ... } < { end - start, end - start, ... }. Thanks, Richard > > What's missing is using a scalar IV for the loop control > (but in principle AVX512 can use the mask here - just the patch > doesn't seem to work for AVX512 yet for some reason - likely > expand_vec_cond_expr_p doesn't work there). What's also missing > is providing more support for predicated operations in the case > of reductions either via VEC_COND_EXPRs or via implementing > some of the .COND_{ADD,SUB,MUL...} internal functions as mapping > to masked AVX512 operations. > > For AVX2 and > > int foo (unsigned *a, unsigned * __restrict b, int n) > { > unsigned sum = 1; > for (int i = 0; i < n; ++i) > b[i] += a[i]; > return sum; > } > > we get > > .L3: > vpmaskmovd (%rsi,%rax), %ymm0, %ymm3 > vpmaskmovd (%rdi,%rax), %ymm0, %ymm1 > addl $8, %edx > vpaddd %ymm3, %ymm1, %ymm1 > vpmaskmovd %ymm1, %ymm0, (%rsi,%rax) > vmovd %edx, %xmm1 > vpsubd %ymm15, %ymm2, %ymm0 > addq $32, %rax > vpbroadcastd %xmm1, %ymm1 > vpaddd %ymm4, %ymm1, %ymm1 > vpsubd %ymm15, %ymm1, %ymm1 > vpcmpgtd %ymm1, %ymm0, %ymm0 > vptest %ymm0, %ymm0 > jne .L3 > > for the fully masked loop body and for the masked epilogue > we see > > .L4: > vmovdqu (%rsi,%rax), %ymm3 > vpaddd (%rdi,%rax), %ymm3, %ymm0 > vmovdqu %ymm0, (%rsi,%rax) > addq $32, %rax > cmpq %rax, %rcx > jne .L4 > movl %edx, %eax > andl $-8, %eax > testb $7, %dl > je .L11 > .L3: > subl %eax, %edx > vmovdqa .LC0(%rip), %ymm1 > salq $2, %rax > vmovd %edx, %xmm0 > movl $-2147483648, %edx > addq %rax, %rsi > vmovd %edx, %xmm15 > vpbroadcastd %xmm0, %ymm0 > vpbroadcastd %xmm15, %ymm15 > vpsubd %ymm15, %ymm1, %ymm1 > vpsubd %ymm15, %ymm0, %ymm0 > vpcmpgtd %ymm1, %ymm0, %ymm0 > vpmaskmovd (%rsi), %ymm0, %ymm1 > vpmaskmovd (%rdi,%rax), %ymm0, %ymm2 > vpaddd %ymm2, %ymm1, %ymm1 > vpmaskmovd %ymm1, %ymm0, (%rsi) > .L11: > vzeroupper > > compared to > > .L3: > movl %edx, %r8d > subl %eax, %r8d > leal -1(%r8), %r9d > cmpl $2, %r9d > jbe .L6 > leaq (%rcx,%rax,4), %r9 > vmovdqu (%rdi,%rax,4), %xmm2 > movl %r8d, %eax > andl $-4, %eax > vpaddd (%r9), %xmm2, %xmm0 > addl %eax, %esi > andl $3, %r8d > vmovdqu %xmm0, (%r9) > je .L2 > .L6: > movslq %esi, %r8 > leaq 0(,%r8,4), %rax > movl (%rdi,%r8,4), %r8d > addl %r8d, (%rcx,%rax) > leal 1(%rsi), %r8d > cmpl %r8d, %edx > jle .L2 > addl $2, %esi > movl 4(%rdi,%rax), %r8d > addl %r8d, 4(%rcx,%rax) > cmpl %esi, %edx > jle .L2 > movl 8(%rdi,%rax), %edx > addl %edx, 8(%rcx,%rax) > .L2: > > I'm giving this a little testing right now but will dig on why > I don't get masked loops when AVX512 is enabled. > > Still comments are appreciated. > > Thanks, > Richard. > > 2021-07-15 Richard Biener > > * tree-vect-stmts.c (can_produce_all_loop_masks_p): We > also can produce masks with VEC_COND_EXPRs. > * tree-vect-loop.c (vect_gen_while): Generate the mask > with a VEC_COND_EXPR in case WHILE_ULT is not supported. > --- > gcc/tree-vect-loop.c | 8 ++++++- > gcc/tree-vect-stmts.c | 50 ++++++++++++++++++++++++++++++++++--------- > 2 files changed, 47 insertions(+), 11 deletions(-) > > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index fc3dab0d143..2214ed11dfb 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -975,11 +975,17 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type) > { > rgroup_controls *rgm; > unsigned int i; > + tree cmp_vectype; > FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm) > if (rgm->type != NULL_TREE > && !direct_internal_fn_supported_p (IFN_WHILE_ULT, > cmp_type, rgm->type, > - OPTIMIZE_FOR_SPEED)) > + OPTIMIZE_FOR_SPEED) > + && ((cmp_vectype > + = truth_type_for (build_vector_type > + (cmp_type, TYPE_VECTOR_SUBPARTS (rgm->type)))), > + true) > + && !expand_vec_cond_expr_p (rgm->type, cmp_vectype, LT_EXPR)) > return false; > return true; > } > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c > index 6a25d661800..216986399b1 100644 > --- a/gcc/tree-vect-stmts.c > +++ b/gcc/tree-vect-stmts.c > @@ -12007,16 +12007,46 @@ vect_gen_while (gimple_seq *seq, tree mask_type, tree start_index, > tree end_index, const char *name) > { > tree cmp_type = TREE_TYPE (start_index); > - gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT, > - cmp_type, mask_type, > - OPTIMIZE_FOR_SPEED)); > - gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3, > - start_index, end_index, > - build_zero_cst (mask_type)); > - tree tmp = make_temp_ssa_name (mask_type, NULL, name); > - gimple_call_set_lhs (call, tmp); > - gimple_seq_add_stmt (seq, call); > - return tmp; > + if (direct_internal_fn_supported_p (IFN_WHILE_ULT, > + cmp_type, mask_type, > + OPTIMIZE_FOR_SPEED)) > + { > + gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3, > + start_index, end_index, > + build_zero_cst (mask_type)); > + tree tmp = make_temp_ssa_name (mask_type, NULL, name); > + gimple_call_set_lhs (call, tmp); > + gimple_seq_add_stmt (seq, call); > + return tmp; > + } > + else > + { > + /* Generate > + _1 = { start_index, start_index, ... }; > + _2 = { end_index, end_index, ... }; > + _3 = _1 + { 0, 1, 2 ... }; > + _4 = _3 < _2; > + _5 = VEC_COND_EXPR <_4, { -1, -1, ... } : { 0, 0, ... }>; */ > + tree cvectype = build_vector_type (cmp_type, > + TYPE_VECTOR_SUBPARTS (mask_type)); > + tree si = make_ssa_name (cvectype); > + gassign *ass = gimple_build_assign > + (si, build_vector_from_val (cvectype, start_index)); > + gimple_seq_add_stmt (seq, ass); > + tree ei = make_ssa_name (cvectype); > + ass = gimple_build_assign (ei, > + build_vector_from_val (cvectype, end_index)); > + gimple_seq_add_stmt (seq, ass); > + tree incr = build_vec_series (cvectype, build_zero_cst (cmp_type), > + build_one_cst (cmp_type)); > + si = gimple_build (seq, PLUS_EXPR, cvectype, si, incr); > + tree cmp = gimple_build (seq, LT_EXPR, truth_type_for (cvectype), > + si, ei); > + tree mask = gimple_build (seq, VEC_COND_EXPR, mask_type, cmp, > + build_all_ones_cst (mask_type), > + build_zero_cst (mask_type)); > + return mask; > + } > } > > /* Generate a vector mask of type MASK_TYPE for which index I is false iff