From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ua1-x92f.google.com (mail-ua1-x92f.google.com [IPv6:2607:f8b0:4864:20::92f]) by sourceware.org (Postfix) with ESMTPS id DA6823858001 for ; Thu, 15 Jul 2021 11:14:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DA6823858001 Received: by mail-ua1-x92f.google.com with SMTP id c20so1929664uar.12 for ; Thu, 15 Jul 2021 04:14:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=fdO4ZH32EEhJb28hCRGquYYMlQi1f89zJ1XXfyIMU1o=; b=ROo6fdJaJs4RMsltbVE3Iu6wNJOV5dkPwQF0l74Kz06KH4UnLBsBH7w0qpZFbS/uSA FDSMdh511i/Dzuy2sMAZQpaGAGb0egac7AnCPcxwU74YBthEwcgMtrFbYPp7sj32HDBI q1McsxgXV3ausm6AtAWSxQ/ekJkWLPHtUieuYWMWtWtltqo2kggUcXN4D2uetSGPDUD/ cQFFgQ1XYnq1FD4Ob26T+UVoEB6qMySN7Xa6T76JYA8uJfoHhUulGTOD4RVXSy69aihF hznXpP14tT+BB7nWj/JhbH6ouyeI+QZsZfcOhqJnGN2jJ3uCh7TWvdCka0d82w8NDvFl qmlw== X-Gm-Message-State: AOAM531+S560s7sxk5e+Or6WKfVI5ArNm4hMpVYydFs7uQvvEAD7kumW X5KBam/Xe90VDvVqwEu9QZwCVinYtmUBBUF5+6Y= X-Google-Smtp-Source: ABdhPJxciK9/HjuyCYxo9h0iyQq/4lhn+v43cXDWj2FMJg3wDMlHqGrD/XOtDbeZKc4esns7u6+5xOCNIQUGSBWYtVo= X-Received: by 2002:ab0:6392:: with SMTP id y18mr6002339uao.139.1626347677444; Thu, 15 Jul 2021 04:14:37 -0700 (PDT) MIME-Version: 1.0 References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr> In-Reply-To: From: Hongtao Liu Date: Thu, 15 Jul 2021 19:20:01 +0800 Message-ID: Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86 To: Richard Biener Cc: Richard Biener , Richard Sandiford , Hongtao Liu , GCC Patches Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Jul 2021 11:14:40 -0000 On Thu, Jul 15, 2021 at 6:45 PM Richard Biener via Gcc-patches wrote: > > On Thu, Jul 15, 2021 at 12:30 PM Richard Biener wrote: > > > > The following extends the existing loop masking support using > > SVE WHILE_ULT to x86 by proving an alternate way to produce the > > mask using VEC_COND_EXPRs. So with --param vect-partial-vector-usage > > you can now enable masked vectorized epilogues (=1) or fully > > masked vector loops (=2). > > > > What's missing is using a scalar IV for the loop control > > (but in principle AVX512 can use the mask here - just the patch > > doesn't seem to work for AVX512 yet for some reason - likely > > expand_vec_cond_expr_p doesn't work there). What's also missing > > is providing more support for predicated operations in the case > > of reductions either via VEC_COND_EXPRs or via implementing > > some of the .COND_{ADD,SUB,MUL...} internal functions as mapping > > to masked AVX512 operations. > > > > For AVX2 and > > > > int foo (unsigned *a, unsigned * __restrict b, int n) > > { > > unsigned sum = 1; > > for (int i = 0; i < n; ++i) > > b[i] += a[i]; > > return sum; > > } > > > > we get > > > > .L3: > > vpmaskmovd (%rsi,%rax), %ymm0, %ymm3 > > vpmaskmovd (%rdi,%rax), %ymm0, %ymm1 > > addl $8, %edx > > vpaddd %ymm3, %ymm1, %ymm1 > > vpmaskmovd %ymm1, %ymm0, (%rsi,%rax) > > vmovd %edx, %xmm1 > > vpsubd %ymm15, %ymm2, %ymm0 > > addq $32, %rax > > vpbroadcastd %xmm1, %ymm1 > > vpaddd %ymm4, %ymm1, %ymm1 > > vpsubd %ymm15, %ymm1, %ymm1 > > vpcmpgtd %ymm1, %ymm0, %ymm0 > > vptest %ymm0, %ymm0 > > jne .L3 > > > > for the fully masked loop body and for the masked epilogue > > we see > > > > .L4: > > vmovdqu (%rsi,%rax), %ymm3 > > vpaddd (%rdi,%rax), %ymm3, %ymm0 > > vmovdqu %ymm0, (%rsi,%rax) > > addq $32, %rax > > cmpq %rax, %rcx > > jne .L4 > > movl %edx, %eax > > andl $-8, %eax > > testb $7, %dl > > je .L11 > > .L3: > > subl %eax, %edx > > vmovdqa .LC0(%rip), %ymm1 > > salq $2, %rax > > vmovd %edx, %xmm0 > > movl $-2147483648, %edx > > addq %rax, %rsi > > vmovd %edx, %xmm15 > > vpbroadcastd %xmm0, %ymm0 > > vpbroadcastd %xmm15, %ymm15 > > vpsubd %ymm15, %ymm1, %ymm1 > > vpsubd %ymm15, %ymm0, %ymm0 > > vpcmpgtd %ymm1, %ymm0, %ymm0 > > vpmaskmovd (%rsi), %ymm0, %ymm1 > > vpmaskmovd (%rdi,%rax), %ymm0, %ymm2 > > vpaddd %ymm2, %ymm1, %ymm1 > > vpmaskmovd %ymm1, %ymm0, (%rsi) > > .L11: > > vzeroupper > > > > compared to > > > > .L3: > > movl %edx, %r8d > > subl %eax, %r8d > > leal -1(%r8), %r9d > > cmpl $2, %r9d > > jbe .L6 > > leaq (%rcx,%rax,4), %r9 > > vmovdqu (%rdi,%rax,4), %xmm2 > > movl %r8d, %eax > > andl $-4, %eax > > vpaddd (%r9), %xmm2, %xmm0 > > addl %eax, %esi > > andl $3, %r8d > > vmovdqu %xmm0, (%r9) > > je .L2 > > .L6: > > movslq %esi, %r8 > > leaq 0(,%r8,4), %rax > > movl (%rdi,%r8,4), %r8d > > addl %r8d, (%rcx,%rax) > > leal 1(%rsi), %r8d > > cmpl %r8d, %edx > > jle .L2 > > addl $2, %esi > > movl 4(%rdi,%rax), %r8d > > addl %r8d, 4(%rcx,%rax) > > cmpl %esi, %edx > > jle .L2 > > movl 8(%rdi,%rax), %edx > > addl %edx, 8(%rcx,%rax) > > .L2: > > > > I'm giving this a little testing right now but will dig on why > > I don't get masked loops when AVX512 is enabled. > > Ah, a simple thinko - rgroup_controls vectypes seem to be > always VECTOR_BOOLEAN_TYPE_P and thus we can > use expand_vec_cmp_expr_p. The AVX512 fully masked > loop then looks like > > .L3: > vmovdqu32 (%rsi,%rax,4), %ymm2{%k1} > vmovdqu32 (%rdi,%rax,4), %ymm1{%k1} > vpaddd %ymm2, %ymm1, %ymm0 > vmovdqu32 %ymm0, (%rsi,%rax,4){%k1} > addq $8, %rax > vpbroadcastd %eax, %ymm0 > vpaddd %ymm4, %ymm0, %ymm0 > vpcmpud $6, %ymm0, %ymm3, %k1 > kortestb %k1, %k1 > jne .L3 > > I guess for x86 it's not worth preserving the VEC_COND_EXPR > mask generation but other archs may not provide all required vec_cmp > expanders. For the main loop, the full-masked loop's codegen seems much worse. Basically, we need at least 4 instructions to do what while_ult in arm does. vpbroadcastd %eax, %ymm0 vpaddd %ymm4, %ymm0, %ymm0 vpcmpud $6, %ymm0, %ymm3, %k1 kortestb %k1, %k1 vs whilelo(or some other while) more instructions are needed for avx2 since there's no direct instruction for .COND_{ADD,SUB..} original .L4: vmovdqu (%rcx,%rax), %ymm1 vpaddd (%rdi,%rax), %ymm1, %ymm0 vmovdqu %ymm0, (%rcx,%rax) addq $32, %rax cmpq %rax, %rsi jne .L4 vs avx512 full-masked loop .L3: vmovdqu32 (%rsi,%rax,4), %ymm2{%k1} vmovdqu32 (%rdi,%rax,4), %ymm1{%k1} vpaddd %ymm2, %ymm1, %ymm0 vmovdqu32 %ymm0, (%rsi,%rax,4){%k1} addq $8, %rax vpbroadcastd %eax, %ymm0 vpaddd %ymm4, %ymm0, %ymm0 vpcmpud $6, %ymm0, %ymm3, %k1 kortestb %k1, %k1 jne .L3 vs avx2 full-masked loop .L3: vpmaskmovd (%rsi,%rax), %ymm0, %ymm3 vpmaskmovd (%rdi,%rax), %ymm0, %ymm1 addl $8, %edx vpaddd %ymm3, %ymm1, %ymm1 vpmaskmovd %ymm1, %ymm0, (%rsi,%rax) vmovd %edx, %xmm1 vpsubd %ymm15, %ymm2, %ymm0 addq $32, %rax vpbroadcastd %xmm1, %ymm1 vpaddd %ymm4, %ymm1, %ymm1 vpsubd %ymm15, %ymm1, %ymm1 vpcmpgtd %ymm1, %ymm0, %ymm0 vptest %ymm0, %ymm0 jne .L3 vs arm64's code .L3: ld1w z1.s, p0/z, [x1, x3, lsl 2] ld1w z0.s, p0/z, [x0, x3, lsl 2] add z0.s, z0.s, z1.s st1w z0.s, p0, [x1, x3, lsl 2] add x3, x3, x4 whilelo p0.s, w3, w2 b.any .L3 > Richard. > > > Still comments are appreciated. > > > > Thanks, > > Richard. > > > > 2021-07-15 Richard Biener > > > > * tree-vect-stmts.c (can_produce_all_loop_masks_p): We > > also can produce masks with VEC_COND_EXPRs. > > * tree-vect-loop.c (vect_gen_while): Generate the mask > > with a VEC_COND_EXPR in case WHILE_ULT is not supported. > > --- > > gcc/tree-vect-loop.c | 8 ++++++- > > gcc/tree-vect-stmts.c | 50 ++++++++++++++++++++++++++++++++++--------- > > 2 files changed, 47 insertions(+), 11 deletions(-) > > > > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > > index fc3dab0d143..2214ed11dfb 100644 > > --- a/gcc/tree-vect-loop.c > > +++ b/gcc/tree-vect-loop.c > > @@ -975,11 +975,17 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type) > > { > > rgroup_controls *rgm; > > unsigned int i; > > + tree cmp_vectype; > > FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm) > > if (rgm->type != NULL_TREE > > && !direct_internal_fn_supported_p (IFN_WHILE_ULT, > > cmp_type, rgm->type, > > - OPTIMIZE_FOR_SPEED)) > > + OPTIMIZE_FOR_SPEED) > > + && ((cmp_vectype > > + = truth_type_for (build_vector_type > > + (cmp_type, TYPE_VECTOR_SUBPARTS (rgm->type)))), > > + true) > > + && !expand_vec_cond_expr_p (rgm->type, cmp_vectype, LT_EXPR)) > > return false; > > return true; > > } > > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c > > index 6a25d661800..216986399b1 100644 > > --- a/gcc/tree-vect-stmts.c > > +++ b/gcc/tree-vect-stmts.c > > @@ -12007,16 +12007,46 @@ vect_gen_while (gimple_seq *seq, tree mask_type, tree start_index, > > tree end_index, const char *name) > > { > > tree cmp_type = TREE_TYPE (start_index); > > - gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT, > > - cmp_type, mask_type, > > - OPTIMIZE_FOR_SPEED)); > > - gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3, > > - start_index, end_index, > > - build_zero_cst (mask_type)); > > - tree tmp = make_temp_ssa_name (mask_type, NULL, name); > > - gimple_call_set_lhs (call, tmp); > > - gimple_seq_add_stmt (seq, call); > > - return tmp; > > + if (direct_internal_fn_supported_p (IFN_WHILE_ULT, > > + cmp_type, mask_type, > > + OPTIMIZE_FOR_SPEED)) > > + { > > + gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3, > > + start_index, end_index, > > + build_zero_cst (mask_type)); > > + tree tmp = make_temp_ssa_name (mask_type, NULL, name); > > + gimple_call_set_lhs (call, tmp); > > + gimple_seq_add_stmt (seq, call); > > + return tmp; > > + } > > + else > > + { > > + /* Generate > > + _1 = { start_index, start_index, ... }; > > + _2 = { end_index, end_index, ... }; > > + _3 = _1 + { 0, 1, 2 ... }; > > + _4 = _3 < _2; > > + _5 = VEC_COND_EXPR <_4, { -1, -1, ... } : { 0, 0, ... }>; */ > > + tree cvectype = build_vector_type (cmp_type, > > + TYPE_VECTOR_SUBPARTS (mask_type)); > > + tree si = make_ssa_name (cvectype); > > + gassign *ass = gimple_build_assign > > + (si, build_vector_from_val (cvectype, start_index)); > > + gimple_seq_add_stmt (seq, ass); > > + tree ei = make_ssa_name (cvectype); > > + ass = gimple_build_assign (ei, > > + build_vector_from_val (cvectype, end_index)); > > + gimple_seq_add_stmt (seq, ass); > > + tree incr = build_vec_series (cvectype, build_zero_cst (cmp_type), > > + build_one_cst (cmp_type)); > > + si = gimple_build (seq, PLUS_EXPR, cvectype, si, incr); > > + tree cmp = gimple_build (seq, LT_EXPR, truth_type_for (cvectype), > > + si, ei); > > + tree mask = gimple_build (seq, VEC_COND_EXPR, mask_type, cmp, > > + build_all_ones_cst (mask_type), > > + build_zero_cst (mask_type)); > > + return mask; > > + } > > } > > > > /* Generate a vector mask of type MASK_TYPE for which index I is false iff > > -- > > 2.26.2 -- BR, Hongtao