From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <crazylht@gmail.com>
Received: from mail-ua1-x92f.google.com (mail-ua1-x92f.google.com
 [IPv6:2607:f8b0:4864:20::92f])
 by sourceware.org (Postfix) with ESMTPS id DA6823858001
 for <gcc-patches@gcc.gnu.org>; Thu, 15 Jul 2021 11:14:37 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DA6823858001
Received: by mail-ua1-x92f.google.com with SMTP id c20so1929664uar.12
 for <gcc-patches@gcc.gnu.org>; Thu, 15 Jul 2021 04:14:37 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=fdO4ZH32EEhJb28hCRGquYYMlQi1f89zJ1XXfyIMU1o=;
 b=ROo6fdJaJs4RMsltbVE3Iu6wNJOV5dkPwQF0l74Kz06KH4UnLBsBH7w0qpZFbS/uSA
 FDSMdh511i/Dzuy2sMAZQpaGAGb0egac7AnCPcxwU74YBthEwcgMtrFbYPp7sj32HDBI
 q1McsxgXV3ausm6AtAWSxQ/ekJkWLPHtUieuYWMWtWtltqo2kggUcXN4D2uetSGPDUD/
 cQFFgQ1XYnq1FD4Ob26T+UVoEB6qMySN7Xa6T76JYA8uJfoHhUulGTOD4RVXSy69aihF
 hznXpP14tT+BB7nWj/JhbH6ouyeI+QZsZfcOhqJnGN2jJ3uCh7TWvdCka0d82w8NDvFl
 qmlw==
X-Gm-Message-State: AOAM531+S560s7sxk5e+Or6WKfVI5ArNm4hMpVYydFs7uQvvEAD7kumW
 X5KBam/Xe90VDvVqwEu9QZwCVinYtmUBBUF5+6Y=
X-Google-Smtp-Source: ABdhPJxciK9/HjuyCYxo9h0iyQq/4lhn+v43cXDWj2FMJg3wDMlHqGrD/XOtDbeZKc4esns7u6+5xOCNIQUGSBWYtVo=
X-Received: by 2002:ab0:6392:: with SMTP id y18mr6002339uao.139.1626347677444; 
 Thu, 15 Jul 2021 04:14:37 -0700 (PDT)
MIME-Version: 1.0
References: <73rrp0p-859r-oq2n-pss7-6744807s3qr5@fhfr.qr>
 <CAFiYyc2YyTZ-4+ab8oPAN_9z3OF52qenRwpf=Pifkm6r8r8NSA@mail.gmail.com>
In-Reply-To: <CAFiYyc2YyTZ-4+ab8oPAN_9z3OF52qenRwpf=Pifkm6r8r8NSA@mail.gmail.com>
From: Hongtao Liu <crazylht@gmail.com>
Date: Thu, 15 Jul 2021 19:20:01 +0800
Message-ID: <CAMZc-bx6iRMqignBhUAqtS-O+d1SQMvERUcODyOWxfyFo6AD6Q@mail.gmail.com>
Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86
To: Richard Biener <richard.guenther@gmail.com>
Cc: Richard Biener <rguenther@suse.de>,
 Richard Sandiford <richard.sandiford@arm.com>, 
 Hongtao Liu <hongtao.liu@intel.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-9.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Jul 2021 11:14:40 -0000

On Thu, Jul 15, 2021 at 6:45 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 15, 2021 at 12:30 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > The following extends the existing loop masking support using
> > SVE WHILE_ULT to x86 by proving an alternate way to produce the
> > mask using VEC_COND_EXPRs.  So with --param vect-partial-vector-usage
> > you can now enable masked vectorized epilogues (=1) or fully
> > masked vector loops (=2).
> >
> > What's missing is using a scalar IV for the loop control
> > (but in principle AVX512 can use the mask here - just the patch
> > doesn't seem to work for AVX512 yet for some reason - likely
> > expand_vec_cond_expr_p doesn't work there).  What's also missing
> > is providing more support for predicated operations in the case
> > of reductions either via VEC_COND_EXPRs or via implementing
> > some of the .COND_{ADD,SUB,MUL...} internal functions as mapping
> > to masked AVX512 operations.
> >
> > For AVX2 and
> >
> > int foo (unsigned *a, unsigned * __restrict b, int n)
> > {
> >   unsigned sum = 1;
> >   for (int i = 0; i < n; ++i)
> >     b[i] += a[i];
> >   return sum;
> > }
> >
> > we get
> >
> > .L3:
> >         vpmaskmovd      (%rsi,%rax), %ymm0, %ymm3
> >         vpmaskmovd      (%rdi,%rax), %ymm0, %ymm1
> >         addl    $8, %edx
> >         vpaddd  %ymm3, %ymm1, %ymm1
> >         vpmaskmovd      %ymm1, %ymm0, (%rsi,%rax)
> >         vmovd   %edx, %xmm1
> >         vpsubd  %ymm15, %ymm2, %ymm0
> >         addq    $32, %rax
> >         vpbroadcastd    %xmm1, %ymm1
> >         vpaddd  %ymm4, %ymm1, %ymm1
> >         vpsubd  %ymm15, %ymm1, %ymm1
> >         vpcmpgtd        %ymm1, %ymm0, %ymm0
> >         vptest  %ymm0, %ymm0
> >         jne     .L3
> >
> > for the fully masked loop body and for the masked epilogue
> > we see
> >
> > .L4:
> >         vmovdqu (%rsi,%rax), %ymm3
> >         vpaddd  (%rdi,%rax), %ymm3, %ymm0
> >         vmovdqu %ymm0, (%rsi,%rax)
> >         addq    $32, %rax
> >         cmpq    %rax, %rcx
> >         jne     .L4
> >         movl    %edx, %eax
> >         andl    $-8, %eax
> >         testb   $7, %dl
> >         je      .L11
> > .L3:
> >         subl    %eax, %edx
> >         vmovdqa .LC0(%rip), %ymm1
> >         salq    $2, %rax
> >         vmovd   %edx, %xmm0
> >         movl    $-2147483648, %edx
> >         addq    %rax, %rsi
> >         vmovd   %edx, %xmm15
> >         vpbroadcastd    %xmm0, %ymm0
> >         vpbroadcastd    %xmm15, %ymm15
> >         vpsubd  %ymm15, %ymm1, %ymm1
> >         vpsubd  %ymm15, %ymm0, %ymm0
> >         vpcmpgtd        %ymm1, %ymm0, %ymm0
> >         vpmaskmovd      (%rsi), %ymm0, %ymm1
> >         vpmaskmovd      (%rdi,%rax), %ymm0, %ymm2
> >         vpaddd  %ymm2, %ymm1, %ymm1
> >         vpmaskmovd      %ymm1, %ymm0, (%rsi)
> > .L11:
> >         vzeroupper
> >
> > compared to
> >
> > .L3:
> >         movl    %edx, %r8d
> >         subl    %eax, %r8d
> >         leal    -1(%r8), %r9d
> >         cmpl    $2, %r9d
> >         jbe     .L6
> >         leaq    (%rcx,%rax,4), %r9
> >         vmovdqu (%rdi,%rax,4), %xmm2
> >         movl    %r8d, %eax
> >         andl    $-4, %eax
> >         vpaddd  (%r9), %xmm2, %xmm0
> >         addl    %eax, %esi
> >         andl    $3, %r8d
> >         vmovdqu %xmm0, (%r9)
> >         je      .L2
> > .L6:
> >         movslq  %esi, %r8
> >         leaq    0(,%r8,4), %rax
> >         movl    (%rdi,%r8,4), %r8d
> >         addl    %r8d, (%rcx,%rax)
> >         leal    1(%rsi), %r8d
> >         cmpl    %r8d, %edx
> >         jle     .L2
> >         addl    $2, %esi
> >         movl    4(%rdi,%rax), %r8d
> >         addl    %r8d, 4(%rcx,%rax)
> >         cmpl    %esi, %edx
> >         jle     .L2
> >         movl    8(%rdi,%rax), %edx
> >         addl    %edx, 8(%rcx,%rax)
> > .L2:
> >
> > I'm giving this a little testing right now but will dig on why
> > I don't get masked loops when AVX512 is enabled.
>
> Ah, a simple thinko - rgroup_controls vectypes seem to be
> always VECTOR_BOOLEAN_TYPE_P and thus we can
> use expand_vec_cmp_expr_p.  The AVX512 fully masked
> loop then looks like
>
> .L3:
>         vmovdqu32       (%rsi,%rax,4), %ymm2{%k1}
>         vmovdqu32       (%rdi,%rax,4), %ymm1{%k1}
>         vpaddd  %ymm2, %ymm1, %ymm0
>         vmovdqu32       %ymm0, (%rsi,%rax,4){%k1}
>         addq    $8, %rax
>         vpbroadcastd    %eax, %ymm0
>         vpaddd  %ymm4, %ymm0, %ymm0
>         vpcmpud $6, %ymm0, %ymm3, %k1
>         kortestb        %k1, %k1
>         jne     .L3
>
> I guess for x86 it's not worth preserving the VEC_COND_EXPR
> mask generation but other archs may not provide all required vec_cmp
> expanders.

For the main loop, the full-masked loop's codegen seems much worse.
Basically, we need at least 4 instructions to do what while_ult in arm does.

         vpbroadcastd    %eax, %ymm0
         vpaddd  %ymm4, %ymm0, %ymm0
         vpcmpud $6, %ymm0, %ymm3, %k1
         kortestb        %k1, %k1
vs
       whilelo(or some other while<op>)

more instructions are needed for avx2 since there's no direct
instruction for .COND_{ADD,SUB..}

original
.L4:
        vmovdqu (%rcx,%rax), %ymm1
        vpaddd (%rdi,%rax), %ymm1, %ymm0
        vmovdqu %ymm0, (%rcx,%rax)
        addq $32, %rax
        cmpq %rax, %rsi
        jne .L4

vs
avx512 full-masked loop
.L3:
         vmovdqu32       (%rsi,%rax,4), %ymm2{%k1}
         vmovdqu32       (%rdi,%rax,4), %ymm1{%k1}
         vpaddd  %ymm2, %ymm1, %ymm0
         vmovdqu32       %ymm0, (%rsi,%rax,4){%k1}
         addq    $8, %rax
         vpbroadcastd    %eax, %ymm0
         vpaddd  %ymm4, %ymm0, %ymm0
         vpcmpud $6, %ymm0, %ymm3, %k1
         kortestb        %k1, %k1
         jne     .L3

vs
avx2 full-masked loop
.L3:
         vpmaskmovd      (%rsi,%rax), %ymm0, %ymm3
         vpmaskmovd      (%rdi,%rax), %ymm0, %ymm1
         addl    $8, %edx
         vpaddd  %ymm3, %ymm1, %ymm1
         vpmaskmovd      %ymm1, %ymm0, (%rsi,%rax)
         vmovd   %edx, %xmm1
         vpsubd  %ymm15, %ymm2, %ymm0
         addq    $32, %rax
         vpbroadcastd    %xmm1, %ymm1
         vpaddd  %ymm4, %ymm1, %ymm1
         vpsubd  %ymm15, %ymm1, %ymm1
         vpcmpgtd        %ymm1, %ymm0, %ymm0
        vptest  %ymm0, %ymm0
         jne     .L3

vs  arm64's code

.L3:
    ld1w z1.s, p0/z, [x1, x3, lsl 2]
    ld1w z0.s, p0/z, [x0, x3, lsl 2]
    add z0.s, z0.s, z1.s
    st1w z0.s, p0, [x1, x3, lsl 2]
    add x3, x3, x4
    whilelo p0.s, w3, w2
    b.any .L3

> Richard.
>
> > Still comments are appreciated.
> >
> > Thanks,
> > Richard.
> >
> > 2021-07-15  Richard Biener  <rguenther@suse.de>
> >
> >         * tree-vect-stmts.c (can_produce_all_loop_masks_p): We
> >         also can produce masks with VEC_COND_EXPRs.
> >         * tree-vect-loop.c (vect_gen_while): Generate the mask
> >         with a VEC_COND_EXPR in case WHILE_ULT is not supported.
> > ---
> >  gcc/tree-vect-loop.c  |  8 ++++++-
> >  gcc/tree-vect-stmts.c | 50 ++++++++++++++++++++++++++++++++++---------
> >  2 files changed, 47 insertions(+), 11 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> > index fc3dab0d143..2214ed11dfb 100644
> > --- a/gcc/tree-vect-loop.c
> > +++ b/gcc/tree-vect-loop.c
> > @@ -975,11 +975,17 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
> >  {
> >    rgroup_controls *rgm;
> >    unsigned int i;
> > +  tree cmp_vectype;
> >    FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
> >      if (rgm->type != NULL_TREE
> >         && !direct_internal_fn_supported_p (IFN_WHILE_ULT,
> >                                             cmp_type, rgm->type,
> > -                                           OPTIMIZE_FOR_SPEED))
> > +                                           OPTIMIZE_FOR_SPEED)
> > +       && ((cmp_vectype
> > +              = truth_type_for (build_vector_type
> > +                                (cmp_type, TYPE_VECTOR_SUBPARTS (rgm->type)))),
> > +           true)
> > +       && !expand_vec_cond_expr_p (rgm->type, cmp_vectype, LT_EXPR))
> >        return false;
> >    return true;
> >  }
> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> > index 6a25d661800..216986399b1 100644
> > --- a/gcc/tree-vect-stmts.c
> > +++ b/gcc/tree-vect-stmts.c
> > @@ -12007,16 +12007,46 @@ vect_gen_while (gimple_seq *seq, tree mask_type, tree start_index,
> >                 tree end_index, const char *name)
> >  {
> >    tree cmp_type = TREE_TYPE (start_index);
> > -  gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT,
> > -                                                      cmp_type, mask_type,
> > -                                                      OPTIMIZE_FOR_SPEED));
> > -  gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3,
> > -                                           start_index, end_index,
> > -                                           build_zero_cst (mask_type));
> > -  tree tmp = make_temp_ssa_name (mask_type, NULL, name);
> > -  gimple_call_set_lhs (call, tmp);
> > -  gimple_seq_add_stmt (seq, call);
> > -  return tmp;
> > +  if (direct_internal_fn_supported_p (IFN_WHILE_ULT,
> > +                                     cmp_type, mask_type,
> > +                                     OPTIMIZE_FOR_SPEED))
> > +    {
> > +      gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3,
> > +                                               start_index, end_index,
> > +                                               build_zero_cst (mask_type));
> > +      tree tmp = make_temp_ssa_name (mask_type, NULL, name);
> > +      gimple_call_set_lhs (call, tmp);
> > +      gimple_seq_add_stmt (seq, call);
> > +      return tmp;
> > +    }
> > +  else
> > +    {
> > +      /* Generate
> > +          _1 = { start_index, start_index, ... };
> > +          _2 = { end_index, end_index, ... };
> > +          _3 = _1 + { 0, 1, 2 ... };
> > +          _4 = _3 < _2;
> > +          _5 = VEC_COND_EXPR <_4, { -1, -1, ... } : { 0, 0, ... }>;   */
> > +      tree cvectype = build_vector_type (cmp_type,
> > +                                        TYPE_VECTOR_SUBPARTS (mask_type));
> > +      tree si = make_ssa_name (cvectype);
> > +      gassign *ass = gimple_build_assign
> > +                       (si, build_vector_from_val (cvectype, start_index));
> > +      gimple_seq_add_stmt (seq, ass);
> > +      tree ei = make_ssa_name (cvectype);
> > +      ass = gimple_build_assign (ei,
> > +                                build_vector_from_val (cvectype, end_index));
> > +      gimple_seq_add_stmt (seq, ass);
> > +      tree incr = build_vec_series (cvectype, build_zero_cst (cmp_type),
> > +                                   build_one_cst (cmp_type));
> > +      si = gimple_build (seq, PLUS_EXPR, cvectype, si, incr);
> > +      tree cmp = gimple_build (seq, LT_EXPR, truth_type_for (cvectype),
> > +                              si, ei);
> > +      tree mask = gimple_build (seq, VEC_COND_EXPR, mask_type, cmp,
> > +                               build_all_ones_cst (mask_type),
> > +                               build_zero_cst (mask_type));
> > +      return mask;
> > +    }
> >  }
> >
> >  /* Generate a vector mask of type MASK_TYPE for which index I is false iff
> > --
> > 2.26.2


-- 
BR,
Hongtao