From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by sourceware.org (Postfix) with ESMTPS id 8F3823854E77 for ; Fri, 16 Jun 2023 09:04:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8F3823854E77 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id A1B6F1FDEB; Fri, 16 Jun 2023 09:04:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1686906256; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=bgjEHpyd+EZ3hs+72eYR60C9YicbXF70avhzaXJNZLQ=; b=qc9aO6FEnthzOqiPLIG6286l70HT2kQbsb45upVux/L80ec7QSLxG5NhVJIT7iFAwAksr4 WMdvjjlWnfBfvtpK817Icmtwvj5Sdedq6OCSWJ+JZ3zfifgMoEYL0itzwbRmlJMaeyOgOv ljc03UwgJCK6S6DurP2JrZhPJtWcVcI= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1686906256; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=bgjEHpyd+EZ3hs+72eYR60C9YicbXF70avhzaXJNZLQ=; b=7ZCGLeiU4b4JpJ5Hwl8lT323RX8QCf7bw32cLKtrdGKHNjiM3ICn4uPUMynHp9HzdZcHh6 Ju83kxjaT0JYaBDA== Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 89E622C141; Fri, 16 Jun 2023 09:04:16 +0000 (UTC) Date: Fri, 16 Jun 2023 09:04:16 +0000 (UTC) From: Richard Biener To: Ju-Zhe Zhong cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com, rdapp.gcc@gmail.com Subject: Re: [PATCH V4] VECT: Support LEN_MASK_{LOAD,STORE} ifn && optabs In-Reply-To: <20230615131435.10323-1-juzhe.zhong@rivai.ai> Message-ID: References: <20230615131435.10323-1-juzhe.zhong@rivai.ai> User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, 15 Jun 2023, juzhe.zhong@rivai.ai wrote: > From: Ju-Zhe Zhong > > This patch bootstrap pass on X86, ok for trunk ? OK with me, please give Richard S. a chance to comment before pushing. Thanks, Richard. > Accoding to comments from Richi, split the first patch to add ifn && optabs > of LEN_MASK_{LOAD,STORE} only, we don't apply them into vectorizer in this > patch. And also add BIAS argument for possible s390's future use. > > The description of the patterns in doc are coming Robin. > > After this patch is approved, will send the second patch to apply len_mask_* > patterns into vectorizer. > > Target like ARM SVE in GCC has an elegant way to handle both loop control > and flow control simultaneously: > > loop_control_mask = WHILE_ULT > flow_control_mask = comparison > control_mask = loop_control_mask & flow_control_mask; > MASK_LOAD (control_mask) > MASK_STORE (control_mask) > > However, targets like RVV (RISC-V Vector) can not use this approach in > auto-vectorization since RVV use length in loop control. > > This patch adds LEN_MASK_ LOAD/STORE to support flow control for targets > like RISC-V that uses length in loop control. > Normalize load/store into LEN_MASK_ LOAD/STORE as long as either length > or mask is valid. Length is the outcome of SELECT_VL or MIN_EXPR. > Mask is the outcome of comparison. > > LEN_MASK_ LOAD/STORE format is defined as follows: > 1). LEN_MASK_LOAD (ptr, align, length, mask). > 2). LEN_MASK_STORE (ptr, align, length, mask, vec). > > Consider these 4 following cases: > > VLA: Variable-length auto-vectorization > VLS: Specific-length auto-vectorization > > Case 1 (VLS): -mrvv-vector-bits=128 IR (Does not use LEN_MASK_*): > Code: v1 = MEM (...) > for (int i = 0; i < 4; i++) v2 = MEM (...) > a[i] = b[i] + c[i]; v3 = v1 + v2 > MEM[...] = v3 > > Case 2 (VLS): -mrvv-vector-bits=128 IR (LEN_MASK_* with length = VF, mask = comparison): > Code: mask = comparison > for (int i = 0; i < 4; i++) v1 = LEN_MASK_LOAD (length = VF, mask) > if (cond[i]) v2 = LEN_MASK_LOAD (length = VF, mask) > a[i] = b[i] + c[i]; v3 = v1 + v2 > LEN_MASK_STORE (length = VF, mask, v3) > > Case 3 (VLA): > Code: loop_len = SELECT_VL or MIN > for (int i = 0; i < n; i++) v1 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...}) > a[i] = b[i] + c[i]; v2 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...}) > v3 = v1 + v2 > LEN_MASK_STORE (length = loop_len, mask = {-1,-1,...}, v3) > > Case 4 (VLA): > Code: loop_len = SELECT_VL or MIN > for (int i = 0; i < n; i++) mask = comparison > if (cond[i]) v1 = LEN_MASK_LOAD (length = loop_len, mask) > a[i] = b[i] + c[i]; v2 = LEN_MASK_LOAD (length = loop_len, mask) > v3 = v1 + v2 > LEN_MASK_STORE (length = loop_len, mask, v3) > > Co-authored-by: Robin Dapp > > gcc/ChangeLog: > > * doc/md.texi: Add len_mask{load,store}. > * genopinit.cc (main): Ditto. > (CMP_NAME): Ditto. > * internal-fn.cc (len_maskload_direct): Ditto. > (len_maskstore_direct): Ditto. > (expand_call_mem_ref): Ditto. > (expand_partial_load_optab_fn): Ditto. > (expand_len_maskload_optab_fn): Ditto. > (expand_partial_store_optab_fn): Ditto. > (expand_len_maskstore_optab_fn): Ditto. > (direct_len_maskload_optab_supported_p): Ditto. > (direct_len_maskstore_optab_supported_p): Ditto. > * internal-fn.def (LEN_MASK_LOAD): Ditto. > (LEN_MASK_STORE): Ditto. > * optabs.def (OPTAB_CD): Ditto. > > --- > gcc/doc/md.texi | 46 +++++++++++++++++++++++++++++++++++++++++++++ > gcc/genopinit.cc | 6 ++++-- > gcc/internal-fn.cc | 43 ++++++++++++++++++++++++++++++++++++++---- > gcc/internal-fn.def | 4 ++++ > gcc/optabs.def | 2 ++ > 5 files changed, 95 insertions(+), 6 deletions(-) > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index a43fd65a2b2..af23ec938d6 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -5136,6 +5136,52 @@ of @code{QI} elements. > > This pattern is not allowed to @code{FAIL}. > > +@cindex @code{len_maskload@var{m}@var{n}} instruction pattern > +@item @samp{len_maskload@var{m}@var{n}} > +Perform a masked load (operand 2 - operand 4) elements from vector memory > +operand 1 into vector register operand 0, setting the other elements of > +operand 0 to undefined values. This is a combination of len_load and maskload. > +Operands 0 and 1 have mode @var{m}, which must be a vector mode. Operand 2 > +has whichever integer mode the target prefers. A secondary mask is specified in > +operand 3 which must be of type @var{n}. Operand 4 conceptually has mode @code{QI}. > + > +Operand 2 can be a variable or a constant amount. Operand 4 specifies a > +constant bias: it is either a constant 0 or a constant -1. The predicate on > +operand 4 must only accept the bias values that the target actually supports. > +GCC handles a bias of 0 more efficiently than a bias of -1. > + > +If (operand 2 - operand 4) exceeds the number of elements in mode > +@var{m}, the behavior is undefined. > + > +If the target prefers the length to be measured in bytes > +rather than elements, it should only implement this pattern for vectors > +of @code{QI} elements. > + > +This pattern is not allowed to @code{FAIL}. > + > +@cindex @code{len_maskstore@var{m}@var{n}} instruction pattern > +@item @samp{len_maskstore@var{m}@var{n}} > +Perform a masked store (operand 2 - operand 4) vector elements from vector register > +operand 1 into memory operand 0, leaving the other elements of operand 0 unchanged. > +This is a combination of len_store and maskstore. > +Operands 0 and 1 have mode @var{m}, which must be a vector mode. Operand 2 has whichever > +integer mode the target prefers. A secondary mask is specified in operand 3 which must be > +of type @var{n}. Operand 4 conceptually has mode @code{QI}. > + > +Operand 2 can be a variable or a constant amount. Operand 3 specifies a > +constant bias: it is either a constant 0 or a constant -1. The predicate on > +operand 4 must only accept the bias values that the target actually supports. > +GCC handles a bias of 0 more efficiently than a bias of -1. > + > +If (operand 2 - operand 4) exceeds the number of elements in mode > +@var{m}, the behavior is undefined. > + > +If the target prefers the length to be measured in bytes > +rather than elements, it should only implement this pattern for vectors > +of @code{QI} elements. > + > +This pattern is not allowed to @code{FAIL}. > + > @cindex @code{vec_perm@var{m}} instruction pattern > @item @samp{vec_perm@var{m}} > Output a (variable) vector permutation. Operand 0 is the destination > diff --git a/gcc/genopinit.cc b/gcc/genopinit.cc > index 0c1b6859ca0..6bd8858a1d9 100644 > --- a/gcc/genopinit.cc > +++ b/gcc/genopinit.cc > @@ -376,7 +376,8 @@ main (int argc, const char **argv) > > fprintf (s_file, > "/* Returns TRUE if the target supports any of the partial vector\n" > - " optabs: while_ult_optab, len_load_optab or len_store_optab,\n" > + " optabs: while_ult_optab, len_load_optab, len_store_optab,\n" > + " len_maskload_optab or len_maskstore_optab,\n" > " for any mode. */\n" > "bool\npartial_vectors_supported_p (void)\n{\n"); > bool any_match = false; > @@ -386,7 +387,8 @@ main (int argc, const char **argv) > { > #define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N))) > if (CMP_NAME("while_ult") || CMP_NAME ("len_load") > - || CMP_NAME ("len_store")) > + || CMP_NAME ("len_store")|| CMP_NAME ("len_maskload") > + || CMP_NAME ("len_maskstore")) > { > if (first) > fprintf (s_file, " HAVE_%s", p->name); > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc > index 208bdf497eb..c911ae790cb 100644 > --- a/gcc/internal-fn.cc > +++ b/gcc/internal-fn.cc > @@ -165,6 +165,7 @@ init_internal_fns () > #define mask_load_lanes_direct { -1, -1, false } > #define gather_load_direct { 3, 1, false } > #define len_load_direct { -1, -1, false } > +#define len_maskload_direct { -1, 3, false } > #define mask_store_direct { 3, 2, false } > #define store_lanes_direct { 0, 0, false } > #define mask_store_lanes_direct { 0, 0, false } > @@ -172,6 +173,7 @@ init_internal_fns () > #define vec_cond_direct { 2, 0, false } > #define scatter_store_direct { 3, 1, false } > #define len_store_direct { 3, 3, false } > +#define len_maskstore_direct { 4, 3, false } > #define vec_set_direct { 3, 3, false } > #define unary_direct { 0, 0, true } > #define unary_convert_direct { -1, 0, true } > @@ -2873,12 +2875,13 @@ expand_call_mem_ref (tree type, gcall *stmt, int index) > return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0)); > } > > -/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB. */ > +/* Expand MASK_LOAD{,_LANES}, LEN_MASK_LOAD or LEN_LOAD call STMT using optab > + * OPTAB. */ > > static void > expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab) > { > - class expand_operand ops[4]; > + class expand_operand ops[5]; > tree type, lhs, rhs, maskt, biast; > rtx mem, target, mask, bias; > insn_code icode; > @@ -2913,6 +2916,20 @@ expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab) > create_input_operand (&ops[3], bias, QImode); > expand_insn (icode, 4, ops); > } > + else if (optab == len_maskload_optab) > + { > + create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)), > + TYPE_UNSIGNED (TREE_TYPE (maskt))); > + maskt = gimple_call_arg (stmt, 3); > + mask = expand_normal (maskt); > + create_input_operand (&ops[3], mask, TYPE_MODE (TREE_TYPE (maskt))); > + icode = convert_optab_handler (optab, TYPE_MODE (type), > + TYPE_MODE (TREE_TYPE (maskt))); > + biast = gimple_call_arg (stmt, 4); > + bias = expand_normal (biast); > + create_input_operand (&ops[4], bias, QImode); > + expand_insn (icode, 5, ops); > + } > else > { > create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt))); > @@ -2926,13 +2943,15 @@ expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab) > #define expand_mask_load_optab_fn expand_partial_load_optab_fn > #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn > #define expand_len_load_optab_fn expand_partial_load_optab_fn > +#define expand_len_maskload_optab_fn expand_partial_load_optab_fn > > -/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB. */ > +/* Expand MASK_STORE{,_LANES}, LEN_MASK_STORE or LEN_STORE call STMT using optab > + * OPTAB. */ > > static void > expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab) > { > - class expand_operand ops[4]; > + class expand_operand ops[5]; > tree type, lhs, rhs, maskt, biast; > rtx mem, reg, mask, bias; > insn_code icode; > @@ -2965,6 +2984,19 @@ expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab) > create_input_operand (&ops[3], bias, QImode); > expand_insn (icode, 4, ops); > } > + else if (optab == len_maskstore_optab) > + { > + create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)), > + TYPE_UNSIGNED (TREE_TYPE (maskt))); > + maskt = gimple_call_arg (stmt, 3); > + mask = expand_normal (maskt); > + create_input_operand (&ops[3], mask, TYPE_MODE (TREE_TYPE (maskt))); > + biast = gimple_call_arg (stmt, 4); > + bias = expand_normal (biast); > + create_input_operand (&ops[4], bias, QImode); > + icode = convert_optab_handler (optab, TYPE_MODE (type), GET_MODE (mask)); > + expand_insn (icode, 5, ops); > + } > else > { > create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt))); > @@ -2975,6 +3007,7 @@ expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab) > #define expand_mask_store_optab_fn expand_partial_store_optab_fn > #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn > #define expand_len_store_optab_fn expand_partial_store_optab_fn > +#define expand_len_maskstore_optab_fn expand_partial_store_optab_fn > > /* Expand VCOND, VCONDU and VCONDEQ optab internal functions. > The expansion of STMT happens based on OPTAB table associated. */ > @@ -3928,6 +3961,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types, > #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p > #define direct_gather_load_optab_supported_p convert_optab_supported_p > #define direct_len_load_optab_supported_p direct_optab_supported_p > +#define direct_len_maskload_optab_supported_p convert_optab_supported_p > #define direct_mask_store_optab_supported_p convert_optab_supported_p > #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p > #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p > @@ -3935,6 +3969,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types, > #define direct_vec_cond_optab_supported_p convert_optab_supported_p > #define direct_scatter_store_optab_supported_p convert_optab_supported_p > #define direct_len_store_optab_supported_p direct_optab_supported_p > +#define direct_len_maskstore_optab_supported_p convert_optab_supported_p > #define direct_while_optab_supported_p convert_optab_supported_p > #define direct_fold_extract_optab_supported_p direct_optab_supported_p > #define direct_fold_left_optab_supported_p direct_optab_supported_p > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > index 9da5f31636e..bc947c0fde7 100644 > --- a/gcc/internal-fn.def > +++ b/gcc/internal-fn.def > @@ -50,12 +50,14 @@ along with GCC; see the file COPYING3. If not see > - mask_load_lanes: currently just vec_mask_load_lanes > - gather_load: used for {mask_,}gather_load > - len_load: currently just len_load > + - len_maskload: currently just len_maskload > > - mask_store: currently just maskstore > - store_lanes: currently just vec_store_lanes > - mask_store_lanes: currently just vec_mask_store_lanes > - scatter_store: used for {mask_,}scatter_store > - len_store: currently just len_store > + - len_maskstore: currently just len_maskstore > > - unary: a normal unary optab, such as vec_reverse_ > - binary: a normal binary optab, such as vec_interleave_lo_ > @@ -157,6 +159,7 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE, > mask_gather_load, gather_load) > > DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load) > +DEF_INTERNAL_OPTAB_FN (LEN_MASK_LOAD, ECF_PURE, len_maskload, len_maskload) > > DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store) > DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0, > @@ -175,6 +178,7 @@ DEF_INTERNAL_OPTAB_FN (VCOND_MASK, 0, vcond_mask, vec_cond_mask) > DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set) > > DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store) > +DEF_INTERNAL_OPTAB_FN (LEN_MASK_STORE, 0, len_maskstore, len_maskstore) > > DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while) > DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl, binary) > diff --git a/gcc/optabs.def b/gcc/optabs.def > index 22b31be0f72..9533eb11565 100644 > --- a/gcc/optabs.def > +++ b/gcc/optabs.def > @@ -91,6 +91,8 @@ OPTAB_CD(vec_cmpu_optab, "vec_cmpu$a$b") > OPTAB_CD(vec_cmpeq_optab, "vec_cmpeq$a$b") > OPTAB_CD(maskload_optab, "maskload$a$b") > OPTAB_CD(maskstore_optab, "maskstore$a$b") > +OPTAB_CD(len_maskload_optab, "len_maskload$a$b") > +OPTAB_CD(len_maskstore_optab, "len_maskstore$a$b") > OPTAB_CD(gather_load_optab, "gather_load$a$b") > OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b") > OPTAB_CD(scatter_store_optab, "scatter_store$a$b") > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)