From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpbguseast3.qq.com (smtpbguseast3.qq.com [54.243.244.52]) by sourceware.org (Postfix) with ESMTPS id 3F2AE3858D20 for ; Fri, 11 Aug 2023 06:38:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3F2AE3858D20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivai.ai Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai X-QQ-mid: bizesmtp71t1691735900tvv0mdnm Received: from rios-cad5.localdomain ( [58.60.1.11]) by bizesmtp.qq.com (ESMTP) with id ; Fri, 11 Aug 2023 14:38:18 +0800 (CST) X-QQ-SSF: 01400000000000G0V000000A0000000 X-QQ-FEAT: C46Rb8GPIEfCZXYYSrYQRgDikEm8J8WPtEuACf+OZjJXKs19ko38iU/R8UMJZ 3Hh3Aw9v/2kv5m2DCR9om0OsDFgNsiXw9p7fkSkVtUXra2I0+7bMJ7OPqxNLd5T00072au5 fwzPsJ5Av64Cc6iqqTqKG8IVKYTq5tM69V0UPOF7O3nD8VWJyO94U3QVbyc8gAiZL4LJ9Hl YUKd/styVlxWarJ/BPpKlYE3Y29Thqkxz2V/+7b2Xk6VgM8GwOxFKrdNRPrIeG2iLaQmLp7 BQEySCMTTkh8GeVuw308I6tnvcsZ6Q+d41Y+LuojFwi8bGfhsocMQlmMC0RVng51MIciuJY 1UvPIW5kgZ3Bu1BYn40c/5PDdk5On3T/nNsWDfHmdY3MPzMc3C0P4NwLiqKOXRhc3Epklh8 xAA0KGy3c12Y8Uc7yIN9IA== X-QQ-GoodBg: 2 X-BIZMAIL-ID: 17228281536687692325 From: juzhe.zhong@rivai.ai To: gcc-patches@gcc.gnu.org Cc: richard.sandiford@arm.com, rguenther@suse.de, Ju-Zhe Zhong Subject: [PATCH V3] VECT: Support loop len control on EXTRACT_LAST vectorization Date: Fri, 11 Aug 2023 14:38:17 +0800 Message-Id: <20230811063817.491547-1-juzhe.zhong@rivai.ai> X-Mailer: git-send-email 2.36.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-QQ-SENDSIZE: 520 Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_STATUS,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: From: Ju-Zhe Zhong Hi, Richard and Richi. This patch add support live vectorization by VEC_EXTRACT for LEN loop control. Consider this following case: #include #define EXTRACT_LAST(TYPE) \ TYPE __attribute__ ((noinline, noclone)) \ test_##TYPE (TYPE *x, int n, TYPE value) \ { \ TYPE last; \ for (int j = 0; j < n; ++j) \ { \ last = x[j]; \ x[j] = last * value; \ } \ return last; \ } #define TEST_ALL(T) \ T (uint8_t) \ TEST_ALL (EXTRACT_LAST) ARM SVE IR: Preheader: max_mask_34 = .WHILE_ULT (0, bnd.5_6, { 0, ... }); Loop: ... # loop_mask_22 = PHI ... vect_last_12.8_23 = .MASK_LOAD (_7, 8B, loop_mask_22); vect__4.9_27 = vect_last_12.8_23 * vect_cst__26; .MASK_STORE (_7, 8B, loop_mask_22, vect__4.9_27); ... next_mask_35 = .WHILE_ULT (_1, bnd.5_6, { 0, ... }); ... Epilogue: _25 = .EXTRACT_LAST (loop_mask_22, vect_last_12.8_23); For RVV since we prefer len in loop control, after this patch for RVV: Loop: ... loop_len_22 = SELECT_VL; vect_last_12.8_23 = .MASK_LOAD (_7, 8B, loop_len_22); vect__4.9_27 = vect_last_12.8_23 * vect_cst__26; .MASK_STORE (_7, 8B, loop_len_22, vect__4.9_27); ... Epilogue: _25 = .VEC_EXTRACT (loop_len_22 + bias - 1, vect_last_12.8_23); Details of this approach: 1. Step 1 - Add 'vect_can_vectorize_extract_last_with_len_p' to enable live vectorization for LEN loop control. This function we check whether target support: - Use LEN as the loop control. - Support VEC_EXTRACT optab. 2. Step 2 - Record LEN for loop control if 'vect_can_vectorize_extract_last_with_len_p' is true. 3. Step 3 - Gerenate VEC_EXTRACT (v, LEN + BIAS - 1). The only difference between mask and len is that len is using length generated by SELECT_VL and use VEC_EXTRACT pattern. The rest of the live vectorization is totally the same ARM SVE. Bootstrap and Regression on X86 passed. Tested on ARM QEMU. Ok for trunk? gcc/ChangeLog: * tree-vect-loop.cc (vect_can_vectorize_extract_last_with_len_p): New function. (vectorizable_live_operation): Add loop len control. --- gcc/tree-vect-loop.cc | 76 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 70 insertions(+), 6 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index bf8d677b584..809b73b966c 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8963,6 +8963,27 @@ vect_can_vectorize_without_simd_p (code_helper code) && vect_can_vectorize_without_simd_p (tree_code (code))); } +/* Return true if target supports extract last vectorization with LEN. */ + +static bool +vect_can_vectorize_extract_last_with_len_p (tree vectype) +{ + /* Return false if target doesn't support LEN in loop control. */ + machine_mode vmode; + machine_mode vec_mode = TYPE_MODE (vectype); + if (!VECTOR_MODE_P (vec_mode)) + return false; + if (!get_len_load_store_mode (vec_mode, true).exists (&vmode) + || !get_len_load_store_mode (vec_mode, false).exists (&vmode)) + return false; + + /* Target need to support VEC_EXTRACT to extract the last active element. */ + return convert_optab_handler (vec_extract_optab, + vec_mode, + TYPE_MODE (TREE_TYPE (vectype))) + != CODE_FOR_nothing; +} + /* Create vector init for vectorized iv. */ static tree vect_create_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, @@ -10279,7 +10300,8 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, if (loop_vinfo && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) { if (!direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype, - OPTIMIZE_FOR_SPEED)) + OPTIMIZE_FOR_SPEED) + && !vect_can_vectorize_extract_last_with_len_p (vectype)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -10308,9 +10330,14 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, else { gcc_assert (ncopies == 1 && !slp_node); - vect_record_loop_mask (loop_vinfo, - &LOOP_VINFO_MASKS (loop_vinfo), - 1, vectype, NULL); + if (vect_can_vectorize_extract_last_with_len_p (vectype)) + vect_record_loop_len (loop_vinfo, + &LOOP_VINFO_LENS (loop_vinfo), + 1, vectype, 1); + else + vect_record_loop_mask (loop_vinfo, + &LOOP_VINFO_MASKS (loop_vinfo), + 1, vectype, NULL); } } /* ??? Enable for loop costing as well. */ @@ -10336,7 +10363,9 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, gimple *vec_stmt; if (slp_node) { - gcc_assert (!loop_vinfo || !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)); + gcc_assert (!loop_vinfo + || (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))); /* Get the correct slp vectorized stmt. */ vec_lhs = SLP_TREE_VEC_DEFS (slp_node)[vec_entry]; @@ -10380,7 +10409,42 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, gimple_seq stmts = NULL; tree new_tree; - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + /* Emit: + + SCALAR_RES = VEC_EXTRACT + + where VEC_LHS is the vectorized live-out result and MASK is + the loop mask for the final iteration. */ + gcc_assert (ncopies == 1 && !slp_node); + gimple_seq tem = NULL; + gimple_stmt_iterator gsi = gsi_last (tem); + tree len + = vect_get_loop_len (loop_vinfo, &gsi, + &LOOP_VINFO_LENS (loop_vinfo), + 1, vectype, 0, 0); + + /* BIAS - 1. */ + signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo); + tree bias_minus_one + = int_const_binop (MINUS_EXPR, + build_int_cst (TREE_TYPE (len), biasval), + build_one_cst (TREE_TYPE (len))); + + /* LAST_INDEX = LEN + (BIAS - 1). */ + tree last_index = gimple_build (&stmts, PLUS_EXPR, TREE_TYPE (len), + len, bias_minus_one); + + /* SCALAR_RES = VEC_EXTRACT . */ + tree scalar_res + = gimple_build (&stmts, CFN_VEC_EXTRACT, TREE_TYPE (vectype), + vec_lhs_phi, last_index); + + /* Convert the extracted vector element to the scalar type. */ + new_tree = gimple_convert (&stmts, lhs_type, scalar_res); + } + else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) { /* Emit: -- 2.36.3