From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpbguseast3.qq.com (smtpbguseast3.qq.com [54.243.244.52]) by sourceware.org (Postfix) with ESMTPS id E32C53857B9B for ; Thu, 10 Aug 2023 07:49:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E32C53857B9B Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivai.ai Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai X-QQ-mid: bizesmtp71t1691653751tlqqx822 Received: from rios-cad5.localdomain ( [58.60.1.11]) by bizesmtp.qq.com (ESMTP) with id ; Thu, 10 Aug 2023 15:49:10 +0800 (CST) X-QQ-SSF: 01400000000000G0V000000A0000000 X-QQ-FEAT: C46Rb8GPIEdOaIWd9TBi1LQs/BXAQ+QPgJmuNTqQuHdPz96EtjmSTvhqgRo37 I95G5aJB+FGQyTLsjjvty++SM/WsiioucfjxNwq4nAMosWJgOrWU8tm4cdmXkqfGP6aCD/M kfG7Lz2lJ1clpQkKqaizKkmmseXaNGhGFNAMQtJfdWeJGJLYYBUEmtjfK5kWypesGENA2Wi j7h9v4Hz1mGDOF05npjc/yBUPEMwLhtC3PFOtuDTp2M8YfMO+A+E80/3eLuXcYSwneUjbVX DP3nCQRPngpbm6EEEch0EvCfYgvQen87U6qu8iioh5OMShzK51jVt2QO8OC5dDKRkO4Y13F CBbJr/ZkvQvF3yGlsR19I2mwoTLwCAeZlc4IDheBTS4tFYzpNlDeczIfOADU6HPbiKG6a/N 9iuXLIqY+DGvO0NuLg3QhA== X-QQ-GoodBg: 2 X-BIZMAIL-ID: 14543276528820692383 From: juzhe.zhong@rivai.ai To: gcc-patches@gcc.gnu.org Cc: richard.sandiford@arm.com, rguenther@suse.de, Ju-Zhe Zhong Subject: [PATCH V2] VECT: Support loop len control on EXTRACT_LAST vectorization Date: Thu, 10 Aug 2023 15:49:09 +0800 Message-Id: <20230810074909.492039-1-juzhe.zhong@rivai.ai> X-Mailer: git-send-email 2.36.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-QQ-SENDSIZE: 520 Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_STATUS,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: From: Ju-Zhe Zhong Hi, Richard and Richi. This patch add support live vectorization by VEC_EXTRACT for LEN loop control. Consider this following case: #include #define EXTRACT_LAST(TYPE) \ TYPE __attribute__ ((noinline, noclone)) \ test_##TYPE (TYPE *x, int n, TYPE value) \ { \ TYPE last; \ for (int j = 0; j < n; ++j) \ { \ last = x[j]; \ x[j] = last * value; \ } \ return last; \ } #define TEST_ALL(T) \ T (uint8_t) \ TEST_ALL (EXTRACT_LAST) ARM SVE IR: Preheader: max_mask_34 = .WHILE_ULT (0, bnd.5_6, { 0, ... }); Loop: ... # loop_mask_22 = PHI ... vect_last_12.8_23 = .MASK_LOAD (_7, 8B, loop_mask_22); vect__4.9_27 = vect_last_12.8_23 * vect_cst__26; .MASK_STORE (_7, 8B, loop_mask_22, vect__4.9_27); ... next_mask_35 = .WHILE_ULT (_1, bnd.5_6, { 0, ... }); ... Epilogue: _25 = .EXTRACT_LAST (loop_mask_22, vect_last_12.8_23); For RVV since we prefer len in loop control, after this patch for RVV: Loop: ... loop_len_22 = SELECT_VL; vect_last_12.8_23 = .MASK_LOAD (_7, 8B, loop_len_22); vect__4.9_27 = vect_last_12.8_23 * vect_cst__26; .MASK_STORE (_7, 8B, loop_len_22, vect__4.9_27); ... Epilogue: _25 = .VEC_EXTRACT (loop_len_22 - 1 - bias, vect_last_12.8_23); Details of this approach: 1. Step 1 - Add 'vect_can_vectorize_extract_last_with_len_p' to enable live vectorization for LEN loop control. This function we check whether target support: - Use LEN as the loop control. - Support VEC_EXTRACT optab. 2. Step 2 - Record LEN for loop control if 'vect_can_vectorize_extract_last_with_len_p' is true. 3. Step 3 - Gerenate VEC_EXTRACT (v, LEN - 1 - BIAS). NOTE: This patch set 'vinfo->any_known_not_updated_vssa = true;' since the original STMT is a simple assignment wheras VEC_EXTRACT is neither pure nor const function according to internal-fn.def: DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT, 0, vec_extract, vec_extract) If we don't set 'vinfo->any_known_not_updated_vssa' as true, it will cause ICE in: if (need_ssa_update_p (cfun)) { gcc_assert (loop_vinfo->any_known_not_updated_vssa); ----> Report assertion fail here. fun->gimple_df->ssa_renaming_needed = false; todo |= TODO_update_ssa_only_virtuals; } I saw there are 2 places set 'vinfo->any_known_not_updated_vssa' as true: - The one is in 'vectorizable_simd_clone_call': /* When the original call is pure or const but the SIMD ABI dictates an aggregate return we will have to use a virtual definition and in a loop eventually even need to add a virtual PHI. That's not straight-forward so allow to fix this up via renaming. */ if (gimple_call_lhs (stmt) && !gimple_vdef (stmt) && TREE_CODE (TREE_TYPE (TREE_TYPE (bestn->decl))) == ARRAY_TYPE) vinfo->any_known_not_updated_vssa = true; - The other is in 'vectorizable_load': if (memory_access_type == VMAT_LOAD_STORE_LANES) vinfo->any_known_not_updated_vssa = true; It seems that they are the same reason as me doing in 'vectorizable_live_operation'. Feel free to correct me if I am wrong. Bootstrap and Regression on X86 passed. gcc/ChangeLog: * tree-vect-loop.cc (vect_can_vectorize_extract_last_with_len_p): New function. (vectorizable_live_operation): Add loop LEN control. --- gcc/tree-vect-loop.cc | 74 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 68 insertions(+), 6 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 00058c3c13e..208918f53fb 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8964,6 +8964,24 @@ vect_can_vectorize_without_simd_p (code_helper code) && vect_can_vectorize_without_simd_p (tree_code (code))); } +/* Return true if target supports extract last vectorization with LEN. */ + +static bool +vect_can_vectorize_extract_last_with_len_p (tree vectype) +{ + /* Return false if target doesn't support LEN in loop control. */ + machine_mode vmode; + if (!get_len_load_store_mode (TYPE_MODE (vectype), true).exists (&vmode) + || !get_len_load_store_mode (TYPE_MODE (vectype), false).exists (&vmode)) + return false; + + /* Target need to support VEC_EXTRACT to extract the last active element. */ + return convert_optab_handler (vec_extract_optab, + TYPE_MODE (vectype), + TYPE_MODE (TREE_TYPE (vectype))) + != CODE_FOR_nothing; +} + /* Create vector init for vectorized iv. */ static tree vect_create_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, @@ -10282,7 +10300,8 @@ vectorizable_live_operation (vec_info *vinfo, if (loop_vinfo && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) { if (!direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype, - OPTIMIZE_FOR_SPEED)) + OPTIMIZE_FOR_SPEED) + && !vect_can_vectorize_extract_last_with_len_p (vectype)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -10311,9 +10330,14 @@ vectorizable_live_operation (vec_info *vinfo, else { gcc_assert (ncopies == 1 && !slp_node); - vect_record_loop_mask (loop_vinfo, - &LOOP_VINFO_MASKS (loop_vinfo), - 1, vectype, NULL); + if (vect_can_vectorize_extract_last_with_len_p (vectype)) + vect_record_loop_len (loop_vinfo, + &LOOP_VINFO_LENS (loop_vinfo), + 1, vectype, 1); + else + vect_record_loop_mask (loop_vinfo, + &LOOP_VINFO_MASKS (loop_vinfo), + 1, vectype, NULL); } } /* ??? Enable for loop costing as well. */ @@ -10339,7 +10363,9 @@ vectorizable_live_operation (vec_info *vinfo, gimple *vec_stmt; if (slp_node) { - gcc_assert (!loop_vinfo || !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)); + gcc_assert (!loop_vinfo + || (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))); /* Get the correct slp vectorized stmt. */ vec_lhs = SLP_TREE_VEC_DEFS (slp_node)[vec_entry]; @@ -10383,7 +10409,43 @@ vectorizable_live_operation (vec_info *vinfo, gimple_seq stmts = NULL; tree new_tree; - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + /* Emit: + + SCALAR_RES = VEC_EXTRACT + + where VEC_LHS is the vectorized live-out result and MASK is + the loop mask for the final iteration. */ + gcc_assert (ncopies == 1 && !slp_node); + tree scalar_type = TREE_TYPE (STMT_VINFO_VECTYPE (stmt_info)); + tree len + = vect_get_loop_len (loop_vinfo, gsi, &LOOP_VINFO_LENS (loop_vinfo), + 1, vectype, 0, 0); + + /* BIAS + 1. */ + signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo); + tree bias_one + = size_binop (PLUS_EXPR, build_int_cst (TREE_TYPE (len), biasval), + build_one_cst (TREE_TYPE (len))); + + /* LAST_INDEX = LEN - (BIAS + 1). */ + tree last_index + = gimple_build (&stmts, MINUS_EXPR, TREE_TYPE (len), len, bias_one); + + tree scalar_res = gimple_build (&stmts, CFN_VEC_EXTRACT, scalar_type, + vec_lhs_phi, last_index); + + /* Convert the extracted vector element to the scalar type. */ + new_tree = gimple_convert (&stmts, lhs_type, scalar_res); + /* When the original stmt is an assignment but VEC_EXTRACT is not pure + or const since it may return a memory result. We will have to use + a virtual definition and in a loop eventually even need to add a + virtual PHI. That's not straight-forward so allow to fix this up + via renaming. */ + vinfo->any_known_not_updated_vssa = true; + } + else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) { /* Emit: -- 2.36.3