From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by sourceware.org (Postfix) with ESMTPS id F2C0C3858D20; Fri, 11 Aug 2023 10:21:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F2C0C3858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 4A9D121873; Fri, 11 Aug 2023 10:21:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1691749295; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rtha7/XgiNoHUk6T5amHFMVY4BmynEQQvnAXEobut+Q=; b=LTj9gAeykduK21h8vhSP5BZcm2P3UhAvVfJ5sFEf/NXqDaEvUlCRIVUL0At6zZbpG0CBru q/Lpmg2YkZhLfNZ3fwEwo+/c7wnUs2/g5yHnEh8e/1bpVBK90pISA+fh5PqhNN8vbZlzeM kC2slv6aUhQh+hVhoTKkVN/eZgk54Xc= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1691749295; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rtha7/XgiNoHUk6T5amHFMVY4BmynEQQvnAXEobut+Q=; b=tgZlDPjYt4wTsdzJ8+FCdlsuQfLkykKK43WoJgpz0f6PgYqYjbukWuhwx7BgDVCv2sbgF5 k4ns0O+oQfkCC7Cg== Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 1768B2C142; Fri, 11 Aug 2023 10:21:35 +0000 (UTC) Date: Fri, 11 Aug 2023 10:21:35 +0000 (UTC) From: Richard Biener To: "juzhe.zhong@rivai.ai" cc: gcc-patches , "richard.sandiford" , linkw , krebbel Subject: Re: Re: [PATCH V3] VECT: Support loop len control on EXTRACT_LAST vectorization In-Reply-To: <0CF24057688529E6+2023081115284214396319@rivai.ai> Message-ID: References: <20230811063817.491547-1-juzhe.zhong@rivai.ai>, <0CF24057688529E6+2023081115284214396319@rivai.ai> User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, 11 Aug 2023, juzhe.zhong@rivai.ai wrote: > Hi, Richi. > > >> So how can we resolve the issue when a non-VL operation like > >> .VEC_EXTRACT is used for _len support? > > Do you mean non-VL extract last operation (I am sorry that not sure whether I understand your question correctly)? > If yes, the answer is for RVV, we are reusing the same flow as ARM SVE (BIT_FILED_REF approach), see the example below: > > https://godbolt.org/z/cqrWrY8q4 > > #define EXTRACT_LAST(TYPE) \ > TYPE __attribute__ ((noinline, noclone)) \ > test_##TYPE (TYPE *x, int n, TYPE value) \ > { \ > TYPE last; \ > for (int j = 0; j < 64; ++j) \ > { \ > last = x[j]; \ > x[j] = last * value; \ > } \ > return last; \ > } > > #define TEST_ALL(T) \ > T (uint8_t) \ > > TEST_ALL (EXTRACT_LAST) > > vect_cst__22 = {value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D), value_12(D)}; > vect_last_11.6_3 = MEM [(uint8_t *)x_10(D)]; > vect__4.7_23 = vect_last_11.6_3 * vect_cst__22; > MEM [(uint8_t *)x_10(D)] = vect__4.7_23; > _21 = BIT_FIELD_REF ; > > This approach works perfectly for both RVV and ARM SVE for non-VL and non-MASK EXTRACT_LAST operation. > > >> So, why do we test for get_len_load_store_mode and not just for > >> VEC_EXTRACT? > > Before answer this question, let me first elaborate how ARM SVE is doing with MASK EXTRACT_LAST. > > Here is the example: > https://godbolt.org/z/8cTv1jqMb > > ARM SVE IR: > > [local count: 955630224]: > # ivtmp_31 = PHI > > # loop_mask_22 = PHI -----> For RVV, we want this to be loop_len = SELECT_VL; > > _7 = &MEM [(uint8_t *)x_11(D) + ivtmp_31 * 1]; > vect_last_12.8_23 = .MASK_LOAD (_7, 8B, loop_mask_22); > vect__4.9_27 = vect_last_12.8_23 * vect_cst__26; > .MASK_STORE (_7, 8B, loop_mask_22, vect__4.9_27); > ivtmp_32 = ivtmp_31 + POLY_INT_CST [16, 16]; > _1 = (unsigned int) ivtmp_32; > > next_mask_35 = .WHILE_ULT (_1, bnd.5_6, { 0, ... }); > > if (next_mask_35 != { 0, ... }) > goto ; [89.00%] > else > goto ; [11.00%] > > [local count: 105119324]: > > _25 = .EXTRACT_LAST (loop_mask_22, vect_last_12.8_23); [tail call] ----> Use the last mask generated in BB 4, so for RVV, we are using the loop_len. > > So this patch is trying to optimize the codegen with simulating same flow as ARM SVE but with replacing 'loop_mask_22' (This is generated in BB4) into 'loop_len'. > > For ARM SVE, they only check whether target support EXTRACT_LAST pattern, this pattern is supported means: > > 1. Target is using loop MASK as the partial vector loop control. I don't think it checks for this? > 2. extract_last optab is enabled in the backend. > > So for RVV, we are also checking same conditions: > > 1. Target is using loop LEN as the partial vector loop control (I use get_len_load_store_mode to check whether target is using loop LEN as the partial vector loop control). But we don't really know this at this point? The only thing we know is that nothing set LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P to false. > 2. vec_extract optab is enabled in the backend. > > An alternative approach is that we can adding EXTRACT_LAST_LEN internal FN, then we can only check this like ARM SVE only check EXTRACT_LAST. I think it should work to change the direct_internal_fn_supported_p check for IFN_EXTRACT_LAST to a "poitive" one guarding gcc_assert (ncopies == 1 && !slp_node); vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo), 1, vectype, NULL); and in the else branch check for VEC_EXTRACT support and if present record a loop len. Just in this case this particular order would be important. > >> can we double-check this on powerpc and s390? > > Sure, I hope it can be beneficial to powerpc and s390. > And, I think Richard's comments are also very important so I am gonna wait for it. Yeah, just to double-check the bias stuff works correctly. Richard. > Thanks. > > > juzhe.zhong@rivai.ai > > From: Richard Biener > Date: 2023-08-11 15:01 > To: Ju-Zhe Zhong > CC: gcc-patches; richard.sandiford; linkw; krebbel > Subject: Re: [PATCH V3] VECT: Support loop len control on EXTRACT_LAST vectorization > On Fri, 11 Aug 2023, juzhe.zhong@rivai.ai wrote: > > > From: Ju-Zhe Zhong > > > > Hi, Richard and Richi. > > > > This patch add support live vectorization by VEC_EXTRACT for LEN loop control. > > > > Consider this following case: > > > > #include > > > > #define EXTRACT_LAST(TYPE) \ > > TYPE __attribute__ ((noinline, noclone)) \ > > test_##TYPE (TYPE *x, int n, TYPE value) \ > > { \ > > TYPE last; \ > > for (int j = 0; j < n; ++j) \ > > { \ > > last = x[j]; \ > > x[j] = last * value; \ > > } \ > > return last; \ > > } > > > > #define TEST_ALL(T) \ > > T (uint8_t) \ > > > > TEST_ALL (EXTRACT_LAST) > > > > ARM SVE IR: > > > > Preheader: > > max_mask_34 = .WHILE_ULT (0, bnd.5_6, { 0, ... }); > > > > Loop: > > ... > > # loop_mask_22 = PHI > > ... > > vect_last_12.8_23 = .MASK_LOAD (_7, 8B, loop_mask_22); > > vect__4.9_27 = vect_last_12.8_23 * vect_cst__26; > > .MASK_STORE (_7, 8B, loop_mask_22, vect__4.9_27); > > ... > > next_mask_35 = .WHILE_ULT (_1, bnd.5_6, { 0, ... }); > > ... > > > > Epilogue: > > _25 = .EXTRACT_LAST (loop_mask_22, vect_last_12.8_23); > > > > For RVV since we prefer len in loop control, after this patch for RVV: > > > > Loop: > > ... > > loop_len_22 = SELECT_VL; > > vect_last_12.8_23 = .MASK_LOAD (_7, 8B, loop_len_22); > > vect__4.9_27 = vect_last_12.8_23 * vect_cst__26; > > .MASK_STORE (_7, 8B, loop_len_22, vect__4.9_27); > > ... > > > > Epilogue: > > _25 = .VEC_EXTRACT (loop_len_22 + bias - 1, vect_last_12.8_23); > > > > Details of this approach: > > > > 1. Step 1 - Add 'vect_can_vectorize_extract_last_with_len_p' to enable live vectorization > > for LEN loop control. > > > > This function we check whether target support: > > - Use LEN as the loop control. > > - Support VEC_EXTRACT optab. > > > > 2. Step 2 - Record LEN for loop control if 'vect_can_vectorize_extract_last_with_len_p' is true. > > > > 3. Step 3 - Gerenate VEC_EXTRACT (v, LEN + BIAS - 1). > > > > The only difference between mask and len is that len is using length generated by SELECT_VL and > > use VEC_EXTRACT pattern. The rest of the live vectorization is totally the same ARM SVE. > > > > Bootstrap and Regression on X86 passed. > > > > Tested on ARM QEMU. > > > > Ok for trunk? > > > > gcc/ChangeLog: > > > > * tree-vect-loop.cc (vect_can_vectorize_extract_last_with_len_p): New function. > > (vectorizable_live_operation): Add loop len control. > > > > --- > > gcc/tree-vect-loop.cc | 76 +++++++++++++++++++++++++++++++++++++++---- > > 1 file changed, 70 insertions(+), 6 deletions(-) > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index bf8d677b584..809b73b966c 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -8963,6 +8963,27 @@ vect_can_vectorize_without_simd_p (code_helper code) > > && vect_can_vectorize_without_simd_p (tree_code (code))); > > } > > > > +/* Return true if target supports extract last vectorization with LEN. */ > > + > > +static bool > > +vect_can_vectorize_extract_last_with_len_p (tree vectype) > > +{ > > + /* Return false if target doesn't support LEN in loop control. */ > > + machine_mode vmode; > > + machine_mode vec_mode = TYPE_MODE (vectype); > > + if (!VECTOR_MODE_P (vec_mode)) > > + return false; > > + if (!get_len_load_store_mode (vec_mode, true).exists (&vmode) > > + || !get_len_load_store_mode (vec_mode, false).exists (&vmode)) > > + return false; > > So this "hidden" bit in the end decides whether to ... > > > + /* Target need to support VEC_EXTRACT to extract the last active element. */ > > + return convert_optab_handler (vec_extract_optab, > > + vec_mode, > > + TYPE_MODE (TREE_TYPE (vectype))) > > + != CODE_FOR_nothing; > > +} > > + > > /* Create vector init for vectorized iv. */ > > static tree > > vect_create_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, > > @@ -10279,7 +10300,8 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, > > if (loop_vinfo && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > > { > > if (!direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype, > > - OPTIMIZE_FOR_SPEED)) > > + OPTIMIZE_FOR_SPEED) > > + && !vect_can_vectorize_extract_last_with_len_p (vectype)) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > @@ -10308,9 +10330,14 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, > > else > > { > > gcc_assert (ncopies == 1 && !slp_node); > > - vect_record_loop_mask (loop_vinfo, > > - &LOOP_VINFO_MASKS (loop_vinfo), > > - 1, vectype, NULL); > > + if (vect_can_vectorize_extract_last_with_len_p (vectype)) > > + vect_record_loop_len (loop_vinfo, > > + &LOOP_VINFO_LENS (loop_vinfo), > > + 1, vectype, 1); > > .. record a loop_len here. I think powerpc at least has .VEC_EXTRACT as > well but of course .VEC_EXTRACT support itself doesn't have anything to > do with 'len' support. > > x86 has .VEC_SET but not yet .VEC_EXTRACT, if it gets .VEC_EXTRACT > its partial vector support still wants masks, not lens (and once > we record both we fail). > > So how can we resolve the issue when a non-VL operation like > .VEC_EXTRACT is used for _len support? > > Note x86 doens't yet support IFN_EXTRACT_LAST either. > > So, why do we test for get_len_load_store_mode and not just for > VEC_EXTRACT? > > > + else > > + vect_record_loop_mask (loop_vinfo, > > + &LOOP_VINFO_MASKS (loop_vinfo), > > + 1, vectype, NULL); > > } > > } > > /* ??? Enable for loop costing as well. */ > > @@ -10336,7 +10363,9 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, > > gimple *vec_stmt; > > if (slp_node) > > { > > - gcc_assert (!loop_vinfo || !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)); > > + gcc_assert (!loop_vinfo > > + || (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) > > + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))); > > > > /* Get the correct slp vectorized stmt. */ > > vec_lhs = SLP_TREE_VEC_DEFS (slp_node)[vec_entry]; > > @@ -10380,7 +10409,42 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, > > > > gimple_seq stmts = NULL; > > tree new_tree; > > - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > > + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > > + { > > + /* Emit: > > + > > + SCALAR_RES = VEC_EXTRACT > > + > > + where VEC_LHS is the vectorized live-out result and MASK is > > + the loop mask for the final iteration. */ > > + gcc_assert (ncopies == 1 && !slp_node); > > + gimple_seq tem = NULL; > > + gimple_stmt_iterator gsi = gsi_last (tem); > > + tree len > > + = vect_get_loop_len (loop_vinfo, &gsi, > > + &LOOP_VINFO_LENS (loop_vinfo), > > + 1, vectype, 0, 0); > > + > > + /* BIAS - 1. */ > > + signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo); > > + tree bias_minus_one > > + = int_const_binop (MINUS_EXPR, > > + build_int_cst (TREE_TYPE (len), biasval), > > + build_one_cst (TREE_TYPE (len))); > > + > > + /* LAST_INDEX = LEN + (BIAS - 1). */ > > + tree last_index = gimple_build (&stmts, PLUS_EXPR, TREE_TYPE (len), > > + len, bias_minus_one); > > + > > + /* SCALAR_RES = VEC_EXTRACT . */ > > + tree scalar_res > > + = gimple_build (&stmts, CFN_VEC_EXTRACT, TREE_TYPE (vectype), > > + vec_lhs_phi, last_index); > > + > > can we double-check this on powerpc and s390? > > Thanks, > Richard. > > > + /* Convert the extracted vector element to the scalar type. */ > > + new_tree = gimple_convert (&stmts, lhs_type, scalar_res); > > + } > > + else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > > { > > /* Emit: > > > > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)