From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id A42393858D38 for ; Tue, 21 Jul 2020 05:52:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A42393858D38 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 06L5V7Gx152672; Tue, 21 Jul 2020 01:52:08 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 32d5k0m354-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 21 Jul 2020 01:52:07 -0400 Received: from m0098421.ppops.net (m0098421.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 06L5VfVg154049; Tue, 21 Jul 2020 01:52:07 -0400 Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com [169.51.49.98]) by mx0a-001b2d01.pphosted.com with ESMTP id 32d5k0m34m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 21 Jul 2020 01:52:07 -0400 Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1]) by ppma03ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 06L5p51G021172; Tue, 21 Jul 2020 05:52:05 GMT Received: from b06cxnps3075.portsmouth.uk.ibm.com (d06relay10.portsmouth.uk.ibm.com [9.149.109.195]) by ppma03ams.nl.ibm.com with ESMTP id 32brq7kg3n-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 21 Jul 2020 05:52:05 +0000 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 06L5q3UZ17760592 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 21 Jul 2020 05:52:03 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6D1695204E; Tue, 21 Jul 2020 05:52:03 +0000 (GMT) Received: from KewenLins-MacBook-Pro.local (unknown [9.200.48.212]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id 0D21652054; Tue, 21 Jul 2020 05:52:00 +0000 (GMT) To: GCC Patches Cc: Bill Schmidt , Segher Boessenkool , Richard Sandiford , Richard Biener From: "Kewen.Lin" Subject: [PATCH] vect: Support vector with length cost modeling Message-ID: <419f1fad-05be-115c-1a53-cb710ae7b2dc@linux.ibm.com> Date: Tue, 21 Jul 2020 13:51:58 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.9.0 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------BFEB2E4545496AA5B8BA4480" Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-07-21_01:2020-07-21, 2020-07-21 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 phishscore=0 mlxlogscore=999 suspectscore=9 priorityscore=1501 mlxscore=0 clxscore=1015 lowpriorityscore=0 malwarescore=0 impostorscore=0 spamscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2007210035 X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Jul 2020 05:52:12 -0000 This is a multi-part message in MIME format. --------------BFEB2E4545496AA5B8BA4480 Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 7bit Hi, This patch is to add the cost modeling for vector with length, it mainly follows what we generate for vector with length in functions vect_set_loop_controls_directly and vect_gen_len at the worst case. For Power, the length is expected to be in bits 0-7 (high bits), we have to model the cost of shifting bits. To allow other targets not suffer this, I used one target hook to describe this extra cost, I'm not sure if it's a correct way. Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit param vect-partial-vector-usage=1. Any comments/suggestions are highly appreciated! BR, Kewen ----- gcc/ChangeLog: * config/rs6000/rs6000.c (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New macro. * doc/tm.texi: Regenerate. * doc/tm.texi.in (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New target hook. * target.def (extra_length_cost): Likewise. * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost modeling for vector with length. --------------BFEB2E4545496AA5B8BA4480 Content-Type: text/plain; charset=UTF-8; x-mac-type="0"; x-mac-creator="0"; name="cost_v1.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="cost_v1.diff" diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 5a4f07d5810..1c5f02796f5 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1668,6 +1668,9 @@ static const struct attribute_spec rs6000_attribute_table[] = #undef TARGET_DOLOOP_COST_FOR_ADDRESS #define TARGET_DOLOOP_COST_FOR_ADDRESS 1000000000 +#undef TARGET_VECTORIZE_EXTRA_LENGTH_COST +#define TARGET_VECTORIZE_EXTRA_LENGTH_COST 1 + #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 6e7d9dc54a9..ef37b5c1d6d 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6106,6 +6106,14 @@ This hook should complete calculations of the cost of vectorizing a loop or basi This hook should release @var{data} and any related data structures allocated by TARGET_VECTORIZE_INIT_COST. The default releases the accumulator. @end deftypefn +@deftypevr {Target Hook} unsigned TARGET_VECTORIZE_EXTRA_LENGTH_COST +For loop vectorization using length-based partial vectors, some targets +need extra operations for length preparation, like one shift operation is +required on Power to make length be encoded in bits 0-7. This hook is to +provide a way for this kind of extra cost. +The default value is zero. +@end deftypevr + @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree @var{mem_vectype}, const_tree @var{index_type}, int @var{scale}) Target builtin that implements vector gather operation. @var{mem_vectype} is the vector type of the load and @var{index_type} is scalar type of diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 3be984bbd5c..ae9a513f529 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4191,6 +4191,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_VECTORIZE_DESTROY_COST_DATA +@hook TARGET_VECTORIZE_EXTRA_LENGTH_COST + @hook TARGET_VECTORIZE_BUILTIN_GATHER @hook TARGET_VECTORIZE_BUILTIN_SCATTER diff --git a/gcc/target.def b/gcc/target.def index 07059a87caf..3134be7ea7b 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -2058,6 +2058,16 @@ DEFHOOK (void *data), default_destroy_cost_data) +/* For target-specific cost on length-based vectorization. */ +DEFHOOKPOD +(extra_length_cost, + "For loop vectorization using length-based partial vectors, some targets\n\ +need extra operations for length preparation, like one shift operation is\n\ +required on Power to make length be encoded in bits 0-7. This hook is to\n\ +provide a way for this kind of extra cost.\n\ +The default value is zero.", + unsigned, 0) + HOOK_VECTOR_END (vectorize) #undef HOOK_PREFIX diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index e933441b922..294a445afac 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, TODO: Build an expression that represents peel_iters for prologue and epilogue to be used in a run-time test. */ - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { peel_iters_prologue = 0; peel_iters_epilogue = 0; @@ -3663,45 +3663,149 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, peel_iters_epilogue += 1; stmt_info_for_cost *si; int j; - FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), - j, si) + FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, + si) (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count, si->kind, si->stmt_info, si->vectype, si->misalign, vect_epilogue); } - /* Calculate how many masks we need to generate. */ - unsigned int num_masks = 0; - rgroup_controls *rgm; - unsigned int num_vectors_m1; - FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) - if (rgm->type) - num_masks += num_vectors_m1 + 1; - gcc_assert (num_masks > 0); - - /* In the worst case, we need to generate each mask in the prologue - and in the loop body. One of the loop body mask instructions - replaces the comparison in the scalar loop, and since we don't - count the scalar comparison against the scalar body, we shouldn't - count that vector instruction against the vector body either. - - Sometimes we can use unpacks instead of generating prologue - masks and sometimes the prologue mask will fold to a constant, - so the actual prologue cost might be smaller. However, it's - simpler and safer to use the worst-case cost; if this ends up - being the tie-breaker between vectorizing or not, then it's - probably better not to vectorize. */ - (void) add_stmt_cost (loop_vinfo, - target_cost_data, num_masks, vector_stmt, - NULL, NULL_TREE, 0, vect_prologue); - (void) add_stmt_cost (loop_vinfo, - target_cost_data, num_masks - 1, vector_stmt, - NULL, NULL_TREE, 0, vect_body); - } - else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) - { - peel_iters_prologue = 0; - peel_iters_epilogue = 0; + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + /* Calculate how many masks we need to generate. */ + unsigned int num_masks = 0; + rgroup_controls *rgm; + unsigned int num_vectors_m1; + FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) + if (rgm->type) + num_masks += num_vectors_m1 + 1; + gcc_assert (num_masks > 0); + + /* In the worst case, we need to generate each mask in the prologue + and in the loop body. One of the loop body mask instructions + replaces the comparison in the scalar loop, and since we don't + count the scalar comparison against the scalar body, we shouldn't + count that vector instruction against the vector body either. + + Sometimes we can use unpacks instead of generating prologue + masks and sometimes the prologue mask will fold to a constant, + so the actual prologue cost might be smaller. However, it's + simpler and safer to use the worst-case cost; if this ends up + being the tie-breaker between vectorizing or not, then it's + probably better not to vectorize. */ + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks, + vector_stmt, NULL, NULL_TREE, 0, vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1, + vector_stmt, NULL, NULL_TREE, 0, vect_body); + } + else + { + gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); + + /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT. */ + if (npeel < 0) + { + peel_iters_prologue = assumed_vf / 2; + /* See below, if peeled iterations are unknown, count a taken + branch and a not taken branch per peeled loop. */ + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, + cond_branch_taken, NULL, NULL_TREE, 0, + vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, + cond_branch_not_taken, NULL, NULL_TREE, 0, + vect_prologue); + } + else + { + peel_iters_prologue = npeel; + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) + /* See vect_get_known_peeling_cost, if peeled iterations are + known but number of scalar loop iterations are unknown, count + a taken branch per peeled loop. */ + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, + cond_branch_taken, NULL, NULL_TREE, 0, + vect_prologue); + } + + stmt_info_for_cost *si; + int j; + FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, + si) + (void) add_stmt_cost (loop_vinfo, target_cost_data, + si->count * peel_iters_prologue, si->kind, + si->stmt_info, si->vectype, si->misalign, + vect_prologue); + + /* Refer to the functions vect_set_loop_condition_partial_vectors + and vect_set_loop_controls_directly, we need to generate each + length in the prologue and in the loop body if required. Although + there are some possible optimization, we consider the worst case + here. */ + + /* For now we only operate length-based partial vectors on Power, + which has constant VF all the time, we need some tweakings below + if it doesn't hold in future. */ + gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()); + + /* For wrap around checking. */ + tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); + unsigned int compare_precision = TYPE_PRECISION (compare_type); + widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo); + + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); + bool need_iterate_p + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && !vect_known_niters_smaller_than_vf (loop_vinfo)); + + /* Init min/max, shift and minus cost relative to single scalar_stmt. + FIXME: For now we only use length-based partial vectors on Power, + target specific costs may be needed for other ports. */ + unsigned int min_max_cost = 2; + unsigned int shift_cost = 1, minus_cost = 1; + + /* Init cost relative to single scalar_stmt. */ + unsigned int prol_cnt = 0; + unsigned int body_cnt = 0; + + rgroup_controls *rgc; + unsigned int num_vectors_m1; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc->type) + { + unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor; + + /* Need one shift for niters_total computation. */ + if (!niters_known_p && nitems != 1) + prol_cnt += shift_cost; + + /* Need to handle wrap around. */ + if (iv_limit == -1 + || (wi::min_precision (iv_limit * nitems, UNSIGNED) + > compare_precision)) + prol_cnt += (min_max_cost + minus_cost); + + /* Need to handle batch limit excepting for the 1st one. */ + prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1; + + unsigned int num_vectors = num_vectors_m1 + 1; + /* Need to set up lengths in prologue, only one MIN required + since start index is zero. */ + prol_cnt += min_max_cost * num_vectors; + + /* Probably need some extra cost to transform this length, like + shift on Power. */ + body_cnt += targetm.vectorize.extra_length_cost * num_vectors; + + /* Need to update lengths in body for next iteration. */ + if (need_iterate_p) + body_cnt += (2 * min_max_cost + minus_cost) * num_vectors; + } + + (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, + scalar_stmt, NULL, NULL_TREE, 0, vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, + scalar_stmt, NULL, NULL_TREE, 0, vect_body); + } } else if (npeel < 0) { @@ -3913,8 +4017,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, } /* ??? The "if" arm is written to handle all cases; see below for what - we would do for !LOOP_VINFO_FULLY_MASKED_P. */ - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P. */ + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Rewriting the condition above in terms of the number of vector iterations (vniters) rather than the number of @@ -3941,7 +4045,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", min_vec_niters); - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Now that we know the minimum number of vector iterations, find the minimum niters for which the scalar cost is larger: @@ -3996,6 +4100,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, && min_profitable_iters < (assumed_vf + peel_iters_prologue)) /* We want the vectorized loop to execute at least once. */ min_profitable_iters = assumed_vf + peel_iters_prologue; + else if (min_profitable_iters < peel_iters_prologue) + /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the + vectorized loop to execute at least once. */ + min_profitable_iters = peel_iters_prologue; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ -4013,7 +4121,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (vec_outside_cost <= 0) min_profitable_estimate = 0; - else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* This is a repeat of the code above, but with + SOC rather than - SOC. */ @@ -4025,7 +4133,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (outside_overhead > 0) min_vec_niters = outside_overhead / saving_per_viter + 1; - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { int threshold = (vec_inside_cost * min_vec_niters + vec_outside_cost --------------BFEB2E4545496AA5B8BA4480--