From: "Kewen.Lin" <linkw@linux.ibm.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Bill Schmidt <wschmidt@linux.ibm.com>,
Segher Boessenkool <segher@kernel.crashing.org>,
Richard Sandiford <richard.sandiford@arm.com>,
Richard Biener <richard.guenther@gmail.com>
Subject: [PATCH] vect: Support vector with length cost modeling
Date: Tue, 21 Jul 2020 13:51:58 +0800 [thread overview]
Message-ID: <419f1fad-05be-115c-1a53-cb710ae7b2dc@linux.ibm.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 955 bytes --]
Hi,
This patch is to add the cost modeling for vector with length,
it mainly follows what we generate for vector with length in
functions vect_set_loop_controls_directly and vect_gen_len
at the worst case.
For Power, the length is expected to be in bits 0-7 (high bits),
we have to model the cost of shifting bits. To allow other targets
not suffer this, I used one target hook to describe this extra cost,
I'm not sure if it's a correct way.
Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1.
Any comments/suggestions are highly appreciated!
BR,
Kewen
-----
gcc/ChangeLog:
* config/rs6000/rs6000.c (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New
macro.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New target hook.
* target.def (extra_length_cost): Likewise.
* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
modeling for vector with length.
[-- Attachment #2: cost_v1.diff --]
[-- Type: text/plain, Size: 13038 bytes --]
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5a4f07d5810..1c5f02796f5 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1668,6 +1668,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
#undef TARGET_DOLOOP_COST_FOR_ADDRESS
#define TARGET_DOLOOP_COST_FOR_ADDRESS 1000000000
+#undef TARGET_VECTORIZE_EXTRA_LENGTH_COST
+#define TARGET_VECTORIZE_EXTRA_LENGTH_COST 1
+
#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 6e7d9dc54a9..ef37b5c1d6d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6106,6 +6106,14 @@ This hook should complete calculations of the cost of vectorizing a loop or basi
This hook should release @var{data} and any related data structures allocated by TARGET_VECTORIZE_INIT_COST. The default releases the accumulator.
@end deftypefn
+@deftypevr {Target Hook} unsigned TARGET_VECTORIZE_EXTRA_LENGTH_COST
+For loop vectorization using length-based partial vectors, some targets
+need extra operations for length preparation, like one shift operation is
+required on Power to make length be encoded in bits 0-7. This hook is to
+provide a way for this kind of extra cost.
+The default value is zero.
+@end deftypevr
+
@deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree @var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
Target builtin that implements vector gather operation. @var{mem_vectype}
is the vector type of the load and @var{index_type} is scalar type of
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3be984bbd5c..ae9a513f529 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4191,6 +4191,8 @@ address; but often a machine-dependent strategy can generate better code.
@hook TARGET_VECTORIZE_DESTROY_COST_DATA
+@hook TARGET_VECTORIZE_EXTRA_LENGTH_COST
+
@hook TARGET_VECTORIZE_BUILTIN_GATHER
@hook TARGET_VECTORIZE_BUILTIN_SCATTER
diff --git a/gcc/target.def b/gcc/target.def
index 07059a87caf..3134be7ea7b 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2058,6 +2058,16 @@ DEFHOOK
(void *data),
default_destroy_cost_data)
+/* For target-specific cost on length-based vectorization. */
+DEFHOOKPOD
+(extra_length_cost,
+ "For loop vectorization using length-based partial vectors, some targets\n\
+need extra operations for length preparation, like one shift operation is\n\
+required on Power to make length be encoded in bits 0-7. This hook is to\n\
+provide a way for this kind of extra cost.\n\
+The default value is zero.",
+ unsigned, 0)
+
HOOK_VECTOR_END (vectorize)
#undef HOOK_PREFIX
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..294a445afac 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
TODO: Build an expression that represents peel_iters for prologue and
epilogue to be used in a run-time test. */
- if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+ if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
{
peel_iters_prologue = 0;
peel_iters_epilogue = 0;
@@ -3663,45 +3663,149 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
peel_iters_epilogue += 1;
stmt_info_for_cost *si;
int j;
- FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
- j, si)
+ FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+ si)
(void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
si->kind, si->stmt_info, si->vectype,
si->misalign, vect_epilogue);
}
- /* Calculate how many masks we need to generate. */
- unsigned int num_masks = 0;
- rgroup_controls *rgm;
- unsigned int num_vectors_m1;
- FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
- if (rgm->type)
- num_masks += num_vectors_m1 + 1;
- gcc_assert (num_masks > 0);
-
- /* In the worst case, we need to generate each mask in the prologue
- and in the loop body. One of the loop body mask instructions
- replaces the comparison in the scalar loop, and since we don't
- count the scalar comparison against the scalar body, we shouldn't
- count that vector instruction against the vector body either.
-
- Sometimes we can use unpacks instead of generating prologue
- masks and sometimes the prologue mask will fold to a constant,
- so the actual prologue cost might be smaller. However, it's
- simpler and safer to use the worst-case cost; if this ends up
- being the tie-breaker between vectorizing or not, then it's
- probably better not to vectorize. */
- (void) add_stmt_cost (loop_vinfo,
- target_cost_data, num_masks, vector_stmt,
- NULL, NULL_TREE, 0, vect_prologue);
- (void) add_stmt_cost (loop_vinfo,
- target_cost_data, num_masks - 1, vector_stmt,
- NULL, NULL_TREE, 0, vect_body);
- }
- else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
- {
- peel_iters_prologue = 0;
- peel_iters_epilogue = 0;
+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+ {
+ /* Calculate how many masks we need to generate. */
+ unsigned int num_masks = 0;
+ rgroup_controls *rgm;
+ unsigned int num_vectors_m1;
+ FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
+ if (rgm->type)
+ num_masks += num_vectors_m1 + 1;
+ gcc_assert (num_masks > 0);
+
+ /* In the worst case, we need to generate each mask in the prologue
+ and in the loop body. One of the loop body mask instructions
+ replaces the comparison in the scalar loop, and since we don't
+ count the scalar comparison against the scalar body, we shouldn't
+ count that vector instruction against the vector body either.
+
+ Sometimes we can use unpacks instead of generating prologue
+ masks and sometimes the prologue mask will fold to a constant,
+ so the actual prologue cost might be smaller. However, it's
+ simpler and safer to use the worst-case cost; if this ends up
+ being the tie-breaker between vectorizing or not, then it's
+ probably better not to vectorize. */
+ (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+ vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+ (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+ vector_stmt, NULL, NULL_TREE, 0, vect_body);
+ }
+ else
+ {
+ gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
+
+ /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT. */
+ if (npeel < 0)
+ {
+ peel_iters_prologue = assumed_vf / 2;
+ /* See below, if peeled iterations are unknown, count a taken
+ branch and a not taken branch per peeled loop. */
+ (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+ cond_branch_taken, NULL, NULL_TREE, 0,
+ vect_prologue);
+ (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+ cond_branch_not_taken, NULL, NULL_TREE, 0,
+ vect_prologue);
+ }
+ else
+ {
+ peel_iters_prologue = npeel;
+ if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+ /* See vect_get_known_peeling_cost, if peeled iterations are
+ known but number of scalar loop iterations are unknown, count
+ a taken branch per peeled loop. */
+ (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+ cond_branch_taken, NULL, NULL_TREE, 0,
+ vect_prologue);
+ }
+
+ stmt_info_for_cost *si;
+ int j;
+ FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+ si)
+ (void) add_stmt_cost (loop_vinfo, target_cost_data,
+ si->count * peel_iters_prologue, si->kind,
+ si->stmt_info, si->vectype, si->misalign,
+ vect_prologue);
+
+ /* Refer to the functions vect_set_loop_condition_partial_vectors
+ and vect_set_loop_controls_directly, we need to generate each
+ length in the prologue and in the loop body if required. Although
+ there are some possible optimization, we consider the worst case
+ here. */
+
+ /* For now we only operate length-based partial vectors on Power,
+ which has constant VF all the time, we need some tweakings below
+ if it doesn't hold in future. */
+ gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+ /* For wrap around checking. */
+ tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+ unsigned int compare_precision = TYPE_PRECISION (compare_type);
+ widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+ bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+ bool need_iterate_p
+ = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+ && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+ /* Init min/max, shift and minus cost relative to single scalar_stmt.
+ FIXME: For now we only use length-based partial vectors on Power,
+ target specific costs may be needed for other ports. */
+ unsigned int min_max_cost = 2;
+ unsigned int shift_cost = 1, minus_cost = 1;
+
+ /* Init cost relative to single scalar_stmt. */
+ unsigned int prol_cnt = 0;
+ unsigned int body_cnt = 0;
+
+ rgroup_controls *rgc;
+ unsigned int num_vectors_m1;
+ FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+ if (rgc->type)
+ {
+ unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+
+ /* Need one shift for niters_total computation. */
+ if (!niters_known_p && nitems != 1)
+ prol_cnt += shift_cost;
+
+ /* Need to handle wrap around. */
+ if (iv_limit == -1
+ || (wi::min_precision (iv_limit * nitems, UNSIGNED)
+ > compare_precision))
+ prol_cnt += (min_max_cost + minus_cost);
+
+ /* Need to handle batch limit excepting for the 1st one. */
+ prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
+
+ unsigned int num_vectors = num_vectors_m1 + 1;
+ /* Need to set up lengths in prologue, only one MIN required
+ since start index is zero. */
+ prol_cnt += min_max_cost * num_vectors;
+
+ /* Probably need some extra cost to transform this length, like
+ shift on Power. */
+ body_cnt += targetm.vectorize.extra_length_cost * num_vectors;
+
+ /* Need to update lengths in body for next iteration. */
+ if (need_iterate_p)
+ body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+ }
+
+ (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt,
+ scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
+ (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt,
+ scalar_stmt, NULL, NULL_TREE, 0, vect_body);
+ }
}
else if (npeel < 0)
{
@@ -3913,8 +4017,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
}
/* ??? The "if" arm is written to handle all cases; see below for what
- we would do for !LOOP_VINFO_FULLY_MASKED_P. */
- if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+ we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P. */
+ if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
{
/* Rewriting the condition above in terms of the number of
vector iterations (vniters) rather than the number of
@@ -3941,7 +4045,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n",
min_vec_niters);
- if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+ if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
{
/* Now that we know the minimum number of vector iterations,
find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4100,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
&& min_profitable_iters < (assumed_vf + peel_iters_prologue))
/* We want the vectorized loop to execute at least once. */
min_profitable_iters = assumed_vf + peel_iters_prologue;
+ else if (min_profitable_iters < peel_iters_prologue)
+ /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+ vectorized loop to execute at least once. */
+ min_profitable_iters = peel_iters_prologue;
if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4121,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
if (vec_outside_cost <= 0)
min_profitable_estimate = 0;
- else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+ else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
{
/* This is a repeat of the code above, but with + SOC rather
than - SOC. */
@@ -4025,7 +4133,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
if (outside_overhead > 0)
min_vec_niters = outside_overhead / saving_per_viter + 1;
- if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+ if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
{
int threshold = (vec_inside_cost * min_vec_niters
+ vec_outside_cost
next reply other threads:[~2020-07-21 5:52 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-21 5:51 Kewen.Lin [this message]
2020-07-21 7:57 ` Richard Biener
2020-07-22 1:26 ` [PATCH v2] vect/rs6000: " Kewen.Lin
2020-07-22 6:38 ` Richard Biener
2020-07-22 7:08 ` Kewen.Lin
2020-07-22 9:11 ` Richard Sandiford
2020-07-22 15:48 ` [PATCH v3] " Kewen.Lin
2020-07-22 16:25 ` Kewen.Lin
2020-07-24 16:21 ` Richard Sandiford
2020-07-27 3:58 ` [PATCH v4] " Kewen.Lin
2020-07-27 13:40 ` Richard Sandiford
2020-07-28 8:36 ` Kewen.Lin
2020-07-31 11:03 ` Richard Sandiford
2020-07-31 11:20 ` Richard Biener
2020-07-31 12:37 ` Kewen.Lin
2020-07-31 13:01 ` Richard Biener
2020-07-31 13:21 ` Kewen.Lin
2020-07-31 14:51 ` [PATCH v5] " Kewen.Lin
2020-08-05 7:27 ` Richard Sandiford
2020-08-05 14:06 ` Segher Boessenkool
2020-08-06 6:47 ` Kewen.Lin
2020-07-22 17:49 ` [PATCH v2] " Segher Boessenkool
2020-07-27 3:44 ` Kewen.Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=419f1fad-05be-115c-1a53-cb710ae7b2dc@linux.ibm.com \
--to=linkw@linux.ibm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
--cc=richard.sandiford@arm.com \
--cc=segher@kernel.crashing.org \
--cc=wschmidt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).