* [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts @ 2019-04-24 8:49 Kewen.Lin 2019-04-24 9:08 ` Jakub Jelinek ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Kewen.Lin @ 2019-04-24 8:49 UTC (permalink / raw) To: GCC Patches Cc: bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther, jakub Hi all, As PR80791 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80791, on some targets which support low-overhead loop, the related compare type ivs use is possible to become dead and removed eventually. However, the current ivopt cost modeling doesn't consider this possible elimination, it leads suboptimal iv candidate decision. To make it better, this patch is to introduce one target hook, to provide target which supports low-overhead loop a way to teach ivopt to know which compare type iv can be ignored in modeling. Since the low-overhead loop optimize transformation is based on RTL, some of those checks are hard to be imitated on gimple, so it's not possible to predict the current loop will be transformed exactly in middle-end. But if we can have most loop predicted precisely, it would be helpful. It highly depends on target hook fine tuning. It's acceptable to have some loops which can be transformed to low- overhead loop but we don't catch. But we should try our best to avoid to predict some loop as low-overhead loop but which isn't. Bootstrapped and regression testing on powerpc64le. One failure was found in regression testing, which is: gcc.target/powerpc/20050830-1.c I did some investigation and found: with this patch, one new iv cand is chosen for exit condition, which is used to rewrite the compare type use. Later loop iv analysis in RTL can NOT determine the loop iteration number is finite, which causes doloop_optimize not transform the loop. The check doesn't find the expected pattern. Still investigating how to solve this failure, like either to enhance loop iv analysis or to check some special condition and guard in ivopt. Any suggestions are highly welcomed. Btw, this is for GCC10. --------------------------------------------- gcc/ChangeLog 2019-04-24 Kewen Lin <linkw@gcc.gnu.org> PR middle-end/80791 * target.def (predict_doloop_p): New. * targhooks.h (default_predict_doloop_p): New. * targhooks.c (default_predict_doloop_p): Likewise. * doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New. * doc/tm.texi: Regenerate. * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New. (costly_iter_for_doloop_p): New. (rs6000_predict_doloop_p): New. * gcc/expr.c (produce_memory_decl_rtl): New. (prepare_decl_rtl): New. * gcc/expr.h (produce_memory_decl_rtl): Declare. (prepare_decl_rtl): Declare. * gcc/tree-ssa-loop-ivopts.c (tree_ssa_iv_optimize_loop): Call predict_doloop_p hook. (tailor_cmp_uses): New. (preserve_ivs_for_use): New. (computation_cost): Call refactored prepare_decl_rtl, consider zero cost iv use. (dump_use): Dump zero_cost_p field. (record_use): Init zero_cost_p field. (produce_memory_decl_rtl): Remove. (prepare_decl_rtl): Remove. gcc/testsuite/ChangeLog 2019-04-24 Kewen Lin <linkw@gcc.gnu.org> PR middle-end/80791 * gcc.dg/tree-ssa/ivopts-lt.c : Adjust. --- gcc/config/rs6000/rs6000.c | 174 ++++++++++++++++++++- gcc/doc/tm.texi | 8 + gcc/doc/tm.texi.in | 2 + gcc/expr.c | 91 +++++++++++ gcc/expr.h | 16 +- gcc/target.def | 9 ++ gcc/targhooks.c | 13 ++ gcc/targhooks.h | 1 + gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c | 7 +- gcc/tree-ssa-loop-ivopts.c | 247 +++++++++++++++++++----------- 10 files changed, 472 insertions(+), 96 deletions(-) diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index a21f4f7..756f914 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -83,6 +83,9 @@ #include "tree-ssa-propagate.h" #include "tree-vrp.h" #include "tree-ssanames.h" +#include "tree-ssa-loop-niter.h" +#include "tree-cfg.h" +#include "tree-scalar-evolution.h" /* This file should be included last. */ #include "target-def.h" @@ -1914,6 +1917,9 @@ static const struct attribute_spec rs6000_attribute_table[] = #undef TARGET_CAN_USE_DOLOOP_P #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost +#undef TARGET_PREDICT_DOLOOP_P +#define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p + #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv @@ -39436,7 +39442,173 @@ rs6000_mangle_decl_assembler_name (tree decl, tree id) return id; } - +/* Check whether there are some instructions preventing doloop transformation + inside loop body, mainly for instructions which are possible to kill CTR. + + Return true if some invalid insn exits, otherwise return false. */ + +static bool +invalid_insn_for_doloop_p (struct loop *loop) +{ + basic_block *body = get_loop_body (loop); + unsigned num_nodes = loop->num_nodes; + gimple_stmt_iterator gsi; + unsigned i; + + for (i = 0; i < num_nodes; i++) + for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + if (is_gimple_call (stmt) && !gimple_call_internal_p (stmt) + && !is_inexpensive_builtin (gimple_call_fndecl (stmt))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "predict doloop failure due to finding call.\n"); + return true; + } + if (computed_goto_p (stmt)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf ( + dump_file, + "predict doloop failure due to finding computed jump.\n"); + return true; + } + /* TODO: Now this hook is expected to be called in ivopts, which is + before switchlower1/switchlower2. Checking for SWITCH at this point + will eliminate some good candidates. But since there are only a few + cases having _a_ switch statement in the innermost loop, it can be a low + priority enhancement. */ + + if (gimple_code (stmt) == GIMPLE_SWITCH) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "predict doloop failure due to finding switch.\n"); + return true; + } + } + + return false; +} + +/* Check whether number of iteration computation is too costly for doloop + transformation. It expands the gimple sequence to equivalent RTL insn + sequence, then evaluate the cost. + + Return true if it's costly, otherwise return false. */ + +static bool +costly_iter_for_doloop_p (struct loop *loop, tree niters) +{ + tree type = TREE_TYPE (niters); + unsigned cost = 0, i; + tree obj; + bool speed = optimize_loop_for_speed_p (loop); + int regno = LAST_VIRTUAL_REGISTER + 1; + vec<tree> tvec; + tvec.create (20); + decl_rtl_data tdata; + tdata.regno = ®no; + tdata.treevec = &tvec; + walk_tree (&niters, prepare_decl_rtl, &tdata, NULL); + start_sequence (); + expand_expr (niters, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL); + rtx_insn *seq = get_insns (); + end_sequence (); + FOR_EACH_VEC_ELT (tvec, i, obj) + SET_DECL_RTL (obj, NULL_RTX); + tvec.release (); + + for (; seq; seq = NEXT_INSN (seq)) + { + if (!INSN_P (seq)) + continue; + rtx body = PATTERN (seq); + if (GET_CODE (body) == SET) + { + rtx set_val = XEXP (body, 1); + enum rtx_code code = GET_CODE (set_val); + enum rtx_class cls = GET_RTX_CLASS (code); + /* For now, we only consider these two RTX classes, to match what we + get in doloop_optimize, excluding operations like zero/sign extend. */ + if (cls == RTX_BIN_ARITH || cls == RTX_COMM_ARITH) + cost += set_src_cost (set_val, GET_MODE (set_val), speed); + } + } + unsigned max_cost + = COSTS_N_INSNS (PARAM_VALUE (PARAM_MAX_ITERATIONS_COMPUTATION_COST)); + if (cost > max_cost) + return true; + + return false; +} + +/* + Predict whether the given loop will be transformed in the RTL + doloop_optimize pass. This is for use by the IVOPTS middle-end pass. + Attempt to duplicate as many doloop_optimize checks as possible. + + Some RTL specific checks seems unable to be checked here, if any + new checks or easy checks _are_ missing here. */ + +static bool +rs6000_predict_doloop_p (struct loop *loop) +{ + gcc_assert (loop); + + /* On rs6000, targetm.can_use_doloop_p is actually + can_use_doloop_if_innermost. Just ensure it's innermost. */ + if (loop->inner != NULL) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "predict doloop failure due to no innermost.\n"); + return false; + } + + /* number_of_latch_executions is not so costly, so we don't use + number_of_iterations_exit for iteration description. */ + tree niters = number_of_latch_executions (loop); + if (niters == NULL_TREE || niters == chrec_dont_know) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "predict doloop failure due to unexpected niters.\n"); + return false; + } + + /* Similar to doloop_optimize, check whether iteration count too small + and not profitable. */ + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop); + if (est_niter == -1) + est_niter = get_likely_max_loop_iterations_int (loop); + if (est_niter >= 0 && est_niter < 3) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "predict doloop failure due to too few iterations (%d).\n", + (unsigned int) est_niter); + return false; + } + + /* Similar to doloop_optimize, check whether number of iterations too costly + to compute. */ + if (costly_iter_for_doloop_p (loop, niters)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "predict doloop failure due to costly niter computation.\n"); + return false; + } + + /* Similar to doloop_optimize targetm.invalid_within_doloop, check for + CALL, tablejump, computed_jump. */ + if (invalid_insn_for_doloop_p (loop)) + return false; + + return true; +} + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-rs6000.h" diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 8c8978b..d3ad208 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -11603,6 +11603,14 @@ function version at run-time for a given set of function versions. body must be generated. @end deftypefn +@deftypefn {Target Hook} bool TARGET_PREDICT_DOLOOP_P (struct loop *@var{loop}) +Return true if we can predict it is possible to use low-overhead loops +for a particular loop. The parameter @var{loop} is a pointer to the loop +which is going to be checked. This target hook is required only when the +target supports low-overhead loops, and will help some earlier middle-end +passes to make some decisions. +@end deftypefn + @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top}) Return true if it is possible to use low-overhead loops (@code{doloop_end} and @code{doloop_begin}) for a particular loop. @var{iterations} gives the diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index fe1194e..e047734 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -7942,6 +7942,8 @@ to by @var{ce_info}. @hook TARGET_GENERATE_VERSION_DISPATCHER_BODY +@hook TARGET_PREDICT_DOLOOP_P + @hook TARGET_CAN_USE_DOLOOP_P @hook TARGET_INVALID_WITHIN_DOLOOP diff --git a/gcc/expr.c b/gcc/expr.c index 9ff5e5f..1f2ad45 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -12539,3 +12539,94 @@ int_expr_size (tree exp) return tree_to_shwi (size); } + +/* Produce DECL_RTL for object obj so it looks like it is stored in memory. */ + +rtx +produce_memory_decl_rtl (tree obj, int *regno) +{ + addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (obj)); + machine_mode address_mode = targetm.addr_space.address_mode (as); + rtx x; + + gcc_assert (obj); + if (TREE_STATIC (obj) || DECL_EXTERNAL (obj)) + { + const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (obj)); + x = gen_rtx_SYMBOL_REF (address_mode, name); + SET_SYMBOL_REF_DECL (x, obj); + x = gen_rtx_MEM (DECL_MODE (obj), x); + set_mem_addr_space (x, as); + targetm.encode_section_info (obj, x, true); + } + else + { + x = gen_raw_REG (address_mode, (*regno)++); + x = gen_rtx_MEM (DECL_MODE (obj), x); + set_mem_addr_space (x, as); + } + + return x; +} + +/* Prepares decl_rtl for variables referred in *EXPR_P. Callback for + walk_tree. DATA contains the actual fake register number. */ + +tree +prepare_decl_rtl (tree *expr_p, int *ws, void *data) +{ + tree obj = NULL_TREE; + rtx x = NULL_RTX; + decl_rtl_data *info = (decl_rtl_data *) data; + int *regno = info->regno; + vec<tree> *treevec = info->treevec; + + switch (TREE_CODE (*expr_p)) + { + case ADDR_EXPR: + for (expr_p = &TREE_OPERAND (*expr_p, 0); handled_component_p (*expr_p); + expr_p = &TREE_OPERAND (*expr_p, 0)) + continue; + obj = *expr_p; + if (DECL_P (obj) && HAS_RTL_P (obj) && !DECL_RTL_SET_P (obj)) + x = produce_memory_decl_rtl (obj, regno); + break; + + case SSA_NAME: + *ws = 0; + obj = SSA_NAME_VAR (*expr_p); + /* Defer handling of anonymous SSA_NAMEs to the expander. */ + if (!obj) + return NULL_TREE; + if (!DECL_RTL_SET_P (obj)) + x = gen_raw_REG (DECL_MODE (obj), (*regno)++); + break; + + case VAR_DECL: + case PARM_DECL: + case RESULT_DECL: + *ws = 0; + obj = *expr_p; + + if (DECL_RTL_SET_P (obj)) + break; + + if (DECL_MODE (obj) == BLKmode) + x = produce_memory_decl_rtl (obj, regno); + else + x = gen_raw_REG (DECL_MODE (obj), (*regno)++); + + break; + + default: + break; + } + + if (x) + { + treevec->safe_push (obj); + SET_DECL_RTL (obj, x); + } + + return NULL_TREE; +} diff --git a/gcc/expr.h b/gcc/expr.h index 17c3962..b1894a6b 100644 --- a/gcc/expr.h +++ b/gcc/expr.h @@ -53,7 +53,21 @@ typedef struct separate_ops tree type; tree op0, op1, op2; } *sepops; - + +/* This structure is used to pass information to tree walker function + prepare_decl_rtl. */ +typedef struct data_for_decl_rtl +{ + int *regno; + vec<tree> *treevec; +} decl_rtl_data; + +/* Produce decl_rtl for object so it looks like it is stored in memory. */ +rtx produce_memory_decl_rtl (tree, int *); + +/* Prepares decl_rtl for variables referred. Callback for walk_tree. */ +tree prepare_decl_rtl (tree *, int *, void *); + /* This is run during target initialization to set up which modes can be used directly in memory and to initialize the block move optab. */ extern void init_expr_target (void); diff --git a/gcc/target.def b/gcc/target.def index 66cee07..2f28e80 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -4227,6 +4227,15 @@ DEFHOOK rtx, (machine_mode mode, rtx result, rtx val, rtx failval), default_speculation_safe_value) +DEFHOOK +(predict_doloop_p, + "Return true if we can predict it is possible to use low-overhead loops\n\ +for a particular loop. The parameter @var{loop} is a pointer to the loop\n\ +which is going to be checked. This target hook is required only when the\n\ +target supports low-overhead loops, and will help some earlier middle-end\n\ +passes to make some decisions.", + bool, (struct loop *loop), + default_predict_doloop_p) DEFHOOK (can_use_doloop_p, diff --git a/gcc/targhooks.c b/gcc/targhooks.c index 318f7e9..56ed4cc 100644 --- a/gcc/targhooks.c +++ b/gcc/targhooks.c @@ -643,6 +643,19 @@ default_has_ifunc_p (void) return HAVE_GNU_INDIRECT_FUNCTION; } +/* True if we can predict this loop is possible to be transformed to a + low-overhead loop, otherwise returns false. + + By default, false is returned, as this hook's applicability should be + verified for each target. Target maintainers should re-define the hook + if the target can take advantage of it. */ + +bool +default_predict_doloop_p (struct loop *loop ATTRIBUTE_UNUSED) +{ + return false; +} + /* NULL if INSN insn is valid within a low-overhead loop, otherwise returns an error message. diff --git a/gcc/targhooks.h b/gcc/targhooks.h index 5943627..70bfb17 100644 --- a/gcc/targhooks.h +++ b/gcc/targhooks.h @@ -85,6 +85,7 @@ extern bool default_fixed_point_supported_p (void); extern bool default_has_ifunc_p (void); +extern bool default_predict_doloop_p (struct loop *); extern const char * default_invalid_within_doloop (const rtx_insn *); extern tree default_builtin_vectorized_function (unsigned int, tree, tree); diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c index 171c85a..f61288c 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c @@ -17,6 +17,7 @@ f1 (char *p, uintptr_t i, uintptr_t n) while (i < n); } -/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */ -/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */ -/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */ +/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */ +/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */ +/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */ +/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */ diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c index a44b4cb..148a48d 100644 --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -109,6 +109,7 @@ along with GCC; see the file COPYING3. If not see #include "builtins.h" #include "tree-vectorizer.h" + /* FIXME: Expressions are expanded to RTL in this pass to determine the cost of different addressing modes. This should be moved to a TBD interface between the GIMPLE and RTL worlds. */ @@ -376,6 +377,7 @@ struct iv_use tree addr_base; /* Base address with const offset stripped. */ poly_uint64_pod addr_offset; /* Const offset stripped from base address. */ + bool zero_cost_p; /* This iv use takes zero cost, only compare type. */ }; /* Group of uses. */ @@ -691,6 +693,8 @@ static vec<tree> decl_rtl_to_reset; static comp_cost force_expr_to_var_cost (tree, bool); +static void preserve_ivs_for_use (struct ivopts_data*, struct iv_use*); + /* The single loop exit if it dominates the latch, NULL otherwise. */ edge @@ -765,6 +769,8 @@ dump_use (FILE *file, struct iv_use *use) fprintf (file, " At pos:\t"); if (use->op_p) print_generic_expr (file, *use->op_p, TDF_SLIM); + if (use->zero_cost_p) + fprintf (file, "\n Zero cost"); fprintf (file, "\n"); dump_iv (file, use->iv, false, 2); } @@ -1549,6 +1555,7 @@ record_use (struct iv_group *group, tree *use_p, struct iv *iv, use->op_p = use_p; use->addr_base = addr_base; use->addr_offset = addr_offset; + use->zero_cost_p = false; group->vuses.safe_push (use); return use; @@ -3687,94 +3694,6 @@ get_group_iv_cost (struct ivopts_data *data, struct iv_group *group, return NULL; } -/* Produce DECL_RTL for object obj so it looks like it is stored in memory. */ -static rtx -produce_memory_decl_rtl (tree obj, int *regno) -{ - addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (obj)); - machine_mode address_mode = targetm.addr_space.address_mode (as); - rtx x; - - gcc_assert (obj); - if (TREE_STATIC (obj) || DECL_EXTERNAL (obj)) - { - const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (obj)); - x = gen_rtx_SYMBOL_REF (address_mode, name); - SET_SYMBOL_REF_DECL (x, obj); - x = gen_rtx_MEM (DECL_MODE (obj), x); - set_mem_addr_space (x, as); - targetm.encode_section_info (obj, x, true); - } - else - { - x = gen_raw_REG (address_mode, (*regno)++); - x = gen_rtx_MEM (DECL_MODE (obj), x); - set_mem_addr_space (x, as); - } - - return x; -} - -/* Prepares decl_rtl for variables referred in *EXPR_P. Callback for - walk_tree. DATA contains the actual fake register number. */ - -static tree -prepare_decl_rtl (tree *expr_p, int *ws, void *data) -{ - tree obj = NULL_TREE; - rtx x = NULL_RTX; - int *regno = (int *) data; - - switch (TREE_CODE (*expr_p)) - { - case ADDR_EXPR: - for (expr_p = &TREE_OPERAND (*expr_p, 0); - handled_component_p (*expr_p); - expr_p = &TREE_OPERAND (*expr_p, 0)) - continue; - obj = *expr_p; - if (DECL_P (obj) && HAS_RTL_P (obj) && !DECL_RTL_SET_P (obj)) - x = produce_memory_decl_rtl (obj, regno); - break; - - case SSA_NAME: - *ws = 0; - obj = SSA_NAME_VAR (*expr_p); - /* Defer handling of anonymous SSA_NAMEs to the expander. */ - if (!obj) - return NULL_TREE; - if (!DECL_RTL_SET_P (obj)) - x = gen_raw_REG (DECL_MODE (obj), (*regno)++); - break; - - case VAR_DECL: - case PARM_DECL: - case RESULT_DECL: - *ws = 0; - obj = *expr_p; - - if (DECL_RTL_SET_P (obj)) - break; - - if (DECL_MODE (obj) == BLKmode) - x = produce_memory_decl_rtl (obj, regno); - else - x = gen_raw_REG (DECL_MODE (obj), (*regno)++); - - break; - - default: - break; - } - - if (x) - { - decl_rtl_to_reset.safe_push (obj); - SET_DECL_RTL (obj, x); - } - - return NULL_TREE; -} /* Determines cost of the computation of EXPR. */ @@ -3792,7 +3711,10 @@ computation_cost (tree expr, bool speed) node->frequency = NODE_FREQUENCY_NORMAL; crtl->maybe_hot_insn_p = speed; - walk_tree (&expr, prepare_decl_rtl, ®no, NULL); + decl_rtl_data data; + data.regno = ®no; + data.treevec = &decl_rtl_to_reset; + walk_tree (&expr, prepare_decl_rtl, &data, NULL); start_sequence (); rslt = expand_expr (expr, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL); seq = get_insns (); @@ -5284,6 +5206,8 @@ determine_group_iv_cost_cond (struct ivopts_data *data, inv_exprs = BITMAP_ALLOC (NULL); bitmap_set_bit (inv_exprs, inv_expr->id); } + if (use->zero_cost_p) + cost.cost = 0; set_group_iv_cost (data, group, cand, cost, inv_vars, bound, comp, inv_exprs); @@ -7202,6 +7126,11 @@ rewrite_use_compare (struct ivopts_data *data, return; } + if (use->zero_cost_p) { + preserve_ivs_for_use(data, use); + return; + } + /* The induction variable elimination failed; just express the original giv. */ comp = get_computation_at (data->current_loop, use->stmt, use, cand); @@ -7248,8 +7177,8 @@ rewrite_groups (struct ivopts_data *data) for (j = 0; j < group->vuses.length (); j++) { - rewrite_use_compare (data, group->vuses[j], cand); - update_stmt (group->vuses[j]->stmt); + rewrite_use_compare (data, group->vuses[j], cand); + update_stmt (group->vuses[j]->stmt); } } } @@ -7527,12 +7456,134 @@ loop_body_includes_call (basic_block *body, unsigned num_nodes) return false; } +/* For a comparison iv_use that we've chosen to ignore, we have to ensure + that its dependent IVs are preserved here. Otherwise remove_unused_ivs + will remove those IVs, leading this use to have no feeding IV. */ + +static void +preserve_ivs_for_use (struct ivopts_data *data, struct iv_use *use) +{ + struct loop *loop = data->current_loop; + struct iv *iv = use->iv; + auto_vec<tree> work_list; + work_list.safe_push (iv->ssa_name); + bitmap visited = BITMAP_ALLOC (NULL); + + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Preserving IVs: "); + + while (!work_list.is_empty ()) + { + tree op = work_list.pop (); + gcc_assert (TREE_CODE (op) == SSA_NAME); + gcc_assert (name_info (data, op)); + if (bitmap_bit_p (visited, SSA_NAME_VERSION (op))) + continue; + + struct version_info *vi = name_info (data, op); + vi->preserve_biv = true; + bitmap_set_bit (visited, SSA_NAME_VERSION (iv->ssa_name)); + if (dump_file && (dump_flags & TDF_DETAILS)) + { + print_generic_expr (dump_file, op, TDF_DETAILS); + fprintf (dump_file, " "); + } + + gimple *def = SSA_NAME_DEF_STMT (op); + /* Ensure the incr_iv preserved as well, for some case + like: + i2 = phi(i0, i1); + i1 = i2 + ...; + cmp i2 ... ; + */ + if (gimple_code (def) == GIMPLE_PHI && gimple_bb (def) == loop->header) + { + tree incr_iv = PHI_ARG_DEF_FROM_EDGE (def, loop_latch_edge (loop)); + if (bitmap_bit_p (visited, SSA_NAME_VERSION (incr_iv))) + continue; + name_info (data, incr_iv)->preserve_biv = true; + if (dump_file && (dump_flags & TDF_DETAILS)) + { + print_generic_expr (dump_file, incr_iv, TDF_DETAILS); + fprintf (dump_file, " "); + } + continue; + } + + /* Only need to check assign for iv chain. */ + if (gimple_code (def) != GIMPLE_ASSIGN) + continue; + + ssa_op_iter iter; + use_operand_p use_p; + tree use_op; + struct iv *use_iv; + FOR_EACH_PHI_OR_STMT_USE (use_p, def, iter, SSA_OP_USE) + { + use_op = USE_FROM_PTR (use_p); + if ((use_iv = name_info (data, use_op)->iv) + && !integer_zerop (use_iv->step)) + work_list.safe_push (use_op); + } + } + + BITMAP_FREE (visited); + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "\n\n"); +} + +/* Doloop optimization RTL pass can make the related comparison computation + become dead and get eliminated, then these comparison IV uses should NOT + be considered in optimal IVs determination, mark them as ignored and taken + as zero cost in cost modeling. */ + +static void +tailor_cmp_uses (struct ivopts_data *data) +{ + unsigned i; + struct iv_group *group; + struct loop *loop = data->current_loop; + + for (i = 0; i < data->vgroups.length (); i++) + { + group = data->vgroups[i]; + if (group->type == USE_COMPARE) + { + gcc_assert (group->vuses.length () == 1); + struct iv_use *use = group->vuses[0]; + gimple *stmt = use->stmt; + if (gimple_code (stmt) == GIMPLE_COND) + { + basic_block bb = gimple_bb (stmt); + edge true_edge, false_edge; + extract_true_false_edges_from_block (bb, &true_edge, &false_edge); + + /* This comparison is used for loop latch. Require latch is empty + for now. */ + if ((loop->latch == true_edge->dest + || loop->latch == false_edge->dest) + && gsi_end_p (gsi_last_bb (loop->latch))) + { + use->zero_cost_p = true; + /* Dump for this use. */ + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "Zeroing cost for use: "); + print_gimple_stmt (dump_file, stmt, TDF_DETAILS); + } + } + } + } + } +} + /* Optimizes the LOOP. Returns true if anything changed. */ static bool tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop) { bool changed = false; + bool tailor_cmp_p = false; struct iv_ca *iv_ca; edge exit = single_dom_exit (loop); basic_block *body; @@ -7578,6 +7629,20 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop) if (data->vgroups.length () > MAX_CONSIDERED_GROUPS) goto finish; + tailor_cmp_p = targetm.predict_doloop_p (loop); + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + "Predict loop %d can %sperform doloop optimization later.\n", + loop->num, tailor_cmp_p ? "" : "not "); + flow_loop_dump (loop, dump_file, NULL, 1); + } + + /* Some compare iv_use is probably useless once the doloop optimization + performs. */ + if (tailor_cmp_p) + tailor_cmp_uses (data); + /* Finds candidates for the induction variables (item 2). */ find_iv_candidates (data); -- 2.7.4 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-24 8:49 [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts Kewen.Lin @ 2019-04-24 9:08 ` Jakub Jelinek 2019-04-24 9:25 ` Kewen.Lin 2019-04-25 12:45 ` Segher Boessenkool 2019-04-27 3:45 ` Bin.Cheng 2 siblings, 1 reply; 22+ messages in thread From: Jakub Jelinek @ 2019-04-24 9:08 UTC (permalink / raw) To: Kewen.Lin Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther On Wed, Apr 24, 2019 at 04:41:01PM +0800, Kewen.Lin wrote: > gcc/ChangeLog Not a review, just ChangeLog nits. > 2019-04-24 Kewen Lin <linkw@gcc.gnu.org> > > PR middle-end/80791 > * target.def (predict_doloop_p): New. > * targhooks.h (default_predict_doloop_p): New. > * targhooks.c (default_predict_doloop_p): Likewise. > * doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New. > * doc/tm.texi: Regenerate. > * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New. > (costly_iter_for_doloop_p): New. > (rs6000_predict_doloop_p): New. There should be no leading spaces after the tab on the lines that don't contain a filename, so: * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New. (costly_iter_for_doloop_p): New. (rs6000_predict_doloop_p): New. instead (and instead of saying just New. usually we write what kind of new thing it is, so New function., New declaration. (or Declare.), New macro. (or Define.), New method. etc. > * gcc/expr.c (produce_memory_decl_rtl): New. The gcc/ prefixes don't belong into gcc/ChangeLog, the filenames are always relative to the ChangeLog file referencing those. Jakub ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-24 9:08 ` Jakub Jelinek @ 2019-04-24 9:25 ` Kewen.Lin 0 siblings, 0 replies; 22+ messages in thread From: Kewen.Lin @ 2019-04-24 9:25 UTC (permalink / raw) To: Jakub Jelinek; +Cc: GCC Patches Thanks for catching, Jakub! The update listed below, will integrate it with next revision patch. gcc/ChangeLog 2019-04-24 Kewen Lin <linkw@gcc.gnu.org> PR middle-end/80791 * target.def (predict_doloop_p): New hook. * targhooks.h (default_predict_doloop_p): New declaration. * targhooks.c (default_predict_doloop_p): New function. * doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New hook. * doc/tm.texi: Regenerate. * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New function. (costly_iter_for_doloop_p): Likewise. (rs6000_predict_doloop_p): Likewise. (TARGET_PREDICT_DOLOOP_P): New macro. * expr.c (produce_memory_decl_rtl): New function. (prepare_decl_rtl): Likewise. * expr.h (produce_memory_decl_rtl): New declaration. (prepare_decl_rtl): Likewise. * tree-ssa-loop-ivopts.c (tree_ssa_iv_optimize_loop): Call predict_doloop_p hook. (tailor_cmp_uses): New function. (preserve_ivs_for_use): New function. (computation_cost): Call refactored prepare_decl_rtl, consider zero cost iv use. (dump_use): Dump zero_cost_p field. (record_use): Init zero_cost_p field. (produce_memory_decl_rtl): Remove. (prepare_decl_rtl): Remove. gcc/testsuite/ChangeLog 2019-04-24 Kewen Lin <linkw@gcc.gnu.org> PR middle-end/80791 * gcc.dg/tree-ssa/ivopts-lt.c : Adjust. on 2019/4/24 脧脗脦莽4:48, Jakub Jelinek wrote: > On Wed, Apr 24, 2019 at 04:41:01PM +0800, Kewen.Lin wrote: >> gcc/ChangeLog > > Not a review, just ChangeLog nits. > >> 2019-04-24 Kewen Lin <linkw@gcc.gnu.org> >> >> PR middle-end/80791 >> * target.def (predict_doloop_p): New. >> * targhooks.h (default_predict_doloop_p): New. >> * targhooks.c (default_predict_doloop_p): Likewise. >> * doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New. >> * doc/tm.texi: Regenerate. >> * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New. >> (costly_iter_for_doloop_p): New. >> (rs6000_predict_doloop_p): New. > > There should be no leading spaces after the tab on the lines that don't contain > a filename, so: > * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New. > (costly_iter_for_doloop_p): New. > (rs6000_predict_doloop_p): New. > instead (and instead of saying just New. usually we write what kind of > new thing it is, so New function., New declaration. (or Declare.), New > macro. (or Define.), New method. etc. > >> * gcc/expr.c (produce_memory_decl_rtl): New. > > The gcc/ prefixes don't belong into gcc/ChangeLog, the filenames are always > relative to the ChangeLog file referencing those. > > Jakub > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-24 8:49 [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts Kewen.Lin 2019-04-24 9:08 ` Jakub Jelinek @ 2019-04-25 12:45 ` Segher Boessenkool 2019-04-26 6:58 ` Kewen.Lin 2019-04-27 3:45 ` Bin.Cheng 2 siblings, 1 reply; 22+ messages in thread From: Segher Boessenkool @ 2019-04-25 12:45 UTC (permalink / raw) To: Kewen.Lin; +Cc: GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, jakub Hi Kewen, Cool stuff. On Wed, Apr 24, 2019 at 04:41:01PM +0800, Kewen.Lin wrote: > One failure was found in regression testing, which is: > gcc.target/powerpc/20050830-1.c > > I did some investigation and found: with this patch, one new iv cand > is chosen for exit condition, which is used to rewrite the compare > type use. Later loop iv analysis in RTL can NOT determine the loop > iteration number is finite, which causes doloop_optimize not transform > the loop. The check doesn't find the expected pattern. Does it create worse code now? What we have before your patch isn't so super either (it has an sldi in the loop, it has two mtctr too). Maybe you can show the generated code? > Btw, this is for GCC10. *Phew* ;-) Some trivial comments: > +static bool > +invalid_insn_for_doloop_p (struct loop *loop) > +{ > + basic_block *body = get_loop_body (loop); > + unsigned num_nodes = loop->num_nodes; > + gimple_stmt_iterator gsi; > + unsigned i; > + > + for (i = 0; i < num_nodes; i++) for (unsigned i = 0; i < num_nodes; i++) (and maybe you can just say loop->num_nodes here; I don't know if that generates worse code, or if that even matters). > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf ( > + dump_file, > + "predict doloop failure due to finding computed jump.\n"); We don't normally end lines in (. There are other solutions to why you did that here; you can use string pasting, to break the string into two, or factor out (some part of) the loop body to reduce indentation. Segher ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-25 12:45 ` Segher Boessenkool @ 2019-04-26 6:58 ` Kewen.Lin 2019-04-26 7:59 ` Richard Biener ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Kewen.Lin @ 2019-04-26 6:58 UTC (permalink / raw) To: Segher Boessenkool Cc: GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, jakub Hi Segher, Thanks a lot for your comments! on 2019/4/25 脧脗脦莽8:16, Segher Boessenkool wrote: > Does it create worse code now? What we have before your patch isn't > so super either (it has an sldi in the loop, it has two mtctr too). > Maybe you can show the generated code? It's a good question! From the generated codes for the core loop, the code after my patch doesn't have bdnz to leverage hardware CTR, it has extra cmpld and branch instead, looks worse. But I wrote a tiny case to invoke the foo and evaluated the running time, they are equal. * Measured time: After: real 199.47 user 198.35 sys 1.11 Before: real 199.19 user 198.56 sys 0.62 * Command I used to compile: ${local_gcc} ${case_src}/20050830-1.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -ffat-lto-objects -fno-ident -c * main file (main.c): extern void foo(); #define LOOPNUM 10000 #define N 1000000*256 int a[N]; int main() { for(int i = 0; i < LOOPNUM; i++) { foo(N); } } * Generated code sequence: Before my patch: cmpwi 0,3,511 blelr 0 addi 10,3,-256 addi 9,3,-512 addis 8,2,.LC0@toc@ha ld 8,.LC0@toc@l(8) cmpwi 0,10,256 rldicl 9,9,56,40 sldi 3,3,2 addi 9,9,1 mtctr 9 add 8,3,8 li 10,42 blt 0,.L7 # jump to L7 it's less than 256 .L3: # core loop stw 10,0(8) addi 8,8,-1024 bdnz .L3 blr .L7: li 9,1 # special case, iteration only 1 mtctr 9 b .L3 After my patch: cmpwi 0,3,511 blelr 0 addis 7,2,.LC0@toc@ha ld 7,.LC0@toc@l(7) addi 10,3,-512 sldi 9,3,2 rlwinm 10,10,0,0,23 li 8,42 subf 10,10,3 sldi 10,10,2 addi 6,7,-1024 add 9,9,7 add 10,10,6 .p2align 4,,15 .L3: # core loop stw 8,0(9) addi 9,9,-1024 cmpld 0,9,10 # cmp beqlr 0 # if eq, return stw 8,0(9) addi 9,9,-1024 cmpld 0,9,10 # cmp again bne 0,.L3 # if ne, jump to L3. blr ------------------------ I practiced whether we can adjust the decision made in ivopts. For one compare iv use with zero cost, if one iv cand whose base is from memory has costly elim_cost before adjust_setup_cost, it's possible to make later analysis unable to find it's finite, then we try to avoid it. The diff looks like: --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, tree *control_var, *bound_cst; enum tree_code comp = ERROR_MARK; struct iv_use *use = group->vuses[0]; + bool dont_elim_p = false; /* Extract condition operands. */ rewrite_type = extract_cond_operands (data, use->stmt, &control_var, @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data, inv_expr_elim = get_loop_invariant_expr (data, bound); bitmap_clear (inv_vars_elim); } + + /* zero cost use makes it easier to select memory based iv cand + for replacement of non memory based iv and its use. But if + the setup sequence are too costly, loop iv analysis can NOT + easily figure out it's finite, it's possible to stop the + low-overhead loop transformation and get unexpected code. */ + if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object + && elim_cost.cost >= 30) + dont_elim_p = true; + /* The bound is a loop invariant, so it will be only computed once. */ elim_cost.cost = adjust_setup_cost (data, elim_cost.cost); @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, express_cost += bound_cost; /* Choose the better approach, preferring the eliminated IV. */ - if (elim_cost <= express_cost) + if (elim_cost <= express_cost && !dont_elim_p) { I was thinking whether this zero cost change just exposed an existing problem, then this kind of check should be for all cases not only for zero cost use, similar to expression_expensive_p? But why doesn't it happen before? Need more investigation. > >> Btw, this is for GCC10. > > *Phew* ;-) > > > Some trivial comments: > >> +static bool >> +invalid_insn_for_doloop_p (struct loop *loop) >> +{ >> + basic_block *body = get_loop_body (loop); >> + unsigned num_nodes = loop->num_nodes; >> + gimple_stmt_iterator gsi; >> + unsigned i; >> + >> + for (i = 0; i < num_nodes; i++) > > for (unsigned i = 0; i < num_nodes; i++) > > (and maybe you can just say loop->num_nodes here; I don't know if that > generates worse code, or if that even matters). Good idea, will fix it. > >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + fprintf ( >> + dump_file, >> + "predict doloop failure due to finding computed jump.\n"); > > We don't normally end lines in (. There are other solutions to why you > did that here; you can use string pasting, to break the string into two, > or factor out (some part of) the loop body to reduce indentation. > Will adjust it. > > Segher > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-26 6:58 ` Kewen.Lin @ 2019-04-26 7:59 ` Richard Biener 2019-04-26 14:32 ` Kewen.Lin 2019-04-26 16:44 ` Segher Boessenkool 2019-04-27 4:13 ` Bin.Cheng 2 siblings, 1 reply; 22+ messages in thread From: Richard Biener @ 2019-04-26 7:59 UTC (permalink / raw) To: Kewen.Lin; +Cc: Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt, jakub [-- Attachment #1: Type: text/plain, Size: 6318 bytes --] On Fri, 26 Apr 2019, Kewen.Lin wrote: > Hi Segher, > > Thanks a lot for your comments! > > on 2019/4/25 ä¸å8:16, Segher Boessenkool wrote: > > > Does it create worse code now? What we have before your patch isn't > > so super either (it has an sldi in the loop, it has two mtctr too). > > Maybe you can show the generated code? > > It's a good question! From the generated codes for the core loop, the > code after my patch doesn't have bdnz to leverage hardware CTR, it has > extra cmpld and branch instead, looks worse. But I wrote a tiny case > to invoke the foo and evaluated the running time, they are equal. > > * Measured time: > After: > real 199.47 > user 198.35 > sys 1.11 > Before: > real 199.19 > user 198.56 > sys 0.62 > > * Command I used to compile: > ${local_gcc} ${case_src}/20050830-1.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -ffat-lto-objects -fno-ident -c > > * main file (main.c): > extern void foo(); > #define LOOPNUM 10000 > #define N 1000000*256 > int a[N]; > int main() { > for(int i = 0; i < LOOPNUM; i++) { > foo(N); > } > } > > > * Generated code sequence: > > Before my patch: > > cmpwi 0,3,511 > blelr 0 > addi 10,3,-256 > addi 9,3,-512 > addis 8,2,.LC0@toc@ha > ld 8,.LC0@toc@l(8) > cmpwi 0,10,256 > rldicl 9,9,56,40 > sldi 3,3,2 > addi 9,9,1 > mtctr 9 > add 8,3,8 > li 10,42 > blt 0,.L7 # jump to L7 it's less than 256 > .L3: # core loop > stw 10,0(8) > addi 8,8,-1024 > bdnz .L3 > blr > .L7: > li 9,1 # special case, iteration only 1 > mtctr 9 > b .L3 > > After my patch: > > cmpwi 0,3,511 > blelr 0 > addis 7,2,.LC0@toc@ha > ld 7,.LC0@toc@l(7) > addi 10,3,-512 > sldi 9,3,2 > rlwinm 10,10,0,0,23 > li 8,42 > subf 10,10,3 > sldi 10,10,2 > addi 6,7,-1024 > add 9,9,7 > add 10,10,6 > .p2align 4,,15 > .L3: # core loop > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp > beqlr 0 # if eq, return > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp again > bne 0,.L3 # if ne, jump to L3. > blr > > ------------------------ > > I practiced whether we can adjust the decision made in ivopts. > For one compare iv use with zero cost, if one iv cand whose base > is from memory has costly elim_cost before adjust_setup_cost, > it's possible to make later analysis unable to find it's finite, > then we try to avoid it. > > The diff looks like: > > --- a/gcc/tree-ssa-loop-ivopts.c > +++ b/gcc/tree-ssa-loop-ivopts.c > @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > tree *control_var, *bound_cst; > enum tree_code comp = ERROR_MARK; > struct iv_use *use = group->vuses[0]; > + bool dont_elim_p = false; > > /* Extract condition operands. */ > rewrite_type = extract_cond_operands (data, use->stmt, &control_var, > @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > inv_expr_elim = get_loop_invariant_expr (data, bound); > bitmap_clear (inv_vars_elim); > } > + > + /* zero cost use makes it easier to select memory based iv cand > + for replacement of non memory based iv and its use. But if > + the setup sequence are too costly, loop iv analysis can NOT > + easily figure out it's finite, it's possible to stop the > + low-overhead loop transformation and get unexpected code. */ > + if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object > + && elim_cost.cost >= 30) > + dont_elim_p = true; > + > /* The bound is a loop invariant, so it will be only computed > once. */ > elim_cost.cost = adjust_setup_cost (data, elim_cost.cost); > @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > express_cost += bound_cost; > > /* Choose the better approach, preferring the eliminated IV. */ > - if (elim_cost <= express_cost) > + if (elim_cost <= express_cost && !dont_elim_p) > { > > > I was thinking whether this zero cost change just exposed > an existing problem, then this kind of check should be for all > cases not only for zero cost use, similar to > expression_expensive_p? But why doesn't it happen before? > Need more investigation. We should think about possible ways of encoding doloop at IVOPTs time rather than trying to re-analyze at RTL. I suppose we can easily set a flag in struct loop but I'm not familiar enough with the doloop pass to tell whether that is good enough. Also we can maybe move doloop to GIMPLE completely? I remember some targets have loop size constraints here as well, so I guess that isn't easily possible. Richard. > > > >> Btw, this is for GCC10. > > > > *Phew* ;-) > > > > > > Some trivial comments: > > > >> +static bool > >> +invalid_insn_for_doloop_p (struct loop *loop) > >> +{ > >> + basic_block *body = get_loop_body (loop); > >> + unsigned num_nodes = loop->num_nodes; > >> + gimple_stmt_iterator gsi; > >> + unsigned i; > >> + > >> + for (i = 0; i < num_nodes; i++) > > > > for (unsigned i = 0; i < num_nodes; i++) > > > > (and maybe you can just say loop->num_nodes here; I don't know if that > > generates worse code, or if that even matters). > > Good idea, will fix it. > > > > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + fprintf ( > >> + dump_file, > >> + "predict doloop failure due to finding computed jump.\n"); > > > > We don't normally end lines in (. There are other solutions to why you > > did that here; you can use string pasting, to break the string into two, > > or factor out (some part of) the loop body to reduce indentation. > > > > Will adjust it. > > > > > Segher > > > > -- Richard Biener <rguenther@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-26 7:59 ` Richard Biener @ 2019-04-26 14:32 ` Kewen.Lin 2019-04-26 17:06 ` Segher Boessenkool 0 siblings, 1 reply; 22+ messages in thread From: Kewen.Lin @ 2019-04-26 14:32 UTC (permalink / raw) To: Richard Biener Cc: Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt, jakub on 2019/4/26 ä¸å3:16, Richard Biener wrote: > On Fri, 26 Apr 2019, Kewen.Lin wrote: > >> I was thinking whether this zero cost change just exposed >> an existing problem, then this kind of check should be for all >> cases not only for zero cost use, similar to >> expression_expensive_p? But why doesn't it happen before? >> Need more investigation. > > We should think about possible ways of encoding doloop at IVOPTs > time rather than trying to re-analyze at RTL. I suppose we can > easily set a flag in struct loop but I'm not familiar enough with > the doloop pass to tell whether that is good enough. Also we > can maybe move doloop to GIMPLE completely? I remember some > targets have loop size constraints here as well, so I guess > that isn't easily possible. > > Richard. Thanks Richard! It's a very good point, but I'm afraid that it's impossible to move the whole doloop analysis pass to GIMPLE. When I implemented this hook rs6000_predict_doloop_p, I went through all the checks in doloop_optimize. I found it looks impossible to imitate all of them at GIMPLE, especially for gen_doloop_end check, jump insn check and register liveness clobber check. Even if we can make hook to check expanded RTL insn sequence in GIMPLE, but it happens too early, some information may not be exact enough, many following passes could update what the analysis is based on. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-26 14:32 ` Kewen.Lin @ 2019-04-26 17:06 ` Segher Boessenkool 2019-04-26 18:23 ` Richard Biener 2019-05-05 3:42 ` Kewen.Lin 0 siblings, 2 replies; 22+ messages in thread From: Segher Boessenkool @ 2019-04-26 17:06 UTC (permalink / raw) To: Kewen.Lin; +Cc: Richard Biener, GCC Patches, bin.cheng, Bill Schmidt, jakub On Fri, Apr 26, 2019 at 10:26:52PM +0800, Kewen.Lin wrote: > on 2019/4/26 ä¸å3:16, Richard Biener wrote: > > We should think about possible ways of encoding doloop at IVOPTs > > time rather than trying to re-analyze at RTL. I suppose we can > > easily set a flag in struct loop but I'm not familiar enough with > > the doloop pass to tell whether that is good enough. Also we > > can maybe move doloop to GIMPLE completely? I remember some > > targets have loop size constraints here as well, so I guess > > that isn't easily possible. > It's a very good point, but I'm afraid that it's impossible to move the > whole doloop analysis pass to GIMPLE. When I implemented this hook > rs6000_predict_doloop_p, I went through all the checks in doloop_optimize. > I found it looks impossible to imitate all of them at GIMPLE, especially > for gen_doloop_end check, jump insn check and register liveness clobber > check. Even if we can make hook to check expanded RTL insn sequence in > GIMPLE, but it happens too early, some information may not be exact enough, > many following passes could update what the analysis is based on. But you could set a flag in gimple, and have the RTL pass only do the final mechanics of making things a counted loop -- including generating the code for when it cannot do the doloop, which indeed will happen for many different reasons, many target-specific, but it is probably pretty easy to predict when we *can* use one, and we can do that optimistically, it's not so very much worse code to have to do it with a few instructions instead of e.g. a bdnz; there are many optimisation passes after this that can still improve the code (cprop, cse, combine). So, make it a doloop in gimple, and still have the rtl pass, but that only reverts it to a non-doloop if it turns out it has to. Does that sound like a good plan? Segher ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-26 17:06 ` Segher Boessenkool @ 2019-04-26 18:23 ` Richard Biener 2019-04-26 18:58 ` Segher Boessenkool 2019-05-05 3:42 ` Kewen.Lin 1 sibling, 1 reply; 22+ messages in thread From: Richard Biener @ 2019-04-26 18:23 UTC (permalink / raw) To: Segher Boessenkool, Kewen.Lin; +Cc: GCC Patches, bin.cheng, Bill Schmidt, jakub On April 26, 2019 6:59:43 PM GMT+02:00, Segher Boessenkool <segher@kernel.crashing.org> wrote: >On Fri, Apr 26, 2019 at 10:26:52PM +0800, Kewen.Lin wrote: >> on 2019/4/26 下午3:16, Richard Biener wrote: >> > We should think about possible ways of encoding doloop at IVOPTs >> > time rather than trying to re-analyze at RTL. I suppose we can >> > easily set a flag in struct loop but I'm not familiar enough with >> > the doloop pass to tell whether that is good enough. Also we >> > can maybe move doloop to GIMPLE completely? I remember some >> > targets have loop size constraints here as well, so I guess >> > that isn't easily possible. > >> It's a very good point, but I'm afraid that it's impossible to move >the >> whole doloop analysis pass to GIMPLE. When I implemented this hook >> rs6000_predict_doloop_p, I went through all the checks in >doloop_optimize. >> I found it looks impossible to imitate all of them at GIMPLE, >especially >> for gen_doloop_end check, jump insn check and register liveness >clobber >> check. Even if we can make hook to check expanded RTL insn sequence >in >> GIMPLE, but it happens too early, some information may not be exact >enough, >> many following passes could update what the analysis is based on. > >But you could set a flag in gimple, and have the RTL pass only do the >final mechanics of making things a counted loop -- including generating >the code for when it cannot do the doloop, which indeed will happen for >many different reasons, many target-specific, but it is probably pretty >easy to predict when we *can* use one, and we can do that >optimistically, >it's not so very much worse code to have to do it with a few >instructions >instead of e.g. a bdnz; there are many optimisation passes after this >that can still improve the code (cprop, cse, combine). > >So, make it a doloop in gimple, and still have the rtl pass, but that >only reverts it to a non-doloop if it turns out it has to. Does that >sound like a good plan? Yes, that's more or less what I had in mind. Note that you'd have a doloop early in RTL, sth we might not expect? Maybe we already have a GIMPLE way of saying decrement and test zero (one of the overflow builtins?), if not we'd need to add such to have Flag_1 = IFN_decandtest (i_2); If (Flag_1! = 0) ... As 'doloop-end' on GIMPLE. Just adjusting RTL expansion to catch a regular Dec and test would be possible As well of course. Richard. > >Segher ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-26 18:23 ` Richard Biener @ 2019-04-26 18:58 ` Segher Boessenkool 0 siblings, 0 replies; 22+ messages in thread From: Segher Boessenkool @ 2019-04-26 18:58 UTC (permalink / raw) To: Richard Biener; +Cc: Kewen.Lin, GCC Patches, bin.cheng, Bill Schmidt, jakub On Fri, Apr 26, 2019 at 07:51:05PM +0200, Richard Biener wrote: > On April 26, 2019 6:59:43 PM GMT+02:00, Segher Boessenkool <segher@kernel.crashing.org> wrote: > >So, make it a doloop in gimple, and still have the rtl pass, but that > >only reverts it to a non-doloop if it turns out it has to. Does that > >sound like a good plan? > > Yes, that's more or less what I had in mind. Note that you'd have a doloop early in RTL, sth we might not expect? I don't think there are any fundamental reasons to not have that. Maybe some adjustments are needed, sure. > Maybe we already have a GIMPLE way of saying decrement and test zero (one of the overflow builtins?), if not we'd need to add such to have > > Flag_1 = IFN_decandtest (i_2); > If (Flag_1! = 0) > ... > > As 'doloop-end' on GIMPLE. Just adjusting RTL expansion to catch a regular Dec and test would be possible > As well of course. I don't know if we should have any special instructions in GIMPLE for this. Maybe it is enough to just mark up the loop? It would be a lot simpler, if it *works* that is :-) Segher ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-26 17:06 ` Segher Boessenkool 2019-04-26 18:23 ` Richard Biener @ 2019-05-05 3:42 ` Kewen.Lin 1 sibling, 0 replies; 22+ messages in thread From: Kewen.Lin @ 2019-05-05 3:42 UTC (permalink / raw) To: gcc-patches on 2019/4/27 ä¸å12:59, Segher Boessenkool wrote: > On Fri, Apr 26, 2019 at 10:26:52PM +0800, Kewen.Lin wrote: >> on 2019/4/26 ä¸å3:16, Richard Biener wrote: >>> We should think about possible ways of encoding doloop at IVOPTs >>> time rather than trying to re-analyze at RTL. I suppose we can >>> easily set a flag in struct loop but I'm not familiar enough with >>> the doloop pass to tell whether that is good enough. Also we >>> can maybe move doloop to GIMPLE completely? I remember some >>> targets have loop size constraints here as well, so I guess >>> that isn't easily possible. > >> It's a very good point, but I'm afraid that it's impossible to move the >> whole doloop analysis pass to GIMPLE. When I implemented this hook >> rs6000_predict_doloop_p, I went through all the checks in doloop_optimize. >> I found it looks impossible to imitate all of them at GIMPLE, especially >> for gen_doloop_end check, jump insn check and register liveness clobber >> check. Even if we can make hook to check expanded RTL insn sequence in >> GIMPLE, but it happens too early, some information may not be exact enough, >> many following passes could update what the analysis is based on. > > But you could set a flag in gimple, and have the RTL pass only do the > final mechanics of making things a counted loop -- including generating > the code for when it cannot do the doloop, which indeed will happen for > many different reasons, many target-specific, but it is probably pretty > easy to predict when we *can* use one, and we can do that optimistically, > it's not so very much worse code to have to do it with a few instructions > instead of e.g. a bdnz; there are many optimisation passes after this > that can still improve the code (cprop, cse, combine). > Yes, we can set a flag to indicate this loop was predicted as doloop, which implicitly says we have checked some doloop criteria that are able to be checked in middle end. It can be well defined later. One question in mind is whether there are some other passes suffering the pain that the pass doesn't know the loop can be transformed to doloop or not and can't perform some optimization. I guess it decides what's the value and priority to make this non-trivial change. Do you happen to know some existing PRs? > So, make it a doloop in gimple, and still have the rtl pass, but that > only reverts it to a non-doloop if it turns out it has to. Does that > sound like a good plan? > Sounds good! Thanks Segher! > > Segher > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-26 6:58 ` Kewen.Lin 2019-04-26 7:59 ` Richard Biener @ 2019-04-26 16:44 ` Segher Boessenkool 2019-04-27 4:13 ` Bin.Cheng 2 siblings, 0 replies; 22+ messages in thread From: Segher Boessenkool @ 2019-04-26 16:44 UTC (permalink / raw) To: Kewen.Lin; +Cc: GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, jakub On Fri, Apr 26, 2019 at 02:43:44PM +0800, Kewen.Lin wrote: > > Does it create worse code now? What we have before your patch isn't > > so super either (it has an sldi in the loop, it has two mtctr too). > > Maybe you can show the generated code? > > It's a good question! From the generated codes for the core loop, the > code after my patch doesn't have bdnz to leverage hardware CTR, it has > extra cmpld and branch instead, looks worse. But I wrote a tiny case > to invoke the foo and evaluated the running time, they are equal. > > * Measured time: > After: > real 199.47 > user 198.35 > sys 1.11 > Before: > real 199.19 > user 198.56 > sys 0.62 Before: > .L3: # core loop > stw 10,0(8) > addi 8,8,-1024 > bdnz .L3 So it didn't use an update instruction here, although it could. Not that that changes anything: it would still be three cycles per iteration (that's the minimum for any loop: instruction fetch is the bottleneck). After: > .L3: # core loop > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp > beqlr 0 # if eq, return > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp again > bne 0,.L3 # if ne, jump to L3. This is unrolled a factor 2. It should be faster, unfortunately it updates r9 twice per unrolled loop, making the dependency chain too long. The bdnz loop could be like 0: stw 10,-1024(8) bdzlr stwu 10,-2048(8) bdnz 0b or similar. There are multiple problems before we can get that :-) (The important one is that the pointer (r8 here) should be updated only once per unrolled loop iteration; just like in the version without bdnz. Using bdzlr and stwu is just niceties, compared to that). > I practiced whether we can adjust the decision made in ivopts. [ snip ] > Need more investigation. Yeah. Segher ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-26 6:58 ` Kewen.Lin 2019-04-26 7:59 ` Richard Biener 2019-04-26 16:44 ` Segher Boessenkool @ 2019-04-27 4:13 ` Bin.Cheng 2019-05-05 5:26 ` Kewen.Lin 2 siblings, 1 reply; 22+ messages in thread From: Bin.Cheng @ 2019-04-27 4:13 UTC (permalink / raw) To: Kewen.Lin Cc: Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, Jakub Jelinek On Fri, Apr 26, 2019 at 2:44 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi Segher, > > Thanks a lot for your comments! > > on 2019/4/25 下午8:16, Segher Boessenkool wrote: > > > Does it create worse code now? What we have before your patch isn't > > so super either (it has an sldi in the loop, it has two mtctr too). > > Maybe you can show the generated code? > > It's a good question! From the generated codes for the core loop, the > code after my patch doesn't have bdnz to leverage hardware CTR, it has > extra cmpld and branch instead, looks worse. But I wrote a tiny case > to invoke the foo and evaluated the running time, they are equal. > > * Measured time: > After: > real 199.47 > user 198.35 > sys 1.11 > Before: > real 199.19 > user 198.56 > sys 0.62 > > * Command I used to compile: > ${local_gcc} ${case_src}/20050830-1.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -ffat-lto-objects -fno-ident -c > > * main file (main.c): > extern void foo(); > #define LOOPNUM 10000 > #define N 1000000*256 > int a[N]; > int main() { > for(int i = 0; i < LOOPNUM; i++) { > foo(N); > } > } > > > * Generated code sequence: > > Before my patch: > > cmpwi 0,3,511 > blelr 0 > addi 10,3,-256 > addi 9,3,-512 > addis 8,2,.LC0@toc@ha > ld 8,.LC0@toc@l(8) > cmpwi 0,10,256 > rldicl 9,9,56,40 > sldi 3,3,2 > addi 9,9,1 > mtctr 9 > add 8,3,8 > li 10,42 > blt 0,.L7 # jump to L7 it's less than 256 > .L3: # core loop > stw 10,0(8) > addi 8,8,-1024 > bdnz .L3 > blr > .L7: > li 9,1 # special case, iteration only 1 > mtctr 9 > b .L3 > > After my patch: > > cmpwi 0,3,511 > blelr 0 > addis 7,2,.LC0@toc@ha > ld 7,.LC0@toc@l(7) > addi 10,3,-512 > sldi 9,3,2 > rlwinm 10,10,0,0,23 > li 8,42 > subf 10,10,3 > sldi 10,10,2 > addi 6,7,-1024 > add 9,9,7 > add 10,10,6 > .p2align 4,,15 > .L3: # core loop > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp > beqlr 0 # if eq, return > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp again > bne 0,.L3 # if ne, jump to L3. > blr > > ------------------------ > > I practiced whether we can adjust the decision made in ivopts. > For one compare iv use with zero cost, if one iv cand whose base > is from memory has costly elim_cost before adjust_setup_cost, > it's possible to make later analysis unable to find it's finite, > then we try to avoid it. > > The diff looks like: > > --- a/gcc/tree-ssa-loop-ivopts.c > +++ b/gcc/tree-ssa-loop-ivopts.c > @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > tree *control_var, *bound_cst; > enum tree_code comp = ERROR_MARK; > struct iv_use *use = group->vuses[0]; > + bool dont_elim_p = false; > > /* Extract condition operands. */ > rewrite_type = extract_cond_operands (data, use->stmt, &control_var, > @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > inv_expr_elim = get_loop_invariant_expr (data, bound); > bitmap_clear (inv_vars_elim); > } > + > + /* zero cost use makes it easier to select memory based iv cand > + for replacement of non memory based iv and its use. But if > + the setup sequence are too costly, loop iv analysis can NOT > + easily figure out it's finite, it's possible to stop the > + low-overhead loop transformation and get unexpected code. */ > + if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object > + && elim_cost.cost >= 30) > + dont_elim_p = true; No, we'd like to avoid such things in general. The conditions look like a hack to me. elim_cost is compared to express_cost, adding another check on it at different place isn't really good, especially 30 itself is a magic number. It's most likely improvement in some cases, deterioration in others. Also it punishes one pass (IVOPTs here) because of other pass' (RTL) problem. It does't mean we can't do such transformations, only it has to be as precise/conservative as possible. For example, if RTL loop iv is improved to handle the case in the future, who would remember to come back and adjust this? GCC lacks the capability passing information to later passes. Gimple analyzer worked hard collecting various information but discards it entering RTL or earlier. Other examples are like runtime alias information, non-wrapping information for specific operations, etc. IMHO, this is what needs to be done. As for this case, it could be finite loop info, or non-wrapping info of the iv_var's increment operation. By passing more information, RTL passes can be simplified too. Thanks, bin > + > /* The bound is a loop invariant, so it will be only computed > once. */ > elim_cost.cost = adjust_setup_cost (data, elim_cost.cost); > @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > express_cost += bound_cost; > > /* Choose the better approach, preferring the eliminated IV. */ > - if (elim_cost <= express_cost) > + if (elim_cost <= express_cost && !dont_elim_p) > { > > > I was thinking whether this zero cost change just exposed > an existing problem, then this kind of check should be for all > cases not only for zero cost use, similar to > expression_expensive_p? But why doesn't it happen before? > Need more investigation. > > > > >> Btw, this is for GCC10. > > > > *Phew* ;-) > > > > > > Some trivial comments: > > > >> +static bool > >> +invalid_insn_for_doloop_p (struct loop *loop) > >> +{ > >> + basic_block *body = get_loop_body (loop); > >> + unsigned num_nodes = loop->num_nodes; > >> + gimple_stmt_iterator gsi; > >> + unsigned i; > >> + > >> + for (i = 0; i < num_nodes; i++) > > > > for (unsigned i = 0; i < num_nodes; i++) > > > > (and maybe you can just say loop->num_nodes here; I don't know if that > > generates worse code, or if that even matters). > > Good idea, will fix it. > > > > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + fprintf ( > >> + dump_file, > >> + "predict doloop failure due to finding computed jump.\n"); > > > > We don't normally end lines in (. There are other solutions to why you > > did that here; you can use string pasting, to break the string into two, > > or factor out (some part of) the loop body to reduce indentation. > > > > Will adjust it. > > > > > Segher > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-27 4:13 ` Bin.Cheng @ 2019-05-05 5:26 ` Kewen.Lin 2019-05-06 10:21 ` Richard Biener 0 siblings, 1 reply; 22+ messages in thread From: Kewen.Lin @ 2019-05-05 5:26 UTC (permalink / raw) To: Bin.Cheng Cc: Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, Jakub Jelinek on 2019/4/27 ä¸å11:44, Bin.Cheng wrote: > On Fri, Apr 26, 2019 at 2:44 PM Kewen.Lin <linkw@linux.ibm.com> wrote: >> + >> + /* zero cost use makes it easier to select memory based iv cand >> + for replacement of non memory based iv and its use. But if >> + the setup sequence are too costly, loop iv analysis can NOT >> + easily figure out it's finite, it's possible to stop the >> + low-overhead loop transformation and get unexpected code. */ >> + if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object >> + && elim_cost.cost >= 30) >> + dont_elim_p = true; > No, we'd like to avoid such things in general. The conditions look > like a hack to me. elim_cost is compared to express_cost, adding > another check on it at different place isn't really good, especially > 30 itself is a magic number. It's most likely improvement in some > cases, deterioration in others. > Yes, I agree it's too hacky and unacceptable as a formal fix. And I tried to investigate more whether it's a general issue but never got exposed. > Also it punishes one pass (IVOPTs here) because of other pass' (RTL) > problem. It does't mean we can't do such transformations, only it has > to be as precise/conservative as possible. For example, if RTL loop > iv is improved to handle the case in the future, who would remember to > come back and adjust this? > Good question! > GCC lacks the capability passing information to later passes. Gimple > analyzer worked hard collecting various information but discards it > entering RTL or earlier. Other examples are like runtime alias > information, non-wrapping information for specific operations, etc. > IMHO, this is what needs to be done. As for this case, it could be > finite loop info, or non-wrapping info of the iv_var's increment > operation. By passing more information, RTL passes can be simplified > too. > Thanks for the information! Is there any under development work for this? That would be fine if we can pass down those information to downstream passes based on upcoming feature. > Thanks, > bin >> + >> /* The bound is a loop invariant, so it will be only computed >> once. */ >> elim_cost.cost = adjust_setup_cost (data, elim_cost.cost); >> @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, >> express_cost += bound_cost; >> >> /* Choose the better approach, preferring the eliminated IV. */ >> - if (elim_cost <= express_cost) >> + if (elim_cost <= express_cost && !dont_elim_p) >> { >> >> >> I was thinking whether this zero cost change just exposed >> an existing problem, then this kind of check should be for all >> cases not only for zero cost use, similar to >> expression_expensive_p? But why doesn't it happen before? >> Need more investigation. >> >>> >>>> Btw, this is for GCC10. >>> >>> *Phew* ;-) >>> >>> >>> Some trivial comments: >>> >>>> +static bool >>>> +invalid_insn_for_doloop_p (struct loop *loop) >>>> +{ >>>> + basic_block *body = get_loop_body (loop); >>>> + unsigned num_nodes = loop->num_nodes; >>>> + gimple_stmt_iterator gsi; >>>> + unsigned i; >>>> + >>>> + for (i = 0; i < num_nodes; i++) >>> >>> for (unsigned i = 0; i < num_nodes; i++) >>> >>> (and maybe you can just say loop->num_nodes here; I don't know if that >>> generates worse code, or if that even matters). >> >> Good idea, will fix it. >> >>> >>>> + if (dump_file && (dump_flags & TDF_DETAILS)) >>>> + fprintf ( >>>> + dump_file, >>>> + "predict doloop failure due to finding computed jump.\n"); >>> >>> We don't normally end lines in (. There are other solutions to why you >>> did that here; you can use string pasting, to break the string into two, >>> or factor out (some part of) the loop body to reduce indentation. >>> >> >> Will adjust it. >> >>> >>> Segher >>> >> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-05-05 5:26 ` Kewen.Lin @ 2019-05-06 10:21 ` Richard Biener 0 siblings, 0 replies; 22+ messages in thread From: Richard Biener @ 2019-05-06 10:21 UTC (permalink / raw) To: Kewen.Lin Cc: Bin.Cheng, Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt, Jakub Jelinek [-- Attachment #1: Type: text/plain, Size: 897 bytes --] On Sun, 5 May 2019, Kewen.Lin wrote: > on 2019/4/27 ä¸å11:44, Bin.Cheng wrote: > > GCC lacks the capability passing information to later passes. Gimple > > analyzer worked hard collecting various information but discards it > > entering RTL or earlier. Other examples are like runtime alias > > information, non-wrapping information for specific operations, etc. > > IMHO, this is what needs to be done. As for this case, it could be > > finite loop info, or non-wrapping info of the iv_var's increment > > operation. By passing more information, RTL passes can be simplified > > too. > > > > Thanks for the information! Is there any under development work for this? > That would be fine if we can pass down those information to downstream > passes based on upcoming feature. GCC keeps several bits of information across passes, notably everything stored in struct loop. Richard. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-24 8:49 [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts Kewen.Lin 2019-04-24 9:08 ` Jakub Jelinek 2019-04-25 12:45 ` Segher Boessenkool @ 2019-04-27 3:45 ` Bin.Cheng 2019-05-05 3:23 ` Kewen.Lin 2 siblings, 1 reply; 22+ messages in thread From: Bin.Cheng @ 2019-04-27 3:45 UTC (permalink / raw) To: Kewen.Lin Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther, Jakub Jelinek Thanks very much for working on this. On Wed, Apr 24, 2019 at 4:41 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi all, > > As PR80791 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80791, on > some targets which support low-overhead loop, the related compare > type ivs use is possible to become dead and removed eventually. > However, the current ivopt cost modeling doesn't consider this > possible elimination, it leads suboptimal iv candidate decision. > > To make it better, this patch is to introduce one target hook, to > provide target which supports low-overhead loop a way to teach > ivopt to know which compare type iv can be ignored in modeling. > > Since the low-overhead loop optimize transformation is based on RTL, > some of those checks are hard to be imitated on gimple, so it's not > possible to predict the current loop will be transformed exactly > in middle-end. But if we can have most loop predicted precisely, > it would be helpful. It highly depends on target hook fine tuning. > It's acceptable to have some loops which can be transformed to low- > overhead loop but we don't catch. But we should try our best to > avoid to predict some loop as low-overhead loop but which isn't. > > Bootstrapped and regression testing on powerpc64le. > > One failure was found in regression testing, which is: > gcc.target/powerpc/20050830-1.c > > I did some investigation and found: with this patch, one new iv cand > is chosen for exit condition, which is used to rewrite the compare > type use. Later loop iv analysis in RTL can NOT determine the loop > iteration number is finite, which causes doloop_optimize not transform > the loop. The check doesn't find the expected pattern. > > Still investigating how to solve this failure, like either to enhance > loop iv analysis or to check some special condition and guard in ivopt. > Any suggestions are highly welcomed. > > Btw, this is for GCC10. > > --------------------------------------------- > > gcc/ChangeLog > > 2019-04-24 Kewen Lin <linkw@gcc.gnu.org> > > PR middle-end/80791 > * target.def (predict_doloop_p): New. > * targhooks.h (default_predict_doloop_p): New. > * targhooks.c (default_predict_doloop_p): Likewise. > * doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New. > * doc/tm.texi: Regenerate. > * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New. > (costly_iter_for_doloop_p): New. > (rs6000_predict_doloop_p): New. > * gcc/expr.c (produce_memory_decl_rtl): New. > (prepare_decl_rtl): New. > * gcc/expr.h (produce_memory_decl_rtl): Declare. > (prepare_decl_rtl): Declare. > * gcc/tree-ssa-loop-ivopts.c (tree_ssa_iv_optimize_loop): Call > predict_doloop_p hook. > (tailor_cmp_uses): New. > (preserve_ivs_for_use): New. > (computation_cost): Call refactored prepare_decl_rtl, consider > zero cost iv use. > (dump_use): Dump zero_cost_p field. > (record_use): Init zero_cost_p field. > (produce_memory_decl_rtl): Remove. > (prepare_decl_rtl): Remove. > > gcc/testsuite/ChangeLog > > 2019-04-24 Kewen Lin <linkw@gcc.gnu.org> > > PR middle-end/80791 > * gcc.dg/tree-ssa/ivopts-lt.c : Adjust. > > --- > gcc/config/rs6000/rs6000.c | 174 ++++++++++++++++++++- > gcc/doc/tm.texi | 8 + > gcc/doc/tm.texi.in | 2 + > gcc/expr.c | 91 +++++++++++ > gcc/expr.h | 16 +- > gcc/target.def | 9 ++ > gcc/targhooks.c | 13 ++ > gcc/targhooks.h | 1 + > gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c | 7 +- > gcc/tree-ssa-loop-ivopts.c | 247 +++++++++++++++++++----------- > 10 files changed, 472 insertions(+), 96 deletions(-) > > diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index a21f4f7..756f914 100644 > --- a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c For such non-trivial patch, we can improve review process by splitting to smaller patches which can be reviewed/approved independently. Below are some general comments specific to IVOPTs change. > @@ -83,6 +83,9 @@ > #include "tree-ssa-propagate.h" > #include "tree-vrp.h" > #include "tree-ssanames.h" > +#include "tree-ssa-loop-niter.h" > +#include "tree-cfg.h" > +#include "tree-scalar-evolution.h" > > /* This file should be included last. */ > #include "target-def.h" > @@ -1914,6 +1917,9 @@ static const struct attribute_spec rs6000_attribute_table[] = > #undef TARGET_CAN_USE_DOLOOP_P > #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost > > +#undef TARGET_PREDICT_DOLOOP_P > +#define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p > + > #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV > #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv > > @@ -39436,7 +39442,173 @@ rs6000_mangle_decl_assembler_name (tree decl, tree id) > return id; > } > > - > +/* Check whether there are some instructions preventing doloop transformation > + inside loop body, mainly for instructions which are possible to kill CTR. > + > + Return true if some invalid insn exits, otherwise return false. */ > + > +static bool > +invalid_insn_for_doloop_p (struct loop *loop) > +{ > + basic_block *body = get_loop_body (loop); > + unsigned num_nodes = loop->num_nodes; > + gimple_stmt_iterator gsi; > + unsigned i; > + > + for (i = 0; i < num_nodes; i++) > + for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi)) > + { > + gimple *stmt = gsi_stmt (gsi); > + if (is_gimple_call (stmt) && !gimple_call_internal_p (stmt) > + && !is_inexpensive_builtin (gimple_call_fndecl (stmt))) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "predict doloop failure due to finding call.\n"); > + return true; > + } > + if (computed_goto_p (stmt)) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf ( > + dump_file, > + "predict doloop failure due to finding computed jump.\n"); > + return true; > + } > + /* TODO: Now this hook is expected to be called in ivopts, which is > + before switchlower1/switchlower2. Checking for SWITCH at this point > + will eliminate some good candidates. But since there are only a few > + cases having _a_ switch statement in the innermost loop, it can be a low > + priority enhancement. */ > + > + if (gimple_code (stmt) == GIMPLE_SWITCH) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "predict doloop failure due to finding switch.\n"); > + return true; > + } > + } > + > + return false; > +} > + > +/* Check whether number of iteration computation is too costly for doloop > + transformation. It expands the gimple sequence to equivalent RTL insn > + sequence, then evaluate the cost. > + > + Return true if it's costly, otherwise return false. */ > + > +static bool > +costly_iter_for_doloop_p (struct loop *loop, tree niters) > +{ > + tree type = TREE_TYPE (niters); > + unsigned cost = 0, i; > + tree obj; > + bool speed = optimize_loop_for_speed_p (loop); > + int regno = LAST_VIRTUAL_REGISTER + 1; > + vec<tree> tvec; > + tvec.create (20); > + decl_rtl_data tdata; > + tdata.regno = ®no; > + tdata.treevec = &tvec; > + walk_tree (&niters, prepare_decl_rtl, &tdata, NULL); > + start_sequence (); > + expand_expr (niters, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL); > + rtx_insn *seq = get_insns (); > + end_sequence (); > + FOR_EACH_VEC_ELT (tvec, i, obj) > + SET_DECL_RTL (obj, NULL_RTX); > + tvec.release (); > + > + for (; seq; seq = NEXT_INSN (seq)) > + { > + if (!INSN_P (seq)) > + continue; > + rtx body = PATTERN (seq); > + if (GET_CODE (body) == SET) > + { > + rtx set_val = XEXP (body, 1); > + enum rtx_code code = GET_CODE (set_val); > + enum rtx_class cls = GET_RTX_CLASS (code); > + /* For now, we only consider these two RTX classes, to match what we > + get in doloop_optimize, excluding operations like zero/sign extend. */ > + if (cls == RTX_BIN_ARITH || cls == RTX_COMM_ARITH) > + cost += set_src_cost (set_val, GET_MODE (set_val), speed); > + } > + } > + unsigned max_cost > + = COSTS_N_INSNS (PARAM_VALUE (PARAM_MAX_ITERATIONS_COMPUTATION_COST)); > + if (cost > max_cost) > + return true; > + > + return false; > +} > + > +/* > + Predict whether the given loop will be transformed in the RTL > + doloop_optimize pass. This is for use by the IVOPTS middle-end pass. > + Attempt to duplicate as many doloop_optimize checks as possible. > + > + Some RTL specific checks seems unable to be checked here, if any > + new checks or easy checks _are_ missing here. */ > + > +static bool > +rs6000_predict_doloop_p (struct loop *loop) > +{ > + gcc_assert (loop); > + > + /* On rs6000, targetm.can_use_doloop_p is actually > + can_use_doloop_if_innermost. Just ensure it's innermost. */ > + if (loop->inner != NULL) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "predict doloop failure due to no innermost.\n"); > + return false; > + } > + > + /* number_of_latch_executions is not so costly, so we don't use > + number_of_iterations_exit for iteration description. */ > + tree niters = number_of_latch_executions (loop); > + if (niters == NULL_TREE || niters == chrec_dont_know) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "predict doloop failure due to unexpected niters.\n"); > + return false; > + } > + > + /* Similar to doloop_optimize, check whether iteration count too small > + and not profitable. */ > + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop); > + if (est_niter == -1) > + est_niter = get_likely_max_loop_iterations_int (loop); > + if (est_niter >= 0 && est_niter < 3) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "predict doloop failure due to too few iterations (%d).\n", > + (unsigned int) est_niter); > + return false; > + } > + > + /* Similar to doloop_optimize, check whether number of iterations too costly > + to compute. */ > + if (costly_iter_for_doloop_p (loop, niters)) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "predict doloop failure due to costly niter computation.\n"); > + return false; > + } > + > + /* Similar to doloop_optimize targetm.invalid_within_doloop, check for > + CALL, tablejump, computed_jump. */ > + if (invalid_insn_for_doloop_p (loop)) > + return false; > + > + return true; > +} > + > struct gcc_target targetm = TARGET_INITIALIZER; > > #include "gt-rs6000.h" > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > index 8c8978b..d3ad208 100644 > --- a/gcc/doc/tm.texi > +++ b/gcc/doc/tm.texi > @@ -11603,6 +11603,14 @@ function version at run-time for a given set of function versions. > body must be generated. > @end deftypefn > > +@deftypefn {Target Hook} bool TARGET_PREDICT_DOLOOP_P (struct loop *@var{loop}) > +Return true if we can predict it is possible to use low-overhead loops > +for a particular loop. The parameter @var{loop} is a pointer to the loop > +which is going to be checked. This target hook is required only when the > +target supports low-overhead loops, and will help some earlier middle-end > +passes to make some decisions. > +@end deftypefn > + > @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top}) > Return true if it is possible to use low-overhead loops (@code{doloop_end} > and @code{doloop_begin}) for a particular loop. @var{iterations} gives the > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > index fe1194e..e047734 100644 > --- a/gcc/doc/tm.texi.in > +++ b/gcc/doc/tm.texi.in > @@ -7942,6 +7942,8 @@ to by @var{ce_info}. > > @hook TARGET_GENERATE_VERSION_DISPATCHER_BODY > > +@hook TARGET_PREDICT_DOLOOP_P > + > @hook TARGET_CAN_USE_DOLOOP_P > > @hook TARGET_INVALID_WITHIN_DOLOOP > diff --git a/gcc/expr.c b/gcc/expr.c > index 9ff5e5f..1f2ad45 100644 > --- a/gcc/expr.c > +++ b/gcc/expr.c > @@ -12539,3 +12539,94 @@ int_expr_size (tree exp) > > return tree_to_shwi (size); > } > + > +/* Produce DECL_RTL for object obj so it looks like it is stored in memory. */ > + > +rtx > +produce_memory_decl_rtl (tree obj, int *regno) > +{ > + addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (obj)); > + machine_mode address_mode = targetm.addr_space.address_mode (as); > + rtx x; > + > + gcc_assert (obj); > + if (TREE_STATIC (obj) || DECL_EXTERNAL (obj)) > + { > + const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (obj)); > + x = gen_rtx_SYMBOL_REF (address_mode, name); > + SET_SYMBOL_REF_DECL (x, obj); > + x = gen_rtx_MEM (DECL_MODE (obj), x); > + set_mem_addr_space (x, as); > + targetm.encode_section_info (obj, x, true); > + } > + else > + { > + x = gen_raw_REG (address_mode, (*regno)++); > + x = gen_rtx_MEM (DECL_MODE (obj), x); > + set_mem_addr_space (x, as); > + } > + > + return x; > +} > + > +/* Prepares decl_rtl for variables referred in *EXPR_P. Callback for > + walk_tree. DATA contains the actual fake register number. */ > + > +tree > +prepare_decl_rtl (tree *expr_p, int *ws, void *data) > +{ > + tree obj = NULL_TREE; > + rtx x = NULL_RTX; > + decl_rtl_data *info = (decl_rtl_data *) data; > + int *regno = info->regno; > + vec<tree> *treevec = info->treevec; > + > + switch (TREE_CODE (*expr_p)) > + { > + case ADDR_EXPR: > + for (expr_p = &TREE_OPERAND (*expr_p, 0); handled_component_p (*expr_p); > + expr_p = &TREE_OPERAND (*expr_p, 0)) > + continue; > + obj = *expr_p; > + if (DECL_P (obj) && HAS_RTL_P (obj) && !DECL_RTL_SET_P (obj)) > + x = produce_memory_decl_rtl (obj, regno); > + break; > + > + case SSA_NAME: > + *ws = 0; > + obj = SSA_NAME_VAR (*expr_p); > + /* Defer handling of anonymous SSA_NAMEs to the expander. */ > + if (!obj) > + return NULL_TREE; > + if (!DECL_RTL_SET_P (obj)) > + x = gen_raw_REG (DECL_MODE (obj), (*regno)++); > + break; > + > + case VAR_DECL: > + case PARM_DECL: > + case RESULT_DECL: > + *ws = 0; > + obj = *expr_p; > + > + if (DECL_RTL_SET_P (obj)) > + break; > + > + if (DECL_MODE (obj) == BLKmode) > + x = produce_memory_decl_rtl (obj, regno); > + else > + x = gen_raw_REG (DECL_MODE (obj), (*regno)++); > + > + break; > + > + default: > + break; > + } > + > + if (x) > + { > + treevec->safe_push (obj); > + SET_DECL_RTL (obj, x); > + } > + > + return NULL_TREE; > +} > diff --git a/gcc/expr.h b/gcc/expr.h > index 17c3962..b1894a6b 100644 > --- a/gcc/expr.h > +++ b/gcc/expr.h > @@ -53,7 +53,21 @@ typedef struct separate_ops > tree type; > tree op0, op1, op2; > } *sepops; > - > + > +/* This structure is used to pass information to tree walker function > + prepare_decl_rtl. */ > +typedef struct data_for_decl_rtl > +{ > + int *regno; > + vec<tree> *treevec; > +} decl_rtl_data; > + > +/* Produce decl_rtl for object so it looks like it is stored in memory. */ > +rtx produce_memory_decl_rtl (tree, int *); > + > +/* Prepares decl_rtl for variables referred. Callback for walk_tree. */ > +tree prepare_decl_rtl (tree *, int *, void *); > + > /* This is run during target initialization to set up which modes can be > used directly in memory and to initialize the block move optab. */ > extern void init_expr_target (void); > diff --git a/gcc/target.def b/gcc/target.def > index 66cee07..2f28e80 100644 > --- a/gcc/target.def > +++ b/gcc/target.def > @@ -4227,6 +4227,15 @@ DEFHOOK > rtx, (machine_mode mode, rtx result, rtx val, rtx failval), > default_speculation_safe_value) > > +DEFHOOK > +(predict_doloop_p, > + "Return true if we can predict it is possible to use low-overhead loops\n\ > +for a particular loop. The parameter @var{loop} is a pointer to the loop\n\ > +which is going to be checked. This target hook is required only when the\n\ > +target supports low-overhead loops, and will help some earlier middle-end\n\ > +passes to make some decisions.", > + bool, (struct loop *loop), > + default_predict_doloop_p) > > DEFHOOK > (can_use_doloop_p, > diff --git a/gcc/targhooks.c b/gcc/targhooks.c > index 318f7e9..56ed4cc 100644 > --- a/gcc/targhooks.c > +++ b/gcc/targhooks.c > @@ -643,6 +643,19 @@ default_has_ifunc_p (void) > return HAVE_GNU_INDIRECT_FUNCTION; > } > > +/* True if we can predict this loop is possible to be transformed to a > + low-overhead loop, otherwise returns false. > + > + By default, false is returned, as this hook's applicability should be > + verified for each target. Target maintainers should re-define the hook > + if the target can take advantage of it. */ > + > +bool > +default_predict_doloop_p (struct loop *loop ATTRIBUTE_UNUSED) > +{ > + return false; > +} > + > /* NULL if INSN insn is valid within a low-overhead loop, otherwise returns > an error message. > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h > index 5943627..70bfb17 100644 > --- a/gcc/targhooks.h > +++ b/gcc/targhooks.h > @@ -85,6 +85,7 @@ extern bool default_fixed_point_supported_p (void); > > extern bool default_has_ifunc_p (void); > > +extern bool default_predict_doloop_p (struct loop *); > extern const char * default_invalid_within_doloop (const rtx_insn *); > > extern tree default_builtin_vectorized_function (unsigned int, tree, tree); > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c > index 171c85a..f61288c 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c > @@ -17,6 +17,7 @@ f1 (char *p, uintptr_t i, uintptr_t n) > while (i < n); > } > > -/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */ > -/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */ > -/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */ > +/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */ > +/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */ > +/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */ > +/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */ > diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c > index a44b4cb..148a48d 100644 > --- a/gcc/tree-ssa-loop-ivopts.c > +++ b/gcc/tree-ssa-loop-ivopts.c > @@ -109,6 +109,7 @@ along with GCC; see the file COPYING3. If not see > #include "builtins.h" > #include "tree-vectorizer.h" > > + > /* FIXME: Expressions are expanded to RTL in this pass to determine the > cost of different addressing modes. This should be moved to a TBD > interface between the GIMPLE and RTL worlds. */ > @@ -376,6 +377,7 @@ struct iv_use > tree addr_base; /* Base address with const offset stripped. */ > poly_uint64_pod addr_offset; > /* Const offset stripped from base address. */ > + bool zero_cost_p; /* This iv use takes zero cost, only compare type. */ > }; > > /* Group of uses. */ > @@ -691,6 +693,8 @@ static vec<tree> decl_rtl_to_reset; > > static comp_cost force_expr_to_var_cost (tree, bool); > > +static void preserve_ivs_for_use (struct ivopts_data*, struct iv_use*); > + > /* The single loop exit if it dominates the latch, NULL otherwise. */ > > edge > @@ -765,6 +769,8 @@ dump_use (FILE *file, struct iv_use *use) > fprintf (file, " At pos:\t"); > if (use->op_p) > print_generic_expr (file, *use->op_p, TDF_SLIM); > + if (use->zero_cost_p) > + fprintf (file, "\n Zero cost"); > fprintf (file, "\n"); > dump_iv (file, use->iv, false, 2); > } > @@ -1549,6 +1555,7 @@ record_use (struct iv_group *group, tree *use_p, struct iv *iv, > use->op_p = use_p; > use->addr_base = addr_base; > use->addr_offset = addr_offset; > + use->zero_cost_p = false; > > group->vuses.safe_push (use); > return use; > @@ -3687,94 +3694,6 @@ get_group_iv_cost (struct ivopts_data *data, struct iv_group *group, > return NULL; > } > > -/* Produce DECL_RTL for object obj so it looks like it is stored in memory. */ > -static rtx > -produce_memory_decl_rtl (tree obj, int *regno) > -{ > - addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (obj)); > - machine_mode address_mode = targetm.addr_space.address_mode (as); > - rtx x; > - > - gcc_assert (obj); > - if (TREE_STATIC (obj) || DECL_EXTERNAL (obj)) > - { > - const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (obj)); > - x = gen_rtx_SYMBOL_REF (address_mode, name); > - SET_SYMBOL_REF_DECL (x, obj); > - x = gen_rtx_MEM (DECL_MODE (obj), x); > - set_mem_addr_space (x, as); > - targetm.encode_section_info (obj, x, true); > - } > - else > - { > - x = gen_raw_REG (address_mode, (*regno)++); > - x = gen_rtx_MEM (DECL_MODE (obj), x); > - set_mem_addr_space (x, as); > - } > - > - return x; > -} > - > -/* Prepares decl_rtl for variables referred in *EXPR_P. Callback for > - walk_tree. DATA contains the actual fake register number. */ > - > -static tree > -prepare_decl_rtl (tree *expr_p, int *ws, void *data) > -{ > - tree obj = NULL_TREE; > - rtx x = NULL_RTX; > - int *regno = (int *) data; > - > - switch (TREE_CODE (*expr_p)) > - { > - case ADDR_EXPR: > - for (expr_p = &TREE_OPERAND (*expr_p, 0); > - handled_component_p (*expr_p); > - expr_p = &TREE_OPERAND (*expr_p, 0)) > - continue; > - obj = *expr_p; > - if (DECL_P (obj) && HAS_RTL_P (obj) && !DECL_RTL_SET_P (obj)) > - x = produce_memory_decl_rtl (obj, regno); > - break; > - > - case SSA_NAME: > - *ws = 0; > - obj = SSA_NAME_VAR (*expr_p); > - /* Defer handling of anonymous SSA_NAMEs to the expander. */ > - if (!obj) > - return NULL_TREE; > - if (!DECL_RTL_SET_P (obj)) > - x = gen_raw_REG (DECL_MODE (obj), (*regno)++); > - break; > - > - case VAR_DECL: > - case PARM_DECL: > - case RESULT_DECL: > - *ws = 0; > - obj = *expr_p; > - > - if (DECL_RTL_SET_P (obj)) > - break; > - > - if (DECL_MODE (obj) == BLKmode) > - x = produce_memory_decl_rtl (obj, regno); > - else > - x = gen_raw_REG (DECL_MODE (obj), (*regno)++); > - > - break; > - > - default: > - break; > - } > - > - if (x) > - { > - decl_rtl_to_reset.safe_push (obj); > - SET_DECL_RTL (obj, x); > - } > - > - return NULL_TREE; > -} > > /* Determines cost of the computation of EXPR. */ > > @@ -3792,7 +3711,10 @@ computation_cost (tree expr, bool speed) > > node->frequency = NODE_FREQUENCY_NORMAL; > crtl->maybe_hot_insn_p = speed; > - walk_tree (&expr, prepare_decl_rtl, ®no, NULL); > + decl_rtl_data data; > + data.regno = ®no; > + data.treevec = &decl_rtl_to_reset; > + walk_tree (&expr, prepare_decl_rtl, &data, NULL); > start_sequence (); > rslt = expand_expr (expr, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL); > seq = get_insns (); > @@ -5284,6 +5206,8 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > inv_exprs = BITMAP_ALLOC (NULL); > bitmap_set_bit (inv_exprs, inv_expr->id); > } > + if (use->zero_cost_p) > + cost.cost = 0; > set_group_iv_cost (data, group, cand, cost, > inv_vars, bound, comp, inv_exprs); > > @@ -7202,6 +7126,11 @@ rewrite_use_compare (struct ivopts_data *data, > return; > } > > + if (use->zero_cost_p) { > + preserve_ivs_for_use(data, use); > + return; > + } > + > /* The induction variable elimination failed; just express the original > giv. */ > comp = get_computation_at (data->current_loop, use->stmt, use, cand); > @@ -7248,8 +7177,8 @@ rewrite_groups (struct ivopts_data *data) > > for (j = 0; j < group->vuses.length (); j++) > { > - rewrite_use_compare (data, group->vuses[j], cand); > - update_stmt (group->vuses[j]->stmt); > + rewrite_use_compare (data, group->vuses[j], cand); > + update_stmt (group->vuses[j]->stmt); Wrong formatting change? Also we can run contrib/check_GNU_style.sh for coding style issues. > } > } > } > @@ -7527,12 +7456,134 @@ loop_body_includes_call (basic_block *body, unsigned num_nodes) > return false; > } > > +/* For a comparison iv_use that we've chosen to ignore, we have to ensure > + that its dependent IVs are preserved here. Otherwise remove_unused_ivs > + will remove those IVs, leading this use to have no feeding IV. */ > + > +static void > +preserve_ivs_for_use (struct ivopts_data *data, struct iv_use *use) > +{ > + struct loop *loop = data->current_loop; > + struct iv *iv = use->iv; > + auto_vec<tree> work_list; > + work_list.safe_push (iv->ssa_name); > + bitmap visited = BITMAP_ALLOC (NULL); > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "Preserving IVs: "); > + > + while (!work_list.is_empty ()) > + { > + tree op = work_list.pop (); > + gcc_assert (TREE_CODE (op) == SSA_NAME); > + gcc_assert (name_info (data, op)); > + if (bitmap_bit_p (visited, SSA_NAME_VERSION (op))) > + continue; > + > + struct version_info *vi = name_info (data, op); > + vi->preserve_biv = true; > + bitmap_set_bit (visited, SSA_NAME_VERSION (iv->ssa_name)); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + print_generic_expr (dump_file, op, TDF_DETAILS); > + fprintf (dump_file, " "); > + } > + > + gimple *def = SSA_NAME_DEF_STMT (op); > + /* Ensure the incr_iv preserved as well, for some case > + like: > + i2 = phi(i0, i1); > + i1 = i2 + ...; > + cmp i2 ... ; > + */ > + if (gimple_code (def) == GIMPLE_PHI && gimple_bb (def) == loop->header) > + { > + tree incr_iv = PHI_ARG_DEF_FROM_EDGE (def, loop_latch_edge (loop)); > + if (bitmap_bit_p (visited, SSA_NAME_VERSION (incr_iv))) > + continue; > + name_info (data, incr_iv)->preserve_biv = true; > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + print_generic_expr (dump_file, incr_iv, TDF_DETAILS); > + fprintf (dump_file, " "); > + } > + continue; > + } > + > + /* Only need to check assign for iv chain. */ > + if (gimple_code (def) != GIMPLE_ASSIGN) > + continue; > + > + ssa_op_iter iter; > + use_operand_p use_p; > + tree use_op; > + struct iv *use_iv; > + FOR_EACH_PHI_OR_STMT_USE (use_p, def, iter, SSA_OP_USE) > + { > + use_op = USE_FROM_PTR (use_p); > + if ((use_iv = name_info (data, use_op)->iv) > + && !integer_zerop (use_iv->step)) > + work_list.safe_push (use_op); > + } > + } > + > + BITMAP_FREE (visited); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "\n\n"); > +} > + > +/* Doloop optimization RTL pass can make the related comparison computation > + become dead and get eliminated, then these comparison IV uses should NOT > + be considered in optimal IVs determination, mark them as ignored and taken > + as zero cost in cost modeling. */ > + > +static void > +tailor_cmp_uses (struct ivopts_data *data) > +{ > + unsigned i; > + struct iv_group *group; > + struct loop *loop = data->current_loop; > + > + for (i = 0; i < data->vgroups.length (); i++) > + { > + group = data->vgroups[i]; > + if (group->type == USE_COMPARE) > + { > + gcc_assert (group->vuses.length () == 1); > + struct iv_use *use = group->vuses[0]; > + gimple *stmt = use->stmt; > + if (gimple_code (stmt) == GIMPLE_COND) > + { > + basic_block bb = gimple_bb (stmt); > + edge true_edge, false_edge; > + extract_true_false_edges_from_block (bb, &true_edge, &false_edge); > + > + /* This comparison is used for loop latch. Require latch is empty > + for now. */ > + if ((loop->latch == true_edge->dest > + || loop->latch == false_edge->dest) > + && gsi_end_p (gsi_last_bb (loop->latch))) > + { > + use->zero_cost_p = true; > + /* Dump for this use. */ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "Zeroing cost for use: "); > + print_gimple_stmt (dump_file, stmt, TDF_DETAILS); > + } > + } > + } > + } > + } > +} > + > /* Optimizes the LOOP. Returns true if anything changed. */ > > static bool > tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop) > { > bool changed = false; > + bool tailor_cmp_p = false; > struct iv_ca *iv_ca; > edge exit = single_dom_exit (loop); > basic_block *body; > @@ -7578,6 +7629,20 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop) > if (data->vgroups.length () > MAX_CONSIDERED_GROUPS) > goto finish; > > + tailor_cmp_p = targetm.predict_doloop_p (loop); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, > + "Predict loop %d can %sperform doloop optimization later.\n", > + loop->num, tailor_cmp_p ? "" : "not "); > + flow_loop_dump (loop, dump_file, NULL, 1); > + } > + > + /* Some compare iv_use is probably useless once the doloop optimization > + performs. */ > + if (tailor_cmp_p) > + tailor_cmp_uses (data); Function tailor_cmp_uses sets iv_use->zero_cost_p under some conditions. Once the flag is set, though the iv_use participates cost computation in determine_group_iv_cost_cond, the result cost is hard-wired to ZERO (which means cost computation for such iv_use is waste of time). Also iv_use rewriting process is skipped for related ivs preserved explicitly by preserve_ivs_for_use. Note IVOPTs already adds candidate for original ivs. So why not detecting doloop iv_use, adjust its cost with the corresponding original iv candidate, then let the general cost model make the decision? I believe this better fits existing infrastructure and requires less change, for example, preserve_ivs_for_use won't be necessary. It has other advantages too, for example, 1) candidate of original iv can be preferred for other iv_uses with doloop cost tuning; 2) the doloop decision can still be canceled by cost model if it's really not beneficial. With current patch, it can't be undo once the decision is made (at very early stage in IVOPTs.). Sorry didn't bring out this earlier, I only realized this after reading your code. > + > /* Finds candidates for the induction variables (item 2). */ > find_iv_candidates (data); > > -- > 2.7.4 > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-04-27 3:45 ` Bin.Cheng @ 2019-05-05 3:23 ` Kewen.Lin 2019-05-05 4:04 ` Bin.Cheng 0 siblings, 1 reply; 22+ messages in thread From: Kewen.Lin @ 2019-05-05 3:23 UTC (permalink / raw) To: Bin.Cheng Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther, Jakub Jelinek Hi Bin, Sorry for late response (just back from vacation). Thanks very much for your comments. on 2019/4/27 ä¸å11:20, Bin.Cheng wrote: > For such non-trivial patch, we can improve review process by splitting > to smaller patches which can be reviewed/approved independently. Good idea! I'll split it. > Below are some general comments specific to IVOPTs change. >> for (j = 0; j < group->vuses.length (); j++) >> { >> - rewrite_use_compare (data, group->vuses[j], cand); >> - update_stmt (group->vuses[j]->stmt); >> + rewrite_use_compare (data, group->vuses[j], cand); >> + update_stmt (group->vuses[j]->stmt); > Wrong formatting change? Also we can run contrib/check_GNU_style.sh > for coding style issues. > Good catch! >> + /* Some compare iv_use is probably useless once the doloop optimization >> + performs. */ >> + if (tailor_cmp_p) >> + tailor_cmp_uses (data); > Function tailor_cmp_uses sets iv_use->zero_cost_p under some > conditions. Once the flag is set, though the iv_use participates cost > computation in determine_group_iv_cost_cond, the result cost is > hard-wired to ZERO (which means cost computation for such iv_use is > waste of time). Yes, it can be improved by some early check and return. But it's still helpful to make it call with may_eliminate_iv. gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv consideration it exposes more opportunities for downstream optimization. > Also iv_use rewriting process is skipped for related > ivs preserved explicitly by preserve_ivs_for_use. > Note IVOPTs already adds candidate for original ivs. So why not > detecting doloop iv_use, adjust its cost with the corresponding > original iv candidate, then let the general cost model make the > decision? I believe this better fits existing infrastructure and > requires less change, for example, preserve_ivs_for_use won't be > necessary. I agree adjusting the cost of original iv candidate of the iv_use requires less change, but on one hand, I thought to remove interest cmp iv use or make it zero cost is close to the fact. Since it's eliminated eventually in doloop optimization, it should not considered in cost modeling. This way looks more exact. One the other hand, I assumed your suggestion is to increase the cost for the pair (original iv cand, cmp iv use), the increase cost seems to be a heuristic value? It seems it's possible to sacrifice performance for some worst cases? Although it's integrated to the existing framework, it probably introduces other cost adjust issues. For example, the original iv cand is inclined not to be selected after this change, but without considering the cmp iv use, it's better to be selected. It looks highly depend on the cost tuning? Does my understanding above make sense? But I will follow your suggestion to update and check the regression testing result etc. I just updated the code to increase the cost for the pair (original iv cand, cmp iv use) with 5 (to balance the original iv 4 vs other 5), the failure case in PR80791 was fixed. I assumed IP_ORIGINAL for cand->pos to check original iv cand is enough. > It has other advantages too, for example, 1) candidate of > original iv can be preferred for other iv_uses with doloop cost I think the patch way is not intrusive either. > tuning; 2) the doloop decision can still be canceled by cost model if > it's really not beneficial. With current patch, it can't be undo once > the decision is made (at very early stage in IVOPTs.). I can't really follow this. If it's predicted to be transformed to doloop, I think it should not be undoed any more, since it's useless to consider this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not be changed (although possible to use other iv), right? Do I miss something? > Sorry didn't bring out this earlier, I only realized this after > reading your code. It doesn't matter indeed! Thanks again! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-05-05 3:23 ` Kewen.Lin @ 2019-05-05 4:04 ` Bin.Cheng 2019-05-05 6:02 ` Kewen.Lin 2019-05-05 16:33 ` Segher Boessenkool 0 siblings, 2 replies; 22+ messages in thread From: Bin.Cheng @ 2019-05-05 4:04 UTC (permalink / raw) To: Kewen.Lin Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther, Jakub Jelinek On Sun, May 5, 2019 at 11:23 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi Bin, > > Sorry for late response (just back from vacation). > > Thanks very much for your comments. > > on 2019/4/27 上午11:20, Bin.Cheng wrote: > > For such non-trivial patch, we can improve review process by splitting > > to smaller patches which can be reviewed/approved independently. > > Good idea! I'll split it. > > > Below are some general comments specific to IVOPTs change. > > >> for (j = 0; j < group->vuses.length (); j++) > >> { > >> - rewrite_use_compare (data, group->vuses[j], cand); > >> - update_stmt (group->vuses[j]->stmt); > >> + rewrite_use_compare (data, group->vuses[j], cand); > >> + update_stmt (group->vuses[j]->stmt); > > Wrong formatting change? Also we can run contrib/check_GNU_style.sh > > for coding style issues. > > > > Good catch! > > >> + /* Some compare iv_use is probably useless once the doloop optimization > >> + performs. */ > >> + if (tailor_cmp_p) > >> + tailor_cmp_uses (data); > > Function tailor_cmp_uses sets iv_use->zero_cost_p under some > > conditions. Once the flag is set, though the iv_use participates cost > > computation in determine_group_iv_cost_cond, the result cost is > > hard-wired to ZERO (which means cost computation for such iv_use is > > waste of time). > > Yes, it can be improved by some early check and return. > But it's still helpful to make it call with may_eliminate_iv. > gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv > consideration it exposes more opportunities for downstream optimization. Hmm, I wasn't suggesting early check and return, on the contrary, we can better integrate doloop/cost stuff in the overall model. See more in following comments. > > > Also iv_use rewriting process is skipped for related > > ivs preserved explicitly by preserve_ivs_for_use. > > Note IVOPTs already adds candidate for original ivs. So why not > > detecting doloop iv_use, adjust its cost with the corresponding > > original iv candidate, then let the general cost model make the > > decision? I believe this better fits existing infrastructure and > > requires less change, for example, preserve_ivs_for_use won't be > > necessary. > I agree adjusting the cost of original iv candidate of the iv_use > requires less change, but on one hand, I thought to remove interest > cmp iv use or make it zero cost is close to the fact. Since it's > eliminated eventually in doloop optimization, it should not > considered in cost modeling. This way looks more exact. Whether doloop transformation should be considered or be bypassed in cost model isn't a problem. Actually we can bind doloop iv_cand to cmp iv_use in order to force the transformation. My concern is the patch specially handles doloop by setting the special flag, then checking it. We generally avoid such special-case handling since it hurts long time maintenance. The pass was very unstable in the pass because of such issues. > One the other hand, I assumed your suggestion is to increase the > cost for the pair (original iv cand, cmp iv use), the increase cost > seems to be a heuristic value? It seems it's possible to sacrifice Decrease the cost so that the iv_cand is preferred? The comment wasn't made on top of implementing doloop in ivopts. Anyway, you can still bypass cost model by binding the "correct" iv_cand to cmp iv_use. > performance for some worst cases? Although it's integrated to the > existing framework, it probably introduces other cost adjust issues. > For example, the original iv cand is inclined not to be selected after > this change, but without considering the cmp iv use, it's better to > be selected. It looks highly depend on the cost tuning? > > Does my understanding above make sense? > > But I will follow your suggestion to update and check the regression > testing result etc. I just updated the code to increase the cost for > the pair (original iv cand, cmp iv use) with 5 (to balance the original > iv 4 vs other 5), the failure case in PR80791 was fixed. I assumed > IP_ORIGINAL for cand->pos to check original iv cand is enough. > > > > It has other advantages too, for example, 1) candidate of > > original iv can be preferred for other iv_uses with doloop cost > > I think the patch way is not intrusive either. > > > tuning; 2) the doloop decision can still be canceled by cost model if > > it's really not beneficial. With current patch, it can't be undo once > > the decision is made (at very early stage in IVOPTs.). > > I can't really follow this. If it's predicted to be transformed to doloop, > I think it should not be undoed any more, since it's useless to consider > this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not > be changed (although possible to use other iv), right? Do I miss something? As mentioned, the previous comment wasn't made on top of implementing doloop in ivopts. That would be nice but a different story. Before we can do that, it'd better be conservative and only makes (doloop) decision in ivopts when you are sure. As you mentioned, it's hard to do the same checks at gimple as RTL, right? In this case, making it a (conservative) heuristic capturing certain beneficial cases sounds better than capturing all cases but fail in later RTL passes. Thanks, bin > > > Sorry didn't bring out this earlier, I only realized this after > > reading your code. > > It doesn't matter indeed! Thanks again! > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-05-05 4:04 ` Bin.Cheng @ 2019-05-05 6:02 ` Kewen.Lin 2019-05-06 1:50 ` Bin.Cheng 2019-05-05 16:33 ` Segher Boessenkool 1 sibling, 1 reply; 22+ messages in thread From: Kewen.Lin @ 2019-05-05 6:02 UTC (permalink / raw) To: Bin.Cheng Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther, Jakub Jelinek on 2019/5/5 ä¸å12:04, Bin.Cheng wrote: > On Sun, May 5, 2019 at 11:23 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >>>> + /* Some compare iv_use is probably useless once the doloop optimization >>>> + performs. */ >>>> + if (tailor_cmp_p) >>>> + tailor_cmp_uses (data); >>> Function tailor_cmp_uses sets iv_use->zero_cost_p under some >>> conditions. Once the flag is set, though the iv_use participates cost >>> computation in determine_group_iv_cost_cond, the result cost is >>> hard-wired to ZERO (which means cost computation for such iv_use is >>> waste of time). >> >> Yes, it can be improved by some early check and return. >> But it's still helpful to make it call with may_eliminate_iv. >> gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv >> consideration it exposes more opportunities for downstream optimization. > Hmm, I wasn't suggesting early check and return, on the contrary, we > can better integrate doloop/cost stuff in the overall model. See more > in following comments. Sorry, I didn't claim it clearly, the previous comment is to claim the call with may_eliminate_iv is not completely "waste of time", and early return can make it save time. :) And yes, it's not an issue any more with your proposed idea. >> >>> Also iv_use rewriting process is skipped for related >>> ivs preserved explicitly by preserve_ivs_for_use. >>> Note IVOPTs already adds candidate for original ivs. So why not >>> detecting doloop iv_use, adjust its cost with the corresponding >>> original iv candidate, then let the general cost model make the >>> decision? I believe this better fits existing infrastructure and >>> requires less change, for example, preserve_ivs_for_use won't be >>> necessary. >> I agree adjusting the cost of original iv candidate of the iv_use >> requires less change, but on one hand, I thought to remove interest >> cmp iv use or make it zero cost is close to the fact. Since it's >> eliminated eventually in doloop optimization, it should not >> considered in cost modeling. This way looks more exact. > Whether doloop transformation should be considered or be bypassed in > cost model isn't a problem. Actually we can bind doloop iv_cand to > cmp iv_use in order to force the transformation. My concern is the > patch specially handles doloop by setting the special flag, then > checking it. We generally avoid such special-case handling since it > hurts long time maintenance. The pass was very unstable in the pass > because of such issues. > OK, I understand your concerns now. Thanks for explanation! >> One the other hand, I assumed your suggestion is to increase the >> cost for the pair (original iv cand, cmp iv use), the increase cost >> seems to be a heuristic value? It seems it's possible to sacrifice > Decrease the cost so that the iv_cand is preferred? The comment > wasn't made on top of implementing doloop in ivopts. Anyway, you can > still bypass cost model by binding the "correct" iv_cand to cmp > iv_use. > To decrease the cost isn't workable for this case, it make the original iv cand is preferred more and over the other iv cand for memory iv use, then the desirable memory based iv cand won't be chosen. If increase the cost, one basic iv cand is chosen for cmp use, memory based iv cand is chosen for memory use, instead of original iv for both. Could you help to explain the "binding" more? Does it mean cost modeling decision making can simply bypass the cmp iv_use (we have artificially assigned one "correct" cand iv to it) and also doesn't count the "correct" iv cand cost into total iv cost? Is my understanding correct? >>> tuning; 2) the doloop decision can still be canceled by cost model if >>> it's really not beneficial. With current patch, it can't be undo once >>> the decision is made (at very early stage in IVOPTs.). >> >> I can't really follow this. If it's predicted to be transformed to doloop, >> I think it should not be undoed any more, since it's useless to consider >> this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not >> be changed (although possible to use other iv), right? Do I miss something? > As mentioned, the previous comment wasn't made on top of implementing > doloop in ivopts. That would be nice but a different story. > Before we can do that, it'd better be conservative and only makes > (doloop) decision in ivopts when you are sure. As you mentioned, it's > hard to do the same checks at gimple as RTL, right? In this case, > making it a (conservative) heuristic capturing certain beneficial > cases sounds better than capturing all cases but fail in later RTL > passes. > Yes, I agree we should be conservative. But it's hard to determine which is better in practice, even for capturing all cases, we are still trying our best to be more conservative, excluding any suspicious factor which is possible to make it fail in later RTL checking, one example is that the patch won't predict it can be doloop once finding switch statement. It depends on how much "bad" cases we don't catch and how serious its impact is and whether easy to improve. Thanks Kewen.Lin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-05-05 6:02 ` Kewen.Lin @ 2019-05-06 1:50 ` Bin.Cheng 2019-05-06 7:31 ` Kewen.Lin 0 siblings, 1 reply; 22+ messages in thread From: Bin.Cheng @ 2019-05-06 1:50 UTC (permalink / raw) To: Kewen.Lin Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther, Jakub Jelinek On Sun, May 5, 2019 at 2:02 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > on 2019/5/5 下午12:04, Bin.Cheng wrote: > > On Sun, May 5, 2019 at 11:23 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > >>>> + /* Some compare iv_use is probably useless once the doloop optimization > >>>> + performs. */ > >>>> + if (tailor_cmp_p) > >>>> + tailor_cmp_uses (data); > >>> Function tailor_cmp_uses sets iv_use->zero_cost_p under some > >>> conditions. Once the flag is set, though the iv_use participates cost > >>> computation in determine_group_iv_cost_cond, the result cost is > >>> hard-wired to ZERO (which means cost computation for such iv_use is > >>> waste of time). > >> > >> Yes, it can be improved by some early check and return. > >> But it's still helpful to make it call with may_eliminate_iv. > >> gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv > >> consideration it exposes more opportunities for downstream optimization. > > Hmm, I wasn't suggesting early check and return, on the contrary, we > > can better integrate doloop/cost stuff in the overall model. See more > > in following comments. > > Sorry, I didn't claim it clearly, the previous comment is to claim the > call with may_eliminate_iv is not completely "waste of time", and early > return can make it save time. :) > > And yes, it's not an issue any more with your proposed idea. > > >> > >>> Also iv_use rewriting process is skipped for related > >>> ivs preserved explicitly by preserve_ivs_for_use. > >>> Note IVOPTs already adds candidate for original ivs. So why not > >>> detecting doloop iv_use, adjust its cost with the corresponding > >>> original iv candidate, then let the general cost model make the > >>> decision? I believe this better fits existing infrastructure and > >>> requires less change, for example, preserve_ivs_for_use won't be > >>> necessary. > >> I agree adjusting the cost of original iv candidate of the iv_use > >> requires less change, but on one hand, I thought to remove interest > >> cmp iv use or make it zero cost is close to the fact. Since it's > >> eliminated eventually in doloop optimization, it should not > >> considered in cost modeling. This way looks more exact. > > Whether doloop transformation should be considered or be bypassed in > > cost model isn't a problem. Actually we can bind doloop iv_cand to > > cmp iv_use in order to force the transformation. My concern is the > > patch specially handles doloop by setting the special flag, then > > checking it. We generally avoid such special-case handling since it > > hurts long time maintenance. The pass was very unstable in the pass > > because of such issues. > > > > OK, I understand your concerns now. Thanks for explanation! > > >> One the other hand, I assumed your suggestion is to increase the > >> cost for the pair (original iv cand, cmp iv use), the increase cost > >> seems to be a heuristic value? It seems it's possible to sacrifice > > Decrease the cost so that the iv_cand is preferred? The comment > > wasn't made on top of implementing doloop in ivopts. Anyway, you can > > still bypass cost model by binding the "correct" iv_cand to cmp > > iv_use. > > > > To decrease the cost isn't workable for this case, it make the original > iv cand is preferred more and over the other iv cand for memory iv use, > then the desirable memory based iv cand won't be chosen. > If increase the cost, one basic iv cand is chosen for cmp use, memory > based iv cand is chosen for memory use, instead of original iv for both. Sorry for the mistake, I meant to decrease use cost of whatever "correct" iv_cand for cmp iv_use that could enable doloop optimization, it doesn't has to the original iv_cand. > > Could you help to explain the "binding" more? Does it mean cost modeling > decision making can simply bypass the cmp iv_use (we have artificially > assigned one "correct" cand iv to it) and also doesn't count the "correct" > iv cand cost into total iv cost? Is my understanding correct? For example, if the heuristic finds out the "correct" doloop iv_cand, we can force choosing that iv_cand for cmp iv_use and bypass the candidate choosing algorithm: struct iv_group { //... struct iv_cand *bind_cand; }; then set this bind_cand directly in struct iv_ca::cand_for_group. As a result, iv_use rewriting code takes care of everything, no special handling (such as preserve_ivs_for_use) is needed. Whether letting cost model decide the "correct" iv_cand or bind it by yourself depends on how good your heuristic check is. It's your call. :) > > >>> tuning; 2) the doloop decision can still be canceled by cost model if > >>> it's really not beneficial. With current patch, it can't be undo once > >>> the decision is made (at very early stage in IVOPTs.). > >> > >> I can't really follow this. If it's predicted to be transformed to doloop, > >> I think it should not be undoed any more, since it's useless to consider > >> this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not > >> be changed (although possible to use other iv), right? Do I miss something? > > As mentioned, the previous comment wasn't made on top of implementing > > doloop in ivopts. That would be nice but a different story. > > Before we can do that, it'd better be conservative and only makes > > (doloop) decision in ivopts when you are sure. As you mentioned, it's > > hard to do the same checks at gimple as RTL, right? In this case, > > making it a (conservative) heuristic capturing certain beneficial > > cases sounds better than capturing all cases but fail in later RTL > > passes. > > > > Yes, I agree we should be conservative. But it's hard to determine which is > better in practice, even for capturing all cases, we are still trying our best > to be more conservative, excluding any suspicious factor which is possible to > make it fail in later RTL checking, one example is that the patch won't predict > it can be doloop once finding switch statement. It depends on how much "bad" > cases we don't catch and how serious its impact is and whether easy to improve. Sure, I don't know ppc so have all the trust in your decision here. Thanks for your patience. Thanks, bin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-05-06 1:50 ` Bin.Cheng @ 2019-05-06 7:31 ` Kewen.Lin 0 siblings, 0 replies; 22+ messages in thread From: Kewen.Lin @ 2019-05-06 7:31 UTC (permalink / raw) To: Bin.Cheng Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther, Jakub Jelinek on 2019/5/6 ä¸å9:50, Bin.Cheng wrote: > On Sun, May 5, 2019 at 2:02 PM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> To decrease the cost isn't workable for this case, it make the original >> iv cand is preferred more and over the other iv cand for memory iv use, >> then the desirable memory based iv cand won't be chosen. >> If increase the cost, one basic iv cand is chosen for cmp use, memory >> based iv cand is chosen for memory use, instead of original iv for both. > Sorry for the mistake, I meant to decrease use cost of whatever "correct" > iv_cand for cmp iv_use that could enable doloop optimization, it doesn't > has to the original iv_cand. > Thanks for your clarification. As mentioned, we are trying to make the predict doloop hook work as conservative as possible, so under the hypothesis that predicted cmp iv use is going to be eliminated later, I'd prefer the way to bind iv_cand artificially. Because whichever iv_cand is selected as "correct" one and then increase/decrease the cost, we are considering one upcoming eliminated cmp iv use (it doesn't require the iv cand eventually), it will *probably* affect the optimal candidate set decision. >> Could you help to explain the "binding" more? Does it mean cost modeling >> decision making can simply bypass the cmp iv_use (we have artificially >> assigned one "correct" cand iv to it) and also doesn't count the "correct" >> iv cand cost into total iv cost? Is my understanding correct? > For example, if the heuristic finds out the "correct" doloop iv_cand, we can > force choosing that iv_cand for cmp iv_use and bypass the candidate choosing > algorithm: > struct iv_group { > //... > struct iv_cand *bind_cand; > }; > then set this bind_cand directly in struct iv_ca::cand_for_group. As a result, > iv_use rewriting code takes care of everything, no special handling (such as > preserve_ivs_for_use) is needed. > > Whether letting cost model decide the "correct" iv_cand or bind it by yourself > depends on how good your heuristic check is. It's your call. :) > Thanks a lot for your detailed explanation! As mentioned above, I prefer to use "bind" for it. I'll modify the patch with the "bind cand" way, initially set the relevant iv cand originally used in that cmp iv use, then update it if it can be updated with IV_ELIM. >> Yes, I agree we should be conservative. But it's hard to determine which is >> better in practice, even for capturing all cases, we are still trying our best >> to be more conservative, excluding any suspicious factor which is possible to >> make it fail in later RTL checking, one example is that the patch won't predict >> it can be doloop once finding switch statement. It depends on how much "bad" >> cases we don't catch and how serious its impact is and whether easy to improve. > Sure, I don't know ppc so have all the trust in your decision here. > > Thanks for your patience. > Thank you too!!! Kewen > Thanks, > bin > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts 2019-05-05 4:04 ` Bin.Cheng 2019-05-05 6:02 ` Kewen.Lin @ 2019-05-05 16:33 ` Segher Boessenkool 1 sibling, 0 replies; 22+ messages in thread From: Segher Boessenkool @ 2019-05-05 16:33 UTC (permalink / raw) To: Bin.Cheng Cc: Kewen.Lin, GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, Jakub Jelinek On Sun, May 05, 2019 at 12:04:00PM +0800, Bin.Cheng wrote: > On Sun, May 5, 2019 at 11:23 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > > I can't really follow this. If it's predicted to be transformed to doloop, > > I think it should not be undoed any more, since it's useless to consider > > this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not > > be changed (although possible to use other iv), right? Do I miss something? > As mentioned, the previous comment wasn't made on top of implementing > doloop in ivopts. That would be nice but a different story. > Before we can do that, it'd better be conservative and only makes > (doloop) decision in ivopts when you are sure. As you mentioned, it's > hard to do the same checks at gimple as RTL, right? In this case, > making it a (conservative) heuristic capturing certain beneficial > cases sounds better than capturing all cases but fail in later RTL > passes. But not *overly* conservative. If some situation almost never happens, it's better to have ivopts guess wrong some of the time than it is to just optimise less aggressively. If ivopts makes a non-optimal decision you still end up with valid code :-) Segher ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2019-05-06 10:21 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-24 8:49 [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts Kewen.Lin 2019-04-24 9:08 ` Jakub Jelinek 2019-04-24 9:25 ` Kewen.Lin 2019-04-25 12:45 ` Segher Boessenkool 2019-04-26 6:58 ` Kewen.Lin 2019-04-26 7:59 ` Richard Biener 2019-04-26 14:32 ` Kewen.Lin 2019-04-26 17:06 ` Segher Boessenkool 2019-04-26 18:23 ` Richard Biener 2019-04-26 18:58 ` Segher Boessenkool 2019-05-05 3:42 ` Kewen.Lin 2019-04-26 16:44 ` Segher Boessenkool 2019-04-27 4:13 ` Bin.Cheng 2019-05-05 5:26 ` Kewen.Lin 2019-05-06 10:21 ` Richard Biener 2019-04-27 3:45 ` Bin.Cheng 2019-05-05 3:23 ` Kewen.Lin 2019-05-05 4:04 ` Bin.Cheng 2019-05-05 6:02 ` Kewen.Lin 2019-05-06 1:50 ` Bin.Cheng 2019-05-06 7:31 ` Kewen.Lin 2019-05-05 16:33 ` Segher Boessenkool
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).