[PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
@ 2019-04-24  8:49 Kewen.Lin
  2019-04-24  9:08 ` Jakub Jelinek
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Kewen.Lin @ 2019-04-24  8:49 UTC (permalink / raw)
  To: GCC Patches
  Cc: bin.cheng, Segher Boessenkool, Bill Schmidt, Richard Guenther, jakub

Hi all,

As PR80791 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80791, on
some targets which support low-overhead loop, the related compare
type ivs use is possible to become dead and removed eventually.
However, the current ivopt cost modeling doesn't consider this
possible elimination, it leads suboptimal iv candidate decision.

To make it better, this patch is to introduce one target hook, to
provide target which supports low-overhead loop a way to teach
ivopt to know which compare type iv can be ignored in modeling.

Since the low-overhead loop optimize transformation is based on RTL,
some of those checks are hard to be imitated on gimple, so it's not
possible to predict the current loop will be transformed exactly
in middle-end.  But if we can have most loop predicted precisely,
it would be helpful.  It highly depends on target hook fine tuning.
It's acceptable to have some loops which can be transformed to low-
overhead loop but we don't catch.  But we should try our best to
avoid to predict some loop as low-overhead loop but which isn't.

Bootstrapped and regression testing on powerpc64le.

One failure was found in regression testing, which is:
  gcc.target/powerpc/20050830-1.c

I did some investigation and found: with this patch, one new iv cand
is chosen for exit condition, which is used to rewrite the compare
type use.  Later loop iv analysis in RTL can NOT determine the loop
iteration number is finite, which causes doloop_optimize not transform
the loop.  The check doesn't find the expected pattern.

Still investigating how to solve this failure, like either to enhance
loop iv analysis or to check some special condition and guard in ivopt.
Any suggestions are highly welcomed.

Btw, this is for GCC10.

---------------------------------------------

gcc/ChangeLog

2019-04-24  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* target.def (predict_doloop_p): New.
	* targhooks.h (default_predict_doloop_p): New.
	* targhooks.c (default_predict_doloop_p): Likewise.
	* doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New.
	* doc/tm.texi: Regenerate.
	* config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New.
	  (costly_iter_for_doloop_p): New.
	  (rs6000_predict_doloop_p): New.
	* gcc/expr.c (produce_memory_decl_rtl): New.
	  (prepare_decl_rtl): New.
	* gcc/expr.h (produce_memory_decl_rtl): Declare.
	  (prepare_decl_rtl): Declare.
	* gcc/tree-ssa-loop-ivopts.c (tree_ssa_iv_optimize_loop): Call
	  predict_doloop_p hook.
	  (tailor_cmp_uses): New.
	  (preserve_ivs_for_use): New.
	  (computation_cost): Call refactored prepare_decl_rtl, consider
	  zero cost iv use.
	  (dump_use): Dump zero_cost_p field.
	  (record_use): Init zero_cost_p field.
	  (produce_memory_decl_rtl): Remove.
	  (prepare_decl_rtl): Remove.

gcc/testsuite/ChangeLog

2019-04-24  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* gcc.dg/tree-ssa/ivopts-lt.c : Adjust.

---
 gcc/config/rs6000/rs6000.c                | 174 ++++++++++++++++++++-
 gcc/doc/tm.texi                           |   8 +
 gcc/doc/tm.texi.in                        |   2 +
 gcc/expr.c                                |  91 +++++++++++
 gcc/expr.h                                |  16 +-
 gcc/target.def                            |   9 ++
 gcc/targhooks.c                           |  13 ++
 gcc/targhooks.h                           |   1 +
 gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c |   7 +-
 gcc/tree-ssa-loop-ivopts.c                | 247 +++++++++++++++++++-----------
 10 files changed, 472 insertions(+), 96 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index a21f4f7..756f914 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -83,6 +83,9 @@
 #include "tree-ssa-propagate.h"
 #include "tree-vrp.h"
 #include "tree-ssanames.h"
+#include "tree-ssa-loop-niter.h"
+#include "tree-cfg.h"
+#include "tree-scalar-evolution.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -1914,6 +1917,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_CAN_USE_DOLOOP_P
 #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
 
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
@@ -39436,7 +39442,173 @@ rs6000_mangle_decl_assembler_name (tree decl, tree id)
   return id;
 }
 
-
+/* Check whether there are some instructions preventing doloop transformation
+   inside loop body, mainly for instructions which are possible to kill CTR.
+
+   Return true if some invalid insn exits, otherwise return false.  */
+
+static bool
+invalid_insn_for_doloop_p (struct loop *loop)
+{
+  basic_block *body = get_loop_body (loop);
+  unsigned num_nodes = loop->num_nodes;
+  gimple_stmt_iterator gsi;
+  unsigned i;
+
+  for (i = 0; i < num_nodes; i++)
+    for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	if (is_gimple_call (stmt) && !gimple_call_internal_p (stmt)
+	    && !is_inexpensive_builtin (gimple_call_fndecl (stmt)))
+	  {
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (dump_file,
+		       "predict doloop failure due to finding call.\n");
+	    return true;
+	  }
+	if (computed_goto_p (stmt))
+	  {
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (
+		dump_file,
+		"predict doloop failure due to finding computed jump.\n");
+	    return true;
+	  }
+	/* TODO: Now this hook is expected to be called in ivopts, which is
+	   before switchlower1/switchlower2.  Checking for SWITCH at this point
+       will eliminate some good candidates.  But since there are only a few
+       cases having _a_ switch statement in the innermost loop, it can be a low
+	   priority enhancement.  */
+
+	if (gimple_code (stmt) == GIMPLE_SWITCH)
+	  {
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (dump_file,
+		       "predict doloop failure due to finding switch.\n");
+	    return true;
+	  }
+      }
+
+  return false;
+}
+
+/* Check whether number of iteration computation is too costly for doloop
+   transformation.  It expands the gimple sequence to equivalent RTL insn
+   sequence, then evaluate the cost.
+
+   Return true if it's costly, otherwise return false.  */
+
+static bool
+costly_iter_for_doloop_p (struct loop *loop, tree niters)
+{
+  tree type = TREE_TYPE (niters);
+  unsigned cost = 0, i;
+  tree obj;
+  bool speed = optimize_loop_for_speed_p (loop);
+  int regno = LAST_VIRTUAL_REGISTER + 1;
+  vec<tree> tvec;
+  tvec.create (20);
+  decl_rtl_data tdata;
+  tdata.regno = &regno;
+  tdata.treevec = &tvec;
+  walk_tree (&niters, prepare_decl_rtl, &tdata, NULL);
+  start_sequence ();
+  expand_expr (niters, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+  FOR_EACH_VEC_ELT (tvec, i, obj)
+    SET_DECL_RTL (obj, NULL_RTX);
+  tvec.release ();
+
+  for (; seq; seq = NEXT_INSN (seq))
+    {
+      if (!INSN_P (seq))
+	continue;
+      rtx body = PATTERN (seq);
+      if (GET_CODE (body) == SET)
+	{
+	  rtx set_val = XEXP (body, 1);
+	  enum rtx_code code = GET_CODE (set_val);
+	  enum rtx_class cls = GET_RTX_CLASS (code);
+	  /* For now, we only consider these two RTX classes, to match what we
+	 get in doloop_optimize, excluding operations like zero/sign extend.  */
+	  if (cls == RTX_BIN_ARITH || cls == RTX_COMM_ARITH)
+	    cost += set_src_cost (set_val, GET_MODE (set_val), speed);
+	}
+    }
+  unsigned max_cost
+    = COSTS_N_INSNS (PARAM_VALUE (PARAM_MAX_ITERATIONS_COMPUTATION_COST));
+  if (cost > max_cost)
+    return true;
+
+  return false;
+}
+
+/*
+   Predict whether the given loop will be transformed in the RTL
+   doloop_optimize pass.  This is for use by the IVOPTS middle-end pass.
+   Attempt to duplicate as many doloop_optimize checks as possible.
+
+   Some RTL specific checks seems unable to be checked here, if any
+   new checks or easy checks _are_ missing here.  */
+
+static bool
+rs6000_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+
+  /* On rs6000, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Just ensure it's innermost.  */
+  if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "predict doloop failure due to no innermost.\n");
+      return false;
+    }
+
+  /* number_of_latch_executions is not so costly, so we don't use
+     number_of_iterations_exit for iteration description.  */
+  tree niters = number_of_latch_executions (loop);
+  if (niters == NULL_TREE || niters == chrec_dont_know)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "predict doloop failure due to unexpected niters.\n");
+      return false;
+    }
+
+  /* Similar to doloop_optimize, check whether iteration count too small
+     and not profitable.  */
+  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
+  if (est_niter == -1)
+    est_niter = get_likely_max_loop_iterations_int (loop);
+  if (est_niter >= 0 && est_niter < 3)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "predict doloop failure due to too few iterations (%d).\n",
+		 (unsigned int) est_niter);
+      return false;
+    }
+
+  /* Similar to doloop_optimize, check whether number of iterations too costly
+     to compute.  */
+  if (costly_iter_for_doloop_p (loop, niters))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "predict doloop failure due to costly niter computation.\n");
+      return false;
+    }
+
+  /* Similar to doloop_optimize targetm.invalid_within_doloop, check for
+     CALL, tablejump, computed_jump.  */
+  if (invalid_insn_for_doloop_p (loop))
+    return false;
+
+  return true;
+}
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-rs6000.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 8c8978b..d3ad208 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11603,6 +11603,14 @@ function version at run-time for a given set of function versions.
 body must be generated.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_PREDICT_DOLOOP_P (struct loop *@var{loop})
+Return true if we can predict it is possible to use low-overhead loops
+for a particular loop.  The parameter @var{loop} is a pointer to the loop
+which is going to be checked. This target hook is required only when the
+target supports low-overhead loops, and will help some earlier middle-end
+passes to make some decisions.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top})
 Return true if it is possible to use low-overhead loops (@code{doloop_end}
 and @code{doloop_begin}) for a particular loop.  @var{iterations} gives the
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index fe1194e..e047734 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7942,6 +7942,8 @@ to by @var{ce_info}.
 
 @hook TARGET_GENERATE_VERSION_DISPATCHER_BODY
 
+@hook TARGET_PREDICT_DOLOOP_P
+
 @hook TARGET_CAN_USE_DOLOOP_P
 
 @hook TARGET_INVALID_WITHIN_DOLOOP
diff --git a/gcc/expr.c b/gcc/expr.c
index 9ff5e5f..1f2ad45 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -12539,3 +12539,94 @@ int_expr_size (tree exp)
 
   return tree_to_shwi (size);
 }
+
+/* Produce DECL_RTL for object obj so it looks like it is stored in memory.  */
+
+rtx
+produce_memory_decl_rtl (tree obj, int *regno)
+{
+  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (obj));
+  machine_mode address_mode = targetm.addr_space.address_mode (as);
+  rtx x;
+
+  gcc_assert (obj);
+  if (TREE_STATIC (obj) || DECL_EXTERNAL (obj))
+    {
+      const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (obj));
+      x = gen_rtx_SYMBOL_REF (address_mode, name);
+      SET_SYMBOL_REF_DECL (x, obj);
+      x = gen_rtx_MEM (DECL_MODE (obj), x);
+      set_mem_addr_space (x, as);
+      targetm.encode_section_info (obj, x, true);
+    }
+  else
+    {
+      x = gen_raw_REG (address_mode, (*regno)++);
+      x = gen_rtx_MEM (DECL_MODE (obj), x);
+      set_mem_addr_space (x, as);
+    }
+
+  return x;
+}
+
+/* Prepares decl_rtl for variables referred in *EXPR_P.  Callback for
+   walk_tree.  DATA contains the actual fake register number.  */
+
+tree
+prepare_decl_rtl (tree *expr_p, int *ws, void *data)
+{
+  tree obj = NULL_TREE;
+  rtx x = NULL_RTX;
+  decl_rtl_data *info = (decl_rtl_data *) data;
+  int *regno = info->regno;
+  vec<tree> *treevec = info->treevec;
+
+  switch (TREE_CODE (*expr_p))
+    {
+    case ADDR_EXPR:
+      for (expr_p = &TREE_OPERAND (*expr_p, 0); handled_component_p (*expr_p);
+	   expr_p = &TREE_OPERAND (*expr_p, 0))
+	continue;
+      obj = *expr_p;
+      if (DECL_P (obj) && HAS_RTL_P (obj) && !DECL_RTL_SET_P (obj))
+	x = produce_memory_decl_rtl (obj, regno);
+      break;
+
+    case SSA_NAME:
+      *ws = 0;
+      obj = SSA_NAME_VAR (*expr_p);
+      /* Defer handling of anonymous SSA_NAMEs to the expander.  */
+      if (!obj)
+	return NULL_TREE;
+      if (!DECL_RTL_SET_P (obj))
+	x = gen_raw_REG (DECL_MODE (obj), (*regno)++);
+      break;
+
+    case VAR_DECL:
+    case PARM_DECL:
+    case RESULT_DECL:
+      *ws = 0;
+      obj = *expr_p;
+
+      if (DECL_RTL_SET_P (obj))
+	break;
+
+      if (DECL_MODE (obj) == BLKmode)
+	x = produce_memory_decl_rtl (obj, regno);
+      else
+	x = gen_raw_REG (DECL_MODE (obj), (*regno)++);
+
+      break;
+
+    default:
+      break;
+    }
+
+  if (x)
+    {
+      treevec->safe_push (obj);
+      SET_DECL_RTL (obj, x);
+    }
+
+  return NULL_TREE;
+}
diff --git a/gcc/expr.h b/gcc/expr.h
index 17c3962..b1894a6b 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -53,7 +53,21 @@ typedef struct separate_ops
   tree type;
   tree op0, op1, op2;
 } *sepops;
-
+
+/* This structure is used to pass information to tree walker function
+   prepare_decl_rtl.  */
+typedef struct data_for_decl_rtl
+{
+  int *regno;
+  vec<tree> *treevec;
+} decl_rtl_data;
+
+/* Produce decl_rtl for object so it looks like it is stored in memory.  */
+rtx produce_memory_decl_rtl (tree, int *);
+
+/* Prepares decl_rtl for variables referred.  Callback for walk_tree.  */
+tree prepare_decl_rtl (tree *, int *, void *);
+
 /* This is run during target initialization to set up which modes can be
    used directly in memory and to initialize the block move optab.  */
 extern void init_expr_target (void);
diff --git a/gcc/target.def b/gcc/target.def
index 66cee07..2f28e80 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4227,6 +4227,15 @@ DEFHOOK
 rtx, (machine_mode mode, rtx result, rtx val, rtx failval),
  default_speculation_safe_value)
  
+DEFHOOK
+(predict_doloop_p,
+ "Return true if we can predict it is possible to use low-overhead loops\n\
+for a particular loop.  The parameter @var{loop} is a pointer to the loop\n\
+which is going to be checked. This target hook is required only when the\n\
+target supports low-overhead loops, and will help some earlier middle-end\n\
+passes to make some decisions.",
+ bool, (struct loop *loop),
+ default_predict_doloop_p)
 
 DEFHOOK
 (can_use_doloop_p,
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 318f7e9..56ed4cc 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -643,6 +643,19 @@ default_has_ifunc_p (void)
   return HAVE_GNU_INDIRECT_FUNCTION;
 }
 
+/* True if we can predict this loop is possible to be transformed to a
+   low-overhead loop, otherwise returns false.
+
+   By default, false is returned, as this hook's applicability should be
+   verified for each target.  Target maintainers should re-define the hook
+   if the target can take advantage of it.  */
+
+bool
+default_predict_doloop_p (struct loop *loop ATTRIBUTE_UNUSED)
+{
+  return false;
+}
+
 /* NULL if INSN insn is valid within a low-overhead loop, otherwise returns
    an error message.
 
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 5943627..70bfb17 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -85,6 +85,7 @@ extern bool default_fixed_point_supported_p (void);
 
 extern bool default_has_ifunc_p (void);
 
+extern bool default_predict_doloop_p (struct loop *);
 extern const char * default_invalid_within_doloop (const rtx_insn *);
 
 extern tree default_builtin_vectorized_function (unsigned int, tree, tree);
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
index 171c85a..f61288c 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
@@ -17,6 +17,7 @@ f1 (char *p, uintptr_t i, uintptr_t n)
   while (i < n);
 }
 
-/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */
-/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */
-/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */
+/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index a44b4cb..148a48d 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -109,6 +109,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "builtins.h"
 #include "tree-vectorizer.h"
 
+
 /* FIXME: Expressions are expanded to RTL in this pass to determine the
    cost of different addressing modes.  This should be moved to a TBD
    interface between the GIMPLE and RTL worlds.  */
@@ -376,6 +377,7 @@ struct iv_use
   tree addr_base;	/* Base address with const offset stripped.  */
   poly_uint64_pod addr_offset;
 			/* Const offset stripped from base address.  */
+  bool zero_cost_p; /* This iv use takes zero cost, only compare type.  */
 };
 
 /* Group of uses.  */
@@ -691,6 +693,8 @@ static vec<tree> decl_rtl_to_reset;
 
 static comp_cost force_expr_to_var_cost (tree, bool);
 
+static void preserve_ivs_for_use (struct ivopts_data*, struct iv_use*);
+
 /* The single loop exit if it dominates the latch, NULL otherwise.  */
 
 edge
@@ -765,6 +769,8 @@ dump_use (FILE *file, struct iv_use *use)
   fprintf (file, "    At pos:\t");
   if (use->op_p)
     print_generic_expr (file, *use->op_p, TDF_SLIM);
+  if (use->zero_cost_p)
+    fprintf (file, "\n    Zero cost");
   fprintf (file, "\n");
   dump_iv (file, use->iv, false, 2);
 }
@@ -1549,6 +1555,7 @@ record_use (struct iv_group *group, tree *use_p, struct iv *iv,
   use->op_p = use_p;
   use->addr_base = addr_base;
   use->addr_offset = addr_offset;
+  use->zero_cost_p = false;
 
   group->vuses.safe_push (use);
   return use;
@@ -3687,94 +3694,6 @@ get_group_iv_cost (struct ivopts_data *data, struct iv_group *group,
   return NULL;
 }
 
-/* Produce DECL_RTL for object obj so it looks like it is stored in memory.  */
-static rtx
-produce_memory_decl_rtl (tree obj, int *regno)
-{
-  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (obj));
-  machine_mode address_mode = targetm.addr_space.address_mode (as);
-  rtx x;
-
-  gcc_assert (obj);
-  if (TREE_STATIC (obj) || DECL_EXTERNAL (obj))
-    {
-      const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (obj));
-      x = gen_rtx_SYMBOL_REF (address_mode, name);
-      SET_SYMBOL_REF_DECL (x, obj);
-      x = gen_rtx_MEM (DECL_MODE (obj), x);
-      set_mem_addr_space (x, as);
-      targetm.encode_section_info (obj, x, true);
-    }
-  else
-    {
-      x = gen_raw_REG (address_mode, (*regno)++);
-      x = gen_rtx_MEM (DECL_MODE (obj), x);
-      set_mem_addr_space (x, as);
-    }
-
-  return x;
-}
-
-/* Prepares decl_rtl for variables referred in *EXPR_P.  Callback for
-   walk_tree.  DATA contains the actual fake register number.  */
-
-static tree
-prepare_decl_rtl (tree *expr_p, int *ws, void *data)
-{
-  tree obj = NULL_TREE;
-  rtx x = NULL_RTX;
-  int *regno = (int *) data;
-
-  switch (TREE_CODE (*expr_p))
-    {
-    case ADDR_EXPR:
-      for (expr_p = &TREE_OPERAND (*expr_p, 0);
-	   handled_component_p (*expr_p);
-	   expr_p = &TREE_OPERAND (*expr_p, 0))
-	continue;
-      obj = *expr_p;
-      if (DECL_P (obj) && HAS_RTL_P (obj) && !DECL_RTL_SET_P (obj))
-	x = produce_memory_decl_rtl (obj, regno);
-      break;
-
-    case SSA_NAME:
-      *ws = 0;
-      obj = SSA_NAME_VAR (*expr_p);
-      /* Defer handling of anonymous SSA_NAMEs to the expander.  */
-      if (!obj)
-	return NULL_TREE;
-      if (!DECL_RTL_SET_P (obj))
-	x = gen_raw_REG (DECL_MODE (obj), (*regno)++);
-      break;
-
-    case VAR_DECL:
-    case PARM_DECL:
-    case RESULT_DECL:
-      *ws = 0;
-      obj = *expr_p;
-
-      if (DECL_RTL_SET_P (obj))
-	break;
-
-      if (DECL_MODE (obj) == BLKmode)
-	x = produce_memory_decl_rtl (obj, regno);
-      else
-	x = gen_raw_REG (DECL_MODE (obj), (*regno)++);
-
-      break;
-
-    default:
-      break;
-    }
-
-  if (x)
-    {
-      decl_rtl_to_reset.safe_push (obj);
-      SET_DECL_RTL (obj, x);
-    }
-
-  return NULL_TREE;
-}
 
 /* Determines cost of the computation of EXPR.  */
 
@@ -3792,7 +3711,10 @@ computation_cost (tree expr, bool speed)
 
   node->frequency = NODE_FREQUENCY_NORMAL;
   crtl->maybe_hot_insn_p = speed;
-  walk_tree (&expr, prepare_decl_rtl, &regno, NULL);
+  decl_rtl_data data;
+  data.regno = &regno;
+  data.treevec = &decl_rtl_to_reset;
+  walk_tree (&expr, prepare_decl_rtl, &data, NULL);
   start_sequence ();
   rslt = expand_expr (expr, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL);
   seq = get_insns ();
@@ -5284,6 +5206,8 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
       inv_exprs = BITMAP_ALLOC (NULL);
       bitmap_set_bit (inv_exprs, inv_expr->id);
     }
+  if (use->zero_cost_p)
+    cost.cost = 0;
   set_group_iv_cost (data, group, cand, cost,
 		     inv_vars, bound, comp, inv_exprs);
 
@@ -7202,6 +7126,11 @@ rewrite_use_compare (struct ivopts_data *data,
       return;
     }
 
+  if (use->zero_cost_p) {
+    preserve_ivs_for_use(data, use);
+    return;
+  }
+
   /* The induction variable elimination failed; just express the original
      giv.  */
   comp = get_computation_at (data->current_loop, use->stmt, use, cand);
@@ -7248,8 +7177,8 @@ rewrite_groups (struct ivopts_data *data)
 
 	  for (j = 0; j < group->vuses.length (); j++)
 	    {
-	      rewrite_use_compare (data, group->vuses[j], cand);
-	      update_stmt (group->vuses[j]->stmt);
+		  rewrite_use_compare (data, group->vuses[j], cand);
+		  update_stmt (group->vuses[j]->stmt);
 	    }
 	}
     }
@@ -7527,12 +7456,134 @@ loop_body_includes_call (basic_block *body, unsigned num_nodes)
   return false;
 }
 
+/* For a comparison iv_use that we've chosen to ignore, we have to ensure
+   that its dependent IVs are preserved here.  Otherwise remove_unused_ivs
+   will remove those IVs, leading this use to have no feeding IV.  */
+
+static void
+preserve_ivs_for_use (struct ivopts_data *data, struct iv_use *use)
+{
+  struct loop *loop = data->current_loop;
+  struct iv *iv = use->iv;
+  auto_vec<tree> work_list;
+  work_list.safe_push (iv->ssa_name);
+  bitmap visited = BITMAP_ALLOC (NULL);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Preserving IVs: ");
+
+  while (!work_list.is_empty ())
+    {
+      tree op = work_list.pop ();
+      gcc_assert (TREE_CODE (op) == SSA_NAME);
+      gcc_assert (name_info (data, op));
+      if (bitmap_bit_p (visited, SSA_NAME_VERSION (op)))
+	continue;
+
+      struct version_info *vi = name_info (data, op);
+      vi->preserve_biv = true;
+      bitmap_set_bit (visited, SSA_NAME_VERSION (iv->ssa_name));
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  print_generic_expr (dump_file, op, TDF_DETAILS);
+	  fprintf (dump_file, " ");
+	}
+
+      gimple *def = SSA_NAME_DEF_STMT (op);
+      /* Ensure the incr_iv preserved as well, for some case
+         like:
+           i2 = phi(i0, i1);
+           i1 = i2 + ...;
+           cmp i2 ... ;
+      */
+      if (gimple_code (def) == GIMPLE_PHI && gimple_bb (def) == loop->header)
+	{
+	  tree incr_iv = PHI_ARG_DEF_FROM_EDGE (def, loop_latch_edge (loop));
+	  if (bitmap_bit_p (visited, SSA_NAME_VERSION (incr_iv)))
+	    continue;
+	  name_info (data, incr_iv)->preserve_biv = true;
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      print_generic_expr (dump_file, incr_iv, TDF_DETAILS);
+	      fprintf (dump_file, " ");
+	    }
+	  continue;
+	}
+
+      /* Only need to check assign for iv chain.  */
+      if (gimple_code (def) != GIMPLE_ASSIGN)
+	continue;
+
+      ssa_op_iter iter;
+      use_operand_p use_p;
+      tree use_op;
+      struct iv *use_iv;
+      FOR_EACH_PHI_OR_STMT_USE (use_p, def, iter, SSA_OP_USE)
+	{
+	  use_op = USE_FROM_PTR (use_p);
+	  if ((use_iv = name_info (data, use_op)->iv)
+	      && !integer_zerop (use_iv->step))
+	    work_list.safe_push (use_op);
+	}
+    }
+
+  BITMAP_FREE (visited);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\n\n");
+}
+
+/* Doloop optimization RTL pass can make the related comparison computation
+   become dead and get eliminated, then these comparison IV uses should NOT
+   be considered in optimal IVs determination, mark them as ignored and taken
+   as zero cost in cost modeling.  */
+
+static void
+tailor_cmp_uses (struct ivopts_data *data)
+{
+  unsigned i;
+  struct iv_group *group;
+  struct loop *loop = data->current_loop;
+
+  for (i = 0; i < data->vgroups.length (); i++)
+    {
+      group = data->vgroups[i];
+      if (group->type == USE_COMPARE)
+	{
+	  gcc_assert (group->vuses.length () == 1);
+	  struct iv_use *use = group->vuses[0];
+	  gimple *stmt = use->stmt;
+	  if (gimple_code (stmt) == GIMPLE_COND)
+	    {
+	      basic_block bb = gimple_bb (stmt);
+	      edge true_edge, false_edge;
+	      extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+
+	      /* This comparison is used for loop latch.  Require latch is empty
+		     for now.  */
+	      if ((loop->latch == true_edge->dest
+		   || loop->latch == false_edge->dest)
+		  && gsi_end_p (gsi_last_bb (loop->latch)))
+		{
+		  use->zero_cost_p = true;
+		  /* Dump for this use.  */
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    {
+		      fprintf (dump_file, "Zeroing cost for use: ");
+		      print_gimple_stmt (dump_file, stmt, TDF_DETAILS);
+		    }
+		}
+	    }
+	}
+    }
+}
+
 /* Optimizes the LOOP.  Returns true if anything changed.  */
 
 static bool
 tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop)
 {
   bool changed = false;
+  bool tailor_cmp_p = false;
   struct iv_ca *iv_ca;
   edge exit = single_dom_exit (loop);
   basic_block *body;
@@ -7578,6 +7629,20 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop)
   if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)
     goto finish;
 
+  tailor_cmp_p = targetm.predict_doloop_p (loop);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+	       "Predict loop %d can %sperform doloop optimization later.\n",
+	       loop->num, tailor_cmp_p ? "" : "not ");
+      flow_loop_dump (loop, dump_file, NULL, 1);
+    }
+
+  /* Some compare iv_use is probably useless once the doloop optimization
+     performs.  */
+  if (tailor_cmp_p)
+    tailor_cmp_uses (data);
+
   /* Finds candidates for the induction variables (item 2).  */
   find_iv_candidates (data);
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-24  8:49 [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts Kewen.Lin
@ 2019-04-24  9:08 ` Jakub Jelinek
  2019-04-24  9:25   ` Kewen.Lin
  2019-04-25 12:45 ` Segher Boessenkool
  2019-04-27  3:45 ` Bin.Cheng
  2 siblings, 1 reply; 22+ messages in thread
From: Jakub Jelinek @ 2019-04-24  9:08 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt,
	Richard Guenther

On Wed, Apr 24, 2019 at 04:41:01PM +0800, Kewen.Lin wrote:
> gcc/ChangeLog

Not a review, just ChangeLog nits.

> 2019-04-24  Kewen Lin  <linkw@gcc.gnu.org>
> 
> 	PR middle-end/80791
> 	* target.def (predict_doloop_p): New.
> 	* targhooks.h (default_predict_doloop_p): New.
> 	* targhooks.c (default_predict_doloop_p): Likewise.
> 	* doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New.
> 	* doc/tm.texi: Regenerate.
> 	* config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New.
> 	  (costly_iter_for_doloop_p): New.
> 	  (rs6000_predict_doloop_p): New.

There should be no leading spaces after the tab on the lines that don't contain
a filename, so:
	* config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New.
	(costly_iter_for_doloop_p): New.
	(rs6000_predict_doloop_p): New.
instead (and instead of saying just New. usually we write what kind of
new thing it is, so New function., New declaration. (or Declare.), New
macro. (or Define.), New method. etc.

> 	* gcc/expr.c (produce_memory_decl_rtl): New.

The gcc/ prefixes don't belong into gcc/ChangeLog, the filenames are always
relative to the ChangeLog file referencing those.

	Jakub

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-24  9:08 ` Jakub Jelinek
@ 2019-04-24  9:25   ` Kewen.Lin
  0 siblings, 0 replies; 22+ messages in thread
From: Kewen.Lin @ 2019-04-24  9:25 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

Thanks for catching, Jakub! 

The update listed below, will integrate it with next revision patch.

gcc/ChangeLog

2019-04-24  Kewen Lin  <linkw@gcc.gnu.org>

        PR middle-end/80791
        * target.def (predict_doloop_p): New hook.
        * targhooks.h (default_predict_doloop_p): New declaration.
        * targhooks.c (default_predict_doloop_p): New function.
        * doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New hook.
        * doc/tm.texi: Regenerate.
        * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New function.
        (costly_iter_for_doloop_p): Likewise.
        (rs6000_predict_doloop_p): Likewise.
        (TARGET_PREDICT_DOLOOP_P): New macro.
        * expr.c (produce_memory_decl_rtl): New function.
        (prepare_decl_rtl): Likewise.
        * expr.h (produce_memory_decl_rtl): New declaration.
        (prepare_decl_rtl): Likewise.
        * tree-ssa-loop-ivopts.c (tree_ssa_iv_optimize_loop): Call
        predict_doloop_p hook.
        (tailor_cmp_uses): New function.
        (preserve_ivs_for_use): New function.
        (computation_cost): Call refactored prepare_decl_rtl, consider
        zero cost iv use.
        (dump_use): Dump zero_cost_p field.
        (record_use): Init zero_cost_p field.
        (produce_memory_decl_rtl): Remove.
        (prepare_decl_rtl): Remove.

gcc/testsuite/ChangeLog

2019-04-24  Kewen Lin  <linkw@gcc.gnu.org>

        PR middle-end/80791
        * gcc.dg/tree-ssa/ivopts-lt.c : Adjust.


on 2019/4/24 脧脗脦莽4:48, Jakub Jelinek wrote:
> On Wed, Apr 24, 2019 at 04:41:01PM +0800, Kewen.Lin wrote:
>> gcc/ChangeLog
> 
> Not a review, just ChangeLog nits.
> 
>> 2019-04-24  Kewen Lin  <linkw@gcc.gnu.org>
>>
>> 	PR middle-end/80791
>> 	* target.def (predict_doloop_p): New.
>> 	* targhooks.h (default_predict_doloop_p): New.
>> 	* targhooks.c (default_predict_doloop_p): Likewise.
>> 	* doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New.
>> 	* doc/tm.texi: Regenerate.
>> 	* config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New.
>> 	  (costly_iter_for_doloop_p): New.
>> 	  (rs6000_predict_doloop_p): New.
> 
> There should be no leading spaces after the tab on the lines that don't contain
> a filename, so:
> 	* config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New.
> 	(costly_iter_for_doloop_p): New.
> 	(rs6000_predict_doloop_p): New.
> instead (and instead of saying just New. usually we write what kind of
> new thing it is, so New function., New declaration. (or Declare.), New
> macro. (or Define.), New method. etc.
> 
>> 	* gcc/expr.c (produce_memory_decl_rtl): New.
> 
> The gcc/ prefixes don't belong into gcc/ChangeLog, the filenames are always
> relative to the ChangeLog file referencing those.
> 
> 	Jakub
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-24  8:49 [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts Kewen.Lin
  2019-04-24  9:08 ` Jakub Jelinek
@ 2019-04-25 12:45 ` Segher Boessenkool
  2019-04-26  6:58   ` Kewen.Lin
  2019-04-27  3:45 ` Bin.Cheng
  2 siblings, 1 reply; 22+ messages in thread
From: Segher Boessenkool @ 2019-04-25 12:45 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, jakub

Hi Kewen,

Cool stuff.

On Wed, Apr 24, 2019 at 04:41:01PM +0800, Kewen.Lin wrote:
> One failure was found in regression testing, which is:
>   gcc.target/powerpc/20050830-1.c
> 
> I did some investigation and found: with this patch, one new iv cand
> is chosen for exit condition, which is used to rewrite the compare
> type use.  Later loop iv analysis in RTL can NOT determine the loop
> iteration number is finite, which causes doloop_optimize not transform
> the loop.  The check doesn't find the expected pattern.

Does it create worse code now?  What we have before your patch isn't
so super either (it has an sldi in the loop, it has two mtctr too).
Maybe you can show the generated code?

> Btw, this is for GCC10.

*Phew* ;-)


Some trivial comments:

> +static bool
> +invalid_insn_for_doloop_p (struct loop *loop)
> +{
> +  basic_block *body = get_loop_body (loop);
> +  unsigned num_nodes = loop->num_nodes;
> +  gimple_stmt_iterator gsi;
> +  unsigned i;
> +
> +  for (i = 0; i < num_nodes; i++)

  for (unsigned i = 0; i < num_nodes; i++)

(and maybe you can just say loop->num_nodes here; I don't know if that
generates worse code, or if that even matters).

> +	    if (dump_file && (dump_flags & TDF_DETAILS))
> +	      fprintf (
> +		dump_file,
> +		"predict doloop failure due to finding computed jump.\n");

We don't normally end lines in (.  There are other solutions to why you
did that here; you can use string pasting, to break the string into two,
or factor out (some part of) the loop body to reduce indentation.


Segher

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-25 12:45 ` Segher Boessenkool
@ 2019-04-26  6:58   ` Kewen.Lin
  2019-04-26  7:59     ` Richard Biener
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Kewen.Lin @ 2019-04-26  6:58 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, jakub

Hi Segher,

Thanks a lot for your comments!

on 2019/4/25 脧脗脦莽8:16, Segher Boessenkool wrote:
 
> Does it create worse code now?  What we have before your patch isn't
> so super either (it has an sldi in the loop, it has two mtctr too).
> Maybe you can show the generated code?

It's a good question! From the generated codes for the core loop, the 
code after my patch doesn't have bdnz to leverage hardware CTR, it has
extra cmpld and branch instead, looks worse.  But I wrote a tiny case
to invoke the foo and evaluated the running time, they are equal.

* Measured time:
  After:
    real 199.47
    user 198.35
    sys 1.11
  Before:
    real 199.19
    user 198.56
    sys 0.62

* Command I used to compile:
${local_gcc} ${case_src}/20050830-1.c  -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -O2 -ffat-lto-objects -fno-ident -c

* main file (main.c):
extern void foo();
#define LOOPNUM 10000
#define N 1000000*256
int a[N];
int main() {
  for(int i = 0; i < LOOPNUM; i++) {
    foo(N);
  }
}


* Generated code sequence:

Before my patch:

        cmpwi 0,3,511
        blelr 0
        addi 10,3,-256
        addi 9,3,-512
        addis 8,2,.LC0@toc@ha
        ld 8,.LC0@toc@l(8)
        cmpwi 0,10,256
        rldicl 9,9,56,40
        sldi 3,3,2
        addi 9,9,1
        mtctr 9
        add 8,3,8
        li 10,42
        blt 0,.L7     # jump to L7 it's less than 256
.L3:                  # core loop
        stw 10,0(8)   
        addi 8,8,-1024
        bdnz .L3
        blr
.L7:
        li 9,1        # special case, iteration only 1
        mtctr 9
        b .L3

After my patch:

        cmpwi 0,3,511
        blelr 0
        addis 7,2,.LC0@toc@ha
        ld 7,.LC0@toc@l(7)
        addi 10,3,-512
        sldi 9,3,2
        rlwinm 10,10,0,0,23
        li 8,42
        subf 10,10,3
        sldi 10,10,2
        addi 6,7,-1024
        add 9,9,7
        add 10,10,6
        .p2align 4,,15
.L3:                      # core loop
        stw 8,0(9)
        addi 9,9,-1024
        cmpld 0,9,10      # cmp 
        beqlr 0           # if eq, return
        stw 8,0(9)     
        addi 9,9,-1024   
        cmpld 0,9,10      # cmp again 
        bne 0,.L3         # if ne, jump to L3.
        blr

------------------------

I practiced whether we can adjust the decision made in ivopts.
For one compare iv use with zero cost, if one iv cand whose base
is from memory has costly elim_cost before adjust_setup_cost,
it's possible to make later analysis unable to find it's finite,
then we try to avoid it.

The diff looks like:

--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
   tree *control_var, *bound_cst;
   enum tree_code comp = ERROR_MARK;
   struct iv_use *use = group->vuses[0];
+  bool dont_elim_p = false;

   /* Extract condition operands.  */
   rewrite_type = extract_cond_operands (data, use->stmt, &control_var,
@@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
          inv_expr_elim = get_loop_invariant_expr (data, bound);
          bitmap_clear (inv_vars_elim);
        }
+
+      /* zero cost use makes it easier to select memory based iv cand
+         for replacement of non memory based iv and its use.   But if
+         the setup sequence are too costly, loop iv analysis can NOT
+         easily figure out it's finite, it's possible to stop the
+         low-overhead loop transformation and get unexpected code.  */
+      if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object
+         && elim_cost.cost >= 30)
+       dont_elim_p = true;
+
       /* The bound is a loop invariant, so it will be only computed
         once.  */
       elim_cost.cost = adjust_setup_cost (data, elim_cost.cost);
@@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
   express_cost += bound_cost;

   /* Choose the better approach, preferring the eliminated IV. */
-  if (elim_cost <= express_cost)
+  if (elim_cost <= express_cost && !dont_elim_p)
     {


I was thinking whether this zero cost change just exposed
an existing problem, then this kind of check should be for all 
cases not only for zero cost use, similar to 
expression_expensive_p?  But why doesn't it happen before?
Need more investigation.

> 
>> Btw, this is for GCC10.
> 
> *Phew* ;-)
> 
> 
> Some trivial comments:
> 
>> +static bool
>> +invalid_insn_for_doloop_p (struct loop *loop)
>> +{
>> +  basic_block *body = get_loop_body (loop);
>> +  unsigned num_nodes = loop->num_nodes;
>> +  gimple_stmt_iterator gsi;
>> +  unsigned i;
>> +
>> +  for (i = 0; i < num_nodes; i++)
> 
>   for (unsigned i = 0; i < num_nodes; i++)
> 
> (and maybe you can just say loop->num_nodes here; I don't know if that
> generates worse code, or if that even matters).

Good idea, will fix it.

> 
>> +	    if (dump_file && (dump_flags & TDF_DETAILS))
>> +	      fprintf (
>> +		dump_file,
>> +		"predict doloop failure due to finding computed jump.\n");
> 
> We don't normally end lines in (.  There are other solutions to why you
> did that here; you can use string pasting, to break the string into two,
> or factor out (some part of) the loop body to reduce indentation.
> 

Will adjust it.

> 
> Segher
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-26  6:58   ` Kewen.Lin
@ 2019-04-26  7:59     ` Richard Biener
  2019-04-26 14:32       ` Kewen.Lin
  2019-04-26 16:44     ` Segher Boessenkool
  2019-04-27  4:13     ` Bin.Cheng
  2 siblings, 1 reply; 22+ messages in thread
From: Richard Biener @ 2019-04-26  7:59 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt, jakub

[-- Attachment #1: Type: text/plain, Size: 6318 bytes --]

On Fri, 26 Apr 2019, Kewen.Lin wrote:

> Hi Segher,
> 
> Thanks a lot for your comments!
> 
> on 2019/4/25 ä¸‹åˆ8:16, Segher Boessenkool wrote:
>  
> > Does it create worse code now?  What we have before your patch isn't
> > so super either (it has an sldi in the loop, it has two mtctr too).
> > Maybe you can show the generated code?
> 
> It's a good question! From the generated codes for the core loop, the 
> code after my patch doesn't have bdnz to leverage hardware CTR, it has
> extra cmpld and branch instead, looks worse.  But I wrote a tiny case
> to invoke the foo and evaluated the running time, they are equal.
> 
> * Measured time:
>   After:
>     real 199.47
>     user 198.35
>     sys 1.11
>   Before:
>     real 199.19
>     user 198.56
>     sys 0.62
> 
> * Command I used to compile:
> ${local_gcc} ${case_src}/20050830-1.c  -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -O2 -ffat-lto-objects -fno-ident -c
> 
> * main file (main.c):
> extern void foo();
> #define LOOPNUM 10000
> #define N 1000000*256
> int a[N];
> int main() {
>   for(int i = 0; i < LOOPNUM; i++) {
>     foo(N);
>   }
> }
> 
> 
> * Generated code sequence:
> 
> Before my patch:
> 
>         cmpwi 0,3,511
>         blelr 0
>         addi 10,3,-256
>         addi 9,3,-512
>         addis 8,2,.LC0@toc@ha
>         ld 8,.LC0@toc@l(8)
>         cmpwi 0,10,256
>         rldicl 9,9,56,40
>         sldi 3,3,2
>         addi 9,9,1
>         mtctr 9
>         add 8,3,8
>         li 10,42
>         blt 0,.L7     # jump to L7 it's less than 256
> .L3:                  # core loop
>         stw 10,0(8)   
>         addi 8,8,-1024
>         bdnz .L3
>         blr
> .L7:
>         li 9,1        # special case, iteration only 1
>         mtctr 9
>         b .L3
> 
> After my patch:
> 
>         cmpwi 0,3,511
>         blelr 0
>         addis 7,2,.LC0@toc@ha
>         ld 7,.LC0@toc@l(7)
>         addi 10,3,-512
>         sldi 9,3,2
>         rlwinm 10,10,0,0,23
>         li 8,42
>         subf 10,10,3
>         sldi 10,10,2
>         addi 6,7,-1024
>         add 9,9,7
>         add 10,10,6
>         .p2align 4,,15
> .L3:                      # core loop
>         stw 8,0(9)
>         addi 9,9,-1024
>         cmpld 0,9,10      # cmp 
>         beqlr 0           # if eq, return
>         stw 8,0(9)     
>         addi 9,9,-1024   
>         cmpld 0,9,10      # cmp again 
>         bne 0,.L3         # if ne, jump to L3.
>         blr
> 
> ------------------------
> 
> I practiced whether we can adjust the decision made in ivopts.
> For one compare iv use with zero cost, if one iv cand whose base
> is from memory has costly elim_cost before adjust_setup_cost,
> it's possible to make later analysis unable to find it's finite,
> then we try to avoid it.
> 
> The diff looks like:
> 
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>    tree *control_var, *bound_cst;
>    enum tree_code comp = ERROR_MARK;
>    struct iv_use *use = group->vuses[0];
> +  bool dont_elim_p = false;
> 
>    /* Extract condition operands.  */
>    rewrite_type = extract_cond_operands (data, use->stmt, &control_var,
> @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>           inv_expr_elim = get_loop_invariant_expr (data, bound);
>           bitmap_clear (inv_vars_elim);
>         }
> +
> +      /* zero cost use makes it easier to select memory based iv cand
> +         for replacement of non memory based iv and its use.   But if
> +         the setup sequence are too costly, loop iv analysis can NOT
> +         easily figure out it's finite, it's possible to stop the
> +         low-overhead loop transformation and get unexpected code.  */
> +      if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object
> +         && elim_cost.cost >= 30)
> +       dont_elim_p = true;
> +
>        /* The bound is a loop invariant, so it will be only computed
>          once.  */
>        elim_cost.cost = adjust_setup_cost (data, elim_cost.cost);
> @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>    express_cost += bound_cost;
> 
>    /* Choose the better approach, preferring the eliminated IV. */
> -  if (elim_cost <= express_cost)
> +  if (elim_cost <= express_cost && !dont_elim_p)
>      {
> 
> 
> I was thinking whether this zero cost change just exposed
> an existing problem, then this kind of check should be for all 
> cases not only for zero cost use, similar to 
> expression_expensive_p?  But why doesn't it happen before?
> Need more investigation.

We should think about possible ways of encoding doloop at IVOPTs
time rather than trying to re-analyze at RTL.  I suppose we can
easily set a flag in struct loop but I'm not familiar enough with
the doloop pass to tell whether that is good enough.  Also we
can maybe move doloop to GIMPLE completely?  I remember some
targets have loop size constraints here as well, so I guess
that isn't easily possible.

Richard.

> > 
> >> Btw, this is for GCC10.
> > 
> > *Phew* ;-)
> > 
> > 
> > Some trivial comments:
> > 
> >> +static bool
> >> +invalid_insn_for_doloop_p (struct loop *loop)
> >> +{
> >> +  basic_block *body = get_loop_body (loop);
> >> +  unsigned num_nodes = loop->num_nodes;
> >> +  gimple_stmt_iterator gsi;
> >> +  unsigned i;
> >> +
> >> +  for (i = 0; i < num_nodes; i++)
> > 
> >   for (unsigned i = 0; i < num_nodes; i++)
> > 
> > (and maybe you can just say loop->num_nodes here; I don't know if that
> > generates worse code, or if that even matters).
> 
> Good idea, will fix it.
> 
> > 
> >> +	    if (dump_file && (dump_flags & TDF_DETAILS))
> >> +	      fprintf (
> >> +		dump_file,
> >> +		"predict doloop failure due to finding computed jump.\n");
> > 
> > We don't normally end lines in (.  There are other solutions to why you
> > did that here; you can use string pasting, to break the string into two,
> > or factor out (some part of) the loop body to reduce indentation.
> > 
> 
> Will adjust it.
> 
> > 
> > Segher
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix ImendÃ¶rffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG NÃ¼rnberg)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-26  7:59     ` Richard Biener
@ 2019-04-26 14:32       ` Kewen.Lin
  2019-04-26 17:06         ` Segher Boessenkool
  0 siblings, 1 reply; 22+ messages in thread
From: Kewen.Lin @ 2019-04-26 14:32 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt, jakub

on 2019/4/26 ä¸‹åˆ3:16, Richard Biener wrote:
> On Fri, 26 Apr 2019, Kewen.Lin wrote:
> 
>> I was thinking whether this zero cost change just exposed
>> an existing problem, then this kind of check should be for all 
>> cases not only for zero cost use, similar to 
>> expression_expensive_p?  But why doesn't it happen before?
>> Need more investigation.
> 
> We should think about possible ways of encoding doloop at IVOPTs
> time rather than trying to re-analyze at RTL.  I suppose we can
> easily set a flag in struct loop but I'm not familiar enough with
> the doloop pass to tell whether that is good enough.  Also we
> can maybe move doloop to GIMPLE completely?  I remember some
> targets have loop size constraints here as well, so I guess
> that isn't easily possible.
> 
> Richard.
Thanks Richard! 

It's a very good point, but I'm afraid that it's impossible to move the 
whole doloop analysis pass to GIMPLE.  When I implemented this hook 
rs6000_predict_doloop_p, I went through all the checks in doloop_optimize.
I found it looks impossible to imitate all of them at GIMPLE, especially 
for gen_doloop_end check, jump insn check and register liveness clobber
check.  Even if we can make hook to check expanded RTL insn sequence in 
GIMPLE, but it happens too early, some information may not be exact enough,
many following passes could update what the analysis is based on.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-26 14:32       ` Kewen.Lin
@ 2019-04-26 17:06         ` Segher Boessenkool
  2019-04-26 18:23           ` Richard Biener
  2019-05-05  3:42           ` Kewen.Lin
  0 siblings, 2 replies; 22+ messages in thread
From: Segher Boessenkool @ 2019-04-26 17:06 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Richard Biener, GCC Patches, bin.cheng, Bill Schmidt, jakub

On Fri, Apr 26, 2019 at 10:26:52PM +0800, Kewen.Lin wrote:
> on 2019/4/26 ä¸‹åˆ3:16, Richard Biener wrote:
> > We should think about possible ways of encoding doloop at IVOPTs
> > time rather than trying to re-analyze at RTL.  I suppose we can
> > easily set a flag in struct loop but I'm not familiar enough with
> > the doloop pass to tell whether that is good enough.  Also we
> > can maybe move doloop to GIMPLE completely?  I remember some
> > targets have loop size constraints here as well, so I guess
> > that isn't easily possible.

> It's a very good point, but I'm afraid that it's impossible to move the 
> whole doloop analysis pass to GIMPLE.  When I implemented this hook 
> rs6000_predict_doloop_p, I went through all the checks in doloop_optimize.
> I found it looks impossible to imitate all of them at GIMPLE, especially 
> for gen_doloop_end check, jump insn check and register liveness clobber
> check.  Even if we can make hook to check expanded RTL insn sequence in 
> GIMPLE, but it happens too early, some information may not be exact enough,
> many following passes could update what the analysis is based on.

But you could set a flag in gimple, and have the RTL pass only do the
final mechanics of making things a counted loop -- including generating
the code for when it cannot do the doloop, which indeed will happen for
many different reasons, many target-specific, but it is probably pretty
easy to predict when we *can* use one, and we can do that optimistically,
it's not so very much worse code to have to do it with a few instructions
instead of e.g. a bdnz; there are many optimisation passes after this
that can still improve the code (cprop, cse, combine).

So, make it a doloop in gimple, and still have the rtl pass, but that
only reverts it to a non-doloop if it turns out it has to.  Does that
sound like a good plan?

Segher

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-26 17:06         ` Segher Boessenkool
@ 2019-04-26 18:23           ` Richard Biener
  2019-04-26 18:58             ` Segher Boessenkool
  2019-05-05  3:42           ` Kewen.Lin
  1 sibling, 1 reply; 22+ messages in thread
From: Richard Biener @ 2019-04-26 18:23 UTC (permalink / raw)
  To: Segher Boessenkool, Kewen.Lin; +Cc: GCC Patches, bin.cheng, Bill Schmidt, jakub

On April 26, 2019 6:59:43 PM GMT+02:00, Segher Boessenkool <segher@kernel.crashing.org> wrote:
>On Fri, Apr 26, 2019 at 10:26:52PM +0800, Kewen.Lin wrote:
>> on 2019/4/26 下午3:16, Richard Biener wrote:
>> > We should think about possible ways of encoding doloop at IVOPTs
>> > time rather than trying to re-analyze at RTL.  I suppose we can
>> > easily set a flag in struct loop but I'm not familiar enough with
>> > the doloop pass to tell whether that is good enough.  Also we
>> > can maybe move doloop to GIMPLE completely?  I remember some
>> > targets have loop size constraints here as well, so I guess
>> > that isn't easily possible.
>
>> It's a very good point, but I'm afraid that it's impossible to move
>the 
>> whole doloop analysis pass to GIMPLE.  When I implemented this hook 
>> rs6000_predict_doloop_p, I went through all the checks in
>doloop_optimize.
>> I found it looks impossible to imitate all of them at GIMPLE,
>especially 
>> for gen_doloop_end check, jump insn check and register liveness
>clobber
>> check.  Even if we can make hook to check expanded RTL insn sequence
>in 
>> GIMPLE, but it happens too early, some information may not be exact
>enough,
>> many following passes could update what the analysis is based on.
>
>But you could set a flag in gimple, and have the RTL pass only do the
>final mechanics of making things a counted loop -- including generating
>the code for when it cannot do the doloop, which indeed will happen for
>many different reasons, many target-specific, but it is probably pretty
>easy to predict when we *can* use one, and we can do that
>optimistically,
>it's not so very much worse code to have to do it with a few
>instructions
>instead of e.g. a bdnz; there are many optimisation passes after this
>that can still improve the code (cprop, cse, combine).
>
>So, make it a doloop in gimple, and still have the rtl pass, but that
>only reverts it to a non-doloop if it turns out it has to.  Does that
>sound like a good plan?

Yes, that's more or less what I had in mind. Note that you'd have a doloop early in RTL, sth we might not expect? 

Maybe we already have a GIMPLE way of saying decrement and test zero (one of the overflow builtins?), if not we'd need to add such to have 

  Flag_1 = IFN_decandtest (i_2);
If (Flag_1! = 0)
 ... 

As 'doloop-end' on GIMPLE. Just adjusting RTL expansion to catch a regular Dec and test would be possible
As well of course. 

Richard. 

>
>Segher

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-26 18:23           ` Richard Biener
@ 2019-04-26 18:58             ` Segher Boessenkool
  0 siblings, 0 replies; 22+ messages in thread
From: Segher Boessenkool @ 2019-04-26 18:58 UTC (permalink / raw)
  To: Richard Biener; +Cc: Kewen.Lin, GCC Patches, bin.cheng, Bill Schmidt, jakub

On Fri, Apr 26, 2019 at 07:51:05PM +0200, Richard Biener wrote:
> On April 26, 2019 6:59:43 PM GMT+02:00, Segher Boessenkool <segher@kernel.crashing.org> wrote:
> >So, make it a doloop in gimple, and still have the rtl pass, but that
> >only reverts it to a non-doloop if it turns out it has to.  Does that
> >sound like a good plan?
> 
> Yes, that's more or less what I had in mind. Note that you'd have a doloop early in RTL, sth we might not expect? 

I don't think there are any fundamental reasons to not have that.  Maybe
some adjustments are needed, sure.

> Maybe we already have a GIMPLE way of saying decrement and test zero (one of the overflow builtins?), if not we'd need to add such to have 
> 
>   Flag_1 = IFN_decandtest (i_2);
> If (Flag_1! = 0)
>  ... 
> 
> As 'doloop-end' on GIMPLE. Just adjusting RTL expansion to catch a regular Dec and test would be possible
> As well of course. 

I don't know if we should have any special instructions in GIMPLE for this.
Maybe it is enough to just mark up the loop?  It would be a lot simpler, if
it *works* that is :-)


Segher

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-26 17:06         ` Segher Boessenkool
  2019-04-26 18:23           ` Richard Biener
@ 2019-05-05  3:42           ` Kewen.Lin
  1 sibling, 0 replies; 22+ messages in thread
From: Kewen.Lin @ 2019-05-05  3:42 UTC (permalink / raw)
  To: gcc-patches

on 2019/4/27 ä¸Šåˆ12:59, Segher Boessenkool wrote:
> On Fri, Apr 26, 2019 at 10:26:52PM +0800, Kewen.Lin wrote:
>> on 2019/4/26 ä¸‹åˆ3:16, Richard Biener wrote:
>>> We should think about possible ways of encoding doloop at IVOPTs
>>> time rather than trying to re-analyze at RTL.  I suppose we can
>>> easily set a flag in struct loop but I'm not familiar enough with
>>> the doloop pass to tell whether that is good enough.  Also we
>>> can maybe move doloop to GIMPLE completely?  I remember some
>>> targets have loop size constraints here as well, so I guess
>>> that isn't easily possible.
> 
>> It's a very good point, but I'm afraid that it's impossible to move the 
>> whole doloop analysis pass to GIMPLE.  When I implemented this hook 
>> rs6000_predict_doloop_p, I went through all the checks in doloop_optimize.
>> I found it looks impossible to imitate all of them at GIMPLE, especially 
>> for gen_doloop_end check, jump insn check and register liveness clobber
>> check.  Even if we can make hook to check expanded RTL insn sequence in 
>> GIMPLE, but it happens too early, some information may not be exact enough,
>> many following passes could update what the analysis is based on.
> 
> But you could set a flag in gimple, and have the RTL pass only do the
> final mechanics of making things a counted loop -- including generating
> the code for when it cannot do the doloop, which indeed will happen for
> many different reasons, many target-specific, but it is probably pretty
> easy to predict when we *can* use one, and we can do that optimistically,
> it's not so very much worse code to have to do it with a few instructions
> instead of e.g. a bdnz; there are many optimisation passes after this
> that can still improve the code (cprop, cse, combine).
> 

Yes, we can set a flag to indicate this loop was predicted as doloop,
which implicitly says we have checked some doloop criteria that are 
able to be checked in middle end.  It can be well defined later.

One question in mind is whether there are some other passes suffering 
the pain that the pass doesn't know the loop can be transformed to doloop
or not and can't perform some optimization.  I guess it decides what's
the value and priority to make this non-trivial change.  Do you happen
to know some existing PRs?

> So, make it a doloop in gimple, and still have the rtl pass, but that
> only reverts it to a non-doloop if it turns out it has to.  Does that
> sound like a good plan?
> 

Sounds good! Thanks Segher!

> 
> Segher
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-26  6:58   ` Kewen.Lin
  2019-04-26  7:59     ` Richard Biener
@ 2019-04-26 16:44     ` Segher Boessenkool
  2019-04-27  4:13     ` Bin.Cheng
  2 siblings, 0 replies; 22+ messages in thread
From: Segher Boessenkool @ 2019-04-26 16:44 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, bin.cheng, Bill Schmidt, Richard Guenther, jakub

On Fri, Apr 26, 2019 at 02:43:44PM +0800, Kewen.Lin wrote:
> > Does it create worse code now?  What we have before your patch isn't
> > so super either (it has an sldi in the loop, it has two mtctr too).
> > Maybe you can show the generated code?
> 
> It's a good question! From the generated codes for the core loop, the 
> code after my patch doesn't have bdnz to leverage hardware CTR, it has
> extra cmpld and branch instead, looks worse.  But I wrote a tiny case
> to invoke the foo and evaluated the running time, they are equal.
> 
> * Measured time:
>   After:
>     real 199.47
>     user 198.35
>     sys 1.11
>   Before:
>     real 199.19
>     user 198.56
>     sys 0.62

Before:

> .L3:                  # core loop
>         stw 10,0(8)   
>         addi 8,8,-1024
>         bdnz .L3

So it didn't use an update instruction here, although it could.  Not that
that changes anything: it would still be three cycles per iteration (that's
the minimum for any loop: instruction fetch is the bottleneck).

After:

> .L3:                      # core loop
>         stw 8,0(9)
>         addi 9,9,-1024
>         cmpld 0,9,10      # cmp 
>         beqlr 0           # if eq, return
>         stw 8,0(9)     
>         addi 9,9,-1024   
>         cmpld 0,9,10      # cmp again 
>         bne 0,.L3         # if ne, jump to L3.

This is unrolled a factor 2.  It should be faster, unfortunately it updates
r9 twice per unrolled loop, making the dependency chain too long.

The bdnz loop could be like

0:
	stw 10,-1024(8)
	bdzlr
	stwu 10,-2048(8)
	bdnz 0b

or similar.  There are multiple problems before we can get that  :-)
(The important one is that the pointer (r8 here) should be updated only
once per unrolled loop iteration; just like in the version without bdnz.
Using bdzlr and stwu is just niceties, compared to that).


> I practiced whether we can adjust the decision made in ivopts.

[ snip ]

> Need more investigation.

Yeah.


Segher

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-26  6:58   ` Kewen.Lin
  2019-04-26  7:59     ` Richard Biener
  2019-04-26 16:44     ` Segher Boessenkool
@ 2019-04-27  4:13     ` Bin.Cheng
  2019-05-05  5:26       ` Kewen.Lin
  2 siblings, 1 reply; 22+ messages in thread
From: Bin.Cheng @ 2019-04-27  4:13 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

On Fri, Apr 26, 2019 at 2:44 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Segher,
>
> Thanks a lot for your comments!
>
> on 2019/4/25 下午8:16, Segher Boessenkool wrote:
>
> > Does it create worse code now?  What we have before your patch isn't
> > so super either (it has an sldi in the loop, it has two mtctr too).
> > Maybe you can show the generated code?
>
> It's a good question! From the generated codes for the core loop, the
> code after my patch doesn't have bdnz to leverage hardware CTR, it has
> extra cmpld and branch instead, looks worse.  But I wrote a tiny case
> to invoke the foo and evaluated the running time, they are equal.
>
> * Measured time:
>   After:
>     real 199.47
>     user 198.35
>     sys 1.11
>   Before:
>     real 199.19
>     user 198.56
>     sys 0.62
>
> * Command I used to compile:
> ${local_gcc} ${case_src}/20050830-1.c  -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -O2 -ffat-lto-objects -fno-ident -c
>
> * main file (main.c):
> extern void foo();
> #define LOOPNUM 10000
> #define N 1000000*256
> int a[N];
> int main() {
>   for(int i = 0; i < LOOPNUM; i++) {
>     foo(N);
>   }
> }
>
>
> * Generated code sequence:
>
> Before my patch:
>
>         cmpwi 0,3,511
>         blelr 0
>         addi 10,3,-256
>         addi 9,3,-512
>         addis 8,2,.LC0@toc@ha
>         ld 8,.LC0@toc@l(8)
>         cmpwi 0,10,256
>         rldicl 9,9,56,40
>         sldi 3,3,2
>         addi 9,9,1
>         mtctr 9
>         add 8,3,8
>         li 10,42
>         blt 0,.L7     # jump to L7 it's less than 256
> .L3:                  # core loop
>         stw 10,0(8)
>         addi 8,8,-1024
>         bdnz .L3
>         blr
> .L7:
>         li 9,1        # special case, iteration only 1
>         mtctr 9
>         b .L3
>
> After my patch:
>
>         cmpwi 0,3,511
>         blelr 0
>         addis 7,2,.LC0@toc@ha
>         ld 7,.LC0@toc@l(7)
>         addi 10,3,-512
>         sldi 9,3,2
>         rlwinm 10,10,0,0,23
>         li 8,42
>         subf 10,10,3
>         sldi 10,10,2
>         addi 6,7,-1024
>         add 9,9,7
>         add 10,10,6
>         .p2align 4,,15
> .L3:                      # core loop
>         stw 8,0(9)
>         addi 9,9,-1024
>         cmpld 0,9,10      # cmp
>         beqlr 0           # if eq, return
>         stw 8,0(9)
>         addi 9,9,-1024
>         cmpld 0,9,10      # cmp again
>         bne 0,.L3         # if ne, jump to L3.
>         blr
>
> ------------------------
>
> I practiced whether we can adjust the decision made in ivopts.
> For one compare iv use with zero cost, if one iv cand whose base
> is from memory has costly elim_cost before adjust_setup_cost,
> it's possible to make later analysis unable to find it's finite,
> then we try to avoid it.
>
> The diff looks like:
>
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>    tree *control_var, *bound_cst;
>    enum tree_code comp = ERROR_MARK;
>    struct iv_use *use = group->vuses[0];
> +  bool dont_elim_p = false;
>
>    /* Extract condition operands.  */
>    rewrite_type = extract_cond_operands (data, use->stmt, &control_var,
> @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>           inv_expr_elim = get_loop_invariant_expr (data, bound);
>           bitmap_clear (inv_vars_elim);
>         }
> +
> +      /* zero cost use makes it easier to select memory based iv cand
> +         for replacement of non memory based iv and its use.   But if
> +         the setup sequence are too costly, loop iv analysis can NOT
> +         easily figure out it's finite, it's possible to stop the
> +         low-overhead loop transformation and get unexpected code.  */
> +      if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object
> +         && elim_cost.cost >= 30)
> +       dont_elim_p = true;
No, we'd like to avoid such things in general.  The conditions look
like a hack to me.  elim_cost is compared to express_cost, adding
another check on it at different place isn't really good, especially
30 itself is a magic number.  It's most likely improvement in some
cases, deterioration in others.

Also it punishes one pass (IVOPTs here) because of other pass' (RTL)
problem.  It does't mean we can't do such transformations, only it has
to be as precise/conservative as possible.  For example, if RTL loop
iv is improved to handle the case in the future, who would remember to
come back and adjust this?

GCC lacks the capability passing information to later passes.  Gimple
analyzer worked hard collecting various information but discards it
entering RTL or earlier.  Other examples are like runtime alias
information, non-wrapping information for specific operations, etc.
IMHO, this is what needs to be done.  As for this case, it could be
finite loop info, or non-wrapping info of the iv_var's increment
operation.  By passing more information, RTL passes can be simplified
too.

Thanks,
bin
> +
>        /* The bound is a loop invariant, so it will be only computed
>          once.  */
>        elim_cost.cost = adjust_setup_cost (data, elim_cost.cost);
> @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>    express_cost += bound_cost;
>
>    /* Choose the better approach, preferring the eliminated IV. */
> -  if (elim_cost <= express_cost)
> +  if (elim_cost <= express_cost && !dont_elim_p)
>      {
>
>
> I was thinking whether this zero cost change just exposed
> an existing problem, then this kind of check should be for all
> cases not only for zero cost use, similar to
> expression_expensive_p?  But why doesn't it happen before?
> Need more investigation.
>
> >
> >> Btw, this is for GCC10.
> >
> > *Phew* ;-)
> >
> >
> > Some trivial comments:
> >
> >> +static bool
> >> +invalid_insn_for_doloop_p (struct loop *loop)
> >> +{
> >> +  basic_block *body = get_loop_body (loop);
> >> +  unsigned num_nodes = loop->num_nodes;
> >> +  gimple_stmt_iterator gsi;
> >> +  unsigned i;
> >> +
> >> +  for (i = 0; i < num_nodes; i++)
> >
> >   for (unsigned i = 0; i < num_nodes; i++)
> >
> > (and maybe you can just say loop->num_nodes here; I don't know if that
> > generates worse code, or if that even matters).
>
> Good idea, will fix it.
>
> >
> >> +        if (dump_file && (dump_flags & TDF_DETAILS))
> >> +          fprintf (
> >> +            dump_file,
> >> +            "predict doloop failure due to finding computed jump.\n");
> >
> > We don't normally end lines in (.  There are other solutions to why you
> > did that here; you can use string pasting, to break the string into two,
> > or factor out (some part of) the loop body to reduce indentation.
> >
>
> Will adjust it.
>
> >
> > Segher
> >
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-27  4:13     ` Bin.Cheng
@ 2019-05-05  5:26       ` Kewen.Lin
  2019-05-06 10:21         ` Richard Biener
  0 siblings, 1 reply; 22+ messages in thread
From: Kewen.Lin @ 2019-05-05  5:26 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Segher Boessenkool, GCC Patches, bin.cheng, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

on 2019/4/27 ä¸Šåˆ11:44, Bin.Cheng wrote:
> On Fri, Apr 26, 2019 at 2:44 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>> +
>> +      /* zero cost use makes it easier to select memory based iv cand
>> +         for replacement of non memory based iv and its use.   But if
>> +         the setup sequence are too costly, loop iv analysis can NOT
>> +         easily figure out it's finite, it's possible to stop the
>> +         low-overhead loop transformation and get unexpected code.  */
>> +      if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object
>> +         && elim_cost.cost >= 30)
>> +       dont_elim_p = true;
> No, we'd like to avoid such things in general.  The conditions look
> like a hack to me.  elim_cost is compared to express_cost, adding
> another check on it at different place isn't really good, especially
> 30 itself is a magic number.  It's most likely improvement in some
> cases, deterioration in others.
> 

Yes, I agree it's too hacky and unacceptable as a formal fix.  And I
tried to investigate more whether it's a general issue but never got
exposed.

> Also it punishes one pass (IVOPTs here) because of other pass' (RTL)
> problem.  It does't mean we can't do such transformations, only it has
> to be as precise/conservative as possible.  For example, if RTL loop
> iv is improved to handle the case in the future, who would remember to
> come back and adjust this?
> 

Good question!

> GCC lacks the capability passing information to later passes.  Gimple
> analyzer worked hard collecting various information but discards it
> entering RTL or earlier.  Other examples are like runtime alias
> information, non-wrapping information for specific operations, etc.
> IMHO, this is what needs to be done.  As for this case, it could be
> finite loop info, or non-wrapping info of the iv_var's increment
> operation.  By passing more information, RTL passes can be simplified
> too.
> 

Thanks for the information! Is there any under development work for this?
That would be fine if we can pass down those information to downstream
passes based on upcoming feature.

> Thanks,
> bin
>> +
>>        /* The bound is a loop invariant, so it will be only computed
>>          once.  */
>>        elim_cost.cost = adjust_setup_cost (data, elim_cost.cost);
>> @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>>    express_cost += bound_cost;
>>
>>    /* Choose the better approach, preferring the eliminated IV. */
>> -  if (elim_cost <= express_cost)
>> +  if (elim_cost <= express_cost && !dont_elim_p)
>>      {
>>
>>
>> I was thinking whether this zero cost change just exposed
>> an existing problem, then this kind of check should be for all
>> cases not only for zero cost use, similar to
>> expression_expensive_p?  But why doesn't it happen before?
>> Need more investigation.
>>
>>>
>>>> Btw, this is for GCC10.
>>>
>>> *Phew* ;-)
>>>
>>>
>>> Some trivial comments:
>>>
>>>> +static bool
>>>> +invalid_insn_for_doloop_p (struct loop *loop)
>>>> +{
>>>> +  basic_block *body = get_loop_body (loop);
>>>> +  unsigned num_nodes = loop->num_nodes;
>>>> +  gimple_stmt_iterator gsi;
>>>> +  unsigned i;
>>>> +
>>>> +  for (i = 0; i < num_nodes; i++)
>>>
>>>   for (unsigned i = 0; i < num_nodes; i++)
>>>
>>> (and maybe you can just say loop->num_nodes here; I don't know if that
>>> generates worse code, or if that even matters).
>>
>> Good idea, will fix it.
>>
>>>
>>>> +        if (dump_file && (dump_flags & TDF_DETAILS))
>>>> +          fprintf (
>>>> +            dump_file,
>>>> +            "predict doloop failure due to finding computed jump.\n");
>>>
>>> We don't normally end lines in (.  There are other solutions to why you
>>> did that here; you can use string pasting, to break the string into two,
>>> or factor out (some part of) the loop body to reduce indentation.
>>>
>>
>> Will adjust it.
>>
>>>
>>> Segher
>>>
>>
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-05-05  5:26       ` Kewen.Lin
@ 2019-05-06 10:21         ` Richard Biener
  0 siblings, 0 replies; 22+ messages in thread
From: Richard Biener @ 2019-05-06 10:21 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Bin.Cheng, Segher Boessenkool, GCC Patches, bin.cheng,
	Bill Schmidt, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 897 bytes --]

On Sun, 5 May 2019, Kewen.Lin wrote:

> on 2019/4/27 ä¸Šåˆ11:44, Bin.Cheng wrote:
> > GCC lacks the capability passing information to later passes.  Gimple
> > analyzer worked hard collecting various information but discards it
> > entering RTL or earlier.  Other examples are like runtime alias
> > information, non-wrapping information for specific operations, etc.
> > IMHO, this is what needs to be done.  As for this case, it could be
> > finite loop info, or non-wrapping info of the iv_var's increment
> > operation.  By passing more information, RTL passes can be simplified
> > too.
> > 
> 
> Thanks for the information! Is there any under development work for this?
> That would be fine if we can pass down those information to downstream
> passes based on upcoming feature.

GCC keeps several bits of information across passes, notably
everything stored in struct loop.

Richard.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-24  8:49 [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts Kewen.Lin
  2019-04-24  9:08 ` Jakub Jelinek
  2019-04-25 12:45 ` Segher Boessenkool
@ 2019-04-27  3:45 ` Bin.Cheng
  2019-05-05  3:23   ` Kewen.Lin
  2 siblings, 1 reply; 22+ messages in thread
From: Bin.Cheng @ 2019-04-27  3:45 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

Thanks very much for working on this.

On Wed, Apr 24, 2019 at 4:41 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi all,
>
> As PR80791 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80791, on
> some targets which support low-overhead loop, the related compare
> type ivs use is possible to become dead and removed eventually.
> However, the current ivopt cost modeling doesn't consider this
> possible elimination, it leads suboptimal iv candidate decision.
>
> To make it better, this patch is to introduce one target hook, to
> provide target which supports low-overhead loop a way to teach
> ivopt to know which compare type iv can be ignored in modeling.
>
> Since the low-overhead loop optimize transformation is based on RTL,
> some of those checks are hard to be imitated on gimple, so it's not
> possible to predict the current loop will be transformed exactly
> in middle-end.  But if we can have most loop predicted precisely,
> it would be helpful.  It highly depends on target hook fine tuning.
> It's acceptable to have some loops which can be transformed to low-
> overhead loop but we don't catch.  But we should try our best to
> avoid to predict some loop as low-overhead loop but which isn't.
>
> Bootstrapped and regression testing on powerpc64le.
>
> One failure was found in regression testing, which is:
>   gcc.target/powerpc/20050830-1.c
>
> I did some investigation and found: with this patch, one new iv cand
> is chosen for exit condition, which is used to rewrite the compare
> type use.  Later loop iv analysis in RTL can NOT determine the loop
> iteration number is finite, which causes doloop_optimize not transform
> the loop.  The check doesn't find the expected pattern.
>
> Still investigating how to solve this failure, like either to enhance
> loop iv analysis or to check some special condition and guard in ivopt.
> Any suggestions are highly welcomed.
>
> Btw, this is for GCC10.
>
> ---------------------------------------------
>
> gcc/ChangeLog
>
> 2019-04-24  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * target.def (predict_doloop_p): New.
>         * targhooks.h (default_predict_doloop_p): New.
>         * targhooks.c (default_predict_doloop_p): Likewise.
>         * doc/tm.texi.in (TARGET_PREDICT_DOLOOP_P): New.
>         * doc/tm.texi: Regenerate.
>         * config/rs6000/rs6000.c (invalid_insn_for_doloop_p): New.
>           (costly_iter_for_doloop_p): New.
>           (rs6000_predict_doloop_p): New.
>         * gcc/expr.c (produce_memory_decl_rtl): New.
>           (prepare_decl_rtl): New.
>         * gcc/expr.h (produce_memory_decl_rtl): Declare.
>           (prepare_decl_rtl): Declare.
>         * gcc/tree-ssa-loop-ivopts.c (tree_ssa_iv_optimize_loop): Call
>           predict_doloop_p hook.
>           (tailor_cmp_uses): New.
>           (preserve_ivs_for_use): New.
>           (computation_cost): Call refactored prepare_decl_rtl, consider
>           zero cost iv use.
>           (dump_use): Dump zero_cost_p field.
>           (record_use): Init zero_cost_p field.
>           (produce_memory_decl_rtl): Remove.
>           (prepare_decl_rtl): Remove.
>
> gcc/testsuite/ChangeLog
>
> 2019-04-24  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * gcc.dg/tree-ssa/ivopts-lt.c : Adjust.
>
> ---
>  gcc/config/rs6000/rs6000.c                | 174 ++++++++++++++++++++-
>  gcc/doc/tm.texi                           |   8 +
>  gcc/doc/tm.texi.in                        |   2 +
>  gcc/expr.c                                |  91 +++++++++++
>  gcc/expr.h                                |  16 +-
>  gcc/target.def                            |   9 ++
>  gcc/targhooks.c                           |  13 ++
>  gcc/targhooks.h                           |   1 +
>  gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c |   7 +-
>  gcc/tree-ssa-loop-ivopts.c                | 247 +++++++++++++++++++-----------
>  10 files changed, 472 insertions(+), 96 deletions(-)
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index a21f4f7..756f914 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
For such non-trivial patch, we can improve review process by splitting
to smaller patches which can be reviewed/approved independently.
Below are some general comments specific to IVOPTs change.
> @@ -83,6 +83,9 @@
>  #include "tree-ssa-propagate.h"
>  #include "tree-vrp.h"
>  #include "tree-ssanames.h"
> +#include "tree-ssa-loop-niter.h"
> +#include "tree-cfg.h"
> +#include "tree-scalar-evolution.h"
>
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -1914,6 +1917,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
>  #undef TARGET_CAN_USE_DOLOOP_P
>  #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
>
> +#undef TARGET_PREDICT_DOLOOP_P
> +#define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
> +
>  #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
>  #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
>
> @@ -39436,7 +39442,173 @@ rs6000_mangle_decl_assembler_name (tree decl, tree id)
>    return id;
>  }
>
> -
> +/* Check whether there are some instructions preventing doloop transformation
> +   inside loop body, mainly for instructions which are possible to kill CTR.
> +
> +   Return true if some invalid insn exits, otherwise return false.  */
> +
> +static bool
> +invalid_insn_for_doloop_p (struct loop *loop)
> +{
> +  basic_block *body = get_loop_body (loop);
> +  unsigned num_nodes = loop->num_nodes;
> +  gimple_stmt_iterator gsi;
> +  unsigned i;
> +
> +  for (i = 0; i < num_nodes; i++)
> +    for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
> +      {
> +       gimple *stmt = gsi_stmt (gsi);
> +       if (is_gimple_call (stmt) && !gimple_call_internal_p (stmt)
> +           && !is_inexpensive_builtin (gimple_call_fndecl (stmt)))
> +         {
> +           if (dump_file && (dump_flags & TDF_DETAILS))
> +             fprintf (dump_file,
> +                      "predict doloop failure due to finding call.\n");
> +           return true;
> +         }
> +       if (computed_goto_p (stmt))
> +         {
> +           if (dump_file && (dump_flags & TDF_DETAILS))
> +             fprintf (
> +               dump_file,
> +               "predict doloop failure due to finding computed jump.\n");
> +           return true;
> +         }
> +       /* TODO: Now this hook is expected to be called in ivopts, which is
> +          before switchlower1/switchlower2.  Checking for SWITCH at this point
> +       will eliminate some good candidates.  But since there are only a few
> +       cases having _a_ switch statement in the innermost loop, it can be a low
> +          priority enhancement.  */
> +
> +       if (gimple_code (stmt) == GIMPLE_SWITCH)
> +         {
> +           if (dump_file && (dump_flags & TDF_DETAILS))
> +             fprintf (dump_file,
> +                      "predict doloop failure due to finding switch.\n");
> +           return true;
> +         }
> +      }
> +
> +  return false;
> +}
> +
> +/* Check whether number of iteration computation is too costly for doloop
> +   transformation.  It expands the gimple sequence to equivalent RTL insn
> +   sequence, then evaluate the cost.
> +
> +   Return true if it's costly, otherwise return false.  */
> +
> +static bool
> +costly_iter_for_doloop_p (struct loop *loop, tree niters)
> +{
> +  tree type = TREE_TYPE (niters);
> +  unsigned cost = 0, i;
> +  tree obj;
> +  bool speed = optimize_loop_for_speed_p (loop);
> +  int regno = LAST_VIRTUAL_REGISTER + 1;
> +  vec<tree> tvec;
> +  tvec.create (20);
> +  decl_rtl_data tdata;
> +  tdata.regno = &regno;
> +  tdata.treevec = &tvec;
> +  walk_tree (&niters, prepare_decl_rtl, &tdata, NULL);
> +  start_sequence ();
> +  expand_expr (niters, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL);
> +  rtx_insn *seq = get_insns ();
> +  end_sequence ();
> +  FOR_EACH_VEC_ELT (tvec, i, obj)
> +    SET_DECL_RTL (obj, NULL_RTX);
> +  tvec.release ();
> +
> +  for (; seq; seq = NEXT_INSN (seq))
> +    {
> +      if (!INSN_P (seq))
> +       continue;
> +      rtx body = PATTERN (seq);
> +      if (GET_CODE (body) == SET)
> +       {
> +         rtx set_val = XEXP (body, 1);
> +         enum rtx_code code = GET_CODE (set_val);
> +         enum rtx_class cls = GET_RTX_CLASS (code);
> +         /* For now, we only consider these two RTX classes, to match what we
> +        get in doloop_optimize, excluding operations like zero/sign extend.  */
> +         if (cls == RTX_BIN_ARITH || cls == RTX_COMM_ARITH)
> +           cost += set_src_cost (set_val, GET_MODE (set_val), speed);
> +       }
> +    }
> +  unsigned max_cost
> +    = COSTS_N_INSNS (PARAM_VALUE (PARAM_MAX_ITERATIONS_COMPUTATION_COST));
> +  if (cost > max_cost)
> +    return true;
> +
> +  return false;
> +}
> +
> +/*
> +   Predict whether the given loop will be transformed in the RTL
> +   doloop_optimize pass.  This is for use by the IVOPTS middle-end pass.
> +   Attempt to duplicate as many doloop_optimize checks as possible.
> +
> +   Some RTL specific checks seems unable to be checked here, if any
> +   new checks or easy checks _are_ missing here.  */
> +
> +static bool
> +rs6000_predict_doloop_p (struct loop *loop)
> +{
> +  gcc_assert (loop);
> +
> +  /* On rs6000, targetm.can_use_doloop_p is actually
> +     can_use_doloop_if_innermost.  Just ensure it's innermost.  */
> +  if (loop->inner != NULL)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "predict doloop failure due to no innermost.\n");
> +      return false;
> +    }
> +
> +  /* number_of_latch_executions is not so costly, so we don't use
> +     number_of_iterations_exit for iteration description.  */
> +  tree niters = number_of_latch_executions (loop);
> +  if (niters == NULL_TREE || niters == chrec_dont_know)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "predict doloop failure due to unexpected niters.\n");
> +      return false;
> +    }
> +
> +  /* Similar to doloop_optimize, check whether iteration count too small
> +     and not profitable.  */
> +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> +  if (est_niter == -1)
> +    est_niter = get_likely_max_loop_iterations_int (loop);
> +  if (est_niter >= 0 && est_niter < 3)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "predict doloop failure due to too few iterations (%d).\n",
> +                (unsigned int) est_niter);
> +      return false;
> +    }
> +
> +  /* Similar to doloop_optimize, check whether number of iterations too costly
> +     to compute.  */
> +  if (costly_iter_for_doloop_p (loop, niters))
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "predict doloop failure due to costly niter computation.\n");
> +      return false;
> +    }
> +
> +  /* Similar to doloop_optimize targetm.invalid_within_doloop, check for
> +     CALL, tablejump, computed_jump.  */
> +  if (invalid_insn_for_doloop_p (loop))
> +    return false;
> +
> +  return true;
> +}
> +
>  struct gcc_target targetm = TARGET_INITIALIZER;
>
>  #include "gt-rs6000.h"
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 8c8978b..d3ad208 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -11603,6 +11603,14 @@ function version at run-time for a given set of function versions.
>  body must be generated.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_PREDICT_DOLOOP_P (struct loop *@var{loop})
> +Return true if we can predict it is possible to use low-overhead loops
> +for a particular loop.  The parameter @var{loop} is a pointer to the loop
> +which is going to be checked. This target hook is required only when the
> +target supports low-overhead loops, and will help some earlier middle-end
> +passes to make some decisions.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top})
>  Return true if it is possible to use low-overhead loops (@code{doloop_end}
>  and @code{doloop_begin}) for a particular loop.  @var{iterations} gives the
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index fe1194e..e047734 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -7942,6 +7942,8 @@ to by @var{ce_info}.
>
>  @hook TARGET_GENERATE_VERSION_DISPATCHER_BODY
>
> +@hook TARGET_PREDICT_DOLOOP_P
> +
>  @hook TARGET_CAN_USE_DOLOOP_P
>
>  @hook TARGET_INVALID_WITHIN_DOLOOP
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 9ff5e5f..1f2ad45 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -12539,3 +12539,94 @@ int_expr_size (tree exp)
>
>    return tree_to_shwi (size);
>  }
> +
> +/* Produce DECL_RTL for object obj so it looks like it is stored in memory.  */
> +
> +rtx
> +produce_memory_decl_rtl (tree obj, int *regno)
> +{
> +  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (obj));
> +  machine_mode address_mode = targetm.addr_space.address_mode (as);
> +  rtx x;
> +
> +  gcc_assert (obj);
> +  if (TREE_STATIC (obj) || DECL_EXTERNAL (obj))
> +    {
> +      const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (obj));
> +      x = gen_rtx_SYMBOL_REF (address_mode, name);
> +      SET_SYMBOL_REF_DECL (x, obj);
> +      x = gen_rtx_MEM (DECL_MODE (obj), x);
> +      set_mem_addr_space (x, as);
> +      targetm.encode_section_info (obj, x, true);
> +    }
> +  else
> +    {
> +      x = gen_raw_REG (address_mode, (*regno)++);
> +      x = gen_rtx_MEM (DECL_MODE (obj), x);
> +      set_mem_addr_space (x, as);
> +    }
> +
> +  return x;
> +}
> +
> +/* Prepares decl_rtl for variables referred in *EXPR_P.  Callback for
> +   walk_tree.  DATA contains the actual fake register number.  */
> +
> +tree
> +prepare_decl_rtl (tree *expr_p, int *ws, void *data)
> +{
> +  tree obj = NULL_TREE;
> +  rtx x = NULL_RTX;
> +  decl_rtl_data *info = (decl_rtl_data *) data;
> +  int *regno = info->regno;
> +  vec<tree> *treevec = info->treevec;
> +
> +  switch (TREE_CODE (*expr_p))
> +    {
> +    case ADDR_EXPR:
> +      for (expr_p = &TREE_OPERAND (*expr_p, 0); handled_component_p (*expr_p);
> +          expr_p = &TREE_OPERAND (*expr_p, 0))
> +       continue;
> +      obj = *expr_p;
> +      if (DECL_P (obj) && HAS_RTL_P (obj) && !DECL_RTL_SET_P (obj))
> +       x = produce_memory_decl_rtl (obj, regno);
> +      break;
> +
> +    case SSA_NAME:
> +      *ws = 0;
> +      obj = SSA_NAME_VAR (*expr_p);
> +      /* Defer handling of anonymous SSA_NAMEs to the expander.  */
> +      if (!obj)
> +       return NULL_TREE;
> +      if (!DECL_RTL_SET_P (obj))
> +       x = gen_raw_REG (DECL_MODE (obj), (*regno)++);
> +      break;
> +
> +    case VAR_DECL:
> +    case PARM_DECL:
> +    case RESULT_DECL:
> +      *ws = 0;
> +      obj = *expr_p;
> +
> +      if (DECL_RTL_SET_P (obj))
> +       break;
> +
> +      if (DECL_MODE (obj) == BLKmode)
> +       x = produce_memory_decl_rtl (obj, regno);
> +      else
> +       x = gen_raw_REG (DECL_MODE (obj), (*regno)++);
> +
> +      break;
> +
> +    default:
> +      break;
> +    }
> +
> +  if (x)
> +    {
> +      treevec->safe_push (obj);
> +      SET_DECL_RTL (obj, x);
> +    }
> +
> +  return NULL_TREE;
> +}
> diff --git a/gcc/expr.h b/gcc/expr.h
> index 17c3962..b1894a6b 100644
> --- a/gcc/expr.h
> +++ b/gcc/expr.h
> @@ -53,7 +53,21 @@ typedef struct separate_ops
>    tree type;
>    tree op0, op1, op2;
>  } *sepops;
> -
> +
> +/* This structure is used to pass information to tree walker function
> +   prepare_decl_rtl.  */
> +typedef struct data_for_decl_rtl
> +{
> +  int *regno;
> +  vec<tree> *treevec;
> +} decl_rtl_data;
> +
> +/* Produce decl_rtl for object so it looks like it is stored in memory.  */
> +rtx produce_memory_decl_rtl (tree, int *);
> +
> +/* Prepares decl_rtl for variables referred.  Callback for walk_tree.  */
> +tree prepare_decl_rtl (tree *, int *, void *);
> +
>  /* This is run during target initialization to set up which modes can be
>     used directly in memory and to initialize the block move optab.  */
>  extern void init_expr_target (void);
> diff --git a/gcc/target.def b/gcc/target.def
> index 66cee07..2f28e80 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -4227,6 +4227,15 @@ DEFHOOK
>  rtx, (machine_mode mode, rtx result, rtx val, rtx failval),
>   default_speculation_safe_value)
>
> +DEFHOOK
> +(predict_doloop_p,
> + "Return true if we can predict it is possible to use low-overhead loops\n\
> +for a particular loop.  The parameter @var{loop} is a pointer to the loop\n\
> +which is going to be checked. This target hook is required only when the\n\
> +target supports low-overhead loops, and will help some earlier middle-end\n\
> +passes to make some decisions.",
> + bool, (struct loop *loop),
> + default_predict_doloop_p)
>
>  DEFHOOK
>  (can_use_doloop_p,
> diff --git a/gcc/targhooks.c b/gcc/targhooks.c
> index 318f7e9..56ed4cc 100644
> --- a/gcc/targhooks.c
> +++ b/gcc/targhooks.c
> @@ -643,6 +643,19 @@ default_has_ifunc_p (void)
>    return HAVE_GNU_INDIRECT_FUNCTION;
>  }
>
> +/* True if we can predict this loop is possible to be transformed to a
> +   low-overhead loop, otherwise returns false.
> +
> +   By default, false is returned, as this hook's applicability should be
> +   verified for each target.  Target maintainers should re-define the hook
> +   if the target can take advantage of it.  */
> +
> +bool
> +default_predict_doloop_p (struct loop *loop ATTRIBUTE_UNUSED)
> +{
> +  return false;
> +}
> +
>  /* NULL if INSN insn is valid within a low-overhead loop, otherwise returns
>     an error message.
>
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index 5943627..70bfb17 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -85,6 +85,7 @@ extern bool default_fixed_point_supported_p (void);
>
>  extern bool default_has_ifunc_p (void);
>
> +extern bool default_predict_doloop_p (struct loop *);
>  extern const char * default_invalid_within_doloop (const rtx_insn *);
>
>  extern tree default_builtin_vectorized_function (unsigned int, tree, tree);
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
> index 171c85a..f61288c 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
> @@ -17,6 +17,7 @@ f1 (char *p, uintptr_t i, uintptr_t n)
>    while (i < n);
>  }
>
> -/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */
> -/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */
> -/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */
> +/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
> +/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */
> +/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
> +/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> index a44b4cb..148a48d 100644
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -109,6 +109,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "builtins.h"
>  #include "tree-vectorizer.h"
>
> +
>  /* FIXME: Expressions are expanded to RTL in this pass to determine the
>     cost of different addressing modes.  This should be moved to a TBD
>     interface between the GIMPLE and RTL worlds.  */
> @@ -376,6 +377,7 @@ struct iv_use
>    tree addr_base;      /* Base address with const offset stripped.  */
>    poly_uint64_pod addr_offset;
>                         /* Const offset stripped from base address.  */
> +  bool zero_cost_p; /* This iv use takes zero cost, only compare type.  */
>  };
>
>  /* Group of uses.  */
> @@ -691,6 +693,8 @@ static vec<tree> decl_rtl_to_reset;
>
>  static comp_cost force_expr_to_var_cost (tree, bool);
>
> +static void preserve_ivs_for_use (struct ivopts_data*, struct iv_use*);
> +
>  /* The single loop exit if it dominates the latch, NULL otherwise.  */
>
>  edge
> @@ -765,6 +769,8 @@ dump_use (FILE *file, struct iv_use *use)
>    fprintf (file, "    At pos:\t");
>    if (use->op_p)
>      print_generic_expr (file, *use->op_p, TDF_SLIM);
> +  if (use->zero_cost_p)
> +    fprintf (file, "\n    Zero cost");
>    fprintf (file, "\n");
>    dump_iv (file, use->iv, false, 2);
>  }
> @@ -1549,6 +1555,7 @@ record_use (struct iv_group *group, tree *use_p, struct iv *iv,
>    use->op_p = use_p;
>    use->addr_base = addr_base;
>    use->addr_offset = addr_offset;
> +  use->zero_cost_p = false;
>
>    group->vuses.safe_push (use);
>    return use;
> @@ -3687,94 +3694,6 @@ get_group_iv_cost (struct ivopts_data *data, struct iv_group *group,
>    return NULL;
>  }
>
> -/* Produce DECL_RTL for object obj so it looks like it is stored in memory.  */
> -static rtx
> -produce_memory_decl_rtl (tree obj, int *regno)
> -{
> -  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (obj));
> -  machine_mode address_mode = targetm.addr_space.address_mode (as);
> -  rtx x;
> -
> -  gcc_assert (obj);
> -  if (TREE_STATIC (obj) || DECL_EXTERNAL (obj))
> -    {
> -      const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (obj));
> -      x = gen_rtx_SYMBOL_REF (address_mode, name);
> -      SET_SYMBOL_REF_DECL (x, obj);
> -      x = gen_rtx_MEM (DECL_MODE (obj), x);
> -      set_mem_addr_space (x, as);
> -      targetm.encode_section_info (obj, x, true);
> -    }
> -  else
> -    {
> -      x = gen_raw_REG (address_mode, (*regno)++);
> -      x = gen_rtx_MEM (DECL_MODE (obj), x);
> -      set_mem_addr_space (x, as);
> -    }
> -
> -  return x;
> -}
> -
> -/* Prepares decl_rtl for variables referred in *EXPR_P.  Callback for
> -   walk_tree.  DATA contains the actual fake register number.  */
> -
> -static tree
> -prepare_decl_rtl (tree *expr_p, int *ws, void *data)
> -{
> -  tree obj = NULL_TREE;
> -  rtx x = NULL_RTX;
> -  int *regno = (int *) data;
> -
> -  switch (TREE_CODE (*expr_p))
> -    {
> -    case ADDR_EXPR:
> -      for (expr_p = &TREE_OPERAND (*expr_p, 0);
> -          handled_component_p (*expr_p);
> -          expr_p = &TREE_OPERAND (*expr_p, 0))
> -       continue;
> -      obj = *expr_p;
> -      if (DECL_P (obj) && HAS_RTL_P (obj) && !DECL_RTL_SET_P (obj))
> -       x = produce_memory_decl_rtl (obj, regno);
> -      break;
> -
> -    case SSA_NAME:
> -      *ws = 0;
> -      obj = SSA_NAME_VAR (*expr_p);
> -      /* Defer handling of anonymous SSA_NAMEs to the expander.  */
> -      if (!obj)
> -       return NULL_TREE;
> -      if (!DECL_RTL_SET_P (obj))
> -       x = gen_raw_REG (DECL_MODE (obj), (*regno)++);
> -      break;
> -
> -    case VAR_DECL:
> -    case PARM_DECL:
> -    case RESULT_DECL:
> -      *ws = 0;
> -      obj = *expr_p;
> -
> -      if (DECL_RTL_SET_P (obj))
> -       break;
> -
> -      if (DECL_MODE (obj) == BLKmode)
> -       x = produce_memory_decl_rtl (obj, regno);
> -      else
> -       x = gen_raw_REG (DECL_MODE (obj), (*regno)++);
> -
> -      break;
> -
> -    default:
> -      break;
> -    }
> -
> -  if (x)
> -    {
> -      decl_rtl_to_reset.safe_push (obj);
> -      SET_DECL_RTL (obj, x);
> -    }
> -
> -  return NULL_TREE;
> -}
>
>  /* Determines cost of the computation of EXPR.  */
>
> @@ -3792,7 +3711,10 @@ computation_cost (tree expr, bool speed)
>
>    node->frequency = NODE_FREQUENCY_NORMAL;
>    crtl->maybe_hot_insn_p = speed;
> -  walk_tree (&expr, prepare_decl_rtl, &regno, NULL);
> +  decl_rtl_data data;
> +  data.regno = &regno;
> +  data.treevec = &decl_rtl_to_reset;
> +  walk_tree (&expr, prepare_decl_rtl, &data, NULL);
>    start_sequence ();
>    rslt = expand_expr (expr, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL);
>    seq = get_insns ();
> @@ -5284,6 +5206,8 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>        inv_exprs = BITMAP_ALLOC (NULL);
>        bitmap_set_bit (inv_exprs, inv_expr->id);
>      }
> +  if (use->zero_cost_p)
> +    cost.cost = 0;
>    set_group_iv_cost (data, group, cand, cost,
>                      inv_vars, bound, comp, inv_exprs);
>
> @@ -7202,6 +7126,11 @@ rewrite_use_compare (struct ivopts_data *data,
>        return;
>      }
>
> +  if (use->zero_cost_p) {
> +    preserve_ivs_for_use(data, use);
> +    return;
> +  }
> +
>    /* The induction variable elimination failed; just express the original
>       giv.  */
>    comp = get_computation_at (data->current_loop, use->stmt, use, cand);
> @@ -7248,8 +7177,8 @@ rewrite_groups (struct ivopts_data *data)
>
>           for (j = 0; j < group->vuses.length (); j++)
>             {
> -             rewrite_use_compare (data, group->vuses[j], cand);
> -             update_stmt (group->vuses[j]->stmt);
> +                 rewrite_use_compare (data, group->vuses[j], cand);
> +                 update_stmt (group->vuses[j]->stmt);
Wrong formatting change?  Also we can run contrib/check_GNU_style.sh
for coding style issues.

>             }
>         }
>      }
> @@ -7527,12 +7456,134 @@ loop_body_includes_call (basic_block *body, unsigned num_nodes)
>    return false;
>  }
>
> +/* For a comparison iv_use that we've chosen to ignore, we have to ensure
> +   that its dependent IVs are preserved here.  Otherwise remove_unused_ivs
> +   will remove those IVs, leading this use to have no feeding IV.  */
> +
> +static void
> +preserve_ivs_for_use (struct ivopts_data *data, struct iv_use *use)
> +{
> +  struct loop *loop = data->current_loop;
> +  struct iv *iv = use->iv;
> +  auto_vec<tree> work_list;
> +  work_list.safe_push (iv->ssa_name);
> +  bitmap visited = BITMAP_ALLOC (NULL);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "Preserving IVs: ");
> +
> +  while (!work_list.is_empty ())
> +    {
> +      tree op = work_list.pop ();
> +      gcc_assert (TREE_CODE (op) == SSA_NAME);
> +      gcc_assert (name_info (data, op));
> +      if (bitmap_bit_p (visited, SSA_NAME_VERSION (op)))
> +       continue;
> +
> +      struct version_info *vi = name_info (data, op);
> +      vi->preserve_biv = true;
> +      bitmap_set_bit (visited, SSA_NAME_VERSION (iv->ssa_name));
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         print_generic_expr (dump_file, op, TDF_DETAILS);
> +         fprintf (dump_file, " ");
> +       }
> +
> +      gimple *def = SSA_NAME_DEF_STMT (op);
> +      /* Ensure the incr_iv preserved as well, for some case
> +         like:
> +           i2 = phi(i0, i1);
> +           i1 = i2 + ...;
> +           cmp i2 ... ;
> +      */
> +      if (gimple_code (def) == GIMPLE_PHI && gimple_bb (def) == loop->header)
> +       {
> +         tree incr_iv = PHI_ARG_DEF_FROM_EDGE (def, loop_latch_edge (loop));
> +         if (bitmap_bit_p (visited, SSA_NAME_VERSION (incr_iv)))
> +           continue;
> +         name_info (data, incr_iv)->preserve_biv = true;
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           {
> +             print_generic_expr (dump_file, incr_iv, TDF_DETAILS);
> +             fprintf (dump_file, " ");
> +           }
> +         continue;
> +       }
> +
> +      /* Only need to check assign for iv chain.  */
> +      if (gimple_code (def) != GIMPLE_ASSIGN)
> +       continue;
> +
> +      ssa_op_iter iter;
> +      use_operand_p use_p;
> +      tree use_op;
> +      struct iv *use_iv;
> +      FOR_EACH_PHI_OR_STMT_USE (use_p, def, iter, SSA_OP_USE)
> +       {
> +         use_op = USE_FROM_PTR (use_p);
> +         if ((use_iv = name_info (data, use_op)->iv)
> +             && !integer_zerop (use_iv->step))
> +           work_list.safe_push (use_op);
> +       }
> +    }
> +
> +  BITMAP_FREE (visited);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "\n\n");
> +}
> +
> +/* Doloop optimization RTL pass can make the related comparison computation
> +   become dead and get eliminated, then these comparison IV uses should NOT
> +   be considered in optimal IVs determination, mark them as ignored and taken
> +   as zero cost in cost modeling.  */
> +
> +static void
> +tailor_cmp_uses (struct ivopts_data *data)
> +{
> +  unsigned i;
> +  struct iv_group *group;
> +  struct loop *loop = data->current_loop;
> +
> +  for (i = 0; i < data->vgroups.length (); i++)
> +    {
> +      group = data->vgroups[i];
> +      if (group->type == USE_COMPARE)
> +       {
> +         gcc_assert (group->vuses.length () == 1);
> +         struct iv_use *use = group->vuses[0];
> +         gimple *stmt = use->stmt;
> +         if (gimple_code (stmt) == GIMPLE_COND)
> +           {
> +             basic_block bb = gimple_bb (stmt);
> +             edge true_edge, false_edge;
> +             extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
> +
> +             /* This comparison is used for loop latch.  Require latch is empty
> +                    for now.  */
> +             if ((loop->latch == true_edge->dest
> +                  || loop->latch == false_edge->dest)
> +                 && gsi_end_p (gsi_last_bb (loop->latch)))
> +               {
> +                 use->zero_cost_p = true;
> +                 /* Dump for this use.  */
> +                 if (dump_file && (dump_flags & TDF_DETAILS))
> +                   {
> +                     fprintf (dump_file, "Zeroing cost for use: ");
> +                     print_gimple_stmt (dump_file, stmt, TDF_DETAILS);
> +                   }
> +               }
> +           }
> +       }
> +    }
> +}
> +
>  /* Optimizes the LOOP.  Returns true if anything changed.  */
>
>  static bool
>  tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop)
>  {
>    bool changed = false;
> +  bool tailor_cmp_p = false;
>    struct iv_ca *iv_ca;
>    edge exit = single_dom_exit (loop);
>    basic_block *body;
> @@ -7578,6 +7629,20 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop)
>    if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)
>      goto finish;
>
> +  tailor_cmp_p = targetm.predict_doloop_p (loop);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file,
> +              "Predict loop %d can %sperform doloop optimization later.\n",
> +              loop->num, tailor_cmp_p ? "" : "not ");
> +      flow_loop_dump (loop, dump_file, NULL, 1);
> +    }
> +
> +  /* Some compare iv_use is probably useless once the doloop optimization
> +     performs.  */
> +  if (tailor_cmp_p)
> +    tailor_cmp_uses (data);
Function tailor_cmp_uses sets iv_use->zero_cost_p under some
conditions.  Once the flag is set, though the iv_use participates cost
computation in determine_group_iv_cost_cond, the result cost is
hard-wired to ZERO (which means cost computation for such iv_use is
waste of time).  Also iv_use rewriting process is skipped for related
ivs preserved explicitly by preserve_ivs_for_use.
Note IVOPTs already adds candidate for original ivs.  So why not
detecting doloop iv_use, adjust its cost with the corresponding
original iv candidate, then let the general cost model make the
decision?  I believe this better fits existing infrastructure and
requires less change, for example, preserve_ivs_for_use won't be
necessary.  It has other advantages too, for example, 1) candidate of
original iv can be preferred for other iv_uses with doloop cost
tuning;  2) the doloop decision can still be canceled by cost model if
it's really not beneficial.  With current patch, it can't be undo once
the decision is made (at very early stage in IVOPTs.).
Sorry didn't bring out this earlier, I only realized this after
reading your code.

> +
>    /* Finds candidates for the induction variables (item 2).  */
>    find_iv_candidates (data);
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-04-27  3:45 ` Bin.Cheng
@ 2019-05-05  3:23   ` Kewen.Lin
  2019-05-05  4:04     ` Bin.Cheng
  0 siblings, 1 reply; 22+ messages in thread
From: Kewen.Lin @ 2019-05-05  3:23 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

Hi Bin,

Sorry for late response (just back from vacation).

Thanks very much for your comments.

on 2019/4/27 ä¸Šåˆ11:20, Bin.Cheng wrote:
> For such non-trivial patch, we can improve review process by splitting
> to smaller patches which can be reviewed/approved independently.

Good idea! I'll split it.

> Below are some general comments specific to IVOPTs change.

>>           for (j = 0; j < group->vuses.length (); j++)
>>             {
>> -             rewrite_use_compare (data, group->vuses[j], cand);
>> -             update_stmt (group->vuses[j]->stmt);
>> +                 rewrite_use_compare (data, group->vuses[j], cand);
>> +                 update_stmt (group->vuses[j]->stmt);
> Wrong formatting change?  Also we can run contrib/check_GNU_style.sh
> for coding style issues.
> 

Good catch!

>> +  /* Some compare iv_use is probably useless once the doloop optimization
>> +     performs.  */
>> +  if (tailor_cmp_p)
>> +    tailor_cmp_uses (data);
> Function tailor_cmp_uses sets iv_use->zero_cost_p under some
> conditions.  Once the flag is set, though the iv_use participates cost
> computation in determine_group_iv_cost_cond, the result cost is
> hard-wired to ZERO (which means cost computation for such iv_use is
> waste of time). 

Yes, it can be improved by some early check and return.
But it's still helpful to make it call with may_eliminate_iv.
gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv 
consideration it exposes more opportunities for downstream optimization.

> Also iv_use rewriting process is skipped for related
> ivs preserved explicitly by preserve_ivs_for_use.
> Note IVOPTs already adds candidate for original ivs.  So why not
> detecting doloop iv_use, adjust its cost with the corresponding
> original iv candidate, then let the general cost model make the
> decision?  I believe this better fits existing infrastructure and
> requires less change, for example, preserve_ivs_for_use won't be
> necessary.  
I agree adjusting the cost of original iv candidate of the iv_use
requires less change, but on one hand, I thought to remove interest
cmp iv use or make it zero cost is close to the fact.  Since it's 
eliminated eventually in doloop optimization, it should not 
considered in cost modeling.  This way looks more exact.
One the other hand, I assumed your suggestion is to increase the 
cost for the pair (original iv cand, cmp iv use), the increase cost 
seems to be a heuristic value?  It seems it's possible to sacrifice
performance for some worst cases?  Although it's integrated to the
existing framework, it probably introduces other cost adjust issues.
For example, the original iv cand is inclined not to be selected after
this change, but without considering the cmp iv use, it's better to 
be selected.  It looks highly depend on the cost tuning? 

Does my understanding above make sense? 

But I will follow your suggestion to update and check the regression 
testing result etc. I just updated the code to increase the cost for 
the pair (original iv cand, cmp iv use) with 5 (to balance the original 
iv 4 vs other 5), the failure case in PR80791 was fixed.  I assumed
IP_ORIGINAL for cand->pos to check original iv cand is enough.


> It has other advantages too, for example, 1) candidate of
> original iv can be preferred for other iv_uses with doloop cost

I think the patch way is not intrusive either.

> tuning;  2) the doloop decision can still be canceled by cost model if
> it's really not beneficial.  With current patch, it can't be undo once
> the decision is made (at very early stage in IVOPTs.).

I can't really follow this.  If it's predicted to be transformed to doloop,
I think it should not be undoed any more, since it's useless to consider 
this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not
be changed (although possible to use other iv), right?  Do I miss something?

> Sorry didn't bring out this earlier, I only realized this after
> reading your code.

It doesn't matter indeed!  Thanks again!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-05-05  3:23   ` Kewen.Lin
@ 2019-05-05  4:04     ` Bin.Cheng
  2019-05-05  6:02       ` Kewen.Lin
  2019-05-05 16:33       ` Segher Boessenkool
  0 siblings, 2 replies; 22+ messages in thread
From: Bin.Cheng @ 2019-05-05  4:04 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

On Sun, May 5, 2019 at 11:23 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> Sorry for late response (just back from vacation).
>
> Thanks very much for your comments.
>
> on 2019/4/27 上午11:20, Bin.Cheng wrote:
> > For such non-trivial patch, we can improve review process by splitting
> > to smaller patches which can be reviewed/approved independently.
>
> Good idea! I'll split it.
>
> > Below are some general comments specific to IVOPTs change.
>
> >>           for (j = 0; j < group->vuses.length (); j++)
> >>             {
> >> -             rewrite_use_compare (data, group->vuses[j], cand);
> >> -             update_stmt (group->vuses[j]->stmt);
> >> +                 rewrite_use_compare (data, group->vuses[j], cand);
> >> +                 update_stmt (group->vuses[j]->stmt);
> > Wrong formatting change?  Also we can run contrib/check_GNU_style.sh
> > for coding style issues.
> >
>
> Good catch!
>
> >> +  /* Some compare iv_use is probably useless once the doloop optimization
> >> +     performs.  */
> >> +  if (tailor_cmp_p)
> >> +    tailor_cmp_uses (data);
> > Function tailor_cmp_uses sets iv_use->zero_cost_p under some
> > conditions.  Once the flag is set, though the iv_use participates cost
> > computation in determine_group_iv_cost_cond, the result cost is
> > hard-wired to ZERO (which means cost computation for such iv_use is
> > waste of time).
>
> Yes, it can be improved by some early check and return.
> But it's still helpful to make it call with may_eliminate_iv.
> gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv
> consideration it exposes more opportunities for downstream optimization.
Hmm, I wasn't suggesting early check and return, on the contrary, we
can better integrate doloop/cost stuff in the overall model.  See more
in following comments.
>
> > Also iv_use rewriting process is skipped for related
> > ivs preserved explicitly by preserve_ivs_for_use.
> > Note IVOPTs already adds candidate for original ivs.  So why not
> > detecting doloop iv_use, adjust its cost with the corresponding
> > original iv candidate, then let the general cost model make the
> > decision?  I believe this better fits existing infrastructure and
> > requires less change, for example, preserve_ivs_for_use won't be
> > necessary.
> I agree adjusting the cost of original iv candidate of the iv_use
> requires less change, but on one hand, I thought to remove interest
> cmp iv use or make it zero cost is close to the fact.  Since it's
> eliminated eventually in doloop optimization, it should not
> considered in cost modeling.  This way looks more exact.
Whether doloop transformation should be considered or be bypassed in
cost model isn't a problem.  Actually we can bind doloop iv_cand to
cmp iv_use in order to force the transformation.  My concern is the
patch specially handles doloop by setting the special flag, then
checking it.  We generally avoid such special-case handling since it
hurts long time maintenance.  The pass was very unstable in the pass
because of such issues.

> One the other hand, I assumed your suggestion is to increase the
> cost for the pair (original iv cand, cmp iv use), the increase cost
> seems to be a heuristic value?  It seems it's possible to sacrifice
Decrease the cost so that the iv_cand is preferred?  The comment
wasn't made on top of implementing doloop in ivopts.  Anyway, you can
still bypass cost model by binding the "correct" iv_cand to cmp
iv_use.

> performance for some worst cases?  Although it's integrated to the
> existing framework, it probably introduces other cost adjust issues.
> For example, the original iv cand is inclined not to be selected after
> this change, but without considering the cmp iv use, it's better to
> be selected.  It looks highly depend on the cost tuning?
>
> Does my understanding above make sense?
>
> But I will follow your suggestion to update and check the regression
> testing result etc. I just updated the code to increase the cost for
> the pair (original iv cand, cmp iv use) with 5 (to balance the original
> iv 4 vs other 5), the failure case in PR80791 was fixed.  I assumed
> IP_ORIGINAL for cand->pos to check original iv cand is enough.
>
>
> > It has other advantages too, for example, 1) candidate of
> > original iv can be preferred for other iv_uses with doloop cost
>
> I think the patch way is not intrusive either.
>
> > tuning;  2) the doloop decision can still be canceled by cost model if
> > it's really not beneficial.  With current patch, it can't be undo once
> > the decision is made (at very early stage in IVOPTs.).
>
> I can't really follow this.  If it's predicted to be transformed to doloop,
> I think it should not be undoed any more, since it's useless to consider
> this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not
> be changed (although possible to use other iv), right?  Do I miss something?
As mentioned, the previous comment wasn't made on top of implementing
doloop in ivopts.  That would be nice but a different story.
Before we can do that, it'd better be conservative and only makes
(doloop) decision in ivopts when you are sure.  As you mentioned, it's
hard to do the same checks at gimple as RTL, right?  In this case,
making it a (conservative) heuristic capturing certain beneficial
cases sounds better than capturing all cases but fail in later RTL
passes.

Thanks,
bin
>
> > Sorry didn't bring out this earlier, I only realized this after
> > reading your code.
>
> It doesn't matter indeed!  Thanks again!
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-05-05  4:04     ` Bin.Cheng
@ 2019-05-05  6:02       ` Kewen.Lin
  2019-05-06  1:50         ` Bin.Cheng
  2019-05-05 16:33       ` Segher Boessenkool
  1 sibling, 1 reply; 22+ messages in thread
From: Kewen.Lin @ 2019-05-05  6:02 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

on 2019/5/5 ä¸‹åˆ12:04, Bin.Cheng wrote:
> On Sun, May 5, 2019 at 11:23 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>> +  /* Some compare iv_use is probably useless once the doloop optimization
>>>> +     performs.  */
>>>> +  if (tailor_cmp_p)
>>>> +    tailor_cmp_uses (data);
>>> Function tailor_cmp_uses sets iv_use->zero_cost_p under some
>>> conditions.  Once the flag is set, though the iv_use participates cost
>>> computation in determine_group_iv_cost_cond, the result cost is
>>> hard-wired to ZERO (which means cost computation for such iv_use is
>>> waste of time).
>>
>> Yes, it can be improved by some early check and return.
>> But it's still helpful to make it call with may_eliminate_iv.
>> gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv
>> consideration it exposes more opportunities for downstream optimization.
> Hmm, I wasn't suggesting early check and return, on the contrary, we
> can better integrate doloop/cost stuff in the overall model.  See more
> in following comments.

Sorry, I didn't claim it clearly, the previous comment is to claim the 
call with may_eliminate_iv is not completely "waste of time", and early
return can make it save time.  :)

And yes, it's not an issue any more with your proposed idea.

>>
>>> Also iv_use rewriting process is skipped for related
>>> ivs preserved explicitly by preserve_ivs_for_use.
>>> Note IVOPTs already adds candidate for original ivs.  So why not
>>> detecting doloop iv_use, adjust its cost with the corresponding
>>> original iv candidate, then let the general cost model make the
>>> decision?  I believe this better fits existing infrastructure and
>>> requires less change, for example, preserve_ivs_for_use won't be
>>> necessary.
>> I agree adjusting the cost of original iv candidate of the iv_use
>> requires less change, but on one hand, I thought to remove interest
>> cmp iv use or make it zero cost is close to the fact.  Since it's
>> eliminated eventually in doloop optimization, it should not
>> considered in cost modeling.  This way looks more exact.
> Whether doloop transformation should be considered or be bypassed in
> cost model isn't a problem.  Actually we can bind doloop iv_cand to
> cmp iv_use in order to force the transformation. My concern is the
> patch specially handles doloop by setting the special flag, then
> checking it.  We generally avoid such special-case handling since it
> hurts long time maintenance.  The pass was very unstable in the pass
> because of such issues.
> 

OK, I understand your concerns now. Thanks for explanation!

>> One the other hand, I assumed your suggestion is to increase the
>> cost for the pair (original iv cand, cmp iv use), the increase cost
>> seems to be a heuristic value?  It seems it's possible to sacrifice
> Decrease the cost so that the iv_cand is preferred?  The comment
> wasn't made on top of implementing doloop in ivopts.  Anyway, you can
> still bypass cost model by binding the "correct" iv_cand to cmp
> iv_use.
> 

To decrease the cost isn't workable for this case, it make the original
iv cand is preferred more and over the other iv cand for memory iv use,
then the desirable memory based iv cand won't be chosen.
If increase the cost, one basic iv cand is chosen for cmp use, memory
based iv cand is chosen for memory use, instead of original iv for both.

Could you help to explain the "binding" more?  Does it mean cost modeling
decision making can simply bypass the cmp iv_use (we have artificially 
assigned one "correct" cand iv to it) and also doesn't count the "correct"
iv cand cost into total iv cost? Is my understanding correct?

>>> tuning;  2) the doloop decision can still be canceled by cost model if
>>> it's really not beneficial.  With current patch, it can't be undo once
>>> the decision is made (at very early stage in IVOPTs.).
>>
>> I can't really follow this.  If it's predicted to be transformed to doloop,
>> I think it should not be undoed any more, since it's useless to consider
>> this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not
>> be changed (although possible to use other iv), right?  Do I miss something?
> As mentioned, the previous comment wasn't made on top of implementing
> doloop in ivopts.  That would be nice but a different story.
> Before we can do that, it'd better be conservative and only makes
> (doloop) decision in ivopts when you are sure.  As you mentioned, it's
> hard to do the same checks at gimple as RTL, right?  In this case,
> making it a (conservative) heuristic capturing certain beneficial
> cases sounds better than capturing all cases but fail in later RTL
> passes.
> 

Yes, I agree we should be conservative.  But it's hard to determine which is 
better in practice, even for capturing all cases, we are still trying our best
to be more conservative, excluding any suspicious factor which is possible to 
make it fail in later RTL checking, one example is that the patch won't predict
it can be doloop once finding switch statement.  It depends on how much "bad"
cases we don't catch and how serious its impact is and whether easy to improve.

Thanks
Kewen.Lin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-05-05  6:02       ` Kewen.Lin
@ 2019-05-06  1:50         ` Bin.Cheng
  2019-05-06  7:31           ` Kewen.Lin
  0 siblings, 1 reply; 22+ messages in thread
From: Bin.Cheng @ 2019-05-06  1:50 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

On Sun, May 5, 2019 at 2:02 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2019/5/5 下午12:04, Bin.Cheng wrote:
> > On Sun, May 5, 2019 at 11:23 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>> +  /* Some compare iv_use is probably useless once the doloop optimization
> >>>> +     performs.  */
> >>>> +  if (tailor_cmp_p)
> >>>> +    tailor_cmp_uses (data);
> >>> Function tailor_cmp_uses sets iv_use->zero_cost_p under some
> >>> conditions.  Once the flag is set, though the iv_use participates cost
> >>> computation in determine_group_iv_cost_cond, the result cost is
> >>> hard-wired to ZERO (which means cost computation for such iv_use is
> >>> waste of time).
> >>
> >> Yes, it can be improved by some early check and return.
> >> But it's still helpful to make it call with may_eliminate_iv.
> >> gcc.dg/no-strict-overflow-6.c is one example, with may_eliminate_iv
> >> consideration it exposes more opportunities for downstream optimization.
> > Hmm, I wasn't suggesting early check and return, on the contrary, we
> > can better integrate doloop/cost stuff in the overall model.  See more
> > in following comments.
>
> Sorry, I didn't claim it clearly, the previous comment is to claim the
> call with may_eliminate_iv is not completely "waste of time", and early
> return can make it save time.  :)
>
> And yes, it's not an issue any more with your proposed idea.
>
> >>
> >>> Also iv_use rewriting process is skipped for related
> >>> ivs preserved explicitly by preserve_ivs_for_use.
> >>> Note IVOPTs already adds candidate for original ivs.  So why not
> >>> detecting doloop iv_use, adjust its cost with the corresponding
> >>> original iv candidate, then let the general cost model make the
> >>> decision?  I believe this better fits existing infrastructure and
> >>> requires less change, for example, preserve_ivs_for_use won't be
> >>> necessary.
> >> I agree adjusting the cost of original iv candidate of the iv_use
> >> requires less change, but on one hand, I thought to remove interest
> >> cmp iv use or make it zero cost is close to the fact.  Since it's
> >> eliminated eventually in doloop optimization, it should not
> >> considered in cost modeling.  This way looks more exact.
> > Whether doloop transformation should be considered or be bypassed in
> > cost model isn't a problem.  Actually we can bind doloop iv_cand to
> > cmp iv_use in order to force the transformation. My concern is the
> > patch specially handles doloop by setting the special flag, then
> > checking it.  We generally avoid such special-case handling since it
> > hurts long time maintenance.  The pass was very unstable in the pass
> > because of such issues.
> >
>
> OK, I understand your concerns now. Thanks for explanation!
>
> >> One the other hand, I assumed your suggestion is to increase the
> >> cost for the pair (original iv cand, cmp iv use), the increase cost
> >> seems to be a heuristic value?  It seems it's possible to sacrifice
> > Decrease the cost so that the iv_cand is preferred?  The comment
> > wasn't made on top of implementing doloop in ivopts.  Anyway, you can
> > still bypass cost model by binding the "correct" iv_cand to cmp
> > iv_use.
> >
>
> To decrease the cost isn't workable for this case, it make the original
> iv cand is preferred more and over the other iv cand for memory iv use,
> then the desirable memory based iv cand won't be chosen.
> If increase the cost, one basic iv cand is chosen for cmp use, memory
> based iv cand is chosen for memory use, instead of original iv for both.
Sorry for the mistake, I meant to decrease use cost of whatever "correct"
iv_cand for cmp iv_use that could enable doloop optimization, it doesn't
has to the original iv_cand.

>
> Could you help to explain the "binding" more?  Does it mean cost modeling
> decision making can simply bypass the cmp iv_use (we have artificially
> assigned one "correct" cand iv to it) and also doesn't count the "correct"
> iv cand cost into total iv cost? Is my understanding correct?
For example, if the heuristic finds out the "correct" doloop iv_cand, we can
force choosing that iv_cand for cmp iv_use and bypass the candidate choosing
algorithm:
struct iv_group {
  //...
  struct iv_cand *bind_cand;
};
then set this bind_cand directly in struct iv_ca::cand_for_group.  As a result,
iv_use rewriting code takes care of everything, no special handling (such as
preserve_ivs_for_use) is needed.

Whether letting cost model decide the "correct" iv_cand or bind it by yourself
depends on how good your heuristic check is.  It's your call. :)

>
> >>> tuning;  2) the doloop decision can still be canceled by cost model if
> >>> it's really not beneficial.  With current patch, it can't be undo once
> >>> the decision is made (at very early stage in IVOPTs.).
> >>
> >> I can't really follow this.  If it's predicted to be transformed to doloop,
> >> I think it should not be undoed any more, since it's useless to consider
> >> this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not
> >> be changed (although possible to use other iv), right?  Do I miss something?
> > As mentioned, the previous comment wasn't made on top of implementing
> > doloop in ivopts.  That would be nice but a different story.
> > Before we can do that, it'd better be conservative and only makes
> > (doloop) decision in ivopts when you are sure.  As you mentioned, it's
> > hard to do the same checks at gimple as RTL, right?  In this case,
> > making it a (conservative) heuristic capturing certain beneficial
> > cases sounds better than capturing all cases but fail in later RTL
> > passes.
> >
>
> Yes, I agree we should be conservative.  But it's hard to determine which is
> better in practice, even for capturing all cases, we are still trying our best
> to be more conservative, excluding any suspicious factor which is possible to
> make it fail in later RTL checking, one example is that the patch won't predict
> it can be doloop once finding switch statement.  It depends on how much "bad"
> cases we don't catch and how serious its impact is and whether easy to improve.
Sure, I don't know ppc so have all the trust in your decision here.

Thanks for your patience.

Thanks,
bin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-05-06  1:50         ` Bin.Cheng
@ 2019-05-06  7:31           ` Kewen.Lin
  0 siblings, 0 replies; 22+ messages in thread
From: Kewen.Lin @ 2019-05-06  7:31 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Segher Boessenkool, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

on 2019/5/6 ä¸Šåˆ9:50, Bin.Cheng wrote:
> On Sun, May 5, 2019 at 2:02 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> To decrease the cost isn't workable for this case, it make the original
>> iv cand is preferred more and over the other iv cand for memory iv use,
>> then the desirable memory based iv cand won't be chosen.
>> If increase the cost, one basic iv cand is chosen for cmp use, memory
>> based iv cand is chosen for memory use, instead of original iv for both.
> Sorry for the mistake, I meant to decrease use cost of whatever "correct"
> iv_cand for cmp iv_use that could enable doloop optimization, it doesn't
> has to the original iv_cand.
> 

Thanks for your clarification. 

As mentioned, we are trying to make the predict doloop hook work as 
conservative as possible, so under the hypothesis that predicted cmp iv 
use is going to be eliminated later, I'd prefer the way to bind iv_cand
artificially.  Because whichever iv_cand is selected as "correct" one
and then increase/decrease the cost, we are considering one upcoming
eliminated cmp iv use (it doesn't require the iv cand eventually), it
will *probably* affect the optimal candidate set decision.

>> Could you help to explain the "binding" more?  Does it mean cost modeling
>> decision making can simply bypass the cmp iv_use (we have artificially
>> assigned one "correct" cand iv to it) and also doesn't count the "correct"
>> iv cand cost into total iv cost? Is my understanding correct?
> For example, if the heuristic finds out the "correct" doloop iv_cand, we can
> force choosing that iv_cand for cmp iv_use and bypass the candidate choosing
> algorithm:
> struct iv_group {
>   //...
>   struct iv_cand *bind_cand;
> };
> then set this bind_cand directly in struct iv_ca::cand_for_group.  As a result,
> iv_use rewriting code takes care of everything, no special handling (such as
> preserve_ivs_for_use) is needed.
> 
> Whether letting cost model decide the "correct" iv_cand or bind it by yourself
> depends on how good your heuristic check is.  It's your call. :)
> 

Thanks a lot for your detailed explanation!

As mentioned above, I prefer to use "bind" for it.
I'll modify the patch with the "bind cand" way, initially set the relevant
iv cand originally used in that cmp iv use, then update it if it can be 
updated with IV_ELIM.

>> Yes, I agree we should be conservative.  But it's hard to determine which is
>> better in practice, even for capturing all cases, we are still trying our best
>> to be more conservative, excluding any suspicious factor which is possible to
>> make it fail in later RTL checking, one example is that the patch won't predict
>> it can be doloop once finding switch statement.  It depends on how much "bad"
>> cases we don't catch and how serious its impact is and whether easy to improve.
> Sure, I don't know ppc so have all the trust in your decision here.
> 
> Thanks for your patience.
> 

Thank you too!!!

Kewen

> Thanks,
> bin
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts
  2019-05-05  4:04     ` Bin.Cheng
  2019-05-05  6:02       ` Kewen.Lin
@ 2019-05-05 16:33       ` Segher Boessenkool
  1 sibling, 0 replies; 22+ messages in thread
From: Segher Boessenkool @ 2019-05-05 16:33 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Kewen.Lin, GCC Patches, bin.cheng, Bill Schmidt,
	Richard Guenther, Jakub Jelinek

On Sun, May 05, 2019 at 12:04:00PM +0800, Bin.Cheng wrote:
> On Sun, May 5, 2019 at 11:23 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> > I can't really follow this.  If it's predicted to be transformed to doloop,
> > I think it should not be undoed any more, since it's useless to consider
> > this cmp iv use. Whatever IVOPTS does, the comp at loop closing should not
> > be changed (although possible to use other iv), right?  Do I miss something?
> As mentioned, the previous comment wasn't made on top of implementing
> doloop in ivopts.  That would be nice but a different story.
> Before we can do that, it'd better be conservative and only makes
> (doloop) decision in ivopts when you are sure.  As you mentioned, it's
> hard to do the same checks at gimple as RTL, right?  In this case,
> making it a (conservative) heuristic capturing certain beneficial
> cases sounds better than capturing all cases but fail in later RTL
> passes.

But not *overly* conservative.  If some situation almost never happens,
it's better to have ivopts guess wrong some of the time than it is to
just optimise less aggressively.  If ivopts makes a non-optimal decision
you still end up with valid code :-)


Segher

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-05-06 10:21 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-24  8:49 [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts Kewen.Lin
2019-04-24  9:08 ` Jakub Jelinek
2019-04-24  9:25   ` Kewen.Lin
2019-04-25 12:45 ` Segher Boessenkool
2019-04-26  6:58   ` Kewen.Lin
2019-04-26  7:59     ` Richard Biener
2019-04-26 14:32       ` Kewen.Lin
2019-04-26 17:06         ` Segher Boessenkool
2019-04-26 18:23           ` Richard Biener
2019-04-26 18:58             ` Segher Boessenkool
2019-05-05  3:42           ` Kewen.Lin
2019-04-26 16:44     ` Segher Boessenkool
2019-04-27  4:13     ` Bin.Cheng
2019-05-05  5:26       ` Kewen.Lin
2019-05-06 10:21         ` Richard Biener
2019-04-27  3:45 ` Bin.Cheng
2019-05-05  3:23   ` Kewen.Lin
2019-05-05  4:04     ` Bin.Cheng
2019-05-05  6:02       ` Kewen.Lin
2019-05-06  1:50         ` Bin.Cheng
2019-05-06  7:31           ` Kewen.Lin
2019-05-05 16:33       ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).