public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
@ 2020-01-16  9:41 Kewen.Lin
  2020-01-16  9:43 ` [PATCH 1/4 GCC11] Add middle-end unroll factor estimation Kewen.Lin
                   ` (4 more replies)
  0 siblings, 5 replies; 45+ messages in thread
From: Kewen.Lin @ 2020-01-16  9:41 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther

Hi,

As we discussed in the thread
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
I'm working to teach IVOPTs to consider D-form group access during unrolling.
The difference on D-form and other forms during unrolling is we can put the
stride into displacement field to avoid additional step increment. eg:

With X-form (uf step increment):
  ...
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  ...

With D-form (one step increment for each base):
  ...
  LD A = baseA, OFF
  LD B = baseB, OFF
  ST C = baseC, OFF
  LD A = baseA, OFF+stride
  LD B = baseB, OFF+stride
  ST C = baseC, OFF+stride
  LD A = baseA, OFF+2*stride
  LD B = baseB, OFF+2*stride
  ST C = baseC, OFF+2*stride
  ...
  baseA += stride * uf
  baseB += stride * uf
  baseC += stride * uf

Imagining that if the loop get unrolled by 8 times, then 3 step updates with
D-form vs. 8 step updates with X-form. Here we only need to check stride
meet D-form field requirement, since if OFF doesn't meet, we can construct
baseA' with baseA + OFF.

This patch set consists four parts:
     
  [PATCH 1/4 GCC11] Add middle-end unroll factor estimation

     Add unroll factor estimation in middle-end. It mainly refers to current
     RTL unroll factor determination in function decide_unrolling and its
     sub calls.  As Richard B. suggested, we probably can force unroll factor
     with this and avoid duplicate unroll factor calculation, but I think it
     need more benchmarking work and should be handled separately.

  [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p 

     Add one target hook to determine whether the current memory access with
     the given mode, stride and other flags have available D-form supports.
     
  [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling

     Teach IVOPTs to identify address type iv group with D-form preferred,
     and flag dform_p of their derived iv cands.  Considering unroll factor,
     increase iv cost with (uf - 1) * cost_step if it's not a dform iv cand. 
     
  [PATCH 4/4 GCC11] rs6000: P9 D-form test cases

     Add some test cases, mainly copied from Kelvin's patch.

Bootstrapped and regress tested on powerpc64le-linux-gnu.
I'll take two weeks leave soon, please expect late responses.
Thanks a lot in advance!

BR,
Kewen

------------

 gcc/cfgloop.h                                       |   3 +
 gcc/config/rs6000/rs6000.c                          |  56 ++++++++++++++++-
 gcc/doc/tm.texi                                     |  14 +++++
 gcc/doc/tm.texi.in                                  |   4 ++
 gcc/target.def                                      |  21 ++++++-
 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c       |  43 +++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c       |  55 +++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c       |  12 ++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c       |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c       |  12 ++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h |  34 +++++++++++
 gcc/tree-ssa-loop-ivopts.c                          |  84 +++++++++++++++++++++++++-
 gcc/tree-ssa-loop-manip.c                           | 254 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-manip.h                           |   3 +-
 gcc/tree-ssa-loop.c                                 |  33 ++++++++++
 gcc/tree-ssa-loop.h                                 |   2 +
 16 files changed, 640 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/4 GCC11] Add middle-end unroll factor estimation
  2020-01-16  9:41 [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
@ 2020-01-16  9:43 ` Kewen.Lin
  2020-01-20 13:12   ` Segher Boessenkool
  2020-01-16 10:02 ` [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p Kewen.Lin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-01-16  9:43 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 731 bytes --]

gcc/ChangeLog

2020-01-16  Kewen Lin  <linkw@gcc.gnu.org>

	* cfgloop.h (struct loop): New field estimated_uf.
	* config/rs6000/rs6000.c (TARGET_LOOP_UNROLL_ADJUST_TREE): New macro.
	(rs6000_loop_unroll_adjust_tree): New function.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in (TARGET_LOOP_UNROLL_ADJUST_TREE): New hook.
	* target.def (loop_unroll_adjust_tree): New hook.
	* tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
	(decide_uf_runtime_iter): Likewise.
	(decide_uf_stupid): Likewise.
	(estimate_unroll_factor): Likewise.
	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declare.
	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
	* tree-ssa-loop.h (tree_average_num_loop_insns): New declare.

[-- Attachment #2: 0001.patch --]
[-- Type: text/plain, Size: 15184 bytes --]

 gcc/cfgloop.h              |   3 +
 gcc/config/rs6000/rs6000.c |  16 ++-
 gcc/doc/tm.texi            |   6 ++
 gcc/doc/tm.texi.in         |   2 +
 gcc/target.def             |   8 ++
 gcc/tree-ssa-loop-manip.c  | 254 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-manip.h  |   3 +-
 gcc/tree-ssa-loop.c        |  33 ++++++
 gcc/tree-ssa-loop.h        |   2 +
 9 files changed, 324 insertions(+), 3 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index e3590d7..feceed6 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -232,6 +232,9 @@ public:
      Other values means unroll with the given unrolling factor.  */
   unsigned short unroll;
 
+  /* Like unroll field above, but it's estimated in middle-end.  */
+  unsigned short estimated_uf;
+
   /* If this loop was inlined the main clique of the callee which does
      not need remapping when copying the loop body.  */
   unsigned short owned_clique;
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 2995348..0dabaa6 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1431,6 +1431,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_LOOP_UNROLL_ADJUST
 #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust
 
+#undef TARGET_LOOP_UNROLL_ADJUST_TREE
+#define TARGET_LOOP_UNROLL_ADJUST_TREE rs6000_loop_unroll_adjust_tree
+
 #undef TARGET_INIT_BUILTINS
 #define TARGET_INIT_BUILTINS rs6000_init_builtins
 #undef TARGET_BUILTIN_DECL
@@ -5090,7 +5093,8 @@ rs6000_destroy_cost_data (void *data)
   free (data);
 }
 
-/* Implement targetm.loop_unroll_adjust.  */
+/* Implement targetm.loop_unroll_adjust.  Don't forget to update
+   loop_unroll_adjust_tree for any changes.  */
 
 static unsigned
 rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
@@ -5109,6 +5113,16 @@ rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
   return nunroll;
 }
 
+/* Implement targetm.loop_unroll_adjust_tree, strictly refers to
+   targetm.loop_unroll_adjust.  */
+
+static unsigned
+rs6000_loop_unroll_adjust_tree (unsigned nunroll, struct loop *loop)
+{
+  /* For now loop_unroll_adjust is simple, just invoke directly.  */
+  return rs6000_loop_unroll_adjust (nunroll, loop);
+}
+
 /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a
    library with vectorized intrinsics.  */
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 2244df4..86ad278 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11875,6 +11875,12 @@ is required only when the target has special constraints like maximum
 number of memory accesses.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST_TREE (unsigned @var{nunroll}, class loop *@var{loop})
+This target hook is the same as @code{loop_unroll_adjust}, but it's for
+middle-end unroll factor estimation computation. See
+@code{loop_unroll_adjust} for the function description.
+@end deftypefn
+
 @defmac POWI_MAX_MULTS
 If defined, this macro is interpreted as a signed integer C expression
 that specifies the maximum number of floating point multiplications
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 52cd603..fd9769e 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -8008,6 +8008,8 @@ lists.
 
 @hook TARGET_LOOP_UNROLL_ADJUST
 
+@hook TARGET_LOOP_UNROLL_ADJUST_TREE
+
 @defmac POWI_MAX_MULTS
 If defined, this macro is interpreted as a signed integer C expression
 that specifies the maximum number of floating point multiplications
diff --git a/gcc/target.def b/gcc/target.def
index e705c5d..f61c831 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2725,6 +2725,14 @@ number of memory accesses.",
  unsigned, (unsigned nunroll, class loop *loop),
  NULL)
 
+DEFHOOK
+(loop_unroll_adjust_tree,
+ "This target hook is the same as @code{loop_unroll_adjust}, but it's for\n\
+middle-end unroll factor estimation computation. See\n\
+@code{loop_unroll_adjust} for the function description.",
+ unsigned, (unsigned nunroll, class loop *loop),
+ NULL)
+
 /* True if X is a legitimate MODE-mode immediate operand.  */
 DEFHOOK
 (legitimate_constant_p,
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index a79912a..db7f6e6 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
@@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-inline.h"
+#include "wide-int.h"
 
 /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
    so that we can free them all at once.  */
@@ -1592,3 +1594,255 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch)
 
   return var_before;
 }
+
+/* Try to determine estimated unroll factor for given LOOP with constant number
+   of iterations, mainly refer to decide_unroll_constant_iterations.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_uf.  */
+
+static bool
+decide_uf_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
+		      unsigned nunroll, const widest_int *iter)
+{
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Check number of iterations is constant.  */
+  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
+      || !tree_fits_uhwi_p (niter_desc->niter))
+    return false;
+
+  unsigned HOST_WIDE_INT const_niter = tree_to_uhwi (niter_desc->niter);
+
+  /* Check for an explicit unrolling factor.  */
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    {
+      /* It should have been peeled instead.  */
+      if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
+	loop->estimated_uf = 1;
+      else
+	loop->estimated_uf = loop->unroll;
+      return true;
+    }
+
+  /* Check whether the loop rolls enough to consider.  */
+  if (const_niter < 2 * nunroll || wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now compute number of iterations to unroll.  */
+  unsigned best_unroll = 0, n_copies = 0;
+  unsigned best_copies = 2 * nunroll + 10;
+  unsigned i = 2 * nunroll + 2;
+
+  if (i > const_niter - 2)
+    i = const_niter - 2;
+
+  for (; i >= nunroll - 1; i--)
+    {
+      unsigned exit_mod = const_niter % (i + 1);
+
+      if (!empty_block_p (loop->latch))
+	n_copies = exit_mod + i + 1;
+      else if (exit_mod != i)
+	n_copies = exit_mod + i + 2;
+      else
+	n_copies = i + 1;
+
+      if (n_copies < best_copies)
+	{
+	  best_copies = n_copies;
+	  best_unroll = i;
+	}
+    }
+
+  loop->estimated_uf = best_unroll + 1;
+  return true;
+}
+
+/* Try to determine estimated unroll factor for given LOOP with countable but
+   non-constant number of iterations, mainly refer to
+   decide_unroll_runtime_iterations.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_uf.  */
+
+static bool
+decide_uf_runtime_iter (class loop *loop, const tree_niter_desc *niter_desc,
+			unsigned nunroll_in, const widest_int *iter)
+{
+  unsigned nunroll = nunroll_in;
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    nunroll = loop->unroll;
+
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Skip constant number of iterations.  */
+  if ((!niter_desc->may_be_zero || !integer_zerop (niter_desc->may_be_zero))
+      && tree_fits_uhwi_p (niter_desc->niter))
+    return false;
+
+  /* Check whether the loop rolls.  */
+  if (wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now force nunroll to be power of 2.  */
+  unsigned i;
+  for (i = 1; 2 * i <= nunroll; i *= 2)
+    continue;
+
+  loop->estimated_uf = i;
+  return true;
+}
+
+/* Try to determine estimated unroll factor for given LOOP with uncountable
+   number of iterations, mainly refer to decide_unroll_stupid.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_uf.  */
+
+static bool
+decide_uf_stupid (class loop *loop, const tree_niter_desc *niter_desc,
+		  unsigned nunroll_in, const widest_int *iter)
+{
+  if (!flag_unroll_all_loops && !loop->unroll)
+    return false;
+
+  unsigned nunroll = nunroll_in;
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    nunroll = loop->unroll;
+
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (!niter_desc || !niter_desc->assumptions);
+
+  /* Skip loop with multiple branches for now.  */
+  if (num_loop_branches (loop) > 1)
+    return false;
+
+  /* Check whether the loop rolls.  */
+  if (wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now force nunroll to be power of 2.  */
+  unsigned i;
+  for (i = 1; 2 * i <= nunroll; i *= 2)
+    continue;
+
+  loop->estimated_uf = i;
+  return true;
+}
+
+/* Try to estimate whether this given LOOP can be unrolled or not, and compute
+   its estimated unroll factor if it can.  To avoid duplicated computation, you
+   can pass number of iterations information by DESC.  The heuristics mainly
+   refer to decide_unrolling in loop-unroll.c.  */
+
+void
+estimate_unroll_factor (class loop *loop, tree_niter_desc *desc)
+{
+  /* Return the existing estimated unroll factor.  */
+  if (loop->estimated_uf)
+    return;
+
+  /* Don't unroll explicitly.  */
+  if (loop->unroll == 1)
+    {
+      loop->estimated_uf = loop->unroll;
+      return;
+    }
+
+  /* Like decide_unrolling, don't unroll if:
+     1) the loop is cold.
+     2) the loop can't be manipulated.
+     3) the loop isn't innermost.  */
+  if (optimize_loop_for_size_p (loop)
+      || !can_duplicate_loop_p (loop)
+      || loop->inner != NULL)
+    {
+      loop->estimated_uf = 1;
+      return;
+    }
+
+  /* Don't unroll without explicit information.  */
+  if (!loop->unroll && !flag_unroll_loops && !flag_unroll_all_loops)
+    {
+      loop->estimated_uf = 1;
+      return;
+    }
+
+  /* Check for instruction number and average instruction number.  */
+  loop->ninsns = tree_num_loop_insns (loop, &eni_size_weights);
+  loop->av_ninsns = tree_average_num_loop_insns (loop, &eni_size_weights);
+  unsigned nunroll = param_max_unrolled_insns / loop->ninsns;
+  unsigned nunroll_by_av = param_max_average_unrolled_insns / loop->av_ninsns;
+
+  if (nunroll > nunroll_by_av)
+    nunroll = nunroll_by_av;
+  if (nunroll > (unsigned) param_max_unroll_times)
+    nunroll = param_max_unroll_times;
+
+  if (targetm.loop_unroll_adjust_tree)
+    nunroll = targetm.loop_unroll_adjust_tree (nunroll, loop);
+
+  tree_niter_desc *niter_desc = NULL;
+  bool desc_need_delete = false;
+
+  /* Compute number of iterations if need.  */
+  if (!desc)
+    {
+      /* For now, use single_dom_exit for simplicity. TODO: Support multiple
+	 exits like find_simple_exit if we finds some profitable cases.  */
+      niter_desc = XNEW (class tree_niter_desc);
+      gcc_assert (niter_desc);
+      edge exit = single_dom_exit (loop);
+      if (!exit || !number_of_iterations_exit (loop, exit, niter_desc, true))
+	{
+	  XDELETE (niter_desc);
+	  niter_desc = NULL;
+	}
+      else
+	desc_need_delete = true;
+    }
+  else
+    niter_desc = desc;
+
+  /* For checking the loop rolls enough to consider, also consult loop bounds
+     and profile.  */
+  widest_int iterations;
+  if (!get_estimated_loop_iterations (loop, &iterations)
+      && !get_likely_max_loop_iterations (loop, &iterations))
+    iterations = 0;
+
+  if (niter_desc && niter_desc->assumptions)
+    {
+      /* For countable loops.  */
+      if (!decide_uf_const_iter (loop, niter_desc, nunroll, &iterations)
+	  && !decide_uf_runtime_iter (loop, niter_desc, nunroll, &iterations))
+	loop->estimated_uf = 1;
+    }
+  else
+    {
+      if (!decide_uf_stupid (loop, niter_desc, nunroll, &iterations))
+	loop->estimated_uf = 1;
+    }
+
+  if (desc_need_delete)
+    {
+      XDELETE (niter_desc);
+      niter_desc = NULL;
+    }
+}
+
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index 8263a67..46aec40 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -55,7 +55,6 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
 extern void tree_unroll_loop (class loop *, unsigned,
 			      edge, class tree_niter_desc *);
 extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
-
-
+extern void estimate_unroll_factor (class loop *, tree_niter_desc *);
 
 #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index fc9f083..d07422e 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "diagnostic-core.h"
 #include "stringpool.h"
 #include "attribs.h"
+#include "sreal.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -790,5 +791,37 @@ tree_num_loop_insns (class loop *loop, eni_weights *weights)
   return size;
 }
 
+/* Computes an estimated number of insns on average per iteration in LOOP,
+   weighted by WEIGHTS.  Refer to function average_num_loop_insns.  */
 
+unsigned
+tree_average_num_loop_insns (class loop *loop, eni_weights *weights)
+{
+  basic_block *body = get_loop_body (loop);
+  gimple_stmt_iterator gsi;
+  unsigned bb_size, i;
+  sreal nsize = 0;
+
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb_size = 0;
+      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
+	bb_size += estimate_num_insns (gsi_stmt (gsi), weights);
+      nsize += (sreal) bb_size
+	       * body[i]->count.to_sreal_scale (loop->header->count);
+      /* Avoid overflows.   */
+      if (nsize > 1000000)
+	{
+	  free (body);
+	  return 1000000;
+	}
+    }
+  free (body);
+
+  unsigned ret = nsize.to_int ();
+  if (!ret)
+    ret = 1; /* To avoid division by zero.  */
+
+  return ret;
+}
 
diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
index e523de2..7bf6ba7 100644
--- a/gcc/tree-ssa-loop.h
+++ b/gcc/tree-ssa-loop.h
@@ -67,6 +67,8 @@ public:
 extern bool for_each_index (tree *, bool (*) (tree, tree *, void *), void *);
 extern char *get_lsm_tmp_name (tree ref, unsigned n, const char *suffix = NULL);
 extern unsigned tree_num_loop_insns (class loop *, struct eni_weights *);
+extern unsigned tree_average_num_loop_insns (class loop *,
+					     struct eni_weights *);
 
 /* Returns the loop of the statement STMT.  */
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p
  2020-01-16  9:41 [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
  2020-01-16  9:43 ` [PATCH 1/4 GCC11] Add middle-end unroll factor estimation Kewen.Lin
@ 2020-01-16 10:02 ` Kewen.Lin
  2020-01-20 10:53   ` Richard Sandiford
  2020-01-16 10:06 ` [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling Kewen.Lin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-01-16 10:02 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 310 bytes --]


gcc/ChangeLog

2020-01-16  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/rs6000.c (TARGET_STRIDE_DFORM_VALID_P): New macro.
	(rs6000_stride_dform_valid_p): New function.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in (TARGET_STRIDE_DFORM_VALID_P): New hook.
	* target.def (stride_dform_valid_p): New hook.


[-- Attachment #2: 0002.patch --]
[-- Type: text/plain, Size: 4308 bytes --]

 gcc/config/rs6000/rs6000.c | 40 ++++++++++++++++++++++++++++++++++++++++
 gcc/doc/tm.texi            |  8 ++++++++
 gcc/doc/tm.texi.in         |  2 ++
 gcc/target.def             | 13 ++++++++++++-
 4 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 0dabaa6..1e41fcf 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1657,6 +1657,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_STRIDE_DFORM_VALID_P
+#define TARGET_STRIDE_DFORM_VALID_P rs6000_stride_dform_valid_p
+
 #undef TARGET_HAVE_COUNT_REG_DECR_P
 #define TARGET_HAVE_COUNT_REG_DECR_P true
 
@@ -26272,6 +26275,43 @@ rs6000_predict_doloop_p (struct loop *loop)
   return true;
 }
 
+/* Return true if the memory access with mode MODE, signedness SIGNED_P and
+   store STORE_P with offset from 0 to (NUNROLL-1) * STRIDE are valid with
+   D-form instructions.  */
+
+static bool
+rs6000_stride_dform_valid_p (machine_mode mode, signed HOST_WIDE_INT stride,
+			  bool signed_p, bool store_p, unsigned nunroll)
+{
+  static const HOST_WIDE_INT max_bound = 0x7fff;
+  static const HOST_WIDE_INT min_bound = -0x8000;
+
+  if (!IN_RANGE ((nunroll - 1) * stride, min_bound, max_bound))
+    return false;
+
+  /* Check DQ-form for vector mode or float128 mode.  */
+  if (VECTOR_MODE_P (mode) || FLOAT128_VECTOR_P (mode))
+    {
+      if (mode_supports_dq_form (mode) && !(stride & 0xF))
+	return true;
+      else
+	return false;
+    }
+
+  /* Simply consider non VSX instructions.  */
+  if (mode == QImode || mode == HImode || mode == SFmode || mode == DFmode)
+    return true;
+
+  /* lwz/stw is D-form, but lwa is DS-form.  */
+  if (mode == SImode && (!signed_p || store_p || !(stride & 0x03)))
+    return true;
+
+  if (mode == DImode && !(stride & 0x03))
+    return true;
+
+  return false;
+}
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-rs6000.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 86ad278..0b8bc7c 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11669,6 +11669,14 @@ function version at run-time for a given set of function versions.
 body must be generated.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_STRIDE_DFORM_VALID_P (machine_mode @var{mode}, signed HOST_WIDE_INT @var{stride}, bool @var{signed_p}, bool @var{store_p}, unsigned @var{nunroll})
+For a given memory access, check whether it is valid to put 0, @var{stride}
+, 2 * @var{stride}, ... , (@var{nunroll} - 1) to the instruction D-form
+displacement, with mode @var{mode}, signedness @var{signed_p} and store
+@var{store_p}.  Return true if valid.
+The default version of this hook returns false.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_PREDICT_DOLOOP_P (class loop *@var{loop})
 Return true if we can predict it is possible to use a low-overhead loop
 for a particular loop.  The parameter @var{loop} is a pointer to the loop.
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index fd9769e..e90d020 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7953,6 +7953,8 @@ to by @var{ce_info}.
 
 @hook TARGET_GENERATE_VERSION_DISPATCHER_BODY
 
+@hook TARGET_STRIDE_DFORM_VALID_P
+
 @hook TARGET_PREDICT_DOLOOP_P
 
 @hook TARGET_HAVE_COUNT_REG_DECR_P
diff --git a/gcc/target.def b/gcc/target.def
index f61c831..ee19a8d 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4300,7 +4300,18 @@ DEFHOOK
  emits a @code{speculation_barrier} instruction if that is defined.",
 rtx, (machine_mode mode, rtx result, rtx val, rtx failval),
  default_speculation_safe_value)
- 
+
+DEFHOOK
+(stride_dform_valid_p,
+ "For a given memory access, check whether it is valid to put 0, @var{stride}\n\
+, 2 * @var{stride}, ... , (@var{nunroll} - 1) to the instruction D-form\n\
+displacement, with mode @var{mode}, signedness @var{signed_p} and store\n\
+@var{store_p}.  Return true if valid.\n\
+The default version of this hook returns false.",
+ bool, (machine_mode mode, signed HOST_WIDE_INT stride, bool signed_p,
+ bool store_p, unsigned nunroll),
+ NULL)
+
 DEFHOOK
 (predict_doloop_p,
  "Return true if we can predict it is possible to use a low-overhead loop\n\
-- 
2.7.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling
  2020-01-16  9:41 [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
  2020-01-16  9:43 ` [PATCH 1/4 GCC11] Add middle-end unroll factor estimation Kewen.Lin
  2020-01-16 10:02 ` [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p Kewen.Lin
@ 2020-01-16 10:06 ` Kewen.Lin
  2020-02-25  9:48   ` [PATCH 3/4 V2 " Kewen.Lin
  2020-01-16 10:12 ` [PATCH 4/4 GCC11] rs6000: P9 D-form test cases Kewen.Lin
  2020-01-20 13:03 ` [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Segher Boessenkool
  4 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-01-16 10:06 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 544 bytes --]

gcc/ChangeLog

2020-01-16  Kewen Lin  <linkw@gcc.gnu.org>

	* tree-ssa-loop-ivopts.c (struct iv_group): New field dform_p.
	(struct iv_cand): New field dform_p.
	(struct ivopts_data): New field mark_dform_p.
	(record_group): Initialize dform_p.
	(mark_dform_groups): New function.
	(find_interesting_uses): Call mark_dform_groups.
	(add_candidate_1): Update dform_p if derived from dform_p group.
	(determine_iv_cost): Increase cost by considering unroll factor.
	(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update mark_dform_p.


[-- Attachment #2: 0003.patch --]
[-- Type: text/plain, Size: 5667 bytes --]

 gcc/tree-ssa-loop-ivopts.c | 84 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 83 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index ab52cbe..a0d29bb 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -429,6 +429,8 @@ struct iv_group
   struct iv_cand *selected;
   /* To indicate this is a doloop use group.  */
   bool doloop_p;
+  /* To indicate this group is D-form preferred.  */
+  bool dform_p;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -470,6 +472,7 @@ struct iv_cand
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
   bool doloop_p;	/* Whether this is a doloop candidate.  */
+  bool dform_p;		/* Derived from one D-form preferred group.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -650,6 +653,10 @@ struct ivopts_data
 
   /* Whether the loop has doloop comparison use.  */
   bool doloop_use_p;
+
+  /* Whether the loop is likely to unroll and need to check and mark
+     D-form group for better step cost modeling.  */
+  bool mark_dform_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -1575,6 +1582,7 @@ record_group (struct ivopts_data *data, enum use_type type)
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
   group->doloop_p = false;
+  group->dform_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -2724,6 +2732,59 @@ split_address_groups (struct ivopts_data *data)
     }
 }
 
+/* Go through all address type groups, check and mark D-form preferred.  */
+static void
+mark_dform_groups (struct ivopts_data *data)
+{
+  if (!data->mark_dform_p)
+    return;
+
+  class loop *loop = data->current_loop;
+  bool dump_details = (dump_file && (dump_flags & TDF_DETAILS));
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (address_p (group->type))
+	{
+	  bool found = true;
+	  for (unsigned j = 0; j < group->vuses.length (); j++)
+	    {
+	      struct iv_use *use = group->vuses[j];
+	      gcc_assert (use->mem_type);
+	      /* Ensure the step fit into D-form field.  */
+	      if (TREE_CODE (use->iv->step) != INTEGER_CST
+		  || !tree_fits_shwi_p (use->iv->step))
+		{
+		  found = false;
+		  if (dump_details)
+		    fprintf (dump_file,
+			     " Group use %u.%u doesn't"
+			     "have constant step for D-form.\n",
+			     i, j);
+		  break;
+		}
+	      bool is_store
+		= TREE_CODE (gimple_assign_lhs (use->stmt)) == SSA_NAME;
+	      if (!targetm.stride_dform_valid_p (TYPE_MODE (use->mem_type),
+						 tree_to_shwi (use->iv->step),
+						 TYPE_UNSIGNED (use->mem_type),
+						 is_store, loop->estimated_uf))
+		{
+		  found = false;
+		  if (dump_details)
+		    fprintf (dump_file,
+			     " Group use %u.%u isn't"
+			     "suitable for D-form.\n",
+			     i, j);
+		  break;
+		}
+	    }
+	  if (found)
+	    group->dform_p = true;
+	}
+    }
+}
+
 /* Finds uses of the induction variables that are interesting.  */
 
 static void
@@ -2755,6 +2816,8 @@ find_interesting_uses (struct ivopts_data *data)
 
   split_address_groups (data);
 
+  mark_dform_groups (data);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, "\n<IV Groups>:\n");
@@ -3137,6 +3200,7 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
       cand->important = important;
       cand->incremented_at = incremented_at;
       cand->doloop_p = doloop;
+      cand->dform_p = false;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3173,7 +3237,11 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
-    bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+    {
+      bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+      if (data->vgroups[use->group_id]->dform_p)
+	cand->dform_p = true;
+    }
 
   return cand;
 }
@@ -5867,6 +5935,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
+  /* Consider the stride update cost during unrolling.  */
+  if (!cand->dform_p)
+    cost += (data->current_loop->estimated_uf - 1) * cost_step;
+
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
@@ -7953,6 +8025,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
+  data->mark_dform_p = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -7984,6 +8057,15 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   if (!find_induction_variables (data))
     goto finish;
 
+  if (targetm.stride_dform_valid_p && exit)
+    {
+      tree_niter_desc *desc = niter_for_exit (data, exit);
+      estimate_unroll_factor (loop, desc);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "estimated_uf:%u\n", loop->estimated_uf);
+      data->mark_dform_p = loop->estimated_uf > 1;
+    }
+
   /* Finds interesting uses (item 1).  */
   find_interesting_uses (data);
   if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)
-- 
2.7.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 4/4 GCC11] rs6000: P9 D-form test cases
  2020-01-16  9:41 [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
                   ` (2 preceding siblings ...)
  2020-01-16 10:06 ` [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling Kewen.Lin
@ 2020-01-16 10:12 ` Kewen.Lin
  2020-01-20 13:37   ` Segher Boessenkool
  2020-01-20 13:03 ` [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Segher Boessenkool
  4 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-01-16 10:12 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 400 bytes --]


gcc/testsuite/ChangeLog

2020-01-16  Kelvin Nilsen  <kelvin@gcc.gnu.org>
            Kewen Lin  <linkw@gcc.gnu.org>

	* gcc.target/powerpc/p9-dform-0.c: New test.
	* gcc.target/powerpc/p9-dform-1.c: New test.
	* gcc.target/powerpc/p9-dform-2.c: New test.
	* gcc.target/powerpc/p9-dform-3.c: New test.
	* gcc.target/powerpc/p9-dform-4.c: New test.
	* gcc.target/powerpc/p9-dform-generic.h: New test.

[-- Attachment #2: 0004.patch --]
[-- Type: text/plain, Size: 7018 bytes --]

 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c      | 43 +++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c      | 55 ++++++++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c      | 12 +++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c      | 15 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c      | 12 +++++
 .../gcc.target/powerpc/p9-dform-generic.h          | 34 +++++++++++++
 6 files changed, 171 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h

diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
new file mode 100644
index 0000000..01f8b69
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+    sacc = 0.00;
+    for (unsigned long long int i = 0; i < N; i++) {
+      sacc += x[i] * y[i];
+    }
+    dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
new file mode 100644
index 0000000..c6f1d76
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
@@ -0,0 +1,55 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+double z [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+    sacc = 0.00;
+    for (unsigned long long int i = 0; i < N; i++) {
+      z[i] = x[i] * y[i];
+      sacc += z[i];
+    }
+    dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+
+/* At time the dform optimization pass was merged with trunk, 6
+   stxv instructions were emitted in place of the same number of stxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mstxv\M} } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
new file mode 100644
index 0000000..8752f3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+#define TYPE int
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-3.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
new file mode 100644
index 0000000..df299a6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+#define TYPE double
+#include "p9-dform-generic.h"
+
+/* At time the dform optimization pass was merged with trunk, 6
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions and 8 stxv instructions replace the same number of
+   stxvx instructions.  No need to require exactly this number, as it
+   may change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-4.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
new file mode 100644
index 0000000..d712958
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -funroll-loops -mfloat128" } */
+
+#define TYPE __float128
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h b/gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h
new file mode 100644
index 0000000..3caed25
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h
@@ -0,0 +1,34 @@
+
+#define ITERATIONS 1000000
+
+#define SIZE (16384/sizeof(TYPE))
+
+static TYPE x[SIZE] __attribute__ ((aligned (16)));
+static TYPE y[SIZE] __attribute__ ((aligned (16)));
+static TYPE a;
+
+void obfuscate(void *a, ...);
+
+static void __attribute__((noinline)) do_one(void)
+{
+  unsigned long i;
+
+  obfuscate(x, y, &a);
+
+  for (i = 0; i < SIZE; i++)
+    y[i] = a * x[i];
+
+  obfuscate(x, y, &a);
+
+}
+
+int main(void)
+{
+  unsigned long i;
+
+  for (i = 0; i < ITERATIONS; i++)
+    do_one();
+
+  return 0;
+
+}
-- 
2.7.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p
  2020-01-16 10:02 ` [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p Kewen.Lin
@ 2020-01-20 10:53   ` Richard Sandiford
  2020-01-20 11:47     ` Richard Biener
  2020-01-20 13:20     ` Segher Boessenkool
  0 siblings, 2 replies; 45+ messages in thread
From: Richard Sandiford @ 2020-01-20 10:53 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng,
	Richard Guenther

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> gcc/ChangeLog
>
> 2020-01-16  Kewen Lin  <linkw@gcc.gnu.org>
>
> 	* config/rs6000/rs6000.c (TARGET_STRIDE_DFORM_VALID_P): New macro.
> 	(rs6000_stride_dform_valid_p): New function.
> 	* doc/tm.texi: Regenerate.
> 	* doc/tm.texi.in (TARGET_STRIDE_DFORM_VALID_P): New hook.
> 	* target.def (stride_dform_valid_p): New hook.

It looks like we should able to derive this information from the normal
legitimate_address_p hook.

Also, "D-form" vs. "X-form" is AFAIK a PowerPC-specific classification.
It would be good to use a more generic term in target-independent code.

Thanks,
Richard



>
>  gcc/config/rs6000/rs6000.c | 40 ++++++++++++++++++++++++++++++++++++++++
>  gcc/doc/tm.texi            |  8 ++++++++
>  gcc/doc/tm.texi.in         |  2 ++
>  gcc/target.def             | 13 ++++++++++++-
>  4 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 0dabaa6..1e41fcf 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -1657,6 +1657,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
>  #undef TARGET_PREDICT_DOLOOP_P
>  #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
>  
> +#undef TARGET_STRIDE_DFORM_VALID_P
> +#define TARGET_STRIDE_DFORM_VALID_P rs6000_stride_dform_valid_p
> +
>  #undef TARGET_HAVE_COUNT_REG_DECR_P
>  #define TARGET_HAVE_COUNT_REG_DECR_P true
>  
> @@ -26272,6 +26275,43 @@ rs6000_predict_doloop_p (struct loop *loop)
>    return true;
>  }
>  
> +/* Return true if the memory access with mode MODE, signedness SIGNED_P and
> +   store STORE_P with offset from 0 to (NUNROLL-1) * STRIDE are valid with
> +   D-form instructions.  */
> +
> +static bool
> +rs6000_stride_dform_valid_p (machine_mode mode, signed HOST_WIDE_INT stride,
> +			  bool signed_p, bool store_p, unsigned nunroll)
> +{
> +  static const HOST_WIDE_INT max_bound = 0x7fff;
> +  static const HOST_WIDE_INT min_bound = -0x8000;
> +
> +  if (!IN_RANGE ((nunroll - 1) * stride, min_bound, max_bound))
> +    return false;
> +
> +  /* Check DQ-form for vector mode or float128 mode.  */
> +  if (VECTOR_MODE_P (mode) || FLOAT128_VECTOR_P (mode))
> +    {
> +      if (mode_supports_dq_form (mode) && !(stride & 0xF))
> +	return true;
> +      else
> +	return false;
> +    }
> +
> +  /* Simply consider non VSX instructions.  */
> +  if (mode == QImode || mode == HImode || mode == SFmode || mode == DFmode)
> +    return true;
> +
> +  /* lwz/stw is D-form, but lwa is DS-form.  */
> +  if (mode == SImode && (!signed_p || store_p || !(stride & 0x03)))
> +    return true;
> +
> +  if (mode == DImode && !(stride & 0x03))
> +    return true;
> +
> +  return false;
> +}
> +
>  struct gcc_target targetm = TARGET_INITIALIZER;
>  
>  #include "gt-rs6000.h"
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 86ad278..0b8bc7c 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -11669,6 +11669,14 @@ function version at run-time for a given set of function versions.
>  body must be generated.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_STRIDE_DFORM_VALID_P (machine_mode @var{mode}, signed HOST_WIDE_INT @var{stride}, bool @var{signed_p}, bool @var{store_p}, unsigned @var{nunroll})
> +For a given memory access, check whether it is valid to put 0, @var{stride}
> +, 2 * @var{stride}, ... , (@var{nunroll} - 1) to the instruction D-form
> +displacement, with mode @var{mode}, signedness @var{signed_p} and store
> +@var{store_p}.  Return true if valid.
> +The default version of this hook returns false.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} bool TARGET_PREDICT_DOLOOP_P (class loop *@var{loop})
>  Return true if we can predict it is possible to use a low-overhead loop
>  for a particular loop.  The parameter @var{loop} is a pointer to the loop.
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index fd9769e..e90d020 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -7953,6 +7953,8 @@ to by @var{ce_info}.
>  
>  @hook TARGET_GENERATE_VERSION_DISPATCHER_BODY
>  
> +@hook TARGET_STRIDE_DFORM_VALID_P
> +
>  @hook TARGET_PREDICT_DOLOOP_P
>  
>  @hook TARGET_HAVE_COUNT_REG_DECR_P
> diff --git a/gcc/target.def b/gcc/target.def
> index f61c831..ee19a8d 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -4300,7 +4300,18 @@ DEFHOOK
>   emits a @code{speculation_barrier} instruction if that is defined.",
>  rtx, (machine_mode mode, rtx result, rtx val, rtx failval),
>   default_speculation_safe_value)
> - 
> +
> +DEFHOOK
> +(stride_dform_valid_p,
> + "For a given memory access, check whether it is valid to put 0, @var{stride}\n\
> +, 2 * @var{stride}, ... , (@var{nunroll} - 1) to the instruction D-form\n\
> +displacement, with mode @var{mode}, signedness @var{signed_p} and store\n\
> +@var{store_p}.  Return true if valid.\n\
> +The default version of this hook returns false.",
> + bool, (machine_mode mode, signed HOST_WIDE_INT stride, bool signed_p,
> + bool store_p, unsigned nunroll),
> + NULL)
> +
>  DEFHOOK
>  (predict_doloop_p,
>   "Return true if we can predict it is possible to use a low-overhead loop\n\

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p
  2020-01-20 10:53   ` Richard Sandiford
@ 2020-01-20 11:47     ` Richard Biener
  2020-01-20 13:20     ` Segher Boessenkool
  1 sibling, 0 replies; 45+ messages in thread
From: Richard Biener @ 2020-01-20 11:47 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Kewen.Lin, GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng

On Mon, 20 Jan 2020, Richard Sandiford wrote:

> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> > gcc/ChangeLog
> >
> > 2020-01-16  Kewen Lin  <linkw@gcc.gnu.org>
> >
> > 	* config/rs6000/rs6000.c (TARGET_STRIDE_DFORM_VALID_P): New macro.
> > 	(rs6000_stride_dform_valid_p): New function.
> > 	* doc/tm.texi: Regenerate.
> > 	* doc/tm.texi.in (TARGET_STRIDE_DFORM_VALID_P): New hook.
> > 	* target.def (stride_dform_valid_p): New hook.
> 
> It looks like we should able to derive this information from the normal
> legitimate_address_p hook.
> 
> Also, "D-form" vs. "X-form" is AFAIK a PowerPC-specific classification.
> It would be good to use a more generic term in target-independent code.

Note the whole series is also stage1 material.

Richard.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-01-16  9:41 [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
                   ` (3 preceding siblings ...)
  2020-01-16 10:12 ` [PATCH 4/4 GCC11] rs6000: P9 D-form test cases Kewen.Lin
@ 2020-01-20 13:03 ` Segher Boessenkool
  2020-02-10  6:17   ` Kewen.Lin
  4 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-01-20 13:03 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

Hi!

On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
> As we discussed in the thread
> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> The difference on D-form and other forms during unrolling is we can put the
> stride into displacement field to avoid additional step increment. eg:

<snip>

> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> D-form vs. 8 step updates with X-form. Here we only need to check stride
> meet D-form field requirement, since if OFF doesn't meet, we can construct
> baseA' with baseA + OFF.

So why doesn't the existing code do this already?  Why does it make all
the extra induction variables?  Is the existing cost model bad, are our
target costs bad, or something like that?


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4 GCC11] Add middle-end unroll factor estimation
  2020-01-16  9:43 ` [PATCH 1/4 GCC11] Add middle-end unroll factor estimation Kewen.Lin
@ 2020-01-20 13:12   ` Segher Boessenkool
  2020-02-10  6:20     ` [PATCH 1/4 v2 " Kewen.Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-01-20 13:12 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

Hi!

On Thu, Jan 16, 2020 at 05:39:40PM +0800, Kewen.Lin wrote:
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -232,6 +232,9 @@ public:
>       Other values means unroll with the given unrolling factor.  */
>    unsigned short unroll;
>  
> +  /* Like unroll field above, but it's estimated in middle-end.  */
> +  unsigned short estimated_uf;

Please use full words?  "estimated_unroll" perhaps?  (Similar for other
new names).

> +/* Implement targetm.loop_unroll_adjust_tree, strictly refers to
> +   targetm.loop_unroll_adjust.  */
> +
> +static unsigned
> +rs6000_loop_unroll_adjust_tree (unsigned nunroll, struct loop *loop)
> +{
> +  /* For now loop_unroll_adjust is simple, just invoke directly.  */
> +  return rs6000_loop_unroll_adjust (nunroll, loop);
> +}

Since the two hooks have the same arguments as well, it should really
just be one hook, and an implementation can check whether
  current_pass->type == RTL_PASS
if it needs to do something special for RTL, etc.?  Or it can use some
more appropriate condition -- the point is you need no extra hook.

> +  /* Check number of iterations is constant.  */
> +  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
> +      || !tree_fits_uhwi_p (niter_desc->niter))
> +    return false;

Check, and do what?  It's easier to read if you say.

> +  /* Check for an explicit unrolling factor.  */
> +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
> +    {
> +      /* It should have been peeled instead.  */
> +      if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
> +	loop->estimated_uf = 1;
> +      else
> +	loop->estimated_uf = loop->unroll;
> +      return true;
> +    }

"If loop->unroll is set, use that as loop->estimated_unroll"?


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p
  2020-01-20 10:53   ` Richard Sandiford
  2020-01-20 11:47     ` Richard Biener
@ 2020-01-20 13:20     ` Segher Boessenkool
  2020-02-25  9:46       ` Kewen.Lin
  1 sibling, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-01-20 13:20 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng,
	Richard Guenther, richard.sandiford

Hi!

On Mon, Jan 20, 2020 at 10:42:12AM +0000, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> > gcc/ChangeLog
> >
> > 2020-01-16  Kewen Lin  <linkw@gcc.gnu.org>
> >
> > 	* config/rs6000/rs6000.c (TARGET_STRIDE_DFORM_VALID_P): New macro.
> > 	(rs6000_stride_dform_valid_p): New function.
> > 	* doc/tm.texi: Regenerate.
> > 	* doc/tm.texi.in (TARGET_STRIDE_DFORM_VALID_P): New hook.
> > 	* target.def (stride_dform_valid_p): New hook.
> 
> It looks like we should able to derive this information from the normal
> legitimate_address_p hook.

Yes, probably.

> Also, "D-form" vs. "X-form" is AFAIK a PowerPC-specific classification.
> It would be good to use a more generic term in target-independent code.

Yeah.  X-form is [reg+reg] addressing; D-form is [reg+imm] addressing.
We can do simple [reg] addressing in either form as well.  Whether D-form
can be used for some access depends on many factors (ISA version, mode of
the datum, alignment, and how big the offset is of course).  But the usual
legitimate_address_p hook should do fine.  The ivopts code already has an
addr_offset_valid_p function, maybe that could be adjusted for this?


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4 GCC11] rs6000: P9 D-form test cases
  2020-01-16 10:12 ` [PATCH 4/4 GCC11] rs6000: P9 D-form test cases Kewen.Lin
@ 2020-01-20 13:37   ` Segher Boessenkool
  2020-02-10  6:25     ` [PATCH 4/4 v2 " Kewen.Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-01-20 13:37 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

Hi!

On Thu, Jan 16, 2020 at 05:42:41PM +0800, Kewen.Lin wrote:
> +/* At time the dform optimization pass was merged with trunk, 12
> +   lxv instructions were emitted in place of the same number of lxvx
> +   instructions.  No need to require exactly this number, as it may
> +   change when other optimization passes evolve.  */
> +
> +/* { dg-final { scan-assembler {\mlxv\M} } } */

Maybe you can also test there ar no lxvx insns generated?


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-01-20 13:03 ` [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Segher Boessenkool
@ 2020-02-10  6:17   ` Kewen.Lin
  2020-02-10 21:29     ` Segher Boessenkool
  0 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-02-10  6:17 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

Hi Segher,

on 2020/1/20 脧脗脦莽8:33, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
>> As we discussed in the thread
>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>> The difference on D-form and other forms during unrolling is we can put the
>> stride into displacement field to avoid additional step increment. eg:
> 
> <snip>
> 
>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>> baseA' with baseA + OFF.
> 
> So why doesn't the existing code do this already?  Why does it make all
> the extra induction variables?  Is the existing cost model bad, are our
> target costs bad, or something like that?
> 

I think the main cause is IVOPTs runs before RTL unroll, when it's determining
the IV sets, it can only take the normal step cost into account, since its
input isn't unrolled yet.  After unrolling, the x-form indexing register has to
play with more UF-1 times update, but we can probably hide them in d-form
displacement field.  The way I proposed here is to adjust IV cost with
additional cost_step according to estimated unroll.  It doesn't introduce new
IV cand but can affect the final optimal set.

BR,
Kewen

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/4 v2 GCC11] Add middle-end unroll factor estimation
  2020-01-20 13:12   ` Segher Boessenkool
@ 2020-02-10  6:20     ` Kewen.Lin
  2020-02-10 23:34       ` Segher Boessenkool
  2020-02-11  2:15       ` [PATCH 1/4 v2 " Jiufu Guo
  0 siblings, 2 replies; 45+ messages in thread
From: Kewen.Lin @ 2020-02-10  6:20 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 2228 bytes --]

Hi Segher,

Thanks for your comments!  Updated to v2 as below:

  1) Removed unnecessary hook loop_unroll_adjust_tree.
  2) Updated estimated_uf to estimated_unroll and some comments.

gcc/ChangeLog

2020-02-10  Kewen Lin  <linkw@gcc.gnu.org>

	* cfgloop.h (struct loop): New field estimated_unroll.
	* tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
	(decide_uf_runtime_iter): Likewise.
	(decide_uf_stupid): Likewise.
	(estimate_unroll_factor): Likewise.
	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declare.
	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
	* tree-ssa-loop.h (tree_average_num_loop_insns): New declare.

BR,
Kewen

on 2020/1/20 脧脗脦莽9:02, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:39:40PM +0800, Kewen.Lin wrote:
>> --- a/gcc/cfgloop.h
>> +++ b/gcc/cfgloop.h
>> @@ -232,6 +232,9 @@ public:
>>       Other values means unroll with the given unrolling factor.  */
>>    unsigned short unroll;
>>  
>> +  /* Like unroll field above, but it's estimated in middle-end.  */
>> +  unsigned short estimated_uf;
> 
> Please use full words?  "estimated_unroll" perhaps?  (Similar for other
> new names).
> 

Done.

>> +/* Implement targetm.loop_unroll_adjust_tree, strictly refers to
>> +   targetm.loop_unroll_adjust.  */
>> +
>> +static unsigned
>> +rs6000_loop_unroll_adjust_tree (unsigned nunroll, struct loop *loop)
>> +{
>> +  /* For now loop_unroll_adjust is simple, just invoke directly.  */
>> +  return rs6000_loop_unroll_adjust (nunroll, loop);
>> +}
> 
> Since the two hooks have the same arguments as well, it should really
> just be one hook, and an implementation can check whether
>   current_pass->type == RTL_PASS
> if it needs to do something special for RTL, etc.?  Or it can use some
> more appropriate condition -- the point is you need no extra hook.
> 

Good point, removed it.

>> +  /* Check number of iterations is constant.  */
>> +  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
>> +      || !tree_fits_uhwi_p (niter_desc->niter))
>> +    return false;
> 
> Check, and do what?  It's easier to read if you say.
> 

Done.

> 
> "If loop->unroll is set, use that as loop->estimated_unroll"?
> 

Done.


[-- Attachment #2: 0001_v2_remove.patch --]
[-- Type: text/plain, Size: 11910 bytes --]

---
 gcc/cfgloop.h             |   3 +
 gcc/tree-ssa-loop-manip.c | 253 ++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-manip.h |   3 +-
 gcc/tree-ssa-loop.c       |  33 ++++++
 gcc/tree-ssa-loop.h       |   2 +
 5 files changed, 292 insertions(+), 2 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 11378ca..c5bcca7 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -232,6 +232,9 @@ public:
      Other values means unroll with the given unrolling factor.  */
   unsigned short unroll;
 
+  /* Like unroll field above, but it's estimated in middle-end.  */
+  unsigned short estimated_unroll;
+
   /* If this loop was inlined the main clique of the callee which does
      not need remapping when copying the loop body.  */
   unsigned short owned_clique;
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 120b35b..72ac335 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
@@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-inline.h"
+#include "wide-int.h"
 
 /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
    so that we can free them all at once.  */
@@ -1592,3 +1594,254 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch)
 
   return var_before;
 }
+
+/* Try to determine estimated unroll factor for given LOOP with constant number
+   of iterations, mainly refer to decide_unroll_constant_iterations.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_uf_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
+		      unsigned nunroll, const widest_int *iter)
+{
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Check number of iterations is constant, return false if no.  */
+  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
+      || !tree_fits_uhwi_p (niter_desc->niter))
+    return false;
+
+  unsigned HOST_WIDE_INT const_niter = tree_to_uhwi (niter_desc->niter);
+
+  /* If unroll factor is set explicitly, use it as estimated_unroll.  */
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    {
+      /* It should have been peeled instead.  */
+      if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
+	loop->estimated_unroll = 1;
+      else
+	loop->estimated_unroll = loop->unroll;
+      return true;
+    }
+
+  /* Check whether the loop rolls enough to consider.  */
+  if (const_niter < 2 * nunroll || wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now compute number of iterations to unroll.  */
+  unsigned best_unroll = 0, n_copies = 0;
+  unsigned best_copies = 2 * nunroll + 10;
+  unsigned i = 2 * nunroll + 2;
+
+  if (i > const_niter - 2)
+    i = const_niter - 2;
+
+  for (; i >= nunroll - 1; i--)
+    {
+      unsigned exit_mod = const_niter % (i + 1);
+
+      if (!empty_block_p (loop->latch))
+	n_copies = exit_mod + i + 1;
+      else if (exit_mod != i)
+	n_copies = exit_mod + i + 2;
+      else
+	n_copies = i + 1;
+
+      if (n_copies < best_copies)
+	{
+	  best_copies = n_copies;
+	  best_unroll = i;
+	}
+    }
+
+  loop->estimated_unroll = best_unroll + 1;
+  return true;
+}
+
+/* Try to determine estimated unroll factor for given LOOP with countable but
+   non-constant number of iterations, mainly refer to
+   decide_unroll_runtime_iterations.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_uf_runtime_iter (class loop *loop, const tree_niter_desc *niter_desc,
+			unsigned nunroll_in, const widest_int *iter)
+{
+  unsigned nunroll = nunroll_in;
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    nunroll = loop->unroll;
+
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Skip constant number of iterations.  */
+  if ((!niter_desc->may_be_zero || !integer_zerop (niter_desc->may_be_zero))
+      && tree_fits_uhwi_p (niter_desc->niter))
+    return false;
+
+  /* Check whether the loop rolls.  */
+  if (wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now force nunroll to be power of 2.  */
+  unsigned i;
+  for (i = 1; 2 * i <= nunroll; i *= 2)
+    continue;
+
+  loop->estimated_unroll = i;
+  return true;
+}
+
+/* Try to determine estimated unroll factor for given LOOP with uncountable
+   number of iterations, mainly refer to decide_unroll_stupid.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_uf_stupid (class loop *loop, const tree_niter_desc *niter_desc,
+		  unsigned nunroll_in, const widest_int *iter)
+{
+  if (!flag_unroll_all_loops && !loop->unroll)
+    return false;
+
+  unsigned nunroll = nunroll_in;
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    nunroll = loop->unroll;
+
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (!niter_desc || !niter_desc->assumptions);
+
+  /* Skip loop with multiple branches for now.  */
+  if (num_loop_branches (loop) > 1)
+    return false;
+
+  /* Check whether the loop rolls.  */
+  if (wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now force nunroll to be power of 2.  */
+  unsigned i;
+  for (i = 1; 2 * i <= nunroll; i *= 2)
+    continue;
+
+  loop->estimated_unroll = i;
+  return true;
+}
+
+/* Try to estimate whether this given LOOP can be unrolled or not, and compute
+   its estimated unroll factor if it can.  To avoid duplicated computation, you
+   can pass number of iterations information by DESC.  The heuristics mainly
+   refer to decide_unrolling in loop-unroll.c.  */
+
+void
+estimate_unroll_factor (class loop *loop, tree_niter_desc *desc)
+{
+  /* Return the existing estimated unroll factor.  */
+  if (loop->estimated_unroll)
+    return;
+
+  /* Don't unroll explicitly.  */
+  if (loop->unroll == 1)
+    {
+      loop->estimated_unroll = loop->unroll;
+      return;
+    }
+
+  /* Like decide_unrolling, don't unroll if:
+     1) the loop is cold.
+     2) the loop can't be manipulated.
+     3) the loop isn't innermost.  */
+  if (optimize_loop_for_size_p (loop) || !can_duplicate_loop_p (loop)
+      || loop->inner != NULL)
+    {
+      loop->estimated_unroll = 1;
+      return;
+    }
+
+  /* Don't unroll without explicit information.  */
+  if (!loop->unroll && !flag_unroll_loops && !flag_unroll_all_loops)
+    {
+      loop->estimated_unroll = 1;
+      return;
+    }
+
+  /* Check for instruction number and average instruction number.  */
+  loop->ninsns = tree_num_loop_insns (loop, &eni_size_weights);
+  loop->av_ninsns = tree_average_num_loop_insns (loop, &eni_size_weights);
+  unsigned nunroll = param_max_unrolled_insns / loop->ninsns;
+  unsigned nunroll_by_av = param_max_average_unrolled_insns / loop->av_ninsns;
+
+  if (nunroll > nunroll_by_av)
+    nunroll = nunroll_by_av;
+  if (nunroll > (unsigned) param_max_unroll_times)
+    nunroll = param_max_unroll_times;
+
+  if (targetm.loop_unroll_adjust)
+    nunroll = targetm.loop_unroll_adjust (nunroll, loop);
+
+  tree_niter_desc *niter_desc = NULL;
+  bool desc_need_delete = false;
+
+  /* Compute number of iterations if need.  */
+  if (!desc)
+    {
+      /* For now, use single_dom_exit for simplicity. TODO: Support multiple
+	 exits like find_simple_exit if we finds some profitable cases.  */
+      niter_desc = XNEW (class tree_niter_desc);
+      gcc_assert (niter_desc);
+      edge exit = single_dom_exit (loop);
+      if (!exit || !number_of_iterations_exit (loop, exit, niter_desc, true))
+	{
+	  XDELETE (niter_desc);
+	  niter_desc = NULL;
+	}
+      else
+	desc_need_delete = true;
+    }
+  else
+    niter_desc = desc;
+
+  /* For checking the loop rolls enough to consider, also consult loop bounds
+     and profile.  */
+  widest_int iterations;
+  if (!get_estimated_loop_iterations (loop, &iterations)
+      && !get_likely_max_loop_iterations (loop, &iterations))
+    iterations = 0;
+
+  if (niter_desc && niter_desc->assumptions)
+    {
+      /* For countable loops.  */
+      if (!decide_uf_const_iter (loop, niter_desc, nunroll, &iterations)
+	  && !decide_uf_runtime_iter (loop, niter_desc, nunroll, &iterations))
+	loop->estimated_unroll = 1;
+    }
+  else
+    {
+      if (!decide_uf_stupid (loop, niter_desc, nunroll, &iterations))
+	loop->estimated_unroll = 1;
+    }
+
+  if (desc_need_delete)
+    {
+      XDELETE (niter_desc);
+      niter_desc = NULL;
+    }
+}
+
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index e789e4f..773a2b3 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -55,7 +55,6 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
 extern void tree_unroll_loop (class loop *, unsigned,
 			      edge, class tree_niter_desc *);
 extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
-
-
+extern void estimate_unroll_factor (class loop *, tree_niter_desc *);
 
 #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 5e8365d..25320fb 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "diagnostic-core.h"
 #include "stringpool.h"
 #include "attribs.h"
+#include "sreal.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -790,5 +791,37 @@ tree_num_loop_insns (class loop *loop, eni_weights *weights)
   return size;
 }
 
+/* Computes an estimated number of insns on average per iteration in LOOP,
+   weighted by WEIGHTS.  Refer to function average_num_loop_insns.  */
 
+unsigned
+tree_average_num_loop_insns (class loop *loop, eni_weights *weights)
+{
+  basic_block *body = get_loop_body (loop);
+  gimple_stmt_iterator gsi;
+  unsigned bb_size, i;
+  sreal nsize = 0;
+
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb_size = 0;
+      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
+	bb_size += estimate_num_insns (gsi_stmt (gsi), weights);
+      nsize += (sreal) bb_size
+	       * body[i]->count.to_sreal_scale (loop->header->count);
+      /* Avoid overflows.   */
+      if (nsize > 1000000)
+	{
+	  free (body);
+	  return 1000000;
+	}
+    }
+  free (body);
+
+  unsigned ret = nsize.to_int ();
+  if (!ret)
+    ret = 1; /* To avoid division by zero.  */
+
+  return ret;
+}
 
diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
index 9e35125..af36177 100644
--- a/gcc/tree-ssa-loop.h
+++ b/gcc/tree-ssa-loop.h
@@ -67,6 +67,8 @@ public:
 extern bool for_each_index (tree *, bool (*) (tree, tree *, void *), void *);
 extern char *get_lsm_tmp_name (tree ref, unsigned n, const char *suffix = NULL);
 extern unsigned tree_num_loop_insns (class loop *, struct eni_weights *);
+extern unsigned tree_average_num_loop_insns (class loop *,
+					     struct eni_weights *);
 
 /* Returns the loop of the statement STMT.  */
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 4/4 v2 GCC11] rs6000: P9 D-form test cases
  2020-01-20 13:37   ` Segher Boessenkool
@ 2020-02-10  6:25     ` Kewen.Lin
  2020-02-10 23:51       ` Segher Boessenkool
  0 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-02-10  6:25 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 1013 bytes --]

Hi Segher,

Updated as below according to your suggestion.

BR,
Kewen

--------

gcc/testsuite/ChangeLog

2020-02-10  Kelvin Nilsen  <kelvin@gcc.gnu.org>
            Kewen Lin  <linkw@gcc.gnu.org>

	* gcc.target/powerpc/p9-dform-0.c: New test.
	* gcc.target/powerpc/p9-dform-1.c: New test.
	* gcc.target/powerpc/p9-dform-2.c: New test.
	* gcc.target/powerpc/p9-dform-3.c: New test.
	* gcc.target/powerpc/p9-dform-4.c: New test.
	* gcc.target/powerpc/p9-dform-generic.h: New test.

on 2020/1/20 脧脗脦莽9:19, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:42:41PM +0800, Kewen.Lin wrote:
>> +/* At time the dform optimization pass was merged with trunk, 12
>> +   lxv instructions were emitted in place of the same number of lxvx
>> +   instructions.  No need to require exactly this number, as it may
>> +   change when other optimization passes evolve.  */
>> +
>> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> 
> Maybe you can also test there ar no lxvx insns generated?
> 

Done, thanks!

[-- Attachment #2: 0004_v2_not.patch --]
[-- Type: text/plain, Size: 7515 bytes --]

---
 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c      | 44 +++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c      | 57 ++++++++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c      | 14 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c      | 17 +++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c      | 14 ++++++
 .../gcc.target/powerpc/p9-dform-generic.h          | 34 +++++++++++++
 6 files changed, 180 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h

diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
new file mode 100644
index 0000000..68b0434
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+    sacc = 0.00;
+    for (unsigned long long int i = 0; i < N; i++) {
+      sacc += x[i] * y[i];
+    }
+    dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not {\mlxvx\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
new file mode 100644
index 0000000..b80ffbc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
@@ -0,0 +1,57 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+double z [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+    sacc = 0.00;
+    for (unsigned long long int i = 0; i < N; i++) {
+      z[i] = x[i] * y[i];
+      sacc += z[i];
+    }
+    dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not {\mlxvx\M} } } */
+
+/* At time the dform optimization pass was merged with trunk, 6
+   stxv instructions were emitted in place of the same number of stxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not {\mstxvx\M} } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
new file mode 100644
index 0000000..51789d7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+#define TYPE int
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not {\mstxvx\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-3.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
new file mode 100644
index 0000000..1a4b6c5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+#define TYPE double
+#include "p9-dform-generic.h"
+
+/* At time the dform optimization pass was merged with trunk, 6
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions and 8 stxv instructions replace the same number of
+   stxvx instructions.  No need to require exactly this number, as it
+   may change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not {\mstxvx\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-4.c b/gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
new file mode 100644
index 0000000..752f987
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -funroll-loops -mfloat128" } */
+
+#define TYPE __float128
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
+/* { dg-final { scan-assembler-not {\mlxvx\M} } } */
+/* { dg-final { scan-assembler-not {\mstxvx\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h b/gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h
new file mode 100644
index 0000000..3caed25
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h
@@ -0,0 +1,34 @@
+
+#define ITERATIONS 1000000
+
+#define SIZE (16384/sizeof(TYPE))
+
+static TYPE x[SIZE] __attribute__ ((aligned (16)));
+static TYPE y[SIZE] __attribute__ ((aligned (16)));
+static TYPE a;
+
+void obfuscate(void *a, ...);
+
+static void __attribute__((noinline)) do_one(void)
+{
+  unsigned long i;
+
+  obfuscate(x, y, &a);
+
+  for (i = 0; i < SIZE; i++)
+    y[i] = a * x[i];
+
+  obfuscate(x, y, &a);
+
+}
+
+int main(void)
+{
+  unsigned long i;
+
+  for (i = 0; i < ITERATIONS; i++)
+    do_one();
+
+  return 0;
+
+}
-- 
2.7.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-10  6:17   ` Kewen.Lin
@ 2020-02-10 21:29     ` Segher Boessenkool
  2020-02-11  2:56       ` Kewen.Lin
  2020-02-11  7:34       ` Richard Biener
  0 siblings, 2 replies; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-10 21:29 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

Hi!

On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote:
> on 2020/1/20 下午8:33, Segher Boessenkool wrote:
> > On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
> >> As we discussed in the thread
> >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> >> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> >> The difference on D-form and other forms during unrolling is we can put the
> >> stride into displacement field to avoid additional step increment. eg:
> > 
> > <snip>
> > 
> >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> >> D-form vs. 8 step updates with X-form. Here we only need to check stride
> >> meet D-form field requirement, since if OFF doesn't meet, we can construct
> >> baseA' with baseA + OFF.
> > 
> > So why doesn't the existing code do this already?  Why does it make all
> > the extra induction variables?  Is the existing cost model bad, are our
> > target costs bad, or something like that?
> > 
> 
> I think the main cause is IVOPTs runs before RTL unroll, when it's determining
> the IV sets, it can only take the normal step cost into account, since its
> input isn't unrolled yet.  After unrolling, the x-form indexing register has to
> play with more UF-1 times update, but we can probably hide them in d-form
> displacement field.  The way I proposed here is to adjust IV cost with
> additional cost_step according to estimated unroll.  It doesn't introduce new
> IV cand but can affect the final optimal set.

Yes, we should decide how often we want to unroll things somewhere before
ivopts already, and just use that info here.

Or are there advantage to doing it *in* ivopts?  It sounds like doing
it there is probably expensive, but maybe not, and we need to do similar
analysis there anyway.


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4 v2 GCC11] Add middle-end unroll factor estimation
  2020-02-10  6:20     ` [PATCH 1/4 v2 " Kewen.Lin
@ 2020-02-10 23:34       ` Segher Boessenkool
  2020-02-11  6:51         ` [PATCH 1/4 v3 " Kewen.Lin
  2020-02-11  2:15       ` [PATCH 1/4 v2 " Jiufu Guo
  1 sibling, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-10 23:34 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

Hi!

On Mon, Feb 10, 2020 at 02:20:17PM +0800, Kewen.Lin wrote:
> 	* tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
> 	(decide_uf_runtime_iter): Likewise.
> 	(decide_uf_stupid): Likewise.

These names still use "uf".  (Those are the last I see).

> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declare.

"New declaration."


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4 v2 GCC11] rs6000: P9 D-form test cases
  2020-02-10  6:25     ` [PATCH 4/4 v2 " Kewen.Lin
@ 2020-02-10 23:51       ` Segher Boessenkool
  0 siblings, 0 replies; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-10 23:51 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

On Mon, Feb 10, 2020 at 02:24:15PM +0800, Kewen.Lin wrote:
> 2020-02-10  Kelvin Nilsen  <kelvin@gcc.gnu.org>
>             Kewen Lin  <linkw@gcc.gnu.org>
> 
> 	* gcc.target/powerpc/p9-dform-0.c: New test.
> 	* gcc.target/powerpc/p9-dform-1.c: New test.
> 	* gcc.target/powerpc/p9-dform-2.c: New test.
> 	* gcc.target/powerpc/p9-dform-3.c: New test.
> 	* gcc.target/powerpc/p9-dform-4.c: New test.
> 	* gcc.target/powerpc/p9-dform-generic.h: New test.

This is fine for trunk (after the other patches are in so this no longer
fails, of course ;-) ).


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4 v2 GCC11] Add middle-end unroll factor estimation
  2020-02-10  6:20     ` [PATCH 1/4 v2 " Kewen.Lin
  2020-02-10 23:34       ` Segher Boessenkool
@ 2020-02-11  2:15       ` Jiufu Guo
  2020-02-11  3:04         ` Kewen.Lin
  1 sibling, 1 reply; 45+ messages in thread
From: Jiufu Guo @ 2020-02-11  2:15 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, GCC Patches, Bill Schmidt, bin.cheng,
	Richard Guenther

"Kewen.Lin" <linkw@linux.ibm.com> writes:

> Hi Segher,
>
> Thanks for your comments!  Updated to v2 as below:
>
>   1) Removed unnecessary hook loop_unroll_adjust_tree.
>   2) Updated estimated_uf to estimated_unroll and some comments.
>
> gcc/ChangeLog
>
> 2020-02-10  Kewen Lin  <linkw@gcc.gnu.org>
>
> 	* cfgloop.h (struct loop): New field estimated_unroll.
> 	* tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
> 	(decide_uf_runtime_iter): Likewise.
> 	(decide_uf_stupid): Likewise.
> 	(estimate_unroll_factor): Likewise.
In RTL unroller, target hooks are also involved when decide unroll
factor, so these decide_uf_xx functions may not same with final real
unroll factor in RTL. For example, at O2 by default, small loops will be
unrolled 2 times. 
> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declare.
> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declare.
>
> BR,
> Kewen
>
> on 2020/1/20 下午9:02, Segher Boessenkool wrote:
>> Hi!
>> 
>> On Thu, Jan 16, 2020 at 05:39:40PM +0800, Kewen.Lin wrote:
>>> --- a/gcc/cfgloop.h
>>> +++ b/gcc/cfgloop.h
>>> @@ -232,6 +232,9 @@ public:
>>>       Other values means unroll with the given unrolling factor.  */
>>>    unsigned short unroll;
>>>  
>>> +  /* Like unroll field above, but it's estimated in middle-end.  */
>>> +  unsigned short estimated_uf;
>> 
>> Please use full words?  "estimated_unroll" perhaps?  (Similar for other
>> new names).
>> 
>
> Done.
>
>>> +/* Implement targetm.loop_unroll_adjust_tree, strictly refers to
>>> +   targetm.loop_unroll_adjust.  */
>>> +
>>> +static unsigned
>>> +rs6000_loop_unroll_adjust_tree (unsigned nunroll, struct loop *loop)
>>> +{
>>> +  /* For now loop_unroll_adjust is simple, just invoke directly.  */
>>> +  return rs6000_loop_unroll_adjust (nunroll, loop);
>>> +}
>> 
>> Since the two hooks have the same arguments as well, it should really
>> just be one hook, and an implementation can check whether
>>   current_pass->type == RTL_PASS
>> if it needs to do something special for RTL, etc.?  Or it can use some
>> more appropriate condition -- the point is you need no extra hook.
>> 
>
> Good point, removed it.
>
>>> +  /* Check number of iterations is constant.  */
>>> +  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
>>> +      || !tree_fits_uhwi_p (niter_desc->niter))
>>> +    return false;
>> 
>> Check, and do what?  It's easier to read if you say.
>> 
>
> Done.
>
>> 
>> "If loop->unroll is set, use that as loop->estimated_unroll"?
>> 
>
> Done.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-10 21:29     ` Segher Boessenkool
@ 2020-02-11  2:56       ` Kewen.Lin
  2020-02-11  7:34       ` Richard Biener
  1 sibling, 0 replies; 45+ messages in thread
From: Kewen.Lin @ 2020-02-11  2:56 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

on 2020/2/11 上午5:29, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote:
>> on 2020/1/20 下午8:33, Segher Boessenkool wrote:
>>> On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
>>>> As we discussed in the thread
>>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>>>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>>>> The difference on D-form and other forms during unrolling is we can put the
>>>> stride into displacement field to avoid additional step increment. eg:
>>>
>>> <snip>
>>>
>>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>>>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>>>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>>>> baseA' with baseA + OFF.
>>>
>>> So why doesn't the existing code do this already?  Why does it make all
>>> the extra induction variables?  Is the existing cost model bad, are our
>>> target costs bad, or something like that?
>>>
>>
>> I think the main cause is IVOPTs runs before RTL unroll, when it's determining
>> the IV sets, it can only take the normal step cost into account, since its
>> input isn't unrolled yet.  After unrolling, the x-form indexing register has to
>> play with more UF-1 times update, but we can probably hide them in d-form
>> displacement field.  The way I proposed here is to adjust IV cost with
>> additional cost_step according to estimated unroll.  It doesn't introduce new
>> IV cand but can affect the final optimal set.
> 
> Yes, we should decide how often we want to unroll things somewhere before
> ivopts already, and just use that info here.

Agreed! If some passes are interested on this unroll factor estimation, we
can move backward there if it's before IVOPTs.  As patch 1/4, once it's set,
the later pass can just reuse that info.  As Richard B. suggested, we can
even skip the later RTL unroll factor determination.

> 
> Or are there advantage to doing it *in* ivopts?  It sounds like doing
> it there is probably expensive, but maybe not, and we need to do similar
> analysis there anyway.
> 

Good question.  I didn't consider that, the reason putting here is we need
this information in IVOPTs for some cases.  :)

BR,
Kewen

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4 v2 GCC11] Add middle-end unroll factor estimation
  2020-02-11  2:15       ` [PATCH 1/4 v2 " Jiufu Guo
@ 2020-02-11  3:04         ` Kewen.Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Kewen.Lin @ 2020-02-11  3:04 UTC (permalink / raw)
  To: Jiufu Guo
  Cc: Segher Boessenkool, GCC Patches, Bill Schmidt, bin.cheng,
	Richard Guenther

Hi Jeff,

on 2020/2/11 上午10:14, Jiufu Guo wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> 
>> Hi Segher,
>>
>> Thanks for your comments!  Updated to v2 as below:
>>
>>   1) Removed unnecessary hook loop_unroll_adjust_tree.
>>   2) Updated estimated_uf to estimated_unroll and some comments.
>>
>> gcc/ChangeLog
>>
>> 2020-02-10  Kewen Lin  <linkw@gcc.gnu.org>
>>
>> 	* cfgloop.h (struct loop): New field estimated_unroll.
>> 	* tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
>> 	(decide_uf_runtime_iter): Likewise.
>> 	(decide_uf_stupid): Likewise.
>> 	(estimate_unroll_factor): Likewise.
> In RTL unroller, target hooks are also involved when decide unroll
> factor, so these decide_uf_xx functions may not same with final real
> unroll factor in RTL. For example, at O2 by default, small loops will be
> unrolled 2 times. 

I didn't quite follow your comments, the patch already had
targetm.loop_unroll_adjust in these decide_uf_xx functions, exactly
the same as what we have in loop-unroll.c (for RTL unroll).

Or anything I missed?

BR,
Kewen

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/4 v3 GCC11] Add middle-end unroll factor estimation
  2020-02-10 23:34       ` Segher Boessenkool
@ 2020-02-11  6:51         ` Kewen.Lin
  2020-02-11  7:00           ` Segher Boessenkool
  0 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-02-11  6:51 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

Hi,

v3 changes:
  - Updated _uf to _unroll for some function names.

By the way, should I guard the current i386/s390 loop_unroll_adjust
ealy return with (current_pass->type != RTL_PASS)?  I'm inclined not
to, since this analysis isn't enabled by default, if those targets
want to adopt this analysis, they will get ICE immediately then users
can notice it and make up the gimple part check.  A guard can make
ICE disappear but the hook works poorly silently, users probably miss
to update it.

BR,
Kewen

-----

gcc/ChangeLog

2020-02-11  Kewen Lin  <linkw@gcc.gnu.org>

	* cfgloop.h (struct loop): New field estimated_unroll.
	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
	(decide_unroll_runtime_iter): Likewise.
	(decide_unroll_stupid): Likewise.
	(estimate_unroll_factor): Likewise.
	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.

on 2020/2/11 脡脧脦莽7:34, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Feb 10, 2020 at 02:20:17PM +0800, Kewen.Lin wrote:
>> 	* tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
>> 	(decide_uf_runtime_iter): Likewise.
>> 	(decide_uf_stupid): Likewise.
> 
> These names still use "uf".  (Those are the last I see).
> 

Good catch!

>> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declare.
> 
> "New declaration."
>
Done.


[-- Attachment #2: 0001_v3.patch --]
[-- Type: text/plain, Size: 11934 bytes --]

---
 gcc/cfgloop.h             |   3 +
 gcc/tree-ssa-loop-manip.c | 253 ++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-manip.h |   3 +-
 gcc/tree-ssa-loop.c       |  33 ++++++
 gcc/tree-ssa-loop.h       |   2 +
 5 files changed, 292 insertions(+), 2 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 11378ca..c5bcca7 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -232,6 +232,9 @@ public:
      Other values means unroll with the given unrolling factor.  */
   unsigned short unroll;
 
+  /* Like unroll field above, but it's estimated in middle-end.  */
+  unsigned short estimated_unroll;
+
   /* If this loop was inlined the main clique of the callee which does
      not need remapping when copying the loop body.  */
   unsigned short owned_clique;
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 120b35b..8a5a1a9 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
@@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-inline.h"
+#include "wide-int.h"
 
 /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
    so that we can free them all at once.  */
@@ -1592,3 +1594,254 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch)
 
   return var_before;
 }
+
+/* Try to determine estimated unroll factor for given LOOP with constant number
+   of iterations, mainly refer to decide_unroll_constant_iterations.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_unroll_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
+		      unsigned nunroll, const widest_int *iter)
+{
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Check number of iterations is constant, return false if no.  */
+  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
+      || !tree_fits_uhwi_p (niter_desc->niter))
+    return false;
+
+  unsigned HOST_WIDE_INT const_niter = tree_to_uhwi (niter_desc->niter);
+
+  /* If unroll factor is set explicitly, use it as estimated_unroll.  */
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    {
+      /* It should have been peeled instead.  */
+      if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
+	loop->estimated_unroll = 1;
+      else
+	loop->estimated_unroll = loop->unroll;
+      return true;
+    }
+
+  /* Check whether the loop rolls enough to consider.  */
+  if (const_niter < 2 * nunroll || wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now compute number of iterations to unroll.  */
+  unsigned best_unroll = 0, n_copies = 0;
+  unsigned best_copies = 2 * nunroll + 10;
+  unsigned i = 2 * nunroll + 2;
+
+  if (i > const_niter - 2)
+    i = const_niter - 2;
+
+  for (; i >= nunroll - 1; i--)
+    {
+      unsigned exit_mod = const_niter % (i + 1);
+
+      if (!empty_block_p (loop->latch))
+	n_copies = exit_mod + i + 1;
+      else if (exit_mod != i)
+	n_copies = exit_mod + i + 2;
+      else
+	n_copies = i + 1;
+
+      if (n_copies < best_copies)
+	{
+	  best_copies = n_copies;
+	  best_unroll = i;
+	}
+    }
+
+  loop->estimated_unroll = best_unroll + 1;
+  return true;
+}
+
+/* Try to determine estimated unroll factor for given LOOP with countable but
+   non-constant number of iterations, mainly refer to
+   decide_unroll_runtime_iterations.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_unroll_runtime_iter (class loop *loop, const tree_niter_desc *niter_desc,
+			unsigned nunroll_in, const widest_int *iter)
+{
+  unsigned nunroll = nunroll_in;
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    nunroll = loop->unroll;
+
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Skip constant number of iterations.  */
+  if ((!niter_desc->may_be_zero || !integer_zerop (niter_desc->may_be_zero))
+      && tree_fits_uhwi_p (niter_desc->niter))
+    return false;
+
+  /* Check whether the loop rolls.  */
+  if (wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now force nunroll to be power of 2.  */
+  unsigned i;
+  for (i = 1; 2 * i <= nunroll; i *= 2)
+    continue;
+
+  loop->estimated_unroll = i;
+  return true;
+}
+
+/* Try to determine estimated unroll factor for given LOOP with uncountable
+   number of iterations, mainly refer to decide_unroll_stupid.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_unroll_stupid (class loop *loop, const tree_niter_desc *niter_desc,
+		  unsigned nunroll_in, const widest_int *iter)
+{
+  if (!flag_unroll_all_loops && !loop->unroll)
+    return false;
+
+  unsigned nunroll = nunroll_in;
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    nunroll = loop->unroll;
+
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (!niter_desc || !niter_desc->assumptions);
+
+  /* Skip loop with multiple branches for now.  */
+  if (num_loop_branches (loop) > 1)
+    return false;
+
+  /* Check whether the loop rolls.  */
+  if (wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now force nunroll to be power of 2.  */
+  unsigned i;
+  for (i = 1; 2 * i <= nunroll; i *= 2)
+    continue;
+
+  loop->estimated_unroll = i;
+  return true;
+}
+
+/* Try to estimate whether this given LOOP can be unrolled or not, and compute
+   its estimated unroll factor if it can.  To avoid duplicated computation, you
+   can pass number of iterations information by DESC.  The heuristics mainly
+   refer to decide_unrolling in loop-unroll.c.  */
+
+void
+estimate_unroll_factor (class loop *loop, tree_niter_desc *desc)
+{
+  /* Return the existing estimated unroll factor.  */
+  if (loop->estimated_unroll)
+    return;
+
+  /* Don't unroll explicitly.  */
+  if (loop->unroll == 1)
+    {
+      loop->estimated_unroll = loop->unroll;
+      return;
+    }
+
+  /* Like decide_unrolling, don't unroll if:
+     1) the loop is cold.
+     2) the loop can't be manipulated.
+     3) the loop isn't innermost.  */
+  if (optimize_loop_for_size_p (loop) || !can_duplicate_loop_p (loop)
+      || loop->inner != NULL)
+    {
+      loop->estimated_unroll = 1;
+      return;
+    }
+
+  /* Don't unroll without explicit information.  */
+  if (!loop->unroll && !flag_unroll_loops && !flag_unroll_all_loops)
+    {
+      loop->estimated_unroll = 1;
+      return;
+    }
+
+  /* Check for instruction number and average instruction number.  */
+  loop->ninsns = tree_num_loop_insns (loop, &eni_size_weights);
+  loop->av_ninsns = tree_average_num_loop_insns (loop, &eni_size_weights);
+  unsigned nunroll = param_max_unrolled_insns / loop->ninsns;
+  unsigned nunroll_by_av = param_max_average_unrolled_insns / loop->av_ninsns;
+
+  if (nunroll > nunroll_by_av)
+    nunroll = nunroll_by_av;
+  if (nunroll > (unsigned) param_max_unroll_times)
+    nunroll = param_max_unroll_times;
+
+  if (targetm.loop_unroll_adjust)
+    nunroll = targetm.loop_unroll_adjust (nunroll, loop);
+
+  tree_niter_desc *niter_desc = NULL;
+  bool desc_need_delete = false;
+
+  /* Compute number of iterations if need.  */
+  if (!desc)
+    {
+      /* For now, use single_dom_exit for simplicity. TODO: Support multiple
+	 exits like find_simple_exit if we finds some profitable cases.  */
+      niter_desc = XNEW (class tree_niter_desc);
+      gcc_assert (niter_desc);
+      edge exit = single_dom_exit (loop);
+      if (!exit || !number_of_iterations_exit (loop, exit, niter_desc, true))
+	{
+	  XDELETE (niter_desc);
+	  niter_desc = NULL;
+	}
+      else
+	desc_need_delete = true;
+    }
+  else
+    niter_desc = desc;
+
+  /* For checking the loop rolls enough to consider, also consult loop bounds
+     and profile.  */
+  widest_int iterations;
+  if (!get_estimated_loop_iterations (loop, &iterations)
+      && !get_likely_max_loop_iterations (loop, &iterations))
+    iterations = 0;
+
+  if (niter_desc && niter_desc->assumptions)
+    {
+      /* For countable loops.  */
+      if (!decide_unroll_const_iter (loop, niter_desc, nunroll, &iterations)
+	  && !decide_unroll_runtime_iter (loop, niter_desc, nunroll, &iterations))
+	loop->estimated_unroll = 1;
+    }
+  else
+    {
+      if (!decide_unroll_stupid (loop, niter_desc, nunroll, &iterations))
+	loop->estimated_unroll = 1;
+    }
+
+  if (desc_need_delete)
+    {
+      XDELETE (niter_desc);
+      niter_desc = NULL;
+    }
+}
+
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index e789e4f..773a2b3 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -55,7 +55,6 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
 extern void tree_unroll_loop (class loop *, unsigned,
 			      edge, class tree_niter_desc *);
 extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
-
-
+extern void estimate_unroll_factor (class loop *, tree_niter_desc *);
 
 #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 5e8365d..25320fb 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "diagnostic-core.h"
 #include "stringpool.h"
 #include "attribs.h"
+#include "sreal.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -790,5 +791,37 @@ tree_num_loop_insns (class loop *loop, eni_weights *weights)
   return size;
 }
 
+/* Computes an estimated number of insns on average per iteration in LOOP,
+   weighted by WEIGHTS.  Refer to function average_num_loop_insns.  */
 
+unsigned
+tree_average_num_loop_insns (class loop *loop, eni_weights *weights)
+{
+  basic_block *body = get_loop_body (loop);
+  gimple_stmt_iterator gsi;
+  unsigned bb_size, i;
+  sreal nsize = 0;
+
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb_size = 0;
+      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
+	bb_size += estimate_num_insns (gsi_stmt (gsi), weights);
+      nsize += (sreal) bb_size
+	       * body[i]->count.to_sreal_scale (loop->header->count);
+      /* Avoid overflows.   */
+      if (nsize > 1000000)
+	{
+	  free (body);
+	  return 1000000;
+	}
+    }
+  free (body);
+
+  unsigned ret = nsize.to_int ();
+  if (!ret)
+    ret = 1; /* To avoid division by zero.  */
+
+  return ret;
+}
 
diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
index 9e35125..af36177 100644
--- a/gcc/tree-ssa-loop.h
+++ b/gcc/tree-ssa-loop.h
@@ -67,6 +67,8 @@ public:
 extern bool for_each_index (tree *, bool (*) (tree, tree *, void *), void *);
 extern char *get_lsm_tmp_name (tree ref, unsigned n, const char *suffix = NULL);
 extern unsigned tree_num_loop_insns (class loop *, struct eni_weights *);
+extern unsigned tree_average_num_loop_insns (class loop *,
+					     struct eni_weights *);
 
 /* Returns the loop of the statement STMT.  */
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4 v3 GCC11] Add middle-end unroll factor estimation
  2020-02-11  6:51         ` [PATCH 1/4 v3 " Kewen.Lin
@ 2020-02-11  7:00           ` Segher Boessenkool
  0 siblings, 0 replies; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-11  7:00 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, bin.cheng, Richard Guenther

Hi!

On Tue, Feb 11, 2020 at 02:50:03PM +0800, Kewen.Lin wrote:
> v3 changes:
>   - Updated _uf to _unroll for some function names.

Thanks.

> By the way, should I guard the current i386/s390 loop_unroll_adjust
> ealy return with (current_pass->type != RTL_PASS)?  I'm inclined not
> to, since this analysis isn't enabled by default, if those targets
> want to adopt this analysis, they will get ICE immediately then users
> can notice it and make up the gimple part check.  A guard can make
> ICE disappear but the hook works poorly silently, users probably miss
> to update it.

Just tell the maintainers of those ports about it?  They are responsive,
and of course they will know what they will want to do :-)


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-10 21:29     ` Segher Boessenkool
  2020-02-11  2:56       ` Kewen.Lin
@ 2020-02-11  7:34       ` Richard Biener
  2020-02-11  7:49         ` Segher Boessenkool
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Biener @ 2020-02-11  7:34 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

[-- Attachment #1: Type: text/plain, Size: 2644 bytes --]

On Mon, 10 Feb 2020, Segher Boessenkool wrote:

> Hi!
> 
> On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote:
> > on 2020/1/20 下午8:33, Segher Boessenkool wrote:
> > > On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
> > >> As we discussed in the thread
> > >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> > >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> > >> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> > >> The difference on D-form and other forms during unrolling is we can put the
> > >> stride into displacement field to avoid additional step increment. eg:
> > > 
> > > <snip>
> > > 
> > >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> > >> D-form vs. 8 step updates with X-form. Here we only need to check stride
> > >> meet D-form field requirement, since if OFF doesn't meet, we can construct
> > >> baseA' with baseA + OFF.
> > > 
> > > So why doesn't the existing code do this already?  Why does it make all
> > > the extra induction variables?  Is the existing cost model bad, are our
> > > target costs bad, or something like that?
> > > 
> > 
> > I think the main cause is IVOPTs runs before RTL unroll, when it's determining
> > the IV sets, it can only take the normal step cost into account, since its
> > input isn't unrolled yet.  After unrolling, the x-form indexing register has to
> > play with more UF-1 times update, but we can probably hide them in d-form
> > displacement field.  The way I proposed here is to adjust IV cost with
> > additional cost_step according to estimated unroll.  It doesn't introduce new
> > IV cand but can affect the final optimal set.
> 
> Yes, we should decide how often we want to unroll things somewhere before
> ivopts already, and just use that info here.
> 
> Or are there advantage to doing it *in* ivopts?  It sounds like doing
> it there is probably expensive, but maybe not, and we need to do similar
> analysis there anyway.

Well, if the only benefit of doing the unrolling is that IVs get
cheaper then yes, IVOPTs should drive it.

But usually unrolling exposes redundancies (catched by predictive
commoning which drives some unrolling) or it enables better use
of CPU resources via scheduling (only catched later in RTL).
For scheduling we have the additional complication that the RTL
side doesn't have as much of a fancy data dependence analysis
framework as on the GIMPLE side.  So I'd put my bet on trying to
move something like SMS to GIMPLE and combine it with unrolling
(IIRC SMS at most interleaves 1 1/2 loop iterations).

Richard.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-11  7:34       ` Richard Biener
@ 2020-02-11  7:49         ` Segher Boessenkool
  2020-02-11  8:01           ` Richard Biener
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-11  7:49 UTC (permalink / raw)
  To: Richard Biener; +Cc: Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote:
> On Mon, 10 Feb 2020, Segher Boessenkool wrote:
> > Yes, we should decide how often we want to unroll things somewhere before
> > ivopts already, and just use that info here.
> > 
> > Or are there advantage to doing it *in* ivopts?  It sounds like doing
> > it there is probably expensive, but maybe not, and we need to do similar
> > analysis there anyway.
> 
> Well, if the only benefit of doing the unrolling is that IVs get
> cheaper then yes, IVOPTs should drive it.

We need to know much earlier in the pass pipeline how often a loop will
be unrolled.  We don't have to *do* it early.

If we want to know it before ivopts, then obviously it has to be done
earlier.  Otherwise, maybe it is a good idea to do it in ivopts itself.
Or maybe not.  It's just an idea :-)

We know we do not want it *later*, ivopts needs to know this to make
good decisions of its own.

> But usually unrolling exposes redundancies (catched by predictive
> commoning which drives some unrolling) or it enables better use
> of CPU resources via scheduling (only catched later in RTL).
> For scheduling we have the additional complication that the RTL
> side doesn't have as much of a fancy data dependence analysis
> framework as on the GIMPLE side.  So I'd put my bet on trying to
> move something like SMS to GIMPLE and combine it with unrolling
> (IIRC SMS at most interleaves 1 1/2 loop iterations).

SMS on RTL always was quite disappointing...  Do you expect it will be
more useful on Gimple?  Moving it there is a good idea in any case ;-)

I don't quite see the synergy between SMS and loop unrolling, but maybe
I need to look harder.


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-11  7:49         ` Segher Boessenkool
@ 2020-02-11  8:01           ` Richard Biener
  2020-02-11 12:46             ` Roman Zhuykov
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Biener @ 2020-02-11  8:01 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Tue, 11 Feb 2020, Segher Boessenkool wrote:

> On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote:
> > On Mon, 10 Feb 2020, Segher Boessenkool wrote:
> > > Yes, we should decide how often we want to unroll things somewhere before
> > > ivopts already, and just use that info here.
> > > 
> > > Or are there advantage to doing it *in* ivopts?  It sounds like doing
> > > it there is probably expensive, but maybe not, and we need to do similar
> > > analysis there anyway.
> > 
> > Well, if the only benefit of doing the unrolling is that IVs get
> > cheaper then yes, IVOPTs should drive it.
> 
> We need to know much earlier in the pass pipeline how often a loop will
> be unrolled.  We don't have to *do* it early.
> 
> If we want to know it before ivopts, then obviously it has to be done
> earlier.  Otherwise, maybe it is a good idea to do it in ivopts itself.
> Or maybe not.  It's just an idea :-)
> 
> We know we do not want it *later*, ivopts needs to know this to make
> good decisions of its own.
> 
> > But usually unrolling exposes redundancies (catched by predictive
> > commoning which drives some unrolling) or it enables better use
> > of CPU resources via scheduling (only catched later in RTL).
> > For scheduling we have the additional complication that the RTL
> > side doesn't have as much of a fancy data dependence analysis
> > framework as on the GIMPLE side.  So I'd put my bet on trying to
> > move something like SMS to GIMPLE and combine it with unrolling
> > (IIRC SMS at most interleaves 1 1/2 loop iterations).
> 
> SMS on RTL always was quite disappointing...

It originally came with "data dependence export from GIMPLE to RTL"
that never materialized so I'm not surprised ;)  It also relies
on doloop detection.

> Do you expect it will be more useful on Gimple?  Moving it there is a 
> good idea in any case ;-)
>
> I don't quite see the synergy between SMS and loop unrolling, but maybe
> I need to look harder.

As said elsewhere I don't believe in actual unrolling doing much good
but in removing data dependences in the CPU pipeline.  SMS rotates
the loop, peeling N iterations (and somehow I think for N > 1 that
should better mean unrolling the loop body).

Of course doing "scheduling" on GIMPLE is "interesting" in its own
but OTOH our pipeline DFAs are imprecise enough that one could even
devise some basic GIMPLE <-> "RTL" mapping to make use of it.  But
then scheduling without IVs or register pressure in mind is somewhat
pointless as well.

That said - if I had enough time I'd still thing that investigating
"scheduling on GIMPLE" as replacement for sched1 is an interesting
thing to do.

Richard.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-11  8:01           ` Richard Biener
@ 2020-02-11 12:46             ` Roman Zhuykov
  2020-02-11 13:58               ` Richard Biener
  2020-02-11 18:12               ` Segher Boessenkool
  0 siblings, 2 replies; 45+ messages in thread
From: Roman Zhuykov @ 2020-02-11 12:46 UTC (permalink / raw)
  To: Richard Biener, Segher Boessenkool
  Cc: Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

11.02.2020 11:01, Richard Biener wrote:
> On Tue, 11 Feb 2020, Segher Boessenkool wrote:
>
>> On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote:
>>> On Mon, 10 Feb 2020, Segher Boessenkool wrote:
>>>> Yes, we should decide how often we want to unroll things somewhere before
>>>> ivopts already, and just use that info here.
>>>>
>>>> Or are there advantage to doing it *in* ivopts?  It sounds like doing
>>>> it there is probably expensive, but maybe not, and we need to do similar
>>>> analysis there anyway.
>>> Well, if the only benefit of doing the unrolling is that IVs get
>>> cheaper then yes, IVOPTs should drive it.
>> We need to know much earlier in the pass pipeline how often a loop will
>> be unrolled.  We don't have to *do* it early.
>>
>> If we want to know it before ivopts, then obviously it has to be done
>> earlier.  Otherwise, maybe it is a good idea to do it in ivopts itself.
>> Or maybe not.  It's just an idea :-)
>>
>> We know we do not want it *later*, ivopts needs to know this to make
>> good decisions of its own.
>>
>>> But usually unrolling exposes redundancies (catched by predictive
>>> commoning which drives some unrolling) or it enables better use
>>> of CPU resources via scheduling (only catched later in RTL).
>>> For scheduling we have the additional complication that the RTL
>>> side doesn't have as much of a fancy data dependence analysis
>>> framework as on the GIMPLE side.  So I'd put my bet on trying to
>>> move something like SMS to GIMPLE and combine it with unrolling
>>> (IIRC SMS at most interleaves 1 1/2 loop iterations).
To clarify, without specifying -fmodulo-sched-allow-regmoves it only
interleaves 2 iterations.  With register moves enabled more iterations
can be considered.
> SMS on RTL always was quite disappointing...
Hmm, even when trying to move it just few passes earlier many years ago,
got another opinion:
https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html
Although without such a move we still have annoying issues which RTL
folks can't solve, see e.q.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2
> It originally came with "data dependence export from GIMPLE to RTL"
> that never materialized so I'm not surprised ;)  It also relies
> on doloop detection.
My current attempt to drop doloop dependency is still WIP, hopefully
I'll create branch in refs/users/ in a month or so.  But older (gcc-7
and earlier) versions are available, see
https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01647.html
Doloops are still supported for some kind of backward compatibility, but
much more loops (which loop-iv can analyze) are considered in new SMS.
>> Do you expect it will be more useful on Gimple?  Moving it there is a 
>> good idea in any case ;-)
>>
>> I don't quite see the synergy between SMS and loop unrolling, but maybe
>> I need to look harder.
> As said elsewhere I don't believe in actual unrolling doing much good
> but in removing data dependences in the CPU pipeline.  SMS rotates
> the loop, peeling N iterations (and somehow I think for N > 1 that
> should better mean unrolling the loop body).
Yes, this is what theory tells us.
> Of course doing "scheduling" on GIMPLE is "interesting" in its own
> but OTOH our pipeline DFAs are imprecise enough that one could even
> devise some basic GIMPLE <-> "RTL" mapping to make use of it.  But
> then scheduling without IVs or register pressure in mind is somewhat
> pointless as well.
Unfortunately, even with -fmodulo-sched-allow-regmoves it doesn't
interact much with register pressure.
> That said - if I had enough time I'd still thing that investigating
> "scheduling on GIMPLE" as replacement for sched1 is an interesting
> thing to do.
Sound good, but IMHO modulo scheduler is not the best choice to be the
first step implementing such a concept.

Roman

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-11 12:46             ` Roman Zhuykov
@ 2020-02-11 13:58               ` Richard Biener
  2020-02-11 18:00                 ` Segher Boessenkool
  2020-02-11 18:12               ` Segher Boessenkool
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Biener @ 2020-02-11 13:58 UTC (permalink / raw)
  To: Roman Zhuykov
  Cc: Segher Boessenkool, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

[-- Attachment #1: Type: text/plain, Size: 4302 bytes --]

On Tue, 11 Feb 2020, Roman Zhuykov wrote:

> 11.02.2020 11:01, Richard Biener wrote:
> > On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> >
> >> On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote:
> >>> On Mon, 10 Feb 2020, Segher Boessenkool wrote:
> >>>> Yes, we should decide how often we want to unroll things somewhere before
> >>>> ivopts already, and just use that info here.
> >>>>
> >>>> Or are there advantage to doing it *in* ivopts?  It sounds like doing
> >>>> it there is probably expensive, but maybe not, and we need to do similar
> >>>> analysis there anyway.
> >>> Well, if the only benefit of doing the unrolling is that IVs get
> >>> cheaper then yes, IVOPTs should drive it.
> >> We need to know much earlier in the pass pipeline how often a loop will
> >> be unrolled.  We don't have to *do* it early.
> >>
> >> If we want to know it before ivopts, then obviously it has to be done
> >> earlier.  Otherwise, maybe it is a good idea to do it in ivopts itself.
> >> Or maybe not.  It's just an idea :-)
> >>
> >> We know we do not want it *later*, ivopts needs to know this to make
> >> good decisions of its own.
> >>
> >>> But usually unrolling exposes redundancies (catched by predictive
> >>> commoning which drives some unrolling) or it enables better use
> >>> of CPU resources via scheduling (only catched later in RTL).
> >>> For scheduling we have the additional complication that the RTL
> >>> side doesn't have as much of a fancy data dependence analysis
> >>> framework as on the GIMPLE side.  So I'd put my bet on trying to
> >>> move something like SMS to GIMPLE and combine it with unrolling
> >>> (IIRC SMS at most interleaves 1 1/2 loop iterations).
> To clarify, without specifying -fmodulo-sched-allow-regmoves it only
> interleaves 2 iterations.  With register moves enabled more iterations
> can be considered.
> > SMS on RTL always was quite disappointing...
> Hmm, even when trying to move it just few passes earlier many years ago,
> got another opinion:
> https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html
> Although without such a move we still have annoying issues which RTL
> folks can't solve, see e.q.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2
> > It originally came with "data dependence export from GIMPLE to RTL"
> > that never materialized so I'm not surprised ;)  It also relies
> > on doloop detection.
> My current attempt to drop doloop dependency is still WIP, hopefully
> I'll create branch in refs/users/ in a month or so.  But older (gcc-7
> and earlier) versions are available, see
> https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01647.html
> Doloops are still supported for some kind of backward compatibility, but
> much more loops (which loop-iv can analyze) are considered in new SMS.
> >> Do you expect it will be more useful on Gimple?  Moving it there is a 
> >> good idea in any case ;-)
> >>
> >> I don't quite see the synergy between SMS and loop unrolling, but maybe
> >> I need to look harder.
> > As said elsewhere I don't believe in actual unrolling doing much good
> > but in removing data dependences in the CPU pipeline.  SMS rotates
> > the loop, peeling N iterations (and somehow I think for N > 1 that
> > should better mean unrolling the loop body).
> Yes, this is what theory tells us.
> > Of course doing "scheduling" on GIMPLE is "interesting" in its own
> > but OTOH our pipeline DFAs are imprecise enough that one could even
> > devise some basic GIMPLE <-> "RTL" mapping to make use of it.  But
> > then scheduling without IVs or register pressure in mind is somewhat
> > pointless as well.
> Unfortunately, even with -fmodulo-sched-allow-regmoves it doesn't
> interact much with register pressure.
> > That said - if I had enough time I'd still thing that investigating
> > "scheduling on GIMPLE" as replacement for sched1 is an interesting
> > thing to do.
> Sound good, but IMHO modulo scheduler is not the best choice to be the
> first step implementing such a concept.

True ;)   But since the context of this thread is unrolling ...
Not sure how you'd figure the unroll factor to apply if you want
to do unrolling within a classical scheduling framework?  Maybe
unroll as much as you can fill slots until the last instruction
of the first iteration retires?

Richard.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-11 13:58               ` Richard Biener
@ 2020-02-11 18:00                 ` Segher Boessenkool
  2020-02-12  8:07                   ` Richard Biener
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-11 18:00 UTC (permalink / raw)
  To: Richard Biener
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote:
> On Tue, 11 Feb 2020, Roman Zhuykov wrote:
> > 11.02.2020 11:01, Richard Biener wrote:
> > Sound good, but IMHO modulo scheduler is not the best choice to be the
> > first step implementing such a concept.
> 
> True ;)   But since the context of this thread is unrolling ...
> Not sure how you'd figure the unroll factor to apply if you want
> to do unrolling within a classical scheduling framework?  Maybe
> unroll as much as you can fill slots until the last instruction
> of the first iteration retires?

That will be terrible on register-rich architectures: it *already* is
problematic how often some things are unrolled, blindly unrolling more
would make things worse.  We need to unroll more where it helps, but
less where it does not.  For that we need a good cost/benefit estimate.


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-11 12:46             ` Roman Zhuykov
  2020-02-11 13:58               ` Richard Biener
@ 2020-02-11 18:12               ` Segher Boessenkool
  2020-02-12  8:13                 ` Richard Biener
  1 sibling, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-11 18:12 UTC (permalink / raw)
  To: Roman Zhuykov
  Cc: Richard Biener, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

Hi!

On Tue, Feb 11, 2020 at 03:46:05PM +0300, Roman Zhuykov wrote:
> Hmm, even when trying to move it just few passes earlier many years ago,
> got another opinion:
> https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html
> Although without such a move we still have annoying issues which RTL
> folks can't solve, see e.q.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2

Basic block partitioning has wildly disproportionate fallout in all
later passes, both in terms of what those *do* (or don't, if partitioning
is enabled), and of impact on the code (not to mention developer time).

Maybe the implementation can be improved, but probably we should do this
in a different way altogether.  The current situation is not good.


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-11 18:00                 ` Segher Boessenkool
@ 2020-02-12  8:07                   ` Richard Biener
  2020-02-12 21:53                     ` Segher Boessenkool
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Biener @ 2020-02-12  8:07 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Tue, 11 Feb 2020, Segher Boessenkool wrote:

> On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote:
> > On Tue, 11 Feb 2020, Roman Zhuykov wrote:
> > > 11.02.2020 11:01, Richard Biener wrote:
> > > Sound good, but IMHO modulo scheduler is not the best choice to be the
> > > first step implementing such a concept.
> > 
> > True ;)   But since the context of this thread is unrolling ...
> > Not sure how you'd figure the unroll factor to apply if you want
> > to do unrolling within a classical scheduling framework?  Maybe
> > unroll as much as you can fill slots until the last instruction
> > of the first iteration retires?
> 
> That will be terrible on register-rich architectures: it *already* is
> problematic how often some things are unrolled, blindly unrolling more
> would make things worse.  We need to unroll more where it helps, but
> less where it does not.  For that we need a good cost/benefit estimate.

True.  For x86 we tried but did not come up with a sensible estimate
(probably the x86 uarchs are way too complicated to understand).

Richard.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-11 18:12               ` Segher Boessenkool
@ 2020-02-12  8:13                 ` Richard Biener
  2020-02-12 10:02                   ` Segher Boessenkool
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Biener @ 2020-02-12  8:13 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Tue, 11 Feb 2020, Segher Boessenkool wrote:

> Hi!
> 
> On Tue, Feb 11, 2020 at 03:46:05PM +0300, Roman Zhuykov wrote:
> > Hmm, even when trying to move it just few passes earlier many years ago,
> > got another opinion:
> > https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html
> > Although without such a move we still have annoying issues which RTL
> > folks can't solve, see e.q.
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2
> 
> Basic block partitioning has wildly disproportionate fallout in all
> later passes, both in terms of what those *do* (or don't, if partitioning
> is enabled), and of impact on the code (not to mention developer time).
> 
> Maybe the implementation can be improved, but probably we should do this
> in a different way altogether.  The current situation is not good.

I think the expectation that you can go back to CFG layout mode
and then work with CFG layout tools after we've lowered to CFG RTL
is simply bogus.  Yeah, you can probably do analysis things but
I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
can wreck things.  Undoubtedly doing CFG manipulations is not going
to work since CFG layout does not respect CFG RTL restrictions.

Partitioning simply uncovered latent bugs, there's nothing wrong
with it IMHO.

Richard.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-12  8:13                 ` Richard Biener
@ 2020-02-12 10:02                   ` Segher Boessenkool
  2020-02-12 10:53                     ` Richard Biener
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-12 10:02 UTC (permalink / raw)
  To: Richard Biener
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote:
> On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> > Basic block partitioning has wildly disproportionate fallout in all
> > later passes, both in terms of what those *do* (or don't, if partitioning
> > is enabled), and of impact on the code (not to mention developer time).
> > 
> > Maybe the implementation can be improved, but probably we should do this
> > in a different way altogether.  The current situation is not good.
> 
> I think the expectation that you can go back to CFG layout mode
> and then work with CFG layout tools after we've lowered to CFG RTL
> is simply bogus.

Partitioning is also quite problematic if you do not use cfglayout
mode.  For example, in shrink-wrapping.  It prevents a lot there.

> Yeah, you can probably do analysis things but
> I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
> can wreck things.  Undoubtedly doing CFG manipulations is not going
> to work since CFG layout does not respect CFG RTL restrictions.

Doing CFG manipulations on CFG RTL mode directly is almost impossible
to do correctly.

For example, bb-reorder.  Which is a much more important optimisation
than partitioning, btw.

> Partitioning simply uncovered latent bugs, there's nothing wrong
> with it IMHO.

I don't agree.  The whole way EDGE_CROSSING works hinders all other
optimisations a lot.


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-12 10:02                   ` Segher Boessenkool
@ 2020-02-12 10:53                     ` Richard Biener
  2020-02-12 22:05                       ` Segher Boessenkool
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Biener @ 2020-02-12 10:53 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Wed, 12 Feb 2020, Segher Boessenkool wrote:

> On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote:
> > On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> > > Basic block partitioning has wildly disproportionate fallout in all
> > > later passes, both in terms of what those *do* (or don't, if partitioning
> > > is enabled), and of impact on the code (not to mention developer time).
> > > 
> > > Maybe the implementation can be improved, but probably we should do this
> > > in a different way altogether.  The current situation is not good.
> > 
> > I think the expectation that you can go back to CFG layout mode
> > and then work with CFG layout tools after we've lowered to CFG RTL
> > is simply bogus.
> 
> Partitioning is also quite problematic if you do not use cfglayout
> mode.  For example, in shrink-wrapping.  It prevents a lot there.
> 
> > Yeah, you can probably do analysis things but
> > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
> > can wreck things.  Undoubtedly doing CFG manipulations is not going
> > to work since CFG layout does not respect CFG RTL restrictions.
> 
> Doing CFG manipulations on CFG RTL mode directly is almost impossible
> to do correctly.
> 
> For example, bb-reorder.  Which is a much more important optimisation
> than partitioning, btw.

BB reorder switches back and forth as well ... :/

I think both are closely enough related that we probably should do
partitioning from within the same framework?  OTOH BB reorder
happens _much_ later.

> > Partitioning simply uncovered latent bugs, there's nothing wrong
> > with it IMHO.
> 
> I don't agree.  The whole way EDGE_CROSSING works hinders all other
> optimisations a lot.

I'm not sure if it's there for correctness reasons or just to
help checking that nothing "undoes" the partitioning decision.

Richard.

> 
> Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-12  8:07                   ` Richard Biener
@ 2020-02-12 21:53                     ` Segher Boessenkool
  0 siblings, 0 replies; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-12 21:53 UTC (permalink / raw)
  To: Richard Biener
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Wed, Feb 12, 2020 at 09:07:27AM +0100, Richard Biener wrote:
> On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> 
> > On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote:
> > > On Tue, 11 Feb 2020, Roman Zhuykov wrote:
> > > > 11.02.2020 11:01, Richard Biener wrote:
> > > > Sound good, but IMHO modulo scheduler is not the best choice to be the
> > > > first step implementing such a concept.
> > > 
> > > True ;)   But since the context of this thread is unrolling ...
> > > Not sure how you'd figure the unroll factor to apply if you want
> > > to do unrolling within a classical scheduling framework?  Maybe
> > > unroll as much as you can fill slots until the last instruction
> > > of the first iteration retires?
> > 
> > That will be terrible on register-rich architectures: it *already* is
> > problematic how often some things are unrolled, blindly unrolling more
> > would make things worse.  We need to unroll more where it helps, but
> > less where it does not.  For that we need a good cost/benefit estimate.
> 
> True.  For x86 we tried but did not come up with a sensible estimate
> (probably the x86 uarchs are way too complicated to understand).

There are three main factors at work:

1) The cost for iterating the loop.  This is related to what ivopts
does, and also on most uarchs every loop iteration has some minimum cost
(on many uarchs it needs a fetch redirect, or you can only do one branch
per cycle, etc.; this is mainly important for "small" loops, where what
is "small" differs per uarch).  Unrolling brings this cost down.
2) Cost reduction by better scheduling.
3) The cost for using more registers, when unrolling.  This is worse if
you also use SMS or variable expansion, or simply because of the better
scheduling.  This can increase the cost a *lot*.

1) isn't all that hard to estimate.  2) and esp. 3) are though :-/


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-12 10:53                     ` Richard Biener
@ 2020-02-12 22:05                       ` Segher Boessenkool
  2020-02-13  7:48                         ` Richard Biener
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-12 22:05 UTC (permalink / raw)
  To: Richard Biener
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote:
> On Wed, 12 Feb 2020, Segher Boessenkool wrote:
> > On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote:
> > > On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> > > > Basic block partitioning has wildly disproportionate fallout in all
> > > > later passes, both in terms of what those *do* (or don't, if partitioning
> > > > is enabled), and of impact on the code (not to mention developer time).
> > > > 
> > > > Maybe the implementation can be improved, but probably we should do this
> > > > in a different way altogether.  The current situation is not good.
> > > 
> > > I think the expectation that you can go back to CFG layout mode
> > > and then work with CFG layout tools after we've lowered to CFG RTL
> > > is simply bogus.
> > 
> > Partitioning is also quite problematic if you do not use cfglayout
> > mode.  For example, in shrink-wrapping.  It prevents a lot there.
> > 
> > > Yeah, you can probably do analysis things but
> > > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
> > > can wreck things.  Undoubtedly doing CFG manipulations is not going
> > > to work since CFG layout does not respect CFG RTL restrictions.
> > 
> > Doing CFG manipulations on CFG RTL mode directly is almost impossible
> > to do correctly.
> > 
> > For example, bb-reorder.  Which is a much more important optimisation
> > than partitioning, btw.
> 
> BB reorder switches back and forth as well ... :/

Yes.  It is extremely hard to change any jumps in cfgrtl mode.

The goal is to use cfglayout mode more, ideally *always*, not to use
it less!

> I think both are closely enough related that we probably should do
> partitioning from within the same framework?  OTOH BB reorder
> happens _much_ later.

This may be doable.  What we need to do first though is to find a better
thing to use than EDGE_CROSSING.

Maybe we can determine which blocks should be hot and cold early, but
actually make that happen only very late (maybe adjusting the decision
if we have to to make things work)?  That way, intervening passes do not
have to care (much).

> > > Partitioning simply uncovered latent bugs, there's nothing wrong
> > > with it IMHO.
> > 
> > I don't agree.  The whole way EDGE_CROSSING works hinders all other
> > optimisations a lot.
> 
> I'm not sure if it's there for correctness reasons or just to
> help checking that nothing "undoes" the partitioning decision.

The latter.  (It may also make it easier for the former, of course, but
that can be solved anyway).  And that makes live very hard for all later
passes, while it is doubtful that it even give the best decisions: for
example it prevents a lot of shrink-wrapping, which you dearly *want* to
do on cold code!


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-12 22:05                       ` Segher Boessenkool
@ 2020-02-13  7:48                         ` Richard Biener
  2020-02-13  9:02                           ` Segher Boessenkool
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Biener @ 2020-02-13  7:48 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Wed, 12 Feb 2020, Segher Boessenkool wrote:

> On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote:
> > On Wed, 12 Feb 2020, Segher Boessenkool wrote:
> > > On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote:
> > > > On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> > > > > Basic block partitioning has wildly disproportionate fallout in all
> > > > > later passes, both in terms of what those *do* (or don't, if partitioning
> > > > > is enabled), and of impact on the code (not to mention developer time).
> > > > > 
> > > > > Maybe the implementation can be improved, but probably we should do this
> > > > > in a different way altogether.  The current situation is not good.
> > > > 
> > > > I think the expectation that you can go back to CFG layout mode
> > > > and then work with CFG layout tools after we've lowered to CFG RTL
> > > > is simply bogus.
> > > 
> > > Partitioning is also quite problematic if you do not use cfglayout
> > > mode.  For example, in shrink-wrapping.  It prevents a lot there.
> > > 
> > > > Yeah, you can probably do analysis things but
> > > > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
> > > > can wreck things.  Undoubtedly doing CFG manipulations is not going
> > > > to work since CFG layout does not respect CFG RTL restrictions.
> > > 
> > > Doing CFG manipulations on CFG RTL mode directly is almost impossible
> > > to do correctly.
> > > 
> > > For example, bb-reorder.  Which is a much more important optimisation
> > > than partitioning, btw.
> > 
> > BB reorder switches back and forth as well ... :/
> 
> Yes.  It is extremely hard to change any jumps in cfgrtl mode.
> 
> The goal is to use cfglayout mode more, ideally *always*, not to use
> it less!

Sure!  It must be that split requires CFG RTL (it doesn't say so, but
we don't have a PROP_rtllayout or a "cannot work with" set of
properties).  Otherwise I'm not sure why we go out of CFG layout
mode so early.  Passes not messing with the CFG at all should be
ignorant of what mode we are in.

> > I think both are closely enough related that we probably should do
> > partitioning from within the same framework?  OTOH BB reorder
> > happens _much_ later.
> 
> This may be doable.  What we need to do first though is to find a better
> thing to use than EDGE_CROSSING.
> 
> Maybe we can determine which blocks should be hot and cold early, but
> actually make that happen only very late (maybe adjusting the decision
> if we have to to make things work)?  That way, intervening passes do not
> have to care (much).

The question is whether we can even preserve things like BB counts
and edge frequencies once we've gone out of CFG layout mode for once.

> > > > Partitioning simply uncovered latent bugs, there's nothing wrong
> > > > with it IMHO.
> > > 
> > > I don't agree.  The whole way EDGE_CROSSING works hinders all other
> > > optimisations a lot.
> > 
> > I'm not sure if it's there for correctness reasons or just to
> > help checking that nothing "undoes" the partitioning decision.
> 
> The latter.  (It may also make it easier for the former, of course, but
> that can be solved anyway).  And that makes live very hard for all later
> passes, while it is doubtful that it even give the best decisions: for
> example it prevents a lot of shrink-wrapping, which you dearly *want* to
> do on cold code!

Sure.  So do that earlier ;)

Richard.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling
  2020-02-13  7:48                         ` Richard Biener
@ 2020-02-13  9:02                           ` Segher Boessenkool
  0 siblings, 0 replies; 45+ messages in thread
From: Segher Boessenkool @ 2020-02-13  9:02 UTC (permalink / raw)
  To: Richard Biener
  Cc: Roman Zhuykov, Kewen.Lin, GCC Patches, Bill Schmidt, bin.cheng

On Thu, Feb 13, 2020 at 08:48:33AM +0100, Richard Biener wrote:
> On Wed, 12 Feb 2020, Segher Boessenkool wrote:
> > On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote:
> > > BB reorder switches back and forth as well ... :/
> > 
> > Yes.  It is extremely hard to change any jumps in cfgrtl mode.
> > 
> > The goal is to use cfglayout mode more, ideally *always*, not to use
> > it less!
> 
> Sure!  It must be that split requires CFG RTL (it doesn't say so, but
> we don't have a PROP_rtllayout or a "cannot work with" set of
> properties).  Otherwise I'm not sure why we go out of CFG layout
> mode so early.  Passes not messing with the CFG at all should be
> ignorant of what mode we are in.

Well, split can split jump insns as well, so it currently cannot use
cfglayout mode.

We could probably make split not split jumps, only do that in split2
and later, which is after reload.  Getting IRA and LRA to work with
cfglayout mode will be fun.

Right after split2 is pro_and_epilogue.  Everything after that doesn't
want cfglayout mode I'd say.  So we don't even have to move
out_of_cfglayout all that far, but there are some big nasties it has
to move over.

> > > I think both are closely enough related that we probably should do
> > > partitioning from within the same framework?  OTOH BB reorder
> > > happens _much_ later.
> > 
> > This may be doable.  What we need to do first though is to find a better
> > thing to use than EDGE_CROSSING.
> > 
> > Maybe we can determine which blocks should be hot and cold early, but
> > actually make that happen only very late (maybe adjusting the decision
> > if we have to to make things work)?  That way, intervening passes do not
> > have to care (much).
> 
> The question is whether we can even preserve things like BB counts
> and edge frequencies once we've gone out of CFG layout mode for once.

Why not?  cfglayout mode simply does not include simple branches and
jump instructions, and fallthrough is not necessarily to the next bb in
the insn stream, but that is all (basically).  The cfg is just the same
in both rtl modes.  What destroys the bb and edge frequencies?

> > > > I don't agree.  The whole way EDGE_CROSSING works hinders all other
> > > > optimisations a lot.
> > > 
> > > I'm not sure if it's there for correctness reasons or just to
> > > help checking that nothing "undoes" the partitioning decision.
> > 
> > The latter.  (It may also make it easier for the former, of course, but
> > that can be solved anyway).  And that makes live very hard for all later
> > passes, while it is doubtful that it even give the best decisions: for
> > example it prevents a lot of shrink-wrapping, which you dearly *want* to
> > do on cold code!
> 
> Sure.  So do that earlier ;)

You cannot.  It has to stay after RA, obviously.  And you cannot move RA
earlier, to before combine.

We do shrink-wrapping (and all other *logue work) pretty much as early
as possible, already.


Segher

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p
  2020-01-20 13:20     ` Segher Boessenkool
@ 2020-02-25  9:46       ` Kewen.Lin
  2020-03-02 11:09         ` Richard Sandiford
  0 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-02-25  9:46 UTC (permalink / raw)
  To: Segher Boessenkool, GCC Patches, richard.sandiford
  Cc: Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 1979 bytes --]

on 2020/1/20 脧脗脦莽9:14, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Jan 20, 2020 at 10:42:12AM +0000, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> gcc/ChangeLog
>>>
>>> 2020-01-16  Kewen Lin  <linkw@gcc.gnu.org>
>>>
>>> 	* config/rs6000/rs6000.c (TARGET_STRIDE_DFORM_VALID_P): New macro.
>>> 	(rs6000_stride_dform_valid_p): New function.
>>> 	* doc/tm.texi: Regenerate.
>>> 	* doc/tm.texi.in (TARGET_STRIDE_DFORM_VALID_P): New hook.
>>> 	* target.def (stride_dform_valid_p): New hook.
>>
>> It looks like we should able to derive this information from the normal
>> legitimate_address_p hook.
> 
> Yes, probably.
> 
>> Also, "D-form" vs. "X-form" is AFAIK a PowerPC-specific classification.
>> It would be good to use a more generic term in target-independent code.
> 
> Yeah.  X-form is [reg+reg] addressing; D-form is [reg+imm] addressing.
> We can do simple [reg] addressing in either form as well.  Whether D-form
> can be used for some access depends on many factors (ISA version, mode of
> the datum, alignment, and how big the offset is of course).  But the usual
> legitimate_address_p hook should do fine.  The ivopts code already has an
> addr_offset_valid_p function, maybe that could be adjusted for this?
> 
> 
> Segher
> 

Hi Segher and Richard S.,

Sorry for late response.  Thanks for your comments on legitimate_address_p hook
and function addr_offset_valid_p.  I updated the IVOPTs part with
addr_offset_valid_p, although rs6000_legitimate_offset_address_p doesn't check
strictly all the time (like worst_case is false), it works well with SPEC2017.
Based on it, the hook is simplified as attached patch.

BR,
Kewen
-----------

gcc/ChangeLog

2020-02-25  Kewen Lin  <linkw@gcc.gnu.org>

	* config/rs6000/rs6000.c (TARGET_CONSIDER_REG_OFFSET_FOR_UNROLL_P): New
	macro.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in (TARGET_CONSIDER_REG_OFFSET_FOR_UNROLL_P): New hook.
	* target.def (consider_reg_offset_for_unroll_p): New hook.

[-- Attachment #2: hook_v2.diff --]
[-- Type: text/plain, Size: 3252 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 4c1c0e9..0eb13df 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1652,6 +1652,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_CONSIDER_REG_OFFSET_FOR_UNROLL_P
+#define TARGET_CONSIDER_REG_OFFSET_FOR_UNROLL_P true
+
 #undef TARGET_HAVE_COUNT_REG_DECR_P
 #define TARGET_HAVE_COUNT_REG_DECR_P true
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 19985ad..fc21a3b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11680,6 +11680,18 @@ function version at run-time for a given set of function versions.
 body must be generated.
 @end deftypefn
 
+@deftypevr {Target Hook} bool TARGET_CONSIDER_REG_OFFSET_FOR_UNROLL_P
+When RTL unrolling performs on a loop, the duplicated loop iterations
+have appropriate IV step update expressions.  But if an IV is derived from
+address object, it is profitable to fill its required offset updates into
+appropriate memory access expressions if target memory accessing supports
+the register offset mode and the resulted offset is in the valid range.
+Return true if target supports register offset memory accessing mode and
+wants IVOPTs or other passes to consider this information for better code
+for unrolling.  It needs to invoke unroll factor estimation in middle-end.
+The default version of this hook returns false.
+@end deftypevr
+
 @deftypefn {Target Hook} bool TARGET_PREDICT_DOLOOP_P (class loop *@var{loop})
 Return true if we can predict it is possible to use a low-overhead loop
 for a particular loop.  The parameter @var{loop} is a pointer to the loop.
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 1a16150..45377d3 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7953,6 +7953,8 @@ to by @var{ce_info}.
 
 @hook TARGET_GENERATE_VERSION_DISPATCHER_BODY
 
+@hook TARGET_CONSIDER_REG_OFFSET_FOR_UNROLL_P
+
 @hook TARGET_PREDICT_DOLOOP_P
 
 @hook TARGET_HAVE_COUNT_REG_DECR_P
diff --git a/gcc/target.def b/gcc/target.def
index b5e82ff..a966d4f 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4299,7 +4299,20 @@ DEFHOOK
  emits a @code{speculation_barrier} instruction if that is defined.",
 rtx, (machine_mode mode, rtx result, rtx val, rtx failval),
  default_speculation_safe_value)
- 
+
+DEFHOOKPOD
+(consider_reg_offset_for_unroll_p,
+ "When RTL unrolling performs on a loop, the duplicated loop iterations\n\
+have appropriate IV step update expressions.  But if an IV is derived from\n\
+address object, it is profitable to fill its required offset updates into\n\
+appropriate memory access expressions if target memory accessing supports\n\
+the register offset mode and the resulted offset is in the valid range.\n\
+Return true if target supports register offset memory accessing mode and\n\
+wants IVOPTs or other passes to consider this information for better code\n\
+for unrolling.  It needs to invoke unroll factor estimation in middle-end.\n\
+The default version of this hook returns false.",
+ bool, false)
+
 DEFHOOK
 (predict_doloop_p,
  "Return true if we can predict it is possible to use a low-overhead loop\n\

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/4 V2 GCC11] IVOPTs Consider cost_step on different forms during unrolling
  2020-01-16 10:06 ` [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling Kewen.Lin
@ 2020-02-25  9:48   ` Kewen.Lin
  2020-05-13  5:42     ` [PATCH 3/4 V3 " Kewen.Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-02-25  9:48 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]

Hi,

As the proposed hook changes, updated this with main changes:
  1) Check with addr_offset_valid_p instead.
  2) Check the 1st and the last use for the whole address group.
  3) Scale up group costs accordingly.

Bootstrapped/regtested on powerpc64le-linux-gnu (LE).

BR,
Kewen
-----------

gcc/ChangeLog

2020-02-25  Kewen Lin  <linkw@gcc.gnu.org>

	* tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
	(struct iv_cand): New field reg_offset_p.
	(struct ivopts_data): New field consider_reg_offset_for_unroll_p.
	(dump_groups): Dump group with reg_offset_p.
	(record_group): Initialize reg_offset_p.
	(mark_reg_offset_groups): New function.
	(find_interesting_uses): Call mark_reg_offset_groups.
	(add_candidate_1): Update reg_offset_p if derived from reg_offset_p group.
	(set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(determine_iv_cost): Increase step cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
	consider_reg_offset_for_unroll_p.

[-- Attachment #2: ivopts_v2.diff --]
[-- Type: text/plain, Size: 7017 bytes --]

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1ce6d8b..c01fd76 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -429,6 +429,8 @@ struct iv_group
   struct iv_cand *selected;
   /* To indicate this is a doloop use group.  */
   bool doloop_p;
+  /* To indicate this group is reg_offset valid.  */
+  bool reg_offset_p;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -470,6 +472,7 @@ struct iv_cand
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
   bool doloop_p;	/* Whether this is a doloop candidate.  */
+  bool reg_offset_p;    /* Derived from one reg_offset valid group.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -650,6 +653,10 @@ struct ivopts_data
 
   /* Whether the loop has doloop comparison use.  */
   bool doloop_use_p;
+
+  /* Whether need to consider register offset addressing mode for the loop with
+     upcoming unrolling by estimated unroll factor.  */
+  bool consider_reg_offset_for_unroll_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -837,6 +844,11 @@ dump_groups (FILE *file, struct ivopts_data *data)
 	  gcc_assert (group->type == USE_COMPARE);
 	  fprintf (file, "  Type:\tCOMPARE\n");
 	}
+      if (group->reg_offset_p)
+	{
+	  gcc_assert (address_p (group->type));
+	  fprintf (file, "  reg_offset_p: true\n");
+	}
       for (j = 0; j < group->vuses.length (); j++)
 	dump_use (file, group->vuses[j]);
     }
@@ -1579,6 +1591,7 @@ record_group (struct ivopts_data *data, enum use_type type)
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
   group->doloop_p = false;
+  group->reg_offset_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -2728,6 +2741,60 @@ split_address_groups (struct ivopts_data *data)
     }
 }
 
+/* Go through all address type groups, check and mark reg_offset addressing mode
+   valid groups.  */
+
+static void
+mark_reg_offset_groups (struct ivopts_data *data)
+{
+  class loop *loop = data->current_loop;
+  gcc_assert (data->current_loop->estimated_unroll > 1);
+  bool any_reg_offset_p = false;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (address_p (group->type))
+	{
+	  struct iv_use *head_use = group->vuses[0];
+	  if (!tree_fits_poly_int64_p (head_use->iv->step))
+	    continue;
+
+	  bool found = true;
+	  poly_int64 step = tree_to_poly_int64 (head_use->iv->step);
+	  /* Max extra offset to fill for head of group.  */
+	  poly_int64 max_increase = (loop->estimated_unroll - 1) * step;
+	  /* Check whether this increment still valid.  */
+	  if (!addr_offset_valid_p (head_use, max_increase))
+	    found = false;
+
+	  unsigned group_size = group->vuses.length ();
+	  /* Check the whole group further.  */
+	  if (group_size > 1)
+	    {
+	      /* Only need to check the last one in the group, both the head and
+		the last is valid, the others should be fine.  */
+	      struct iv_use *last_use = group->vuses[group_size - 1];
+	      poly_int64 max_delta
+		= last_use->addr_offset - head_use->addr_offset;
+	      poly_int64 max_offset = max_delta + max_increase;
+	      if (maybe_ne (max_delta, 0)
+		  && !addr_offset_valid_p (head_use, max_offset))
+		found = false;
+	    }
+
+	  if (found)
+	    {
+	      group->reg_offset_p = true;
+	      any_reg_offset_p = true;
+	    }
+	}
+    }
+
+  if (!any_reg_offset_p)
+    data->consider_reg_offset_for_unroll_p = false;
+}
+
 /* Finds uses of the induction variables that are interesting.  */
 
 static void
@@ -2759,6 +2826,9 @@ find_interesting_uses (struct ivopts_data *data)
 
   split_address_groups (data);
 
+  if (data->consider_reg_offset_for_unroll_p)
+    mark_reg_offset_groups (data);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, "\n<IV Groups>:\n");
@@ -3144,6 +3214,7 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
       cand->important = important;
       cand->incremented_at = incremented_at;
       cand->doloop_p = doloop;
+      cand->reg_offset_p = false;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3180,7 +3251,11 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
-    bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+    {
+      bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+      if (data->vgroups[use->group_id]->reg_offset_p)
+	cand->reg_offset_p = true;
+    }
 
   return cand;
 }
@@ -3638,6 +3713,14 @@ set_group_iv_cost (struct ivopts_data *data,
       return;
     }
 
+  /* Since we priced more on non reg_offset IV cand step cost, we should scale
+     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
+     loop exit, FIXME if multiple exits supported or no loop exit comparisons
+     matter.  */
+  if (data->consider_reg_offset_for_unroll_p
+      && group->vuses[0]->type != USE_COMPARE)
+    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
+
   if (data->consider_all_candidates)
     {
       group->cost_map[cand->id].cand = cand;
@@ -5874,6 +5957,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
+  /* Consider additional step updates during unrolling.  */
+  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
+    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
+
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
@@ -7960,6 +8047,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
+  data->consider_reg_offset_for_unroll_p = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -7992,6 +8080,16 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   if (!find_induction_variables (data))
     goto finish;
 
+  if (targetm.consider_reg_offset_for_unroll_p && exit)
+    {
+      tree_niter_desc *desc = niter_for_exit (data, exit);
+      estimate_unroll_factor (loop, desc);
+      data->consider_reg_offset_for_unroll_p = loop->estimated_unroll > 1;
+      if (dump_file && (dump_flags & TDF_DETAILS)
+	  && data->consider_reg_offset_for_unroll_p)
+	fprintf (dump_file, "estimated_unroll:%u\n", loop->estimated_unroll);
+    }
+
   /* Finds interesting uses (item 1).  */
   find_interesting_uses (data);
   if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p
  2020-02-25  9:46       ` Kewen.Lin
@ 2020-03-02 11:09         ` Richard Sandiford
  2020-03-03 12:26           ` Kewen.Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Sandiford @ 2020-03-02 11:09 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, GCC Patches, Bill Schmidt, bin.cheng,
	Richard Guenther

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/1/20 下午9:14, Segher Boessenkool wrote:
>> Hi!
>> 
>> On Mon, Jan 20, 2020 at 10:42:12AM +0000, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> gcc/ChangeLog
>>>>
>>>> 2020-01-16  Kewen Lin  <linkw@gcc.gnu.org>
>>>>
>>>> 	* config/rs6000/rs6000.c (TARGET_STRIDE_DFORM_VALID_P): New macro.
>>>> 	(rs6000_stride_dform_valid_p): New function.
>>>> 	* doc/tm.texi: Regenerate.
>>>> 	* doc/tm.texi.in (TARGET_STRIDE_DFORM_VALID_P): New hook.
>>>> 	* target.def (stride_dform_valid_p): New hook.
>>>
>>> It looks like we should able to derive this information from the normal
>>> legitimate_address_p hook.
>> 
>> Yes, probably.
>> 
>>> Also, "D-form" vs. "X-form" is AFAIK a PowerPC-specific classification.
>>> It would be good to use a more generic term in target-independent code.
>> 
>> Yeah.  X-form is [reg+reg] addressing; D-form is [reg+imm] addressing.
>> We can do simple [reg] addressing in either form as well.  Whether D-form
>> can be used for some access depends on many factors (ISA version, mode of
>> the datum, alignment, and how big the offset is of course).  But the usual
>> legitimate_address_p hook should do fine.  The ivopts code already has an
>> addr_offset_valid_p function, maybe that could be adjusted for this?
>> 
>> 
>> Segher
>> 
>
> Hi Segher and Richard S.,
>
> Sorry for late response.  Thanks for your comments on legitimate_address_p hook
> and function addr_offset_valid_p.  I updated the IVOPTs part with
> addr_offset_valid_p, although rs6000_legitimate_offset_address_p doesn't check
> strictly all the time (like worst_case is false), it works well with SPEC2017.
> Based on it, the hook is simplified as attached patch.

Thanks for the update.  I think it would be better to add a --param
rather than a bool hook though.  Targets can then change the default
(if necessary) using SET_OPTION_IF_UNSET.  The user can override the
default if they want to.

It might also be better to start with an opt-out rather than an opt-in
(i.e. with the default param value being true rather than false).
With a default-off option, it's much harder to tell whether something
has been deliberately turned off or whether no-one's thought about it
either way.  We can always flip the default later if it turns out that
nothing other than rs6000 benefits.

Richard

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p
  2020-03-02 11:09         ` Richard Sandiford
@ 2020-03-03 12:26           ` Kewen.Lin
  2020-05-13  5:50             ` Kewen.Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-03-03 12:26 UTC (permalink / raw)
  To: Segher Boessenkool, GCC Patches, Bill Schmidt, bin.cheng,
	Richard Guenther, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 2116 bytes --]

>> Hi Segher and Richard S.,
>>
>> Sorry for late response.  Thanks for your comments on legitimate_address_p hook
>> and function addr_offset_valid_p.  I updated the IVOPTs part with
>> addr_offset_valid_p, although rs6000_legitimate_offset_address_p doesn't check
>> strictly all the time (like worst_case is false), it works well with SPEC2017.
>> Based on it, the hook is simplified as attached patch.
> 
> Thanks for the update.  I think it would be better to add a --param
> rather than a bool hook though.  Targets can then change the default
> (if necessary) using SET_OPTION_IF_UNSET.  The user can override the
> default if they want to.
> 
> It might also be better to start with an opt-out rather than an opt-in
> (i.e. with the default param value being true rather than false).
> With a default-off option, it's much harder to tell whether something
> has been deliberately turned off or whether no-one's thought about it
> either way.  We can always flip the default later if it turns out that
> nothing other than rs6000 benefits.
> 
> Richard
> 

Hi Richard,

Thanks for your comments!  It's a good idea to use param due to the
flexibility.  And yes, it sounds good to have more targets to try and
make it better.  But I have a bit concern on turning it on by default.
Since it replies on unroll factor estimation, as part 1/4 shows, it
calls targetm.loop_unroll_adjust if target supports, which used to
work on RTL level.  To avoid possible ICE, I'm intended to turn it
off for those targets (s390 & i386) with that hook, since without good
understanding on those targets, it's hard for me to extend them with
gimple level support.  Does it make sense?

The updated patch has been attached.

BR,
Kewen
---------

gcc/ChangeLog

2020-03-03  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/invoke.texi (iv-consider-reg-offset-for-unroll): Document new option.
	* params.opt (iv-consider-reg-offset-for-unroll): New.
	* config/s390/s390.c (s390_option_override_internal): Disable parameter
	iv-consider-reg-offset-for-unroll by default.
	* config/i386/i386-options.c (ix86_option_override_internal): Likewise.

[-- Attachment #2: param.diff --]
[-- Type: text/plain, Size: 3232 bytes --]

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index e0be493..41c99b3 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -2902,6 +2902,12 @@ ix86_option_override_internal (bool main_args_p,
   if (ix86_indirect_branch != indirect_branch_keep)
     SET_OPTION_IF_UNSET (opts, opts_set, flag_jump_tables, 0);
 
+  /* Disable this for now till loop_unroll_adjust supports gimple level checks,
+     to avoid possible ICE.  */
+  if (opts->x_optimize >= 1)
+    SET_OPTION_IF_UNSET (opts, opts_set,
+			 param_iv_consider_reg_offset_for_unroll, 0);
+
   return true;
 }
 
diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
index ebba670..ae4c2bd 100644
--- a/gcc/config/s390/s390.c
+++ b/gcc/config/s390/s390.c
@@ -15318,6 +15318,12 @@ s390_option_override_internal (struct gcc_options *opts,
      not the case when the code runs before the prolog. */
   if (opts->x_flag_fentry && !TARGET_64BIT)
     error ("%<-mfentry%> is supported only for 64-bit CPUs");
+
+  /* Disable this for now till loop_unroll_adjust supports gimple level checks,
+     to avoid possible ICE.  */
+  if (opts->x_optimize >= 1)
+    SET_OPTION_IF_UNSET (opts, opts_set,
+			 param_iv_consider_reg_offset_for_unroll, 0);
 }
 
 static void
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index fa98e2f..502031c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12220,6 +12220,15 @@ If the number of candidates in the set is smaller than this value,
 always try to remove unnecessary ivs from the set
 when adding a new one.
 
+@item iv-consider-reg-offset-for-unroll
+When RTL unrolling performs on a loop, the duplicated loop iterations introduce
+appropriate induction variable step update expressions.  But if an induction
+variable is derived from address object, it is profitable to fill its required
+offset updates into appropriate memory access expressions if target memory
+accessing supports the register offset mode and the resulted offset is in the
+valid range.  The induction variable optimizations consider this information
+for better unrolling code.  It requires unroll factor estimation in middle-end.
+
 @item avg-loop-niter
 Average number of iterations of a loop.
 
diff --git a/gcc/params.opt b/gcc/params.opt
index 8e4217d..31424cf 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -270,6 +270,10 @@ Bound on number of candidates below that all candidates are considered in iv opt
 Common Joined UInteger Var(param_iv_max_considered_uses) Init(250) Param Optimization
 Bound on number of iv uses in loop optimized in iv optimizations.
 
+-param=iv-consider-reg-offset-for-unroll=
+Common Joined UInteger Var(param_iv_consider_reg_offset_for_unroll) Init(1) Optimization IntegerRange(0, 1) Param
+Whether iv optimizations mark register offset valid groups and consider their derived iv candidates would be profitable with estimated unroll factor consideration.
+
 -param=jump-table-max-growth-ratio-for-size=
 Common Joined UInteger Var(param_jump_table_max_growth_ratio_for_size) Init(300) Param Optimization
 The maximum code size growth ratio when expanding into a jump table (in percent).  The parameter is used when optimizing for size.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/4 V3 GCC11] IVOPTs Consider cost_step on different forms during unrolling
  2020-02-25  9:48   ` [PATCH 3/4 V2 " Kewen.Lin
@ 2020-05-13  5:42     ` Kewen.Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Kewen.Lin @ 2020-05-13  5:42 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

Hi,

Updated to v3 according to 2/4's param change.

BR,
Kewen
-----------

gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
	(struct iv_cand): New field reg_offset_p.
	(struct ivopts_data): New field consider_reg_offset_for_unroll_p.
	(dump_groups): Dump group with reg_offset_p.
	(record_group): Initialize reg_offset_p.
	(mark_reg_offset_groups): New function.
	(find_interesting_uses): Call mark_reg_offset_groups.
	(add_candidate_1): Update reg_offset_p if derived from reg_offset_p group.
	(set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(determine_iv_cost): Increase step cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
	consider_reg_offset_for_unroll_p.


on 2020/2/25 下午5:48, Kewen.Lin wrote:
> Hi,
> 
> As the proposed hook changes, updated this with main changes:
>   1) Check with addr_offset_valid_p instead.
>   2) Check the 1st and the last use for the whole address group.
>   3) Scale up group costs accordingly.
> 
> Bootstrapped/regtested on powerpc64le-linux-gnu (LE).
> 
> BR,
> Kewen
> -----------
> 
> gcc/ChangeLog
> 
> 2020-02-25  Kewen Lin  <linkw@gcc.gnu.org>
> 
> 	* tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
> 	(struct iv_cand): New field reg_offset_p.
> 	(struct ivopts_data): New field consider_reg_offset_for_unroll_p.
> 	(dump_groups): Dump group with reg_offset_p.
> 	(record_group): Initialize reg_offset_p.
> 	(mark_reg_offset_groups): New function.
> 	(find_interesting_uses): Call mark_reg_offset_groups.
> 	(add_candidate_1): Update reg_offset_p if derived from reg_offset_p group.
> 	(set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
> 	consider_reg_offset_for_unroll_p.
> 	(determine_iv_cost): Increase step cost with estimate_unroll_factor if
> 	consider_reg_offset_for_unroll_p.
> 	(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
> 	consider_reg_offset_for_unroll_p.
> 

[-- Attachment #2: ivopts_v3.diff --]
[-- Type: text/plain, Size: 7029 bytes --]

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1d2697ae1ba..1b7e4621f37 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -432,6 +432,8 @@ struct iv_group
   struct iv_cand *selected;
   /* To indicate this is a doloop use group.  */
   bool doloop_p;
+  /* To indicate this group is reg_offset valid.  */
+  bool reg_offset_p;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -473,6 +475,7 @@ struct iv_cand
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
   bool doloop_p;	/* Whether this is a doloop candidate.  */
+  bool reg_offset_p;    /* Derived from one reg_offset valid group.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -653,6 +656,10 @@ struct ivopts_data
 
   /* Whether the loop has doloop comparison use.  */
   bool doloop_use_p;
+
+  /* Whether need to consider register offset addressing mode for the loop with
+     upcoming unrolling by estimated unroll factor.  */
+  bool consider_reg_offset_for_unroll_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -840,6 +847,11 @@ dump_groups (FILE *file, struct ivopts_data *data)
 	  gcc_assert (group->type == USE_COMPARE);
 	  fprintf (file, "  Type:\tCOMPARE\n");
 	}
+      if (group->reg_offset_p)
+	{
+	  gcc_assert (address_p (group->type));
+	  fprintf (file, "  reg_offset_p: true\n");
+	}
       for (j = 0; j < group->vuses.length (); j++)
 	dump_use (file, group->vuses[j]);
     }
@@ -1582,6 +1594,7 @@ record_group (struct ivopts_data *data, enum use_type type)
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
   group->doloop_p = false;
+  group->reg_offset_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -2731,6 +2744,60 @@ split_address_groups (struct ivopts_data *data)
     }
 }
 
+/* Go through all address type groups, check and mark reg_offset addressing mode
+   valid groups.  */
+
+static void
+mark_reg_offset_groups (struct ivopts_data *data)
+{
+  class loop *loop = data->current_loop;
+  gcc_assert (data->current_loop->estimated_unroll > 1);
+  bool any_reg_offset_p = false;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (address_p (group->type))
+	{
+	  struct iv_use *head_use = group->vuses[0];
+	  if (!tree_fits_poly_int64_p (head_use->iv->step))
+	    continue;
+
+	  bool found = true;
+	  poly_int64 step = tree_to_poly_int64 (head_use->iv->step);
+	  /* Max extra offset to fill for head of group.  */
+	  poly_int64 max_increase = (loop->estimated_unroll - 1) * step;
+	  /* Check whether this increment still valid.  */
+	  if (!addr_offset_valid_p (head_use, max_increase))
+	    found = false;
+
+	  unsigned group_size = group->vuses.length ();
+	  /* Check the whole group further.  */
+	  if (group_size > 1)
+	    {
+	      /* Only need to check the last one in the group, both the head and
+		the last is valid, the others should be fine.  */
+	      struct iv_use *last_use = group->vuses[group_size - 1];
+	      poly_int64 max_delta
+		= last_use->addr_offset - head_use->addr_offset;
+	      poly_int64 max_offset = max_delta + max_increase;
+	      if (maybe_ne (max_delta, 0)
+		  && !addr_offset_valid_p (head_use, max_offset))
+		found = false;
+	    }
+
+	  if (found)
+	    {
+	      group->reg_offset_p = true;
+	      any_reg_offset_p = true;
+	    }
+	}
+    }
+
+  if (!any_reg_offset_p)
+    data->consider_reg_offset_for_unroll_p = false;
+}
+
 /* Finds uses of the induction variables that are interesting.  */
 
 static void
@@ -2762,6 +2829,9 @@ find_interesting_uses (struct ivopts_data *data)
 
   split_address_groups (data);
 
+  if (data->consider_reg_offset_for_unroll_p)
+    mark_reg_offset_groups (data);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, "\n<IV Groups>:\n");
@@ -3147,6 +3217,7 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
       cand->important = important;
       cand->incremented_at = incremented_at;
       cand->doloop_p = doloop;
+      cand->reg_offset_p = false;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3183,7 +3254,11 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
-    bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+    {
+      bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+      if (data->vgroups[use->group_id]->reg_offset_p)
+	cand->reg_offset_p = true;
+    }
 
   return cand;
 }
@@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
       return;
     }
 
+  /* Since we priced more on non reg_offset IV cand step cost, we should scale
+     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
+     loop exit, FIXME if multiple exits supported or no loop exit comparisons
+     matter.  */
+  if (data->consider_reg_offset_for_unroll_p
+      && group->vuses[0]->type != USE_COMPARE)
+    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
+
   if (data->consider_all_candidates)
     {
       group->cost_map[cand->id].cand = cand;
@@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
+  /* Consider additional step updates during unrolling.  */
+  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
+    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
+
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
@@ -7976,6 +8063,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
+  data->consider_reg_offset_for_unroll_p = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -8008,6 +8096,16 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   if (!find_induction_variables (data))
     goto finish;
 
+  if (param_iv_consider_reg_offset_for_unroll != 0 && exit)
+    {
+      tree_niter_desc *desc = niter_for_exit (data, exit);
+      estimate_unroll_factor (loop, desc);
+      data->consider_reg_offset_for_unroll_p = loop->estimated_unroll > 1;
+      if (dump_file && (dump_flags & TDF_DETAILS)
+	  && data->consider_reg_offset_for_unroll_p)
+	fprintf (dump_file, "estimated_unroll:%u\n", loop->estimated_unroll);
+    }
+
   /* Finds interesting uses (item 1).  */
   find_interesting_uses (data);
   if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p
  2020-03-03 12:26           ` Kewen.Lin
@ 2020-05-13  5:50             ` Kewen.Lin
  2020-05-28  2:17               ` Ping^1 [PATCH 2/4 V3] " Kewen.Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-05-13  5:50 UTC (permalink / raw)
  To: GCC Patches
  Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther,
	richard.sandiford

Hi,

I'd like to ping this patch as well as its sblings.  Thanks in advance.

1/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-February/540171.html
2/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-March/541387.html
3/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545643.html

BR,
Kewen

on 2020/3/3 下午8:25, Kewen.Lin wrote:
> Hi Richard,
> 
> Thanks for your comments!  It's a good idea to use param due to the
> flexibility.  And yes, it sounds good to have more targets to try and
> make it better.  But I have a bit concern on turning it on by default.
> Since it replies on unroll factor estimation, as part 1/4 shows, it
> calls targetm.loop_unroll_adjust if target supports, which used to
> work on RTL level.  To avoid possible ICE, I'm intended to turn it
> off for those targets (s390 & i386) with that hook, since without good
> understanding on those targets, it's hard for me to extend them with
> gimple level support.  Does it make sense?
> 
> The updated patch has been attached.
> 
> BR,
> Kewen
> ---------
> 
> gcc/ChangeLog
> 
> 2020-03-03  Kewen Lin  <linkw@gcc.gnu.org>
> 
> 	* doc/invoke.texi (iv-consider-reg-offset-for-unroll): Document new option.
> 	* params.opt (iv-consider-reg-offset-for-unroll): New.
> 	* config/s390/s390.c (s390_option_override_internal): Disable parameter
> 	iv-consider-reg-offset-for-unroll by default.
> 	* config/i386/i386-options.c (ix86_option_override_internal): Likewise.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Ping^1 [PATCH 2/4 V3] Add target hook stride_dform_valid_p
  2020-05-13  5:50             ` Kewen.Lin
@ 2020-05-28  2:17               ` Kewen.Lin
  2020-05-28 10:54                 ` Richard Sandiford
  0 siblings, 1 reply; 45+ messages in thread
From: Kewen.Lin @ 2020-05-28  2:17 UTC (permalink / raw)
  To: GCC Patches
  Cc: bin.cheng, Richard Guenther, Bill Schmidt, Segher Boessenkool,
	Richard Sandiford

Hi,

Gentle ping patches as below:

1/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-February/540171.html
2/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-March/541387.html
3/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545643.html

Or shall I ping them seperately?

Thanks!
Kewen

on 2020/5/13 下午1:50, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> I'd like to ping this patch as well as its sblings.  Thanks in advance.
> 
> 1/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-February/540171.html
> 2/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-March/541387.html
> 3/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545643.html
> 
> BR,
> Kewen
> 
> on 2020/3/3 下午8:25, Kewen.Lin wrote:
>> Hi Richard,
>>
>> Thanks for your comments!  It's a good idea to use param due to the
>> flexibility.  And yes, it sounds good to have more targets to try and
>> make it better.  But I have a bit concern on turning it on by default.
>> Since it replies on unroll factor estimation, as part 1/4 shows, it
>> calls targetm.loop_unroll_adjust if target supports, which used to
>> work on RTL level.  To avoid possible ICE, I'm intended to turn it
>> off for those targets (s390 & i386) with that hook, since without good
>> understanding on those targets, it's hard for me to extend them with
>> gimple level support.  Does it make sense?
>>
>> The updated patch has been attached.
>>
>> BR,
>> Kewen
>> ---------
>>
>> gcc/ChangeLog
>>
>> 2020-03-03  Kewen Lin  <linkw@gcc.gnu.org>
>>
>> 	* doc/invoke.texi (iv-consider-reg-offset-for-unroll): Document new option.
>> 	* params.opt (iv-consider-reg-offset-for-unroll): New.
>> 	* config/s390/s390.c (s390_option_override_internal): Disable parameter
>> 	iv-consider-reg-offset-for-unroll by default.
>> 	* config/i386/i386-options.c (ix86_option_override_internal): Likewise.
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Ping^1 [PATCH 2/4 V3] Add target hook stride_dform_valid_p
  2020-05-28  2:17               ` Ping^1 [PATCH 2/4 V3] " Kewen.Lin
@ 2020-05-28 10:54                 ` Richard Sandiford
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Sandiford @ 2020-05-28 10:54 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi,
>
> Gentle ping patches as below:
>
> 1/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-February/540171.html
> 2/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-March/541387.html
> 3/4 v3 https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545643.html
>
> Or shall I ping them seperately?

Can you repost the full series?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2020-05-28 10:54 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16  9:41 [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
2020-01-16  9:43 ` [PATCH 1/4 GCC11] Add middle-end unroll factor estimation Kewen.Lin
2020-01-20 13:12   ` Segher Boessenkool
2020-02-10  6:20     ` [PATCH 1/4 v2 " Kewen.Lin
2020-02-10 23:34       ` Segher Boessenkool
2020-02-11  6:51         ` [PATCH 1/4 v3 " Kewen.Lin
2020-02-11  7:00           ` Segher Boessenkool
2020-02-11  2:15       ` [PATCH 1/4 v2 " Jiufu Guo
2020-02-11  3:04         ` Kewen.Lin
2020-01-16 10:02 ` [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p Kewen.Lin
2020-01-20 10:53   ` Richard Sandiford
2020-01-20 11:47     ` Richard Biener
2020-01-20 13:20     ` Segher Boessenkool
2020-02-25  9:46       ` Kewen.Lin
2020-03-02 11:09         ` Richard Sandiford
2020-03-03 12:26           ` Kewen.Lin
2020-05-13  5:50             ` Kewen.Lin
2020-05-28  2:17               ` Ping^1 [PATCH 2/4 V3] " Kewen.Lin
2020-05-28 10:54                 ` Richard Sandiford
2020-01-16 10:06 ` [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling Kewen.Lin
2020-02-25  9:48   ` [PATCH 3/4 V2 " Kewen.Lin
2020-05-13  5:42     ` [PATCH 3/4 V3 " Kewen.Lin
2020-01-16 10:12 ` [PATCH 4/4 GCC11] rs6000: P9 D-form test cases Kewen.Lin
2020-01-20 13:37   ` Segher Boessenkool
2020-02-10  6:25     ` [PATCH 4/4 v2 " Kewen.Lin
2020-02-10 23:51       ` Segher Boessenkool
2020-01-20 13:03 ` [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Segher Boessenkool
2020-02-10  6:17   ` Kewen.Lin
2020-02-10 21:29     ` Segher Boessenkool
2020-02-11  2:56       ` Kewen.Lin
2020-02-11  7:34       ` Richard Biener
2020-02-11  7:49         ` Segher Boessenkool
2020-02-11  8:01           ` Richard Biener
2020-02-11 12:46             ` Roman Zhuykov
2020-02-11 13:58               ` Richard Biener
2020-02-11 18:00                 ` Segher Boessenkool
2020-02-12  8:07                   ` Richard Biener
2020-02-12 21:53                     ` Segher Boessenkool
2020-02-11 18:12               ` Segher Boessenkool
2020-02-12  8:13                 ` Richard Biener
2020-02-12 10:02                   ` Segher Boessenkool
2020-02-12 10:53                     ` Richard Biener
2020-02-12 22:05                       ` Segher Boessenkool
2020-02-13  7:48                         ` Richard Biener
2020-02-13  9:02                           ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).