public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling
@ 2020-05-28 12:17 Kewen.Lin
  2020-05-28 12:19 ` [PATCH 1/4] unroll: Add middle-end unroll factor estimation Kewen.Lin
                   ` (3 more replies)
  0 siblings, 4 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-05-28 12:17 UTC (permalink / raw)
  To: GCC Patches
  Cc: Segher Boessenkool, Bill Schmidt, bin.cheng, Richard Guenther,
	Richard Sandiford

Hi,

This is one repost and you can refer to the original series 
via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.

As we discussed in the thread
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
I'm working to teach IVOPTs to consider D-form group access during unrolling.
The difference on D-form and other forms during unrolling is we can put the
stride into displacement field to avoid additional step increment. eg:

With X-form (uf step increment):
  ...
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  ...

With D-form (one step increment for each base):
  ...
  LD A = baseA, OFF
  LD B = baseB, OFF
  ST C = baseC, OFF
  LD A = baseA, OFF+stride
  LD B = baseB, OFF+stride
  ST C = baseC, OFF+stride
  LD A = baseA, OFF+2*stride
  LD B = baseB, OFF+2*stride
  ST C = baseC, OFF+2*stride
  ...
  baseA += stride * uf
  baseB += stride * uf
  baseC += stride * uf

Imagining that if the loop get unrolled by 8 times, then 3 step updates with
D-form vs. 8 step updates with X-form. Here we only need to check stride
meet D-form field requirement, since if OFF doesn't meet, we can construct
baseA' with baseA + OFF.

This patch set consists four parts:
     
  [PATCH 1/4] unroll: Add middle-end unroll factor estimation

     Add unroll factor estimation in middle-end. It mainly refers to current
     RTL unroll factor determination in function decide_unrolling and its
     sub calls.  As Richi suggested, we probably can force unroll factor
     with this and avoid duplicate unroll factor calculation, but I think it
     need more benchmarking work and should be handled separately.

  [PATCH 2/4] param: Introduce one param to control unroll factor 

     As Richard and Segher's suggestion, I used addr_offset_valid_p for the
     addressing mode, rather than one target hook.  As Richard's suggestion,     
     it introduces one parameter to control this IVOPTs consideration and
     further tweaking [3/4] on top of unroll factor estimation [1/4].
     
  [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling

     Teach IVOPTs to mark the IV cand as reg_offset_p which is derived from
     one address IV type group where the whole group is valid to use reg_offset
     mode.  Then scaling up the IV cand step cost by (uf - 1) for no
     reg_offset_p IV cands, here the uf is one estimated unroll factor [1/4].
     
  [PATCH 4/4] rs6000: P9 D-form test cases

     Add some test cases, mainly copied from Kelvin's patch.  This is approved
     by Segher if the whole series is fine.


Many thanks to Richard and Segher on previous version reviews.

Bootstrapped and regress tested on powerpc64le-linux-gnu.

Any comments are highly appreciated!  Thanks in advance!


BR,
Kewen

-------

 gcc/cfgloop.h                  |   3 ++
 gcc/config/i386/i386-options.c |   6 +++
 gcc/config/s390/s390.c         |   6 +++
 gcc/doc/invoke.texi            |   9 +++++
 gcc/params.opt                 |   4 ++
 gcc/tree-ssa-loop-ivopts.c     | 100 ++++++++++++++++++++++++++++++++++++++++++++++-
 gcc/tree-ssa-loop-manip.c      | 253 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-manip.h      |   3 +-
 gcc/tree-ssa-loop.c            |  33 ++++++++++++++++
 gcc/tree-ssa-loop.h            |   2 +
 10 files changed, 416 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-05-28 12:17 [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
@ 2020-05-28 12:19 ` Kewen.Lin
  2020-08-31  5:49   ` PING " Kewen.Lin
  2021-01-21 21:45   ` Segher Boessenkool
  2020-05-28 12:23 ` [PATCH 2/4] param: Introduce one param to control ivopts reg-offset consideration Kewen.Lin
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-05-28 12:19 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Sandiford, Richard Guenther, Bill Schmidt, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 501 bytes --]


gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* cfgloop.h (struct loop): New field estimated_unroll.
	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
	(decide_unroll_runtime_iter): Likewise.
	(decide_unroll_stupid): Likewise.
	(estimate_unroll_factor): Likewise.
	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.

----

[-- Attachment #2: 0001_unroll_v3.patch --]
[-- Type: text/plain, Size: 11934 bytes --]

---
 gcc/cfgloop.h             |   3 +
 gcc/tree-ssa-loop-manip.c | 253 ++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-manip.h |   3 +-
 gcc/tree-ssa-loop.c       |  33 ++++++
 gcc/tree-ssa-loop.h       |   2 +
 5 files changed, 292 insertions(+), 2 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 11378ca..c5bcca7 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -232,6 +232,9 @@ public:
      Other values means unroll with the given unrolling factor.  */
   unsigned short unroll;
 
+  /* Like unroll field above, but it's estimated in middle-end.  */
+  unsigned short estimated_unroll;
+
   /* If this loop was inlined the main clique of the callee which does
      not need remapping when copying the loop body.  */
   unsigned short owned_clique;
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 120b35b..8a5a1a9 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
@@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-inline.h"
+#include "wide-int.h"
 
 /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
    so that we can free them all at once.  */
@@ -1592,3 +1594,254 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch)
 
   return var_before;
 }
+
+/* Try to determine estimated unroll factor for given LOOP with constant number
+   of iterations, mainly refer to decide_unroll_constant_iterations.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_unroll_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
+		      unsigned nunroll, const widest_int *iter)
+{
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Check number of iterations is constant, return false if no.  */
+  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
+      || !tree_fits_uhwi_p (niter_desc->niter))
+    return false;
+
+  unsigned HOST_WIDE_INT const_niter = tree_to_uhwi (niter_desc->niter);
+
+  /* If unroll factor is set explicitly, use it as estimated_unroll.  */
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    {
+      /* It should have been peeled instead.  */
+      if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
+	loop->estimated_unroll = 1;
+      else
+	loop->estimated_unroll = loop->unroll;
+      return true;
+    }
+
+  /* Check whether the loop rolls enough to consider.  */
+  if (const_niter < 2 * nunroll || wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now compute number of iterations to unroll.  */
+  unsigned best_unroll = 0, n_copies = 0;
+  unsigned best_copies = 2 * nunroll + 10;
+  unsigned i = 2 * nunroll + 2;
+
+  if (i > const_niter - 2)
+    i = const_niter - 2;
+
+  for (; i >= nunroll - 1; i--)
+    {
+      unsigned exit_mod = const_niter % (i + 1);
+
+      if (!empty_block_p (loop->latch))
+	n_copies = exit_mod + i + 1;
+      else if (exit_mod != i)
+	n_copies = exit_mod + i + 2;
+      else
+	n_copies = i + 1;
+
+      if (n_copies < best_copies)
+	{
+	  best_copies = n_copies;
+	  best_unroll = i;
+	}
+    }
+
+  loop->estimated_unroll = best_unroll + 1;
+  return true;
+}
+
+/* Try to determine estimated unroll factor for given LOOP with countable but
+   non-constant number of iterations, mainly refer to
+   decide_unroll_runtime_iterations.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_unroll_runtime_iter (class loop *loop, const tree_niter_desc *niter_desc,
+			unsigned nunroll_in, const widest_int *iter)
+{
+  unsigned nunroll = nunroll_in;
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    nunroll = loop->unroll;
+
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Skip constant number of iterations.  */
+  if ((!niter_desc->may_be_zero || !integer_zerop (niter_desc->may_be_zero))
+      && tree_fits_uhwi_p (niter_desc->niter))
+    return false;
+
+  /* Check whether the loop rolls.  */
+  if (wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now force nunroll to be power of 2.  */
+  unsigned i;
+  for (i = 1; 2 * i <= nunroll; i *= 2)
+    continue;
+
+  loop->estimated_unroll = i;
+  return true;
+}
+
+/* Try to determine estimated unroll factor for given LOOP with uncountable
+   number of iterations, mainly refer to decide_unroll_stupid.
+    - NITER_DESC holds number of iteration description if it isn't NULL.
+    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
+    - ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_unroll_stupid (class loop *loop, const tree_niter_desc *niter_desc,
+		  unsigned nunroll_in, const widest_int *iter)
+{
+  if (!flag_unroll_all_loops && !loop->unroll)
+    return false;
+
+  unsigned nunroll = nunroll_in;
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+    nunroll = loop->unroll;
+
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+    return false;
+
+  gcc_assert (!niter_desc || !niter_desc->assumptions);
+
+  /* Skip loop with multiple branches for now.  */
+  if (num_loop_branches (loop) > 1)
+    return false;
+
+  /* Check whether the loop rolls.  */
+  if (wi::ltu_p (*iter, 2 * nunroll))
+    return false;
+
+  /* Success; now force nunroll to be power of 2.  */
+  unsigned i;
+  for (i = 1; 2 * i <= nunroll; i *= 2)
+    continue;
+
+  loop->estimated_unroll = i;
+  return true;
+}
+
+/* Try to estimate whether this given LOOP can be unrolled or not, and compute
+   its estimated unroll factor if it can.  To avoid duplicated computation, you
+   can pass number of iterations information by DESC.  The heuristics mainly
+   refer to decide_unrolling in loop-unroll.c.  */
+
+void
+estimate_unroll_factor (class loop *loop, tree_niter_desc *desc)
+{
+  /* Return the existing estimated unroll factor.  */
+  if (loop->estimated_unroll)
+    return;
+
+  /* Don't unroll explicitly.  */
+  if (loop->unroll == 1)
+    {
+      loop->estimated_unroll = loop->unroll;
+      return;
+    }
+
+  /* Like decide_unrolling, don't unroll if:
+     1) the loop is cold.
+     2) the loop can't be manipulated.
+     3) the loop isn't innermost.  */
+  if (optimize_loop_for_size_p (loop) || !can_duplicate_loop_p (loop)
+      || loop->inner != NULL)
+    {
+      loop->estimated_unroll = 1;
+      return;
+    }
+
+  /* Don't unroll without explicit information.  */
+  if (!loop->unroll && !flag_unroll_loops && !flag_unroll_all_loops)
+    {
+      loop->estimated_unroll = 1;
+      return;
+    }
+
+  /* Check for instruction number and average instruction number.  */
+  loop->ninsns = tree_num_loop_insns (loop, &eni_size_weights);
+  loop->av_ninsns = tree_average_num_loop_insns (loop, &eni_size_weights);
+  unsigned nunroll = param_max_unrolled_insns / loop->ninsns;
+  unsigned nunroll_by_av = param_max_average_unrolled_insns / loop->av_ninsns;
+
+  if (nunroll > nunroll_by_av)
+    nunroll = nunroll_by_av;
+  if (nunroll > (unsigned) param_max_unroll_times)
+    nunroll = param_max_unroll_times;
+
+  if (targetm.loop_unroll_adjust)
+    nunroll = targetm.loop_unroll_adjust (nunroll, loop);
+
+  tree_niter_desc *niter_desc = NULL;
+  bool desc_need_delete = false;
+
+  /* Compute number of iterations if need.  */
+  if (!desc)
+    {
+      /* For now, use single_dom_exit for simplicity. TODO: Support multiple
+	 exits like find_simple_exit if we finds some profitable cases.  */
+      niter_desc = XNEW (class tree_niter_desc);
+      gcc_assert (niter_desc);
+      edge exit = single_dom_exit (loop);
+      if (!exit || !number_of_iterations_exit (loop, exit, niter_desc, true))
+	{
+	  XDELETE (niter_desc);
+	  niter_desc = NULL;
+	}
+      else
+	desc_need_delete = true;
+    }
+  else
+    niter_desc = desc;
+
+  /* For checking the loop rolls enough to consider, also consult loop bounds
+     and profile.  */
+  widest_int iterations;
+  if (!get_estimated_loop_iterations (loop, &iterations)
+      && !get_likely_max_loop_iterations (loop, &iterations))
+    iterations = 0;
+
+  if (niter_desc && niter_desc->assumptions)
+    {
+      /* For countable loops.  */
+      if (!decide_unroll_const_iter (loop, niter_desc, nunroll, &iterations)
+	  && !decide_unroll_runtime_iter (loop, niter_desc, nunroll, &iterations))
+	loop->estimated_unroll = 1;
+    }
+  else
+    {
+      if (!decide_unroll_stupid (loop, niter_desc, nunroll, &iterations))
+	loop->estimated_unroll = 1;
+    }
+
+  if (desc_need_delete)
+    {
+      XDELETE (niter_desc);
+      niter_desc = NULL;
+    }
+}
+
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index e789e4f..773a2b3 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -55,7 +55,6 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
 extern void tree_unroll_loop (class loop *, unsigned,
 			      edge, class tree_niter_desc *);
 extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
-
-
+extern void estimate_unroll_factor (class loop *, tree_niter_desc *);
 
 #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 5e8365d..25320fb 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "diagnostic-core.h"
 #include "stringpool.h"
 #include "attribs.h"
+#include "sreal.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -790,5 +791,37 @@ tree_num_loop_insns (class loop *loop, eni_weights *weights)
   return size;
 }
 
+/* Computes an estimated number of insns on average per iteration in LOOP,
+   weighted by WEIGHTS.  Refer to function average_num_loop_insns.  */
 
+unsigned
+tree_average_num_loop_insns (class loop *loop, eni_weights *weights)
+{
+  basic_block *body = get_loop_body (loop);
+  gimple_stmt_iterator gsi;
+  unsigned bb_size, i;
+  sreal nsize = 0;
+
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb_size = 0;
+      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
+	bb_size += estimate_num_insns (gsi_stmt (gsi), weights);
+      nsize += (sreal) bb_size
+	       * body[i]->count.to_sreal_scale (loop->header->count);
+      /* Avoid overflows.   */
+      if (nsize > 1000000)
+	{
+	  free (body);
+	  return 1000000;
+	}
+    }
+  free (body);
+
+  unsigned ret = nsize.to_int ();
+  if (!ret)
+    ret = 1; /* To avoid division by zero.  */
+
+  return ret;
+}
 
diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
index 9e35125..af36177 100644
--- a/gcc/tree-ssa-loop.h
+++ b/gcc/tree-ssa-loop.h
@@ -67,6 +67,8 @@ public:
 extern bool for_each_index (tree *, bool (*) (tree, tree *, void *), void *);
 extern char *get_lsm_tmp_name (tree ref, unsigned n, const char *suffix = NULL);
 extern unsigned tree_num_loop_insns (class loop *, struct eni_weights *);
+extern unsigned tree_average_num_loop_insns (class loop *,
+					     struct eni_weights *);
 
 /* Returns the loop of the statement STMT.  */
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/4] param: Introduce one param to control ivopts reg-offset consideration
  2020-05-28 12:17 [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
  2020-05-28 12:19 ` [PATCH 1/4] unroll: Add middle-end unroll factor estimation Kewen.Lin
@ 2020-05-28 12:23 ` Kewen.Lin
  2020-05-28 12:24 ` [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling Kewen.Lin
  2020-06-02 11:38 ` [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Richard Biener
  3 siblings, 0 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-05-28 12:23 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Sandiford, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, ubizjak, krebbel

[-- Attachment #1: Type: text/plain, Size: 392 bytes --]


gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/invoke.texi (iv-consider-reg-offset-for-unroll): Document new option.
	* params.opt (iv-consider-reg-offset-for-unroll): New.
	* config/s390/s390.c (s390_option_override_internal): Disable parameter
	iv-consider-reg-offset-for-unroll by default.
	* config/i386/i386-options.c (ix86_option_override_internal): Likewise.

----

[-- Attachment #2: 0002_param_v3.diff --]
[-- Type: text/plain, Size: 3232 bytes --]

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index e0be493..41c99b3 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -2902,6 +2902,12 @@ ix86_option_override_internal (bool main_args_p,
   if (ix86_indirect_branch != indirect_branch_keep)
     SET_OPTION_IF_UNSET (opts, opts_set, flag_jump_tables, 0);
 
+  /* Disable this for now till loop_unroll_adjust supports gimple level checks,
+     to avoid possible ICE.  */
+  if (opts->x_optimize >= 1)
+    SET_OPTION_IF_UNSET (opts, opts_set,
+			 param_iv_consider_reg_offset_for_unroll, 0);
+
   return true;
 }
 
diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
index ebba670..ae4c2bd 100644
--- a/gcc/config/s390/s390.c
+++ b/gcc/config/s390/s390.c
@@ -15318,6 +15318,12 @@ s390_option_override_internal (struct gcc_options *opts,
      not the case when the code runs before the prolog. */
   if (opts->x_flag_fentry && !TARGET_64BIT)
     error ("%<-mfentry%> is supported only for 64-bit CPUs");
+
+  /* Disable this for now till loop_unroll_adjust supports gimple level checks,
+     to avoid possible ICE.  */
+  if (opts->x_optimize >= 1)
+    SET_OPTION_IF_UNSET (opts, opts_set,
+			 param_iv_consider_reg_offset_for_unroll, 0);
 }
 
 static void
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index fa98e2f..502031c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12220,6 +12220,15 @@ If the number of candidates in the set is smaller than this value,
 always try to remove unnecessary ivs from the set
 when adding a new one.
 
+@item iv-consider-reg-offset-for-unroll
+When RTL unrolling performs on a loop, the duplicated loop iterations introduce
+appropriate induction variable step update expressions.  But if an induction
+variable is derived from address object, it is profitable to fill its required
+offset updates into appropriate memory access expressions if target memory
+accessing supports the register offset mode and the resulted offset is in the
+valid range.  The induction variable optimizations consider this information
+for better unrolling code.  It requires unroll factor estimation in middle-end.
+
 @item avg-loop-niter
 Average number of iterations of a loop.
 
diff --git a/gcc/params.opt b/gcc/params.opt
index 8e4217d..31424cf 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -270,6 +270,10 @@ Bound on number of candidates below that all candidates are considered in iv opt
 Common Joined UInteger Var(param_iv_max_considered_uses) Init(250) Param Optimization
 Bound on number of iv uses in loop optimized in iv optimizations.
 
+-param=iv-consider-reg-offset-for-unroll=
+Common Joined UInteger Var(param_iv_consider_reg_offset_for_unroll) Init(1) Optimization IntegerRange(0, 1) Param
+Whether iv optimizations mark register offset valid groups and consider their derived iv candidates would be profitable with estimated unroll factor consideration.
+
 -param=jump-table-max-growth-ratio-for-size=
 Common Joined UInteger Var(param_jump_table_max_growth_ratio_for_size) Init(300) Param Optimization
 The maximum code size growth ratio when expanding into a jump table (in percent).  The parameter is used when optimizing for size.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-05-28 12:17 [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
  2020-05-28 12:19 ` [PATCH 1/4] unroll: Add middle-end unroll factor estimation Kewen.Lin
  2020-05-28 12:23 ` [PATCH 2/4] param: Introduce one param to control ivopts reg-offset consideration Kewen.Lin
@ 2020-05-28 12:24 ` Kewen.Lin
  2020-06-01 17:59   ` Richard Sandiford
  2020-08-08  8:01   ` Bin.Cheng
  2020-06-02 11:38 ` [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Richard Biener
  3 siblings, 2 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-05-28 12:24 UTC (permalink / raw)
  To: GCC Patches
  Cc: bin.cheng, Richard Guenther, Bill Schmidt, Segher Boessenkool,
	Richard Sandiford, Bin.Cheng

[-- Attachment #1: Type: text/plain, Size: 821 bytes --]


gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
	(struct iv_cand): New field reg_offset_p.
	(struct ivopts_data): New field consider_reg_offset_for_unroll_p.
	(dump_groups): Dump group with reg_offset_p.
	(record_group): Initialize reg_offset_p.
	(mark_reg_offset_groups): New function.
	(find_interesting_uses): Call mark_reg_offset_groups.
	(add_candidate_1): Update reg_offset_p if derived from reg_offset_p group.
	(set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(determine_iv_cost): Increase step cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
	consider_reg_offset_for_unroll_p.

----

[-- Attachment #2: ivopts_v3.diff --]
[-- Type: text/plain, Size: 7029 bytes --]

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1d2697ae1ba..1b7e4621f37 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -432,6 +432,8 @@ struct iv_group
   struct iv_cand *selected;
   /* To indicate this is a doloop use group.  */
   bool doloop_p;
+  /* To indicate this group is reg_offset valid.  */
+  bool reg_offset_p;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -473,6 +475,7 @@ struct iv_cand
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
   bool doloop_p;	/* Whether this is a doloop candidate.  */
+  bool reg_offset_p;    /* Derived from one reg_offset valid group.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -653,6 +656,10 @@ struct ivopts_data
 
   /* Whether the loop has doloop comparison use.  */
   bool doloop_use_p;
+
+  /* Whether need to consider register offset addressing mode for the loop with
+     upcoming unrolling by estimated unroll factor.  */
+  bool consider_reg_offset_for_unroll_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -840,6 +847,11 @@ dump_groups (FILE *file, struct ivopts_data *data)
 	  gcc_assert (group->type == USE_COMPARE);
 	  fprintf (file, "  Type:\tCOMPARE\n");
 	}
+      if (group->reg_offset_p)
+	{
+	  gcc_assert (address_p (group->type));
+	  fprintf (file, "  reg_offset_p: true\n");
+	}
       for (j = 0; j < group->vuses.length (); j++)
 	dump_use (file, group->vuses[j]);
     }
@@ -1582,6 +1594,7 @@ record_group (struct ivopts_data *data, enum use_type type)
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
   group->doloop_p = false;
+  group->reg_offset_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -2731,6 +2744,60 @@ split_address_groups (struct ivopts_data *data)
     }
 }
 
+/* Go through all address type groups, check and mark reg_offset addressing mode
+   valid groups.  */
+
+static void
+mark_reg_offset_groups (struct ivopts_data *data)
+{
+  class loop *loop = data->current_loop;
+  gcc_assert (data->current_loop->estimated_unroll > 1);
+  bool any_reg_offset_p = false;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (address_p (group->type))
+	{
+	  struct iv_use *head_use = group->vuses[0];
+	  if (!tree_fits_poly_int64_p (head_use->iv->step))
+	    continue;
+
+	  bool found = true;
+	  poly_int64 step = tree_to_poly_int64 (head_use->iv->step);
+	  /* Max extra offset to fill for head of group.  */
+	  poly_int64 max_increase = (loop->estimated_unroll - 1) * step;
+	  /* Check whether this increment still valid.  */
+	  if (!addr_offset_valid_p (head_use, max_increase))
+	    found = false;
+
+	  unsigned group_size = group->vuses.length ();
+	  /* Check the whole group further.  */
+	  if (group_size > 1)
+	    {
+	      /* Only need to check the last one in the group, both the head and
+		the last is valid, the others should be fine.  */
+	      struct iv_use *last_use = group->vuses[group_size - 1];
+	      poly_int64 max_delta
+		= last_use->addr_offset - head_use->addr_offset;
+	      poly_int64 max_offset = max_delta + max_increase;
+	      if (maybe_ne (max_delta, 0)
+		  && !addr_offset_valid_p (head_use, max_offset))
+		found = false;
+	    }
+
+	  if (found)
+	    {
+	      group->reg_offset_p = true;
+	      any_reg_offset_p = true;
+	    }
+	}
+    }
+
+  if (!any_reg_offset_p)
+    data->consider_reg_offset_for_unroll_p = false;
+}
+
 /* Finds uses of the induction variables that are interesting.  */
 
 static void
@@ -2762,6 +2829,9 @@ find_interesting_uses (struct ivopts_data *data)
 
   split_address_groups (data);
 
+  if (data->consider_reg_offset_for_unroll_p)
+    mark_reg_offset_groups (data);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, "\n<IV Groups>:\n");
@@ -3147,6 +3217,7 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
       cand->important = important;
       cand->incremented_at = incremented_at;
       cand->doloop_p = doloop;
+      cand->reg_offset_p = false;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3183,7 +3254,11 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
-    bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+    {
+      bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+      if (data->vgroups[use->group_id]->reg_offset_p)
+	cand->reg_offset_p = true;
+    }
 
   return cand;
 }
@@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
       return;
     }
 
+  /* Since we priced more on non reg_offset IV cand step cost, we should scale
+     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
+     loop exit, FIXME if multiple exits supported or no loop exit comparisons
+     matter.  */
+  if (data->consider_reg_offset_for_unroll_p
+      && group->vuses[0]->type != USE_COMPARE)
+    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
+
   if (data->consider_all_candidates)
     {
       group->cost_map[cand->id].cand = cand;
@@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
+  /* Consider additional step updates during unrolling.  */
+  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
+    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
+
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
@@ -7976,6 +8063,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
+  data->consider_reg_offset_for_unroll_p = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -8008,6 +8096,16 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   if (!find_induction_variables (data))
     goto finish;
 
+  if (param_iv_consider_reg_offset_for_unroll != 0 && exit)
+    {
+      tree_niter_desc *desc = niter_for_exit (data, exit);
+      estimate_unroll_factor (loop, desc);
+      data->consider_reg_offset_for_unroll_p = loop->estimated_unroll > 1;
+      if (dump_file && (dump_flags & TDF_DETAILS)
+	  && data->consider_reg_offset_for_unroll_p)
+	fprintf (dump_file, "estimated_unroll:%u\n", loop->estimated_unroll);
+    }
+
   /* Finds interesting uses (item 1).  */
   find_interesting_uses (data);
   if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-05-28 12:24 ` [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling Kewen.Lin
@ 2020-06-01 17:59   ` Richard Sandiford
  2020-06-02  3:39     ` Kewen.Lin
  2020-08-08  8:01   ` Bin.Cheng
  1 sibling, 1 reply; 64+ messages in thread
From: Richard Sandiford @ 2020-06-01 17:59 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Bin.Cheng

Could you go into more detail about this choice of cost calculation?
It looks like we first calculate per-group flags, which are true only if
the unrolled offsets are valid for all uses in the group.  Then we create
per-candidate flags when associating candidates with groups.

Instead, couldn't we take this into account in get_address_cost,
which calculates the cost of an address use for a given candidate?
E.g. after the main if-else at the start of the function,
perhaps it would make sense to add the worst-case offset to
the address in “parts”, check whether that too is a valid address,
and if not, increase var_cost by the cost of one add instruction.

I guess there are two main sources of inexactness if we do that:

(1) It might underestimate the cost because it assumes that vuse[0]
    stands for all vuses in the group.

(2) It might overestimates the cost because it treats all unrolled
    iterations as having the cost of the final unrolled iteration.

(1) could perhaps be avoided by adding a flag to the iv_use to say
whether it wants this treatment.  I think the flag approach suffers
from (2) too, and I'd be surprised if it makes a difference in practice.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-06-01 17:59   ` Richard Sandiford
@ 2020-06-02  3:39     ` Kewen.Lin
  2020-06-02  7:14       ` Richard Sandiford
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-06-02  3:39 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: bin.cheng, Richard Guenther, Bill Schmidt, Segher Boessenkool, Bin.Cheng

Hi Richard,

Thanks for the comments!

on 2020/6/2 上午1:59, Richard Sandiford wrote:
> Could you go into more detail about this choice of cost calculation?
> It looks like we first calculate per-group flags, which are true only if
> the unrolled offsets are valid for all uses in the group.  Then we create
> per-candidate flags when associating candidates with groups.
> 

Sure.  It checks every address type IV group to determine whether this
group is valid to use reg offset addressing mode.  Here we only need to
check the first one and the last one, since the intermediates should 
have been handled by split_address_groups.  With unrolling the
displacement of the address can be offset-ed by (UF-1)*step, check the
address with this max offset whether still valid.  If the check finds
it's valid to use reg offset mode for the whole group, we flag this
group.  Later, when we create IV candidate for address group flagged,
we flag the candidate further.  This flag is mainly for iv cand
costing, we don't need to scale up iv cand's step cost for this kind
of candidate.

Imagining this loop is being unrolled, all the statements will be
duplicated by UF.  For the cost modeling against iv group, it's
scaling up the cost by UF (here I simply excluded the compare_type
since in most cases it for loop ending check).  For the cost modeling
against iv candidate, it's to focus on step costs, for an iv candidate
we flagged before, it's taken as one time step cost, for the others,
it's scaling up the step cost since the unrolling make step 
calculation become UF times.

This cost modeling is trying to simulate cost change after the
unrolling, scaling up the costs accordingly.  There are somethings
to be improved like distinguish the loop ending compare or else,
whether need to tweak the other costs somehow since the scaling up
probably cause existing cost framework imbalance, but during
benchmarking I didn't find these matter, so take it as simple as 
possible for now.


> Instead, couldn't we take this into account in get_address_cost,
> which calculates the cost of an address use for a given candidate?
> E.g. after the main if-else at the start of the function,
> perhaps it would make sense to add the worst-case offset to
> the address in “parts”, check whether that too is a valid address,
> and if not, increase var_cost by the cost of one add instruction.
> 

IIUC, what you suggest is to tweak the iv group cost, if we find
one address group is valid for reg offset mode, we price more on
the pairs between this group and other non address-based iv cands.
The question is how do we decide this add-on cost.  For the test
case I was working on initially, adding one cost (of add) doesn't
work, the normal iv still wined.  We can price it more like two
but what's the justification on this value, by heuristics?

> I guess there are two main sources of inexactness if we do that:
> 
> (1) It might underestimate the cost because it assumes that vuse[0]
>     stands for all vuses in the group.
> 

Do you mean we don't need one check function like mark_reg_offset_groups?
If without it, vuse[0] might be not enough since we can't ensure the
others are fine with additional displacement from unrolling.  If we still
have it, I think it's fine to just use vuse[0].

> (2) It might overestimates the cost because it treats all unrolled
>     iterations as having the cost of the final unrolled iteration.
>
> (1) could perhaps be avoided by adding a flag to the iv_use to say
> whether it wants this treatment.  I think the flag approach suffers
> from (2) too, and I'd be surprised if it makes a difference in practice.
> 

Sorry, I didn't have the whole picture how to deal with uf for your proposal.
But the flag approach considers uf in iv group cost calculation as well as
iv cand step cost calculation.

BR,
Kewen

> Thanks,
> Richard
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-06-02  3:39     ` Kewen.Lin
@ 2020-06-02  7:14       ` Richard Sandiford
  2020-06-03  3:18         ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Richard Sandiford @ 2020-06-02  7:14 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Bin.Cheng

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> Thanks for the comments!
>
> on 2020/6/2 上午1:59, Richard Sandiford wrote:
>> Could you go into more detail about this choice of cost calculation?
>> It looks like we first calculate per-group flags, which are true only if
>> the unrolled offsets are valid for all uses in the group.  Then we create
>> per-candidate flags when associating candidates with groups.
>> 
>
> Sure.  It checks every address type IV group to determine whether this
> group is valid to use reg offset addressing mode.  Here we only need to
> check the first one and the last one, since the intermediates should 
> have been handled by split_address_groups.  With unrolling the
> displacement of the address can be offset-ed by (UF-1)*step, check the
> address with this max offset whether still valid.  If the check finds
> it's valid to use reg offset mode for the whole group, we flag this
> group.  Later, when we create IV candidate for address group flagged,
> we flag the candidate further.  This flag is mainly for iv cand
> costing, we don't need to scale up iv cand's step cost for this kind
> of candidate.

But AIUI, this is calculating whether the uses in their original form
support all unrolled offsets.  For ivopts, I think the question is really
whether the uses support all unrolled offsets when based on a given IV
candidate (which might not be the original IV).

E.g. there might be another IV candidate at a constant offset
from the original one, and the offsets might all be in range
for that offset too.

> Imagining this loop is being unrolled, all the statements will be
> duplicated by UF.  For the cost modeling against iv group, it's
> scaling up the cost by UF (here I simply excluded the compare_type
> since in most cases it for loop ending check).  For the cost modeling
> against iv candidate, it's to focus on step costs, for an iv candidate
> we flagged before, it's taken as one time step cost, for the others,
> it's scaling up the step cost since the unrolling make step 
> calculation become UF times.
>
> This cost modeling is trying to simulate cost change after the
> unrolling, scaling up the costs accordingly.  There are somethings
> to be improved like distinguish the loop ending compare or else,
> whether need to tweak the other costs somehow since the scaling up
> probably cause existing cost framework imbalance, but during
> benchmarking I didn't find these matter, so take it as simple as 
> possible for now.
>
>
>> Instead, couldn't we take this into account in get_address_cost,
>> which calculates the cost of an address use for a given candidate?
>> E.g. after the main if-else at the start of the function,
>> perhaps it would make sense to add the worst-case offset to
>> the address in “parts”, check whether that too is a valid address,
>> and if not, increase var_cost by the cost of one add instruction.
>> 
>
> IIUC, what you suggest is to tweak the iv group cost, if we find
> one address group is valid for reg offset mode, we price more on
> the pairs between this group and other non address-based iv cands.
> The question is how do we decide this add-on cost.  For the test
> case I was working on initially, adding one cost (of add) doesn't
> work, the normal iv still wined.  We can price it more like two
> but what's the justification on this value, by heuristics?

Yeah, I was thinking of adding one instance of add_cost.  If that
doesn't work, it'd be interesting to know why in more detail.

>> I guess there are two main sources of inexactness if we do that:
>> 
>> (1) It might underestimate the cost because it assumes that vuse[0]
>>     stands for all vuses in the group.
>> 
>
> Do you mean we don't need one check function like mark_reg_offset_groups?
> If without it, vuse[0] might be not enough since we can't ensure the
> others are fine with additional displacement from unrolling.  If we still
> have it, I think it's fine to just use vuse[0].
>
>> (2) It might overestimates the cost because it treats all unrolled
>>     iterations as having the cost of the final unrolled iteration.
>>
>> (1) could perhaps be avoided by adding a flag to the iv_use to say
>> whether it wants this treatment.  I think the flag approach suffers
>> from (2) too, and I'd be surprised if it makes a difference in practice.
>> 
>
> Sorry, I didn't have the whole picture how to deal with uf for your proposal.
> But the flag approach considers uf in iv group cost calculation as well as
> iv cand step cost calculation.
>
> BR,
> Kewen
>
>> Thanks,
>> Richard
>> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling
  2020-05-28 12:17 [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
                   ` (2 preceding siblings ...)
  2020-05-28 12:24 ` [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling Kewen.Lin
@ 2020-06-02 11:38 ` Richard Biener
  2020-06-03  3:46   ` Kewen.Lin
  3 siblings, 1 reply; 64+ messages in thread
From: Richard Biener @ 2020-06-02 11:38 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng,
	Richard Sandiford

On Thu, 28 May 2020, Kewen.Lin wrote:

> Hi,
> 
> This is one repost and you can refer to the original series 
> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
> 
> As we discussed in the thread
> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> The difference on D-form and other forms during unrolling is we can put the
> stride into displacement field to avoid additional step increment. eg:
> 
> With X-form (uf step increment):
>   ...
>   LD A = baseA, X
>   LD B = baseB, X
>   ST C = baseC, X
>   X = X + stride
>   LD A = baseA, X
>   LD B = baseB, X
>   ST C = baseC, X
>   X = X + stride
>   LD A = baseA, X
>   LD B = baseB, X
>   ST C = baseC, X
>   X = X + stride
>   ...
> 
> With D-form (one step increment for each base):
>   ...
>   LD A = baseA, OFF
>   LD B = baseB, OFF
>   ST C = baseC, OFF
>   LD A = baseA, OFF+stride
>   LD B = baseB, OFF+stride
>   ST C = baseC, OFF+stride
>   LD A = baseA, OFF+2*stride
>   LD B = baseB, OFF+2*stride
>   ST C = baseC, OFF+2*stride
>   ...
>   baseA += stride * uf
>   baseB += stride * uf
>   baseC += stride * uf
> 
> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> D-form vs. 8 step updates with X-form. Here we only need to check stride
> meet D-form field requirement, since if OFF doesn't meet, we can construct
> baseA' with baseA + OFF.

I'd just mention there are other targets that have the choice between
the above forms.  Since IVOPTs itself does not perform the unrolling
the IL it produces is the same, correct?

Richard.

> This patch set consists four parts:
>      
>   [PATCH 1/4] unroll: Add middle-end unroll factor estimation
> 
>      Add unroll factor estimation in middle-end. It mainly refers to current
>      RTL unroll factor determination in function decide_unrolling and its
>      sub calls.  As Richi suggested, we probably can force unroll factor
>      with this and avoid duplicate unroll factor calculation, but I think it
>      need more benchmarking work and should be handled separately.
> 
>   [PATCH 2/4] param: Introduce one param to control unroll factor 
> 
>      As Richard and Segher's suggestion, I used addr_offset_valid_p for the
>      addressing mode, rather than one target hook.  As Richard's suggestion,     
>      it introduces one parameter to control this IVOPTs consideration and
>      further tweaking [3/4] on top of unroll factor estimation [1/4].
>      
>   [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
> 
>      Teach IVOPTs to mark the IV cand as reg_offset_p which is derived from
>      one address IV type group where the whole group is valid to use reg_offset
>      mode.  Then scaling up the IV cand step cost by (uf - 1) for no
>      reg_offset_p IV cands, here the uf is one estimated unroll factor [1/4].
>      
>   [PATCH 4/4] rs6000: P9 D-form test cases
> 
>      Add some test cases, mainly copied from Kelvin's patch.  This is approved
>      by Segher if the whole series is fine.
> 
> 
> Many thanks to Richard and Segher on previous version reviews.
> 
> Bootstrapped and regress tested on powerpc64le-linux-gnu.
> 
> Any comments are highly appreciated!  Thanks in advance!
> 
> 
> BR,
> Kewen
> 
> -------
> 
>  gcc/cfgloop.h                  |   3 ++
>  gcc/config/i386/i386-options.c |   6 +++
>  gcc/config/s390/s390.c         |   6 +++
>  gcc/doc/invoke.texi            |   9 +++++
>  gcc/params.opt                 |   4 ++
>  gcc/tree-ssa-loop-ivopts.c     | 100 ++++++++++++++++++++++++++++++++++++++++++++++-
>  gcc/tree-ssa-loop-manip.c      | 253 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/tree-ssa-loop-manip.h      |   3 +-
>  gcc/tree-ssa-loop.c            |  33 ++++++++++++++++
>  gcc/tree-ssa-loop.h            |   2 +
>  10 files changed, 416 insertions(+), 3 deletions(-)
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-06-02  7:14       ` Richard Sandiford
@ 2020-06-03  3:18         ` Kewen.Lin
  0 siblings, 0 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-06-03  3:18 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: bin.cheng, Richard Guenther, Bill Schmidt, Segher Boessenkool, Bin.Cheng

Hi Richard,

on 2020/6/2 下午3:14, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> Thanks for the comments!
>>
>> on 2020/6/2 上午1:59, Richard Sandiford wrote:
>>> Could you go into more detail about this choice of cost calculation?
>>> It looks like we first calculate per-group flags, which are true only if
>>> the unrolled offsets are valid for all uses in the group.  Then we create
>>> per-candidate flags when associating candidates with groups.
>>>
>>
>> Sure.  It checks every address type IV group to determine whether this
>> group is valid to use reg offset addressing mode.  Here we only need to
>> check the first one and the last one, since the intermediates should 
>> have been handled by split_address_groups.  With unrolling the
>> displacement of the address can be offset-ed by (UF-1)*step, check the
>> address with this max offset whether still valid.  If the check finds
>> it's valid to use reg offset mode for the whole group, we flag this
>> group.  Later, when we create IV candidate for address group flagged,
>> we flag the candidate further.  This flag is mainly for iv cand
>> costing, we don't need to scale up iv cand's step cost for this kind
>> of candidate.
> 
> But AIUI, this is calculating whether the uses in their original form
> support all unrolled offsets.  For ivopts, I think the question is really
> whether the uses support all unrolled offsets when based on a given IV
> candidate (which might not be the original IV).
> 

Good point!  Indeed, the patch only flags the IV cands derived from the
address group flagged with reg_offset_p, it has the possibility that
we miss some other candidates with same basic but different offset
which can satisfy addr_offset_valid_p.

How about to update the current approach to: instead of flag the derived
iv cand, when we determine the cost for iv cand, we check whether this
iv cand has the same basic object as the one of any reg_offset_p vuse[0]
(both should be stripped their offsets), then further check the offset
can satisfy addr_offset_valid_p, if all checks expected, update the step
cost without uf consideration.

I would expect this kind of address based iv cand will mainly be used
for address type group and compare type group, for the address type 
group, it can be only applied for those with same basic object, in most
cases they are put in the same address group.  If it's used for generic
much, the step cost tweaking might not be fixed at the beginning but
varies according to the iv set members.  It looks too much for the
existing framework.

> E.g. there might be another IV candidate at a constant offset
> from the original one, and the offsets might all be in range
> for that offset too.
> 
>> Imagining this loop is being unrolled, all the statements will be
>> duplicated by UF.  For the cost modeling against iv group, it's
>> scaling up the cost by UF (here I simply excluded the compare_type
>> since in most cases it for loop ending check).  For the cost modeling
>> against iv candidate, it's to focus on step costs, for an iv candidate
>> we flagged before, it's taken as one time step cost, for the others,
>> it's scaling up the step cost since the unrolling make step 
>> calculation become UF times.
>>
>> This cost modeling is trying to simulate cost change after the
>> unrolling, scaling up the costs accordingly.  There are somethings
>> to be improved like distinguish the loop ending compare or else,
>> whether need to tweak the other costs somehow since the scaling up
>> probably cause existing cost framework imbalance, but during
>> benchmarking I didn't find these matter, so take it as simple as 
>> possible for now.
>>
>>
>>> Instead, couldn't we take this into account in get_address_cost,
>>> which calculates the cost of an address use for a given candidate?
>>> E.g. after the main if-else at the start of the function,
>>> perhaps it would make sense to add the worst-case offset to
>>> the address in “parts”, check whether that too is a valid address,
>>> and if not, increase var_cost by the cost of one add instruction.
>>>
>>
>> IIUC, what you suggest is to tweak the iv group cost, if we find
>> one address group is valid for reg offset mode, we price more on
>> the pairs between this group and other non address-based iv cands.
>> The question is how do we decide this add-on cost.  For the test
>> case I was working on initially, adding one cost (of add) doesn't
>> work, the normal iv still wined.  We can price it more like two
>> but what's the justification on this value, by heuristics?
> 
> Yeah, I was thinking of adding one instance of add_cost.  If that
> doesn't work, it'd be interesting to know why in more detail.
> 

The case is like:

            for (i = 0; i < SIZE; i++)
              y[i] = a * x[i] + z[i];

It has three array access in the loop body, after vectorization,
it looks like

  vect__1.7_15 = MEM <vector(2) double> [(double *)vectp_x.5_20];
  vect__3.8_13 = vect__1.7_15 * vect_cst__14;
  vect__4.11_19 = MEM <vector(2) double> [(double *)vectp_z.9_7];
  vect__5.12_23 = vect__3.8_13 + vect__4.11_19;
  MEM <vector(2) double> [(double *)vectp_y.13_24] = vect__5.12_23;

we expect to use reg_offset for those vector load/store when
unrolling factor is big, but without unrolling or unrolling factor
smaller like 2, the reg_index should outperform reg_offset.

IIUC, the proposed costing change would look like (the left is before
change, while the right is after change).  Those zero costs are for
those iv cands based on address objects.  Group 0,1,2 are address
groups, group 3 is compare group (omitted).  One insn add cost is
counted as 4 on ppc64le.


  <Group-candidate Costs>:                \  <Group-candidate Costs>:
  Group 0:                                \  Group 0:
    cand  cost    compl.  inv.expr.       \    cand  cost    compl.  inv.expr.       inv.vars
    1     5       1       1;      NIL;    \    1     9       1       1;      NIL;
    4     1       1       1;      NIL;    \    4     5       1       1;      NIL;
    7     0       1       NIL;    NIL;    \    7     0       1       NIL;    NIL;
    8     0       0       NIL;    NIL;    \    8     0       0       NIL;    NIL;
    9     0       0       NIL;    NIL;    \    9     0       0       NIL;    NIL;
    13    5       1       1;      NIL;    \    13    9       1       1;      NIL;
    14    5       1       3;      NIL;    \    14    9       1       3;      NIL;
                                          \
  Group 1:                                \  Group 1:
    cand  cost    compl.  inv.expr.       \    cand  cost    compl.  inv.expr.       inv.vars
    1     5       1       4;      NIL;    \    1     9       1       4;      NIL;
    3     0       1       NIL;    NIL;    \    3     0       1       NIL;    NIL;
    4     1       1       4;      NIL;    \    4     5       1       4;      NIL;
    5     0       0       NIL;    NIL;    \    5     0       0       NIL;    NIL;
    6     0       0       NIL;    NIL;    \    6     0       0       NIL;    NIL;
    13    5       1       4;      NIL;    \    13    9       1       4;      NIL;
    14    5       1       6;      NIL;    \    14    9       1       6;      NIL;
                                          \
  Group 2:                                \  Group 2:
    cand  cost    compl.  inv.expr.       \    cand  cost    compl.  inv.expr.       inv.vars
    1     5       1       7;      NIL;    \    1     9       1       7;      NIL;
    4     1       1       7;      NIL;    \    4     5       1       7;      NIL;
    10    0       0       NIL;    NIL;    \    10    4       0       NIL;    NIL;
    11    0       0       NIL;    NIL;    \    11    4       0       NIL;    NIL;
    12    0       1       NIL;    NIL;    \    12    4       1       NIL;    NIL;
    13    5       1       7;      NIL;    \    13    9       1       7;      NIL;
    14    5       1       9;      NIL;    \    14    9       1       9;      NIL;

  Initial set of candidates:              \  Initial set of candidates:
    cost: 13 (complexity 3)               \    cost: 25 (complexity 0)
    reg_cost: 5                           \    reg_cost: 6
    cand_cost: 5                          \    cand_cost: 15
    cand_group_cost: 3 (complexity 3)     \    cand_group_cost: 4 (complexity 0)
    candidates: 4                         \    candidates: 5, 8, 10
     group:0 --> iv_cand:4, cost=(1,1)    \     group:0 --> iv_cand:8, cost=(0,0)
     group:1 --> iv_cand:4, cost=(1,1)    \     group:1 --> iv_cand:5, cost=(0,0)
     group:2 --> iv_cand:4, cost=(1,1)    \     group:2 --> iv_cand:10, cost=(4,0)
     group:3 --> iv_cand:4, cost=(0,0)    \     group:3 --> iv_cand:5, cost=(0,0)
    invariant variables:                  \    invariant variables:
    invariant expressions: 1, 4, 7        \    invariant expressions:
                                          \
  Original cost 21 (complexity 0)         \  Original cost 25 (complexity 0)
                                          \
  Final cost 13 (complexity 3)            \  Final cost 25 (complexity 0)

When unrolling factor is 8, we expect to use reg offset modes for group 0, 1, 2,
this proposal works well.  But for unrolling factor is 2, we expect to still
use reg index modes for group 0, 1, 2 (with iv_cand 4 as before), this proposal
looks sticked to use iv_cand 8,5,10.  It looks we will over-blame the other 
non reg offset candidates.  sorry that I have no idea how to leverage uf in this
proposal, but it changes the focus to iv_group cost, seems not easy to scale up
or down this with good justification.

BR,
Kewen

>>> I guess there are two main sources of inexactness if we do that:
>>>
>>> (1) It might underestimate the cost because it assumes that vuse[0]
>>>     stands for all vuses in the group.
>>>
>>
>> Do you mean we don't need one check function like mark_reg_offset_groups?
>> If without it, vuse[0] might be not enough since we can't ensure the
>> others are fine with additional displacement from unrolling.  If we still
>> have it, I think it's fine to just use vuse[0].
>>
>>> (2) It might overestimates the cost because it treats all unrolled
>>>     iterations as having the cost of the final unrolled iteration.
>>>
>>> (1) could perhaps be avoided by adding a flag to the iv_use to say
>>> whether it wants this treatment.  I think the flag approach suffers
>>> from (2) too, and I'd be surprised if it makes a difference in practice.
>>>
>>
>> Sorry, I didn't have the whole picture how to deal with uf for your proposal.
>> But the flag approach considers uf in iv group cost calculation as well as
>> iv cand step cost calculation.
>>
>> BR,
>> Kewen
>>
>>> Thanks,
>>> Richard
>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling
  2020-06-02 11:38 ` [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Richard Biener
@ 2020-06-03  3:46   ` Kewen.Lin
  2020-06-03  7:07     ` Richard Biener
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-06-03  3:46 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng,
	Richard Sandiford

Hi Richi,

on 2020/6/2 下午7:38, Richard Biener wrote:
> On Thu, 28 May 2020, Kewen.Lin wrote:
> 
>> Hi,
>>
>> This is one repost and you can refer to the original series 
>> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
>>
>> As we discussed in the thread
>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>> The difference on D-form and other forms during unrolling is we can put the
>> stride into displacement field to avoid additional step increment. eg:
>>
>> With X-form (uf step increment):
>>   ...
>>   LD A = baseA, X
>>   LD B = baseB, X
>>   ST C = baseC, X
>>   X = X + stride
>>   LD A = baseA, X
>>   LD B = baseB, X
>>   ST C = baseC, X
>>   X = X + stride
>>   LD A = baseA, X
>>   LD B = baseB, X
>>   ST C = baseC, X
>>   X = X + stride
>>   ...
>>
>> With D-form (one step increment for each base):
>>   ...
>>   LD A = baseA, OFF
>>   LD B = baseB, OFF
>>   ST C = baseC, OFF
>>   LD A = baseA, OFF+stride
>>   LD B = baseB, OFF+stride
>>   ST C = baseC, OFF+stride
>>   LD A = baseA, OFF+2*stride
>>   LD B = baseB, OFF+2*stride
>>   ST C = baseC, OFF+2*stride
>>   ...
>>   baseA += stride * uf
>>   baseB += stride * uf
>>   baseC += stride * uf
>>
>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>> baseA' with baseA + OFF.
> 
> I'd just mention there are other targets that have the choice between
> the above forms.  Since IVOPTs itself does not perform the unrolling
> the IL it produces is the same, correct?
> 
Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
it only models things based on what it sees.  We can assume it thinks
later RTL unrolling won't perform.

With this patch, since the IV choice probably changes, the IL can probably
change.  The typical difference with this patch is:

  vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
vs.
  vect__1.7_15 = MEM[base: _29, offset: 0B];

BR,
Kewen

> Richard.
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling
  2020-06-03  3:46   ` Kewen.Lin
@ 2020-06-03  7:07     ` Richard Biener
  2020-06-03  7:58       ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Richard Biener @ 2020-06-03  7:07 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng,
	Richard Sandiford

On Wed, 3 Jun 2020, Kewen.Lin wrote:

> Hi Richi,
> 
> on 2020/6/2 下午7:38, Richard Biener wrote:
> > On Thu, 28 May 2020, Kewen.Lin wrote:
> > 
> >> Hi,
> >>
> >> This is one repost and you can refer to the original series 
> >> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
> >>
> >> As we discussed in the thread
> >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> >> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> >> The difference on D-form and other forms during unrolling is we can put the
> >> stride into displacement field to avoid additional step increment. eg:
> >>
> >> With X-form (uf step increment):
> >>   ...
> >>   LD A = baseA, X
> >>   LD B = baseB, X
> >>   ST C = baseC, X
> >>   X = X + stride
> >>   LD A = baseA, X
> >>   LD B = baseB, X
> >>   ST C = baseC, X
> >>   X = X + stride
> >>   LD A = baseA, X
> >>   LD B = baseB, X
> >>   ST C = baseC, X
> >>   X = X + stride
> >>   ...
> >>
> >> With D-form (one step increment for each base):
> >>   ...
> >>   LD A = baseA, OFF
> >>   LD B = baseB, OFF
> >>   ST C = baseC, OFF
> >>   LD A = baseA, OFF+stride
> >>   LD B = baseB, OFF+stride
> >>   ST C = baseC, OFF+stride
> >>   LD A = baseA, OFF+2*stride
> >>   LD B = baseB, OFF+2*stride
> >>   ST C = baseC, OFF+2*stride
> >>   ...
> >>   baseA += stride * uf
> >>   baseB += stride * uf
> >>   baseC += stride * uf
> >>
> >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> >> D-form vs. 8 step updates with X-form. Here we only need to check stride
> >> meet D-form field requirement, since if OFF doesn't meet, we can construct
> >> baseA' with baseA + OFF.
> > 
> > I'd just mention there are other targets that have the choice between
> > the above forms.  Since IVOPTs itself does not perform the unrolling
> > the IL it produces is the same, correct?
> > 
> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
> it only models things based on what it sees.  We can assume it thinks
> later RTL unrolling won't perform.
> 
> With this patch, since the IV choice probably changes, the IL can probably
> change.  The typical difference with this patch is:
> 
>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
> vs.
>   vect__1.7_15 = MEM[base: _29, offset: 0B];

So we're asking IVOPTS "if we were unrolling this loop would you make
a different IV choice?" thus I wonder why we need so much complexity
here?  That is, if we can classify the loop as being possibly unrolled
we could evaluate IVOPTs IV choice (and overall cost) on the original
loop and in a second run on the original loop with fake IV uses
added with extra offset.  If the overall IV cost is similar we'll
take the unroll friendly choice if the costs are way different
(I wouldn't expect this to be the case ever?) I'd side with the
IV choice when not unrolling (and mark the loop as to be not unrolled).

Thus I'd err on the side of not unrolling but leave the ultimate choice
of whether to unroll to RTL unless IV cost makes that prohibitive.

Even without X- or D- form addressing modes the IV choice may differ
and I think we don't need extra knobs for the unroller but instead
can decide to set the existing n_unroll to zero (force not unroll)
when costs say it would be bad?

Richard.

> BR,
> Kewen
> 
> > Richard.
> > 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling
  2020-06-03  7:07     ` Richard Biener
@ 2020-06-03  7:58       ` Kewen.Lin
  2020-06-03  9:27         ` Richard Biener
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-06-03  7:58 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng,
	Richard Sandiford

on 2020/6/3 下午3:07, Richard Biener wrote:
> On Wed, 3 Jun 2020, Kewen.Lin wrote:
> 
>> Hi Richi,
>>
>> on 2020/6/2 下午7:38, Richard Biener wrote:
>>> On Thu, 28 May 2020, Kewen.Lin wrote:
>>>
>>>> Hi,
>>>>
>>>> This is one repost and you can refer to the original series 
>>>> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
>>>>
>>>> As we discussed in the thread
>>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>>>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>>>> The difference on D-form and other forms during unrolling is we can put the
>>>> stride into displacement field to avoid additional step increment. eg:
>>>>
>>>> With X-form (uf step increment):
>>>>   ...
>>>>   LD A = baseA, X
>>>>   LD B = baseB, X
>>>>   ST C = baseC, X
>>>>   X = X + stride
>>>>   LD A = baseA, X
>>>>   LD B = baseB, X
>>>>   ST C = baseC, X
>>>>   X = X + stride
>>>>   LD A = baseA, X
>>>>   LD B = baseB, X
>>>>   ST C = baseC, X
>>>>   X = X + stride
>>>>   ...
>>>>
>>>> With D-form (one step increment for each base):
>>>>   ...
>>>>   LD A = baseA, OFF
>>>>   LD B = baseB, OFF
>>>>   ST C = baseC, OFF
>>>>   LD A = baseA, OFF+stride
>>>>   LD B = baseB, OFF+stride
>>>>   ST C = baseC, OFF+stride
>>>>   LD A = baseA, OFF+2*stride
>>>>   LD B = baseB, OFF+2*stride
>>>>   ST C = baseC, OFF+2*stride
>>>>   ...
>>>>   baseA += stride * uf
>>>>   baseB += stride * uf
>>>>   baseC += stride * uf
>>>>
>>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>>>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>>>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>>>> baseA' with baseA + OFF.
>>>
>>> I'd just mention there are other targets that have the choice between
>>> the above forms.  Since IVOPTs itself does not perform the unrolling
>>> the IL it produces is the same, correct?
>>>
>> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
>> it only models things based on what it sees.  We can assume it thinks
>> later RTL unrolling won't perform.
>>
>> With this patch, since the IV choice probably changes, the IL can probably
>> change.  The typical difference with this patch is:
>>
>>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
>> vs.
>>   vect__1.7_15 = MEM[base: _29, offset: 0B];
> 
> So we're asking IVOPTS "if we were unrolling this loop would you make
> a different IV choice?" thus I wonder why we need so much complexity
> here?  

I would describe it more like "we are going to unroll this loop with
unroll factor uf in RTL, would you consider this variable when modeling?"

In most cases, one single iteration is representative for the unrolled
body, so it doesn't matter considering unrolling or not.  But for the
case here, it's not true, expected reg_offset iv cand can make iv cand
step cost reduced, it leads the difference.

> That is, if we can classify the loop as being possibly unrolled
> we could evaluate IVOPTs IV choice (and overall cost) on the original
> loop and in a second run on the original loop with fake IV uses
> added with extra offset.  If the overall IV cost is similar we'll
> take the unroll friendly choice if the costs are way different
> (I wouldn't expect this to be the case ever?) I'd side with the
> IV choice when not unrolling (and mark the loop as to be not unrolled).
> 

Could you elaborate it a bit?  I guess it won't estimate the unroll
factor here, just guess it's to be unrolled or not?  The second run
with fake IV uses added with extra offset sounds like scaling up the 
iv group cost by uf.

> Thus I'd err on the side of not unrolling but leave the ultimate choice
> of whether to unroll to RTL unless IV cost makes that prohibitive.
> 
> Even without X- or D- form addressing modes the IV choice may differ
> and I think we don't need extra knobs for the unroller but instead
> can decide to set the existing n_unroll to zero (force not unroll)
> when costs say it would be bad?

Yes, even without x- or d- form addressing, the difference probably comes 
from compare type IV use for loop ending, maybe more cases which I am not
aware of.  But I don't see people care about it, probably the impact is
small.

IIUC what you stated here looks like to use ivopts information for unrolling
factor decision, I think this is a separate direction, do we have this
kind of case where ivopts costs can foresee the unrolling?

Now the unroll factor estimation can be used for other optimization passes
if they are wondering future unrolling factor decision, as discussed it
sounds a good idea to override the n_unroll with some benchmarking.

BR,
Kewen

> 
> Richard.
> 
>> BR,
>> Kewen
>>
>>> Richard.
>>>
>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling
  2020-06-03  7:58       ` Kewen.Lin
@ 2020-06-03  9:27         ` Richard Biener
  2020-06-03 10:47           ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Richard Biener @ 2020-06-03  9:27 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng,
	Richard Sandiford

On Wed, 3 Jun 2020, Kewen.Lin wrote:

> on 2020/6/3 下午3:07, Richard Biener wrote:
> > On Wed, 3 Jun 2020, Kewen.Lin wrote:
> > 
> >> Hi Richi,
> >>
> >> on 2020/6/2 下午7:38, Richard Biener wrote:
> >>> On Thu, 28 May 2020, Kewen.Lin wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> This is one repost and you can refer to the original series 
> >>>> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
> >>>>
> >>>> As we discussed in the thread
> >>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> >>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> >>>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> >>>> The difference on D-form and other forms during unrolling is we can put the
> >>>> stride into displacement field to avoid additional step increment. eg:
> >>>>
> >>>> With X-form (uf step increment):
> >>>>   ...
> >>>>   LD A = baseA, X
> >>>>   LD B = baseB, X
> >>>>   ST C = baseC, X
> >>>>   X = X + stride
> >>>>   LD A = baseA, X
> >>>>   LD B = baseB, X
> >>>>   ST C = baseC, X
> >>>>   X = X + stride
> >>>>   LD A = baseA, X
> >>>>   LD B = baseB, X
> >>>>   ST C = baseC, X
> >>>>   X = X + stride
> >>>>   ...
> >>>>
> >>>> With D-form (one step increment for each base):
> >>>>   ...
> >>>>   LD A = baseA, OFF
> >>>>   LD B = baseB, OFF
> >>>>   ST C = baseC, OFF
> >>>>   LD A = baseA, OFF+stride
> >>>>   LD B = baseB, OFF+stride
> >>>>   ST C = baseC, OFF+stride
> >>>>   LD A = baseA, OFF+2*stride
> >>>>   LD B = baseB, OFF+2*stride
> >>>>   ST C = baseC, OFF+2*stride
> >>>>   ...
> >>>>   baseA += stride * uf
> >>>>   baseB += stride * uf
> >>>>   baseC += stride * uf
> >>>>
> >>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> >>>> D-form vs. 8 step updates with X-form. Here we only need to check stride
> >>>> meet D-form field requirement, since if OFF doesn't meet, we can construct
> >>>> baseA' with baseA + OFF.
> >>>
> >>> I'd just mention there are other targets that have the choice between
> >>> the above forms.  Since IVOPTs itself does not perform the unrolling
> >>> the IL it produces is the same, correct?
> >>>
> >> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
> >> it only models things based on what it sees.  We can assume it thinks
> >> later RTL unrolling won't perform.
> >>
> >> With this patch, since the IV choice probably changes, the IL can probably
> >> change.  The typical difference with this patch is:
> >>
> >>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
> >> vs.
> >>   vect__1.7_15 = MEM[base: _29, offset: 0B];
> > 
> > So we're asking IVOPTS "if we were unrolling this loop would you make
> > a different IV choice?" thus I wonder why we need so much complexity
> > here?  
> 
> I would describe it more like "we are going to unroll this loop with
> unroll factor uf in RTL, would you consider this variable when modeling?"
> 
> In most cases, one single iteration is representative for the unrolled
> body, so it doesn't matter considering unrolling or not.  But for the
> case here, it's not true, expected reg_offset iv cand can make iv cand
> step cost reduced, it leads the difference.
> 
> > That is, if we can classify the loop as being possibly unrolled
> > we could evaluate IVOPTs IV choice (and overall cost) on the original
> > loop and in a second run on the original loop with fake IV uses
> > added with extra offset.  If the overall IV cost is similar we'll
> > take the unroll friendly choice if the costs are way different
> > (I wouldn't expect this to be the case ever?) I'd side with the
> > IV choice when not unrolling (and mark the loop as to be not unrolled).
> > 
> 
> Could you elaborate it a bit?  I guess it won't estimate the unroll
> factor here, just guess it's to be unrolled or not?  The second run
> with fake IV uses added with extra offset sounds like scaling up the 
> iv group cost by uf.

From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, 
offset: 0B]) is preferable since in the unrolled variant we have
the same addres but with a different constant offset for the unroll
copies while the second form would have to update the 'base' IV.

Thus I think the difference in IV cost and decision should already
show up if we, for each USE add a USE with an added constant offset.
This might be what your patch does with that extra flag on the USEs,
I was suggesting to model the USEs more explicitely, simulating a
2-way unroll.  I think in the end I'll defer to Bin here who knows
the code best.

> > Thus I'd err on the side of not unrolling but leave the ultimate choice
> > of whether to unroll to RTL unless IV cost makes that prohibitive.
> > 
> > Even without X- or D- form addressing modes the IV choice may differ
> > and I think we don't need extra knobs for the unroller but instead
> > can decide to set the existing n_unroll to zero (force not unroll)
> > when costs say it would be bad?
> 
> Yes, even without x- or d- form addressing, the difference probably comes 
> from compare type IV use for loop ending, maybe more cases which I am not
> aware of.  But I don't see people care about it, probably the impact is
> small.
> 
> IIUC what you stated here looks like to use ivopts information for unrolling
> factor decision, I think this is a separate direction, do we have this
> kind of case where ivopts costs can foresee the unrolling?
> 
> Now the unroll factor estimation can be used for other optimization passes
> if they are wondering future unrolling factor decision, as discussed it
> sounds a good idea to override the n_unroll with some benchmarking.

I didnt' suggest to use IVOPTs to determine the unroll factor.  In
fact your patch looks like it does this?  Instead I wanted to make
IVOPTs choose a set of IVs that is best for a blend of both worlds - use
D-form when it doesn't hurt the not unrolled code [much], and X-form
when the D-form is way worse (for whatever reason) and signal that
to the unroller (but we could chose to not do that).

The real issue is of course we're applying IV decision to a not final
loop.

> BR,
> Kewen
> 
> > 
> > Richard.
> > 
> >> BR,
> >> Kewen
> >>
> >>> Richard.
> >>>
> >>
> > 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling
  2020-06-03  9:27         ` Richard Biener
@ 2020-06-03 10:47           ` Kewen.Lin
  2020-06-03 11:08             ` Richard Sandiford
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-06-03 10:47 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng,
	Richard Sandiford

on 2020/6/3 下午5:27, Richard Biener wrote:
> On Wed, 3 Jun 2020, Kewen.Lin wrote:
> 
>> on 2020/6/3 下午3:07, Richard Biener wrote:
>>> On Wed, 3 Jun 2020, Kewen.Lin wrote:
>>>
>>>> Hi Richi,
>>>>

snip ...

>>>>>
>>>>> I'd just mention there are other targets that have the choice between
>>>>> the above forms.  Since IVOPTs itself does not perform the unrolling
>>>>> the IL it produces is the same, correct?
>>>>>
>>>> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
>>>> it only models things based on what it sees.  We can assume it thinks
>>>> later RTL unrolling won't perform.
>>>>
>>>> With this patch, since the IV choice probably changes, the IL can probably
>>>> change.  The typical difference with this patch is:
>>>>
>>>>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
>>>> vs.
>>>>   vect__1.7_15 = MEM[base: _29, offset: 0B];
>>>
>>> So we're asking IVOPTS "if we were unrolling this loop would you make
>>> a different IV choice?" thus I wonder why we need so much complexity
>>> here?  
>>
>> I would describe it more like "we are going to unroll this loop with
>> unroll factor uf in RTL, would you consider this variable when modeling?"
>>
>> In most cases, one single iteration is representative for the unrolled
>> body, so it doesn't matter considering unrolling or not.  But for the
>> case here, it's not true, expected reg_offset iv cand can make iv cand
>> step cost reduced, it leads the difference.
>>
>>> That is, if we can classify the loop as being possibly unrolled
>>> we could evaluate IVOPTs IV choice (and overall cost) on the original
>>> loop and in a second run on the original loop with fake IV uses
>>> added with extra offset.  If the overall IV cost is similar we'll
>>> take the unroll friendly choice if the costs are way different
>>> (I wouldn't expect this to be the case ever?) I'd side with the
>>> IV choice when not unrolling (and mark the loop as to be not unrolled).
>>>
>>
>> Could you elaborate it a bit?  I guess it won't estimate the unroll
>> factor here, just guess it's to be unrolled or not?  The second run
>> with fake IV uses added with extra offset sounds like scaling up the 
>> iv group cost by uf.
> 
> From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, 
> offset: 0B]) is preferable since in the unrolled variant we have
> the same addres but with a different constant offset for the unroll
> copies while the second form would have to update the 'base' IV.
> 
> Thus I think the difference in IV cost and decision should already
> show up if we, for each USE add a USE with an added constant offset.
> This might be what your patch does with that extra flag on the USEs,
> I was suggesting to model the USEs more explicitely, simulating a
> 2-way unroll.  I think in the end I'll defer to Bin here who knows
> the code best.
> 

Thanks for your further explanation!  As your proposal we introduce more
iv use groups with step added.  Take the example here
https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547128.html
Imagining initially the cand iv 4 leading to x-form wins, it's the
original iv, has the iv-group cost 1 against the address group.
Although we introduce one more group (2-way unrolling), the iv still
wins since pulling the address iv in takes 5 (15 for three).  Probably
we can introduce more groups according to uf here.

OK.  Looking forward to Bin's comments.

>>> Thus I'd err on the side of not unrolling but leave the ultimate choice
>>> of whether to unroll to RTL unless IV cost makes that prohibitive.
>>>
>>> Even without X- or D- form addressing modes the IV choice may differ
>>> and I think we don't need extra knobs for the unroller but instead
>>> can decide to set the existing n_unroll to zero (force not unroll)
>>> when costs say it would be bad?
>>
>> Yes, even without x- or d- form addressing, the difference probably comes 
>> from compare type IV use for loop ending, maybe more cases which I am not
>> aware of.  But I don't see people care about it, probably the impact is
>> small.
>>
>> IIUC what you stated here looks like to use ivopts information for unrolling
>> factor decision, I think this is a separate direction, do we have this
>> kind of case where ivopts costs can foresee the unrolling?
>>
>> Now the unroll factor estimation can be used for other optimization passes
>> if they are wondering future unrolling factor decision, as discussed it
>> sounds a good idea to override the n_unroll with some benchmarking.
> 
> I didnt' suggest to use IVOPTs to determine the unroll factor.  In
> fact your patch looks like it does this?  Instead I wanted to make
> IVOPTs choose a set of IVs that is best for a blend of both worlds - use
> D-form when it doesn't hurt the not unrolled code [much], and X-form
> when the D-form is way worse (for whatever reason) and signal that
> to the unroller (but we could chose to not do that).
> 

Sorry for my weak comprehension!  Nice, we are on the same direction.  :)

> The real issue is of course we're applying IV decision to a not final
> loop.
> 

Exactly.

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling
  2020-06-03 10:47           ` Kewen.Lin
@ 2020-06-03 11:08             ` Richard Sandiford
  0 siblings, 0 replies; 64+ messages in thread
From: Richard Sandiford @ 2020-06-03 11:08 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Biener, GCC Patches, Segher Boessenkool, Bill Schmidt, bin.cheng

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/6/3 下午5:27, Richard Biener wrote:
>> On Wed, 3 Jun 2020, Kewen.Lin wrote:
>> 
>>> on 2020/6/3 下午3:07, Richard Biener wrote:
>>>> On Wed, 3 Jun 2020, Kewen.Lin wrote:
>>>>
>>>>> Hi Richi,
>>>>>
>
> snip ...
>
>>>>>>
>>>>>> I'd just mention there are other targets that have the choice between
>>>>>> the above forms.  Since IVOPTs itself does not perform the unrolling
>>>>>> the IL it produces is the same, correct?
>>>>>>
>>>>> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
>>>>> it only models things based on what it sees.  We can assume it thinks
>>>>> later RTL unrolling won't perform.
>>>>>
>>>>> With this patch, since the IV choice probably changes, the IL can probably
>>>>> change.  The typical difference with this patch is:
>>>>>
>>>>>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
>>>>> vs.
>>>>>   vect__1.7_15 = MEM[base: _29, offset: 0B];
>>>>
>>>> So we're asking IVOPTS "if we were unrolling this loop would you make
>>>> a different IV choice?" thus I wonder why we need so much complexity
>>>> here?  
>>>
>>> I would describe it more like "we are going to unroll this loop with
>>> unroll factor uf in RTL, would you consider this variable when modeling?"
>>>
>>> In most cases, one single iteration is representative for the unrolled
>>> body, so it doesn't matter considering unrolling or not.  But for the
>>> case here, it's not true, expected reg_offset iv cand can make iv cand
>>> step cost reduced, it leads the difference.
>>>
>>>> That is, if we can classify the loop as being possibly unrolled
>>>> we could evaluate IVOPTs IV choice (and overall cost) on the original
>>>> loop and in a second run on the original loop with fake IV uses
>>>> added with extra offset.  If the overall IV cost is similar we'll
>>>> take the unroll friendly choice if the costs are way different
>>>> (I wouldn't expect this to be the case ever?) I'd side with the
>>>> IV choice when not unrolling (and mark the loop as to be not unrolled).
>>>>
>>>
>>> Could you elaborate it a bit?  I guess it won't estimate the unroll
>>> factor here, just guess it's to be unrolled or not?  The second run
>>> with fake IV uses added with extra offset sounds like scaling up the 
>>> iv group cost by uf.
>> 
>> From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, 
>> offset: 0B]) is preferable since in the unrolled variant we have
>> the same addres but with a different constant offset for the unroll
>> copies while the second form would have to update the 'base' IV.
>> 
>> Thus I think the difference in IV cost and decision should already
>> show up if we, for each USE add a USE with an added constant offset.
>> This might be what your patch does with that extra flag on the USEs,
>> I was suggesting to model the USEs more explicitely, simulating a
>> 2-way unroll.  I think in the end I'll defer to Bin here who knows
>> the code best.
>> 
>
> Thanks for your further explanation!  As your proposal we introduce more
> iv use groups with step added.  Take the example here
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547128.html
> Imagining initially the cand iv 4 leading to x-form wins, it's the
> original iv, has the iv-group cost 1 against the address group.
> Although we introduce one more group (2-way unrolling), the iv still
> wins since pulling the address iv in takes 5 (15 for three).  Probably
> we can introduce more groups according to uf here.

Yeah, to summarise that thread: the idea there was that we would
continue to cost each use once, but base the cost on the kind of address
seen in the unrolled iterations.  I guess this tends to over-estimate the
cost of index IVs to some extent, but I too was aiming for something
simple that doesn't depend on a specific unroll factor.

Kewen's point there was that that approach works for high unroll factors,
but not for small unroll factors like 2.  For:

  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride

using X as an IV is still preferred.  It's only once the unroll
factor exceeds the number of pointer IVs that using pointer IVs
becomes better.

So like Kewen says, using 2 USEs (the original one and an unrolled one)
would have the opposite problem: it would still prefer index IVs and not
consider the benefit of pointer IVs at higher unroll factors.

But I agree that trying to guess what a much later pass will do doesn't
feel very clean either...

Thanks,
Richard

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-05-28 12:24 ` [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling Kewen.Lin
  2020-06-01 17:59   ` Richard Sandiford
@ 2020-08-08  8:01   ` Bin.Cheng
  2020-08-10  4:27     ` Kewen.Lin
  1 sibling, 1 reply; 64+ messages in thread
From: Bin.Cheng @ 2020-08-08  8:01 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

Hi Kewen,
Sorry for the late reply.
The patch's most important change is below cost computation:

> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>
> +  /* Consider additional step updates during unrolling.  */
> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
This is a bit strange, to me the add instructions are additional
computation caused by unrolling+addressing_mode, rather than a native
part in candidate itself.  Specifically, an additional cost is needed
if candidates (without reg_offset_p) are chosen for the address type
group/uses.
> +
>   /* Prefer the original ivs unless we may gain something by replacing it.
>      The reason is to make debugging simpler; so this is not relevant for
>      artificial ivs created by other optimization passes.  */
>

> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>       return;
>     }
>
> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
> +     matter.  */
> +  if (data->consider_reg_offset_for_unroll_p
> +      && group->vuses[0]->type != USE_COMPARE)
> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
Not quite follow here, giving "pricing more on on-reg_offset IV cand"
doesn't make much sense to me.  Also why generic type uses are not
skipped?  We want to model the cost required for address computation,
however, for generic type uses there is no way to save the computation
in "address expression".  Once unrolled, the computation is always
there?

And what's the impact on targets supporting [base + index + offset]
addressing mode?

Given the patch is not likely to harm because rtl loop unrolling is
(or was?) by default disabled, so I am OK once the above comments are
addressed.

I wonder if it's possible to get 10% of (all which should be unrolled)
loops unrolled (conservatively) on gimple and enable it by default at
O3, rather than teaching ivopts to model a future pass which not
likely be used outside of benchmarks?

Thanks,
bin

> +
>   if (data->consider_all_candidates)
>     {
>       group->cost_map[cand->id].cand = cand;

On Thu, May 28, 2020 at 8:24 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
>
> gcc/ChangeLog
>
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>
>         * tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
>         (struct iv_cand): New field reg_offset_p.
>         (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
>         (dump_groups): Dump group with reg_offset_p.
>         (record_group): Initialize reg_offset_p.
>         (mark_reg_offset_groups): New function.
>         (find_interesting_uses): Call mark_reg_offset_groups.
>         (add_candidate_1): Update reg_offset_p if derived from reg_offset_p group.
>         (set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
>         consider_reg_offset_for_unroll_p.
>         (determine_iv_cost): Increase step cost with estimate_unroll_factor if
>         consider_reg_offset_for_unroll_p.
>         (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
>         consider_reg_offset_for_unroll_p.
>
> ----

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-08-08  8:01   ` Bin.Cheng
@ 2020-08-10  4:27     ` Kewen.Lin
  2020-08-10 12:38       ` Bin.Cheng
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-08-10  4:27 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

Hi Bin,

Thanks for the review!!

on 2020/8/8 下午4:01, Bin.Cheng wrote:
> Hi Kewen,
> Sorry for the late reply.
> The patch's most important change is below cost computation:
> 
>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
>>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>>
>> +  /* Consider additional step updates during unrolling.  */
>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
>> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> This is a bit strange, to me the add instructions are additional
> computation caused by unrolling+addressing_mode, rather than a native
> part in candidate itself.  Specifically, an additional cost is needed
> if candidates (without reg_offset_p) are chosen for the address type
> group/uses.

Good point, ideally it should be one additional cost for each cand set,
when we select one cand for one group, we need to check this pair need
more (estimated_unroll - 1) step costs, we probably need to care about
this during remove/replace etc.  IIUC the current IVOPTs cost framework
doesn't support this and it could increase the selection complexity and
time.  I hesitated to do it and put it to cand step cost initially instead.

I was thinking those candidates with reg_offset_p should be only used for
those reg_offset_p groups in most cases (very limited) meanwhile the others
are simply scaled up like before.  But indeed this can cover some similar
cases like one cand is only used for the compare type group which is for
loop closing, then it doesn't need more step costs for unrolling.

Do you prefer me to improve the current cost framework?

>> +
>>   /* Prefer the original ivs unless we may gain something by replacing it.
>>      The reason is to make debugging simpler; so this is not relevant for
>>      artificial ivs created by other optimization passes.  */
>>
> 
>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>>       return;
>>     }
>>
>> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
>> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
>> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
>> +     matter.  */
>> +  if (data->consider_reg_offset_for_unroll_p
>> +      && group->vuses[0]->type != USE_COMPARE)
>> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> doesn't make much sense to me.  Also why generic type uses are not
> skipped?  We want to model the cost required for address computation,
> however, for generic type uses there is no way to save the computation
> in "address expression".  Once unrolled, the computation is always
> there?
> 

The main intention is to scale up the group/cand cost for unrolling since
we have scaled up the step costs.  The assumption is that the original
costing (without this patch) can be viewed as either for all unrolled
iterations or just one single iteration.  Since IVOPTs doesn't support
fractional costing, I interpreted it as single iterations, to emulate
unrolling scenario based on the previous step cost scaling, we need to 
scale up the cost for all computation.

In most cases, the compare type use is for loop closing, there is still
only one computation even unrolling happens, so I excluded it here.
As "FIXME", if we find some cases are off, we can further restrict it to
those USE_COMPARE uses which is exactly for loop closing.

> And what's the impact on targets supporting [base + index + offset]
> addressing mode?

Good question, I didn't notice it since power doesn't support it.
I noticed the comments of function addr_offset_valid_p only mentioning
[base + offset], I guess it excludes [base + index + offset]?
But I guess the address-based IV can work for this mode?

> 
> Given the patch is not likely to harm because rtl loop unrolling is
> (or was?) by default disabled, so I am OK once the above comments are
> addressed.
> 

Yes, it needs explicit unrolling options, excepting for some targets
wants to enable it for some cases with specific loop_unroll_adjust checks.

> I wonder if it's possible to get 10% of (all which should be unrolled)
> loops unrolled (conservatively) on gimple and enable it by default at
> O3, rather than teaching ivopts to model a future pass which not
> likely be used outside of benchmarks?
> 

Yeah, it would be nice if the unrolling happen before IVOPTs and won't
have any future unrollings to get it off.  PR[1] seems to have some
discussion on gimple unrolling.

Richi suggested driven-by-need gimple unrolling in the previous discussion[2]
on the RFC of this.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
[2] https://gcc.gnu.org/pipermail/gcc-patches/2020-January/537645.html

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-08-10  4:27     ` Kewen.Lin
@ 2020-08-10 12:38       ` Bin.Cheng
  2020-08-10 14:41         ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Bin.Cheng @ 2020-08-10 12:38 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> Thanks for the review!!
>
> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> > Hi Kewen,
> > Sorry for the late reply.
> > The patch's most important change is below cost computation:
> >
> >> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
> >>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
> >>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> >>
> >> +  /* Consider additional step updates during unrolling.  */
> >> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> >> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> > This is a bit strange, to me the add instructions are additional
> > computation caused by unrolling+addressing_mode, rather than a native
> > part in candidate itself.  Specifically, an additional cost is needed
> > if candidates (without reg_offset_p) are chosen for the address type
> > group/uses.
>
> Good point, ideally it should be one additional cost for each cand set,
> when we select one cand for one group, we need to check this pair need
> more (estimated_unroll - 1) step costs, we probably need to care about
> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> doesn't support this and it could increase the selection complexity and
> time.  I hesitated to do it and put it to cand step cost initially instead.
>
> I was thinking those candidates with reg_offset_p should be only used for
> those reg_offset_p groups in most cases (very limited) meanwhile the others
> are simply scaled up like before.  But indeed this can cover some similar
> cases like one cand is only used for the compare type group which is for
> loop closing, then it doesn't need more step costs for unrolling.
>
> Do you prefer me to improve the current cost framework?
No, I don't think it's relevant to the candidate selecting algorithm.
I was thinking about adjusting cost somehow in
determine_group_iv_cost_address. Given we don't expose selected
addressing mode in this function, you may need to do it in
get_address_cost, either way.

>
> >> +
> >>   /* Prefer the original ivs unless we may gain something by replacing it.
> >>      The reason is to make debugging simpler; so this is not relevant for
> >>      artificial ivs created by other optimization passes.  */
> >>
> >
> >> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
> >>       return;
> >>     }
> >>
> >> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
> >> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
> >> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
> >> +     matter.  */
> >> +  if (data->consider_reg_offset_for_unroll_p
> >> +      && group->vuses[0]->type != USE_COMPARE)
> >> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> > Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> > doesn't make much sense to me.  Also why generic type uses are not
> > skipped?  We want to model the cost required for address computation,
> > however, for generic type uses there is no way to save the computation
> > in "address expression".  Once unrolled, the computation is always
> > there?
> >
>
> The main intention is to scale up the group/cand cost for unrolling since
> we have scaled up the step costs.  The assumption is that the original
If we adjust cost appropriately in function *group_iv_cost_address,
this would become unnecessary, right?  And naturally.
> costing (without this patch) can be viewed as either for all unrolled
> iterations or just one single iteration.  Since IVOPTs doesn't support
> fractional costing, I interpreted it as single iterations, to emulate
> unrolling scenario based on the previous step cost scaling, we need to
> scale up the cost for all computation.
>
> In most cases, the compare type use is for loop closing, there is still
> only one computation even unrolling happens, so I excluded it here.
> As "FIXME", if we find some cases are off, we can further restrict it to
> those USE_COMPARE uses which is exactly for loop closing.
>
> > And what's the impact on targets supporting [base + index + offset]
> > addressing mode?
>
> Good question, I didn't notice it since power doesn't support it.
> I noticed the comments of function addr_offset_valid_p only mentioning
> [base + offset], I guess it excludes [base + index + offset]?
> But I guess the address-based IV can work for this mode?
No, addr_offset_valid_p is only used to split address use groups.  See
get_address_cost and struct mem_address.
I forgot to ask, what about target only supports [base + offset]
addressing mode like RISC-V?  I would expect it's not affected at all.

>
> >
> > Given the patch is not likely to harm because rtl loop unrolling is
> > (or was?) by default disabled, so I am OK once the above comments are
> > addressed.
> >
>
> Yes, it needs explicit unrolling options, excepting for some targets
> wants to enable it for some cases with specific loop_unroll_adjust checks.
>
> > I wonder if it's possible to get 10% of (all which should be unrolled)
> > loops unrolled (conservatively) on gimple and enable it by default at
> > O3, rather than teaching ivopts to model a future pass which not
> > likely be used outside of benchmarks?
> >
>
> Yeah, it would be nice if the unrolling happen before IVOPTs and won't
> have any future unrollings to get it off.  PR[1] seems to have some
> discussion on gimple unrolling.
Thanks for directing me to the discussion.  I am on Wilco's side on
this problem, IMHO, It might be useful getting small loops unrolled
(at O3 by default) by simply saving induction variable stepping and
exit condition check, which can be modeled on gimple.  Especially for
RISC-V, it doesn't support index addressing mode, which means there
will be as many induction variables as distinct arrays.  Also
interleaving after unrolling is not that important, it's at high
chance that small loops eligible for interleaving are handled by
vectorizer already.

Thanks,
bin
>
> Richi suggested driven-by-need gimple unrolling in the previous discussion[2]
> on the RFC of this.
>
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> [2] https://gcc.gnu.org/pipermail/gcc-patches/2020-January/537645.html
>
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-08-10 12:38       ` Bin.Cheng
@ 2020-08-10 14:41         ` Kewen.Lin
  2020-08-16  3:59           ` Bin.Cheng
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-08-10 14:41 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

Hi Bin,

on 2020/8/10 下午8:38, Bin.Cheng wrote:
> On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Bin,
>>
>> Thanks for the review!!
>>
>> on 2020/8/8 下午4:01, Bin.Cheng wrote:
>>> Hi Kewen,
>>> Sorry for the late reply.
>>> The patch's most important change is below cost computation:
>>>
>>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
>>>>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>>>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>>>>
>>>> +  /* Consider additional step updates during unrolling.  */
>>>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
>>>> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
>>> This is a bit strange, to me the add instructions are additional
>>> computation caused by unrolling+addressing_mode, rather than a native
>>> part in candidate itself.  Specifically, an additional cost is needed
>>> if candidates (without reg_offset_p) are chosen for the address type
>>> group/uses.
>>
>> Good point, ideally it should be one additional cost for each cand set,
>> when we select one cand for one group, we need to check this pair need
>> more (estimated_unroll - 1) step costs, we probably need to care about
>> this during remove/replace etc.  IIUC the current IVOPTs cost framework
>> doesn't support this and it could increase the selection complexity and
>> time.  I hesitated to do it and put it to cand step cost initially instead.
>>
>> I was thinking those candidates with reg_offset_p should be only used for
>> those reg_offset_p groups in most cases (very limited) meanwhile the others
>> are simply scaled up like before.  But indeed this can cover some similar
>> cases like one cand is only used for the compare type group which is for
>> loop closing, then it doesn't need more step costs for unrolling.
>>
>> Do you prefer me to improve the current cost framework?
> No, I don't think it's relevant to the candidate selecting algorithm.
> I was thinking about adjusting cost somehow in
> determine_group_iv_cost_address. Given we don't expose selected
> addressing mode in this function, you may need to do it in
> get_address_cost, either way.
> 

Thanks for your suggestion!

Sorry, I may miss something, but I still think the additional cost is
per candidate.  The justification is that we miss to model the iv
candidate step well in the context of unrolling, the step cost is part
of candidate cost, which is per candidate.

To initialize it in determine_iv_cost isn't perfect as you pointed out,
ideally we should check any uses of the candidate requires iv update
after each replicated iteration, and take extra step costs into account
if at least one needs, meanwhile scaling up all the computation cost to
reflect unrolling cost nature.

Besides, the reg_offset desirable pair already takes zero cost for
cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you
suggesting increasing the cost for non reg_offset pairs?  If so and per
pair, the extra cost looks possible to be computed several times
unexpectedly.

>>
>>>> +
>>>>   /* Prefer the original ivs unless we may gain something by replacing it.
>>>>      The reason is to make debugging simpler; so this is not relevant for
>>>>      artificial ivs created by other optimization passes.  */
>>>>
>>>
>>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>>>>       return;
>>>>     }
>>>>
>>>> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
>>>> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
>>>> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
>>>> +     matter.  */
>>>> +  if (data->consider_reg_offset_for_unroll_p
>>>> +      && group->vuses[0]->type != USE_COMPARE)
>>>> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
>>> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
>>> doesn't make much sense to me.  Also why generic type uses are not
>>> skipped?  We want to model the cost required for address computation,
>>> however, for generic type uses there is no way to save the computation
>>> in "address expression".  Once unrolled, the computation is always
>>> there?
>>>
>>
>> The main intention is to scale up the group/cand cost for unrolling since
>> we have scaled up the step costs.  The assumption is that the original
> If we adjust cost appropriately in function *group_iv_cost_address,
> this would become unnecessary, right?  And naturally.
>> costing (without this patch) can be viewed as either for all unrolled
>> iterations or just one single iteration.  Since IVOPTs doesn't support
>> fractional costing, I interpreted it as single iterations, to emulate
>> unrolling scenario based on the previous step cost scaling, we need to
>> scale up the cost for all computation.
>>
>> In most cases, the compare type use is for loop closing, there is still
>> only one computation even unrolling happens, so I excluded it here.
>> As "FIXME", if we find some cases are off, we can further restrict it to
>> those USE_COMPARE uses which is exactly for loop closing.
>>
>>> And what's the impact on targets supporting [base + index + offset]
>>> addressing mode?
>>
>> Good question, I didn't notice it since power doesn't support it.
>> I noticed the comments of function addr_offset_valid_p only mentioning
>> [base + offset], I guess it excludes [base + index + offset]?
>> But I guess the address-based IV can work for this mode?
> No, addr_offset_valid_p is only used to split address use groups.  See
> get_address_cost and struct mem_address.
> I forgot to ask, what about target only supports [base + offset]
> addressing mode like RISC-V?  I would expect it's not affected at all.
> 
addr_offset_valid_p is also used in this patch as Richard S. and Segher
suggested to check the offset after unrolling (like: offset+(uf-1)*step)
is still valid for the target.  If address-based IV can not work for
[base + index + offset], it won't affect [base + index + offset].

It can help all targets which supports [base + offset], so I'm afraid
it can affect RISC-V too, but I would expect it's positive.  Or do you
happen to see some potential issues? or have some concerns?

And as Richard S. suggested before, it has one parameter to control,
target can simply disable this if it dislikes.

>>
>>>
>>> Given the patch is not likely to harm because rtl loop unrolling is
>>> (or was?) by default disabled, so I am OK once the above comments are
>>> addressed.
>>>
>>
>> Yes, it needs explicit unrolling options, excepting for some targets
>> wants to enable it for some cases with specific loop_unroll_adjust checks.
>>
>>> I wonder if it's possible to get 10% of (all which should be unrolled)
>>> loops unrolled (conservatively) on gimple and enable it by default at
>>> O3, rather than teaching ivopts to model a future pass which not
>>> likely be used outside of benchmarks?
>>>
>>
>> Yeah, it would be nice if the unrolling happen before IVOPTs and won't
>> have any future unrollings to get it off.  PR[1] seems to have some
>> discussion on gimple unrolling.
> Thanks for directing me to the discussion.  I am on Wilco's side on
> this problem, IMHO, It might be useful getting small loops unrolled
> (at O3 by default) by simply saving induction variable stepping and
> exit condition check, which can be modeled on gimple.  Especially for
> RISC-V, it doesn't support index addressing mode, which means there
> will be as many induction variables as distinct arrays.  Also
> interleaving after unrolling is not that important, it's at high
> chance that small loops eligible for interleaving are handled by
> vectorizer already.
> 

Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see
whether we can bring in one gimple unrolling pass.  Regardless we
have/don't have it, if the RTL unrolling is there, I guess this patch
set is still beneficial?  If so, I would take it separately.

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
  2020-08-10 14:41         ` Kewen.Lin
@ 2020-08-16  3:59           ` Bin.Cheng
  2020-08-18  9:02             ` [PATCH 3/4 v2] " Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Bin.Cheng @ 2020-08-16  3:59 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

On Mon, Aug 10, 2020 at 10:41 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> on 2020/8/10 下午8:38, Bin.Cheng wrote:
> > On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi Bin,
> >>
> >> Thanks for the review!!
> >>
> >> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> >>> Hi Kewen,
> >>> Sorry for the late reply.
> >>> The patch's most important change is below cost computation:
> >>>
> >>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
> >>>>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
> >>>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> >>>>
> >>>> +  /* Consider additional step updates during unrolling.  */
> >>>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> >>>> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> >>> This is a bit strange, to me the add instructions are additional
> >>> computation caused by unrolling+addressing_mode, rather than a native
> >>> part in candidate itself.  Specifically, an additional cost is needed
> >>> if candidates (without reg_offset_p) are chosen for the address type
> >>> group/uses.
> >>
> >> Good point, ideally it should be one additional cost for each cand set,
> >> when we select one cand for one group, we need to check this pair need
> >> more (estimated_unroll - 1) step costs, we probably need to care about
> >> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> >> doesn't support this and it could increase the selection complexity and
> >> time.  I hesitated to do it and put it to cand step cost initially instead.
> >>
> >> I was thinking those candidates with reg_offset_p should be only used for
> >> those reg_offset_p groups in most cases (very limited) meanwhile the others
> >> are simply scaled up like before.  But indeed this can cover some similar
> >> cases like one cand is only used for the compare type group which is for
> >> loop closing, then it doesn't need more step costs for unrolling.
> >>
> >> Do you prefer me to improve the current cost framework?
> > No, I don't think it's relevant to the candidate selecting algorithm.
> > I was thinking about adjusting cost somehow in
> > determine_group_iv_cost_address. Given we don't expose selected
> > addressing mode in this function, you may need to do it in
> > get_address_cost, either way.
> >
>
> Thanks for your suggestion!
>
> Sorry, I may miss something, but I still think the additional cost is
> per candidate.  The justification is that we miss to model the iv
> candidate step well in the context of unrolling, the step cost is part
> of candidate cost, which is per candidate.
>
> To initialize it in determine_iv_cost isn't perfect as you pointed out,
> ideally we should check any uses of the candidate requires iv update
> after each replicated iteration, and take extra step costs into account
> if at least one needs, meanwhile scaling up all the computation cost to
> reflect unrolling cost nature.
I see, it's similar to the auto-increment case where cost should be
recorded only once.  So this is okay given 1) fine predicting
rtl-unroll is likely impossible here; 2) the patch has very limited
impact.

Thanks,
bin
>
> Besides, the reg_offset desirable pair already takes zero cost for
> cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you
> suggesting increasing the cost for non reg_offset pairs?  If so and per
> pair, the extra cost looks possible to be computed several times
> unexpectedly.
>
> >>
> >>>> +
> >>>>   /* Prefer the original ivs unless we may gain something by replacing it.
> >>>>      The reason is to make debugging simpler; so this is not relevant for
> >>>>      artificial ivs created by other optimization passes.  */
> >>>>
> >>>
> >>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
> >>>>       return;
> >>>>     }
> >>>>
> >>>> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
> >>>> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
> >>>> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
> >>>> +     matter.  */
> >>>> +  if (data->consider_reg_offset_for_unroll_p
> >>>> +      && group->vuses[0]->type != USE_COMPARE)
> >>>> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> >>> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> >>> doesn't make much sense to me.  Also why generic type uses are not
> >>> skipped?  We want to model the cost required for address computation,
> >>> however, for generic type uses there is no way to save the computation
> >>> in "address expression".  Once unrolled, the computation is always
> >>> there?
> >>>
> >>
> >> The main intention is to scale up the group/cand cost for unrolling since
> >> we have scaled up the step costs.  The assumption is that the original
> > If we adjust cost appropriately in function *group_iv_cost_address,
> > this would become unnecessary, right?  And naturally.
> >> costing (without this patch) can be viewed as either for all unrolled
> >> iterations or just one single iteration.  Since IVOPTs doesn't support
> >> fractional costing, I interpreted it as single iterations, to emulate
> >> unrolling scenario based on the previous step cost scaling, we need to
> >> scale up the cost for all computation.
> >>
> >> In most cases, the compare type use is for loop closing, there is still
> >> only one computation even unrolling happens, so I excluded it here.
> >> As "FIXME", if we find some cases are off, we can further restrict it to
> >> those USE_COMPARE uses which is exactly for loop closing.
> >>
> >>> And what's the impact on targets supporting [base + index + offset]
> >>> addressing mode?
> >>
> >> Good question, I didn't notice it since power doesn't support it.
> >> I noticed the comments of function addr_offset_valid_p only mentioning
> >> [base + offset], I guess it excludes [base + index + offset]?
> >> But I guess the address-based IV can work for this mode?
> > No, addr_offset_valid_p is only used to split address use groups.  See
> > get_address_cost and struct mem_address.
> > I forgot to ask, what about target only supports [base + offset]
> > addressing mode like RISC-V?  I would expect it's not affected at all.
> >
> addr_offset_valid_p is also used in this patch as Richard S. and Segher
> suggested to check the offset after unrolling (like: offset+(uf-1)*step)
> is still valid for the target.  If address-based IV can not work for
> [base + index + offset], it won't affect [base + index + offset].
>
> It can help all targets which supports [base + offset], so I'm afraid
> it can affect RISC-V too, but I would expect it's positive.  Or do you
> happen to see some potential issues? or have some concerns?
>
> And as Richard S. suggested before, it has one parameter to control,
> target can simply disable this if it dislikes.
>
> >>
> >>>
> >>> Given the patch is not likely to harm because rtl loop unrolling is
> >>> (or was?) by default disabled, so I am OK once the above comments are
> >>> addressed.
> >>>
> >>
> >> Yes, it needs explicit unrolling options, excepting for some targets
> >> wants to enable it for some cases with specific loop_unroll_adjust checks.
> >>
> >>> I wonder if it's possible to get 10% of (all which should be unrolled)
> >>> loops unrolled (conservatively) on gimple and enable it by default at
> >>> O3, rather than teaching ivopts to model a future pass which not
> >>> likely be used outside of benchmarks?
> >>>
> >>
> >> Yeah, it would be nice if the unrolling happen before IVOPTs and won't
> >> have any future unrollings to get it off.  PR[1] seems to have some
> >> discussion on gimple unrolling.
> > Thanks for directing me to the discussion.  I am on Wilco's side on
> > this problem, IMHO, It might be useful getting small loops unrolled
> > (at O3 by default) by simply saving induction variable stepping and
> > exit condition check, which can be modeled on gimple.  Especially for
> > RISC-V, it doesn't support index addressing mode, which means there
> > will be as many induction variables as distinct arrays.  Also
> > interleaving after unrolling is not that important, it's at high
> > chance that small loops eligible for interleaving are handled by
> > vectorizer already.
> >
>
> Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see
> whether we can bring in one gimple unrolling pass.  Regardless we
> have/don't have it, if the RTL unrolling is there, I guess this patch
> set is still beneficial?  If so, I would take it separately.
>
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 3/4 v2] ivopts: Consider cost_step on different forms during unrolling
  2020-08-16  3:59           ` Bin.Cheng
@ 2020-08-18  9:02             ` Kewen.Lin
  2020-08-22  5:11               ` Bin.Cheng
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-08-18  9:02 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

[-- Attachment #1: Type: text/plain, Size: 1867 bytes --]

Hi Bin,

> I see, it's similar to the auto-increment case where cost should be
> recorded only once.  So this is okay given 1) fine predicting
> rtl-unroll is likely impossible here; 2) the patch has very limited
> impact.
> 
Really appreciate your time and patience!

I extended the previous version to address Richard S.'s comments on
candidates with the same base/step but different offsets here:
https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547014.html.

The previous version only allows the candidate derived from the group
of interest, this updated patch extends it to those ones which have the
same bases/steps and same/different offsets but in the acceptable range
by considering unrolling.

For one particular case like: 

            for (i = 0; i < SIZE; i++)
              y[i] = a * x[i] + z[i];

we will mark reg_offset_p for IV candidates on x as below:
   - (unsigned long) (x_18(D) + 8)    // only mark this before.
   - x_18(D) + 8
   - (unsigned long) (x_18(D) + 24)
   - (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 18446744073709551600)
   ...

Do you mind to have a review again?  Thanks in advance!

Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.

SPEC2017 P9 performance run has no remarkable degradations/improvements.

BR,
Kewen
-----
gcc/ChangeLog:

	* tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
	(struct ivopts_data): New field consider_reg_offset_for_unroll_p.
	(mark_reg_offset_candidates): New function.
	(add_candidate_1): Set reg_offset_p to false for new candidate.
	(set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(determine_iv_cost): Increase step cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
	consider_reg_offset_for_unroll_p.


[-- Attachment #2: ivopts_0818.diff --]
[-- Type: text/plain, Size: 7406 bytes --]

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1d2697ae1ba..5a19b53c8d5 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -473,6 +473,9 @@ struct iv_cand
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
   bool doloop_p;	/* Whether this is a doloop candidate.  */
+  bool reg_offset_p;	/* Whether this is available for an address type group
+			   where its all uses are valid to adopt reg_offset
+			   addressing mode even considering unrolling.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -653,6 +656,10 @@ struct ivopts_data
 
   /* Whether the loop has doloop comparison use.  */
   bool doloop_use_p;
+
+  /* Whether need to consider register offset addressing mode for the loop with
+     upcoming unrolling by estimated unroll factor.  */
+  bool consider_reg_offset_for_unroll_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -2731,6 +2738,112 @@ split_address_groups (struct ivopts_data *data)
     }
 }
 
+/* For each address type group, it finds the address-based IV candidates with
+   the same base and step, for those that are available to be used for the
+   whole group with reg_offset addressing mode by considering the address offset
+   difference and increased offset with unrolling factor estimation, mark them
+   as reg_offset_p.  */
+
+static void
+mark_reg_offset_candidates (struct ivopts_data *data)
+{
+  class loop *loop = data->current_loop;
+  gcc_assert (data->current_loop->estimated_unroll > 1);
+  bool any_reg_offset_p = false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "<Reg_offset_p Candidates>:\n");
+
+  auto valid_reg_offset_p
+    = [] (struct iv_use *use, poly_uint64 off, poly_uint64 max_inc) {
+	if (!addr_offset_valid_p (use, off))
+	  return false;
+	if (!addr_offset_valid_p (use, off + max_inc))
+	  return false;
+	return true;
+      };
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+
+      if (address_p (group->type))
+	{
+	  struct iv_use *head_use = group->vuses[0];
+	  if (!tree_fits_poly_int64_p (head_use->iv->step))
+	    continue;
+
+	  poly_int64 step = tree_to_poly_int64 (head_use->iv->step);
+	  /* Max extra offset to be added due to unrolling.  */
+	  poly_int64 max_increase = (loop->estimated_unroll - 1) * step;
+
+	  tree use_base = head_use->addr_base;
+	  STRIP_NOPS (use_base);
+
+	  struct iv_use *last_use = NULL;
+	  unsigned group_size = group->vuses.length ();
+	  gcc_assert (group_size >= 1);
+	  if (maybe_ne (head_use->addr_offset,
+			group->vuses[group_size - 1]->addr_offset))
+	    last_use = group->vuses[group_size - 1];
+
+	  unsigned j;
+	  bitmap_iterator bi;
+	  EXECUTE_IF_SET_IN_BITMAP (group->related_cands, 0, j, bi)
+	  {
+	    struct iv_cand *cand = data->vcands[j];
+
+	    if (!cand->iv->base_object)
+	      continue;
+
+	    if (cand->reg_offset_p)
+	      continue;
+
+	    if (!operand_equal_p (head_use->iv->base_object,
+				  cand->iv->base_object, 0))
+	      continue;
+
+	    if (!operand_equal_p (head_use->iv->step, cand->iv->step, 0))
+	      continue;
+
+	    poly_uint64 cand_offset = 0;
+	    tree cand_base = strip_offset (cand->iv->base, &cand_offset);
+	    STRIP_NOPS (cand_base);
+	    if (!operand_equal_p (use_base, cand_base, 0))
+	      continue;
+
+	    /* Only need to check the first one and the last one in the group
+	       since it's sorted.  If both are valid, the other intermediate
+	       ones should be in the acceptable range.  */
+	    poly_uint64 head_off = head_use->addr_offset - cand_offset;
+	    if (!valid_reg_offset_p (head_use, head_off, max_increase))
+	      continue;
+
+	    if (last_use)
+	      {
+		poly_int64 last_off = last_use->addr_offset - cand_offset;
+		if (!valid_reg_offset_p (head_use, last_off, max_increase))
+		  continue;
+	      }
+
+	    cand->reg_offset_p = true;
+
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (dump_file, "  cand %u valid for group %u\n", j, i);
+
+	    if (!any_reg_offset_p)
+	      any_reg_offset_p = true;
+	  }
+	}
+    }
+
+  if (!any_reg_offset_p)
+    data->consider_reg_offset_for_unroll_p = false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\n");
+}
+
 /* Finds uses of the induction variables that are interesting.  */
 
 static void
@@ -3147,6 +3260,7 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
       cand->important = important;
       cand->incremented_at = incremented_at;
       cand->doloop_p = doloop;
+      cand->reg_offset_p = false;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3654,6 +3768,14 @@ set_group_iv_cost (struct ivopts_data *data,
       return;
     }
 
+  /* Since we priced more on non reg_offset IV cand step cost, we should scale
+     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
+     loop exit, FIXME if multiple exits supported or no loop exit comparisons
+     matter.  */
+  if (data->consider_reg_offset_for_unroll_p
+      && group->vuses[0]->type != USE_COMPARE)
+    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
+
   if (data->consider_all_candidates)
     {
       group->cost_map[cand->id].cand = cand;
@@ -5718,6 +5840,9 @@ find_iv_candidates (struct ivopts_data *data)
   if (!data->consider_all_candidates)
     relate_compare_use_with_all_cands (data);
 
+  if (data->consider_reg_offset_for_unroll_p)
+    mark_reg_offset_candidates (data);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       unsigned i;
@@ -5890,6 +6015,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
+  /* Consider additional step updates during unrolling.  */
+  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
+    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
+
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
@@ -7976,6 +8105,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
+  data->consider_reg_offset_for_unroll_p = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -8008,6 +8138,16 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   if (!find_induction_variables (data))
     goto finish;
 
+  if (param_iv_consider_reg_offset_for_unroll != 0 && exit)
+    {
+      tree_niter_desc *desc = niter_for_exit (data, exit);
+      estimate_unroll_factor (loop, desc);
+      data->consider_reg_offset_for_unroll_p = loop->estimated_unroll > 1;
+      if (dump_file && (dump_flags & TDF_DETAILS)
+	  && data->consider_reg_offset_for_unroll_p)
+	fprintf (dump_file, "\nEstimated_unroll:%u\n", loop->estimated_unroll);
+    }
+
   /* Finds interesting uses (item 1).  */
   find_interesting_uses (data);
   if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v2] ivopts: Consider cost_step on different forms during unrolling
  2020-08-18  9:02             ` [PATCH 3/4 v2] " Kewen.Lin
@ 2020-08-22  5:11               ` Bin.Cheng
  2020-08-25 12:46                 ` [PATCH 3/4 v3] " Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Bin.Cheng @ 2020-08-22  5:11 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

On Tue, Aug 18, 2020 at 5:03 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> > I see, it's similar to the auto-increment case where cost should be
> > recorded only once.  So this is okay given 1) fine predicting
> > rtl-unroll is likely impossible here; 2) the patch has very limited
> > impact.
> >
> Really appreciate your time and patience!
>
> I extended the previous version to address Richard S.'s comments on
> candidates with the same base/step but different offsets here:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547014.html.
>
> The previous version only allows the candidate derived from the group
> of interest, this updated patch extends it to those ones which have the
> same bases/steps and same/different offsets but in the acceptable range
> by considering unrolling.
>
> For one particular case like:
>
>             for (i = 0; i < SIZE; i++)
>               y[i] = a * x[i] + z[i];
>
> we will mark reg_offset_p for IV candidates on x as below:
>    - (unsigned long) (x_18(D) + 8)    // only mark this before.
>    - x_18(D) + 8
>    - (unsigned long) (x_18(D) + 24)
>    - (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 18446744073709551600)
>    ...
>
> Do you mind to have a review again?  Thanks in advance!
I trust you with the change.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
>
> SPEC2017 P9 performance run has no remarkable degradations/improvements.
Is this run with unroll-loops?
Could you exercise the code with unroll-loops enabled when
bootstrap/regtest please?  It doesn't matter if cases fail with
unroll-loops, just making sure there is no fallout.  Otherwise it's
fine with me.

Thanks,
bin
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
>         (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
>         (mark_reg_offset_candidates): New function.
>         (add_candidate_1): Set reg_offset_p to false for new candidate.
>         (set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
>         consider_reg_offset_for_unroll_p.
>         (determine_iv_cost): Increase step cost with estimate_unroll_factor if
>         consider_reg_offset_for_unroll_p.
>         (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
>         consider_reg_offset_for_unroll_p.
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-08-22  5:11               ` Bin.Cheng
@ 2020-08-25 12:46                 ` Kewen.Lin
  2020-08-31 19:41                   ` Segher Boessenkool
  2020-09-01 11:19                   ` Bin.Cheng
  0 siblings, 2 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-08-25 12:46 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

[-- Attachment #1: Type: text/plain, Size: 4448 bytes --]

Hi Bin,

>>
>> For one particular case like:
>>
>>             for (i = 0; i < SIZE; i++)
>>               y[i] = a * x[i] + z[i];
>>
>> we will mark reg_offset_p for IV candidates on x as below:
>>    - (unsigned long) (x_18(D) + 8)    // only mark this before.
>>    - x_18(D) + 8
>>    - (unsigned long) (x_18(D) + 24)
>>    - (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 18446744073709551600)
>>    ...
>>
>> Do you mind to have a review again?  Thanks in advance!
> I trust you with the change.

Thanks again!  Sorry for the late since it took some time to investigate
the exposed issues.

>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
>>
>> SPEC2017 P9 performance run has no remarkable degradations/improvements.
> Is this run with unroll-loops?

Yes, the options what I used were: 

   -Ofast -mcpu=power9 -fpeel-loops -mrecip -funroll-loops

I also re-tested the newly attached patch, nothing changes for SPEC2017 data.

> Could you exercise the code with unroll-loops enabled when
> bootstrap/regtest please?  It doesn't matter if cases fail with
> unroll-loops, just making sure there is no fallout.  Otherwise it's
> fine with me.

Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
but the regression testing did show one failure (the only one):

  PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1

It exposes two issues:

1) Currently address_cost hook on rs6000 always return zero, but at least
from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
have to take the address update into account (scalar normal operation).
Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
with scaling up, the off becomes much.  With one simple hack on for pre_inc/
pre_dec in rs6000 address_cost, the case passed.  It should be handled in
one separated issue.

2) This case makes me think we should exclude ainc candidates in function
mark_reg_offset_candidates.  The justification is that: ainc candidate
handles step update itself and when we calculate the cost for it against
its ainc_use, the cost_step has been reduced. When unrolling happens,
the ainc computations are replicated and it doesn't save step updates 
like normal reg_offset_p candidates.

I've updated the patch to punt ainc_use candidates as below:

> +         /* Skip AINC candidate since it contains address update itself,
> +            the replicated AINC computations when unrolling still have
> +            updates, unlike reg_offset_p candidates can save step updates
> +            when unrolling.  */
> +         if (cand->ainc_use)
> +           continue;
> +

For this new attached patch, it's bootstrapped and regress-tested without
explicit unrolling, while the only one failure has been identified as
rs6000 specific address_cost issue in bootstrapping and regression testing
with explicit unrolling.

By the way, with above simple hack of address_cost, I also did one
bootstrapping and regression testing with explicit unrolling, the above
sms-4.c did pass as I expected but had two failures instead:

  PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
  PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2

By further investigation, the 2nd one is expected due to the adddress_cost hook
hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
I believe it's just exposed by this combination unluckily/luckily ;-) I will
send a patch separately for it once I got time to look into it, but it should
be unrelated to this patch series for sure.

In a word, bootstrapping/regress-testing with unroll-loops enabled shows this
patch looks fine.

BR,
Kewen
-----
gcc/ChangeLog:

	* tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
	(struct ivopts_data): New field consider_reg_offset_for_unroll_p.
	(mark_reg_offset_candidates): New function.
	(add_candidate_1): Set reg_offset_p to false for new candidate.
	(set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(determine_iv_cost): Increase step cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
	consider_reg_offset_for_unroll_p.

[-- Attachment #2: ivopts_0825.diff --]
[-- Type: text/plain, Size: 7690 bytes --]

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1d2697ae1ba..4b58b620ddd 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -473,6 +473,9 @@ struct iv_cand
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
   bool doloop_p;	/* Whether this is a doloop candidate.  */
+  bool reg_offset_p;	/* Whether this is available for an address type group
+			   where its all uses are valid to adopt reg_offset
+			   addressing mode even considering unrolling.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -653,6 +656,10 @@ struct ivopts_data
 
   /* Whether the loop has doloop comparison use.  */
   bool doloop_use_p;
+
+  /* Whether need to consider register offset addressing mode for the loop with
+     upcoming unrolling by estimated unroll factor.  */
+  bool consider_reg_offset_for_unroll_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -2731,6 +2738,119 @@ split_address_groups (struct ivopts_data *data)
     }
 }
 
+/* For each address type group, it finds the address-based IV candidates with
+   the same base and step, for those that are available to be used for the
+   whole group with reg_offset addressing mode by considering the address offset
+   difference and increased offset with unrolling factor estimation, mark them
+   as reg_offset_p.  */
+
+static void
+mark_reg_offset_candidates (struct ivopts_data *data)
+{
+  class loop *loop = data->current_loop;
+  gcc_assert (data->current_loop->estimated_unroll > 1);
+  bool any_reg_offset_p = false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "<Reg_offset_p Candidates>:\n");
+
+  auto valid_reg_offset_p
+    = [] (struct iv_use *use, poly_uint64 off, poly_uint64 max_inc) {
+	if (!addr_offset_valid_p (use, off))
+	  return false;
+	if (!addr_offset_valid_p (use, off + max_inc))
+	  return false;
+	return true;
+      };
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+
+      if (address_p (group->type))
+	{
+	  struct iv_use *head_use = group->vuses[0];
+	  if (!tree_fits_poly_int64_p (head_use->iv->step))
+	    continue;
+
+	  poly_int64 step = tree_to_poly_int64 (head_use->iv->step);
+	  /* Max extra offset to be added due to unrolling.  */
+	  poly_int64 max_increase = (loop->estimated_unroll - 1) * step;
+
+	  tree use_base = head_use->addr_base;
+	  STRIP_NOPS (use_base);
+
+	  struct iv_use *last_use = NULL;
+	  unsigned group_size = group->vuses.length ();
+	  gcc_assert (group_size >= 1);
+	  if (maybe_ne (head_use->addr_offset,
+			group->vuses[group_size - 1]->addr_offset))
+	    last_use = group->vuses[group_size - 1];
+
+	  unsigned j;
+	  bitmap_iterator bi;
+	  EXECUTE_IF_SET_IN_BITMAP (group->related_cands, 0, j, bi)
+	  {
+	    struct iv_cand *cand = data->vcands[j];
+
+	    if (!cand->iv->base_object)
+	      continue;
+
+	    if (cand->reg_offset_p)
+	      continue;
+
+	    /* Skip AINC candidate since it contains address update itself,
+	       the replicated AINC computations when unrolling still have
+	       updates, unlike reg_offset_p candidates can save step updates
+	       when unrolling.  */
+	    if (cand->ainc_use)
+	      continue;
+
+	    if (!operand_equal_p (head_use->iv->base_object,
+				  cand->iv->base_object, 0))
+	      continue;
+
+	    if (!operand_equal_p (head_use->iv->step, cand->iv->step, 0))
+	      continue;
+
+	    poly_uint64 cand_offset = 0;
+	    tree cand_base = strip_offset (cand->iv->base, &cand_offset);
+	    STRIP_NOPS (cand_base);
+	    if (!operand_equal_p (use_base, cand_base, 0))
+	      continue;
+
+	    /* Only need to check the first one and the last one in the group
+	       since it's sorted.  If both are valid, the other intermediate
+	       ones should be in the acceptable range.  */
+	    poly_uint64 head_off = head_use->addr_offset - cand_offset;
+	    if (!valid_reg_offset_p (head_use, head_off, max_increase))
+	      continue;
+
+	    if (last_use)
+	      {
+		poly_int64 last_off = last_use->addr_offset - cand_offset;
+		if (!valid_reg_offset_p (head_use, last_off, max_increase))
+		  continue;
+	      }
+
+	    cand->reg_offset_p = true;
+
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (dump_file, "  cand %u valid for group %u\n", j, i);
+
+	    if (!any_reg_offset_p)
+	      any_reg_offset_p = true;
+	  }
+	}
+    }
+
+  if (!any_reg_offset_p)
+    data->consider_reg_offset_for_unroll_p = false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\n");
+}
+
 /* Finds uses of the induction variables that are interesting.  */
 
 static void
@@ -3147,6 +3267,7 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
       cand->important = important;
       cand->incremented_at = incremented_at;
       cand->doloop_p = doloop;
+      cand->reg_offset_p = false;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3654,6 +3775,14 @@ set_group_iv_cost (struct ivopts_data *data,
       return;
     }
 
+  /* Since we priced more on non reg_offset IV cand step cost, we should scale
+     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
+     loop exit, FIXME if multiple exits supported or no loop exit comparisons
+     matter.  */
+  if (data->consider_reg_offset_for_unroll_p
+      && group->vuses[0]->type != USE_COMPARE)
+    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
+
   if (data->consider_all_candidates)
     {
       group->cost_map[cand->id].cand = cand;
@@ -5718,6 +5847,9 @@ find_iv_candidates (struct ivopts_data *data)
   if (!data->consider_all_candidates)
     relate_compare_use_with_all_cands (data);
 
+  if (data->consider_reg_offset_for_unroll_p)
+    mark_reg_offset_candidates (data);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       unsigned i;
@@ -5890,6 +6022,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
+  /* Consider additional step updates during unrolling.  */
+  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
+    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
+
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
@@ -7976,6 +8112,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
+  data->consider_reg_offset_for_unroll_p = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -8008,6 +8145,16 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   if (!find_induction_variables (data))
     goto finish;
 
+  if (param_iv_consider_reg_offset_for_unroll != 0 && exit)
+    {
+      tree_niter_desc *desc = niter_for_exit (data, exit);
+      estimate_unroll_factor (loop, desc);
+      data->consider_reg_offset_for_unroll_p = loop->estimated_unroll > 1;
+      if (dump_file && (dump_flags & TDF_DETAILS)
+	  && data->consider_reg_offset_for_unroll_p)
+	fprintf (dump_file, "\nEstimated_unroll:%u\n", loop->estimated_unroll);
+    }
+
   /* Finds interesting uses (item 1).  */
   find_interesting_uses (data);
   if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* PING [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-05-28 12:19 ` [PATCH 1/4] unroll: Add middle-end unroll factor estimation Kewen.Lin
@ 2020-08-31  5:49   ` Kewen.Lin
  2020-09-15  7:44     ` PING^2 " Kewen.Lin
  2021-01-21 21:45   ` Segher Boessenkool
  1 sibling, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-08-31  5:49 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Richard Biener, Segher Boessenkool

Hi,

I'd like to gentle ping this since IVOPTs part is already to land.

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html

BR,
Kewen

on 2020/5/28 下午8:19, Kewen.Lin via Gcc-patches wrote:
> 
> gcc/ChangeLog
> 
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
> 
> 	* cfgloop.h (struct loop): New field estimated_unroll.
> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
> 	(decide_unroll_runtime_iter): Likewise.
> 	(decide_unroll_stupid): Likewise.
> 	(estimate_unroll_factor): Likewise.
> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
> 




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-08-25 12:46                 ` [PATCH 3/4 v3] " Kewen.Lin
@ 2020-08-31 19:41                   ` Segher Boessenkool
  2020-09-02  3:16                     ` Kewen.Lin
  2020-09-01 11:19                   ` Bin.Cheng
  1 sibling, 1 reply; 64+ messages in thread
From: Segher Boessenkool @ 2020-08-31 19:41 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Bin.Cheng, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

Hi!

Just a note:

On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> 1) Currently address_cost hook on rs6000 always return zero, but at least
> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> have to take the address update into account (scalar normal operation).

From Power4 on already (not sure about Power6, but does anyone care?)


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-08-25 12:46                 ` [PATCH 3/4 v3] " Kewen.Lin
  2020-08-31 19:41                   ` Segher Boessenkool
@ 2020-09-01 11:19                   ` Bin.Cheng
  2020-09-02  3:50                     ` Kewen.Lin
  2020-09-06  2:47                     ` Hans-Peter Nilsson
  1 sibling, 2 replies; 64+ messages in thread
From: Bin.Cheng @ 2020-09-01 11:19 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

On Tue, Aug 25, 2020 at 8:47 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> >>
> >> For one particular case like:
> >>
> >>             for (i = 0; i < SIZE; i++)
> >>               y[i] = a * x[i] + z[i];
> >>
> >> we will mark reg_offset_p for IV candidates on x as below:
> >>    - (unsigned long) (x_18(D) + 8)    // only mark this before.
> >>    - x_18(D) + 8
> >>    - (unsigned long) (x_18(D) + 24)
> >>    - (unsigned long) ((vector(2) double *) (x_18(D) + 8) + 18446744073709551600)
> >>    ...
> >>
> >> Do you mind to have a review again?  Thanks in advance!
> > I trust you with the change.
>
> Thanks again!  Sorry for the late since it took some time to investigate
> the exposed issues.
>
> >>
> >> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9.
> >>
> >> SPEC2017 P9 performance run has no remarkable degradations/improvements.
> > Is this run with unroll-loops?
>
> Yes, the options what I used were:
>
>    -Ofast -mcpu=power9 -fpeel-loops -mrecip -funroll-loops
>
> I also re-tested the newly attached patch, nothing changes for SPEC2017 data.
>
> > Could you exercise the code with unroll-loops enabled when
> > bootstrap/regtest please?  It doesn't matter if cases fail with
> > unroll-loops, just making sure there is no fallout.  Otherwise it's
> > fine with me.
>
> Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
> but the regression testing did show one failure (the only one):
>
>   PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1
>
> It exposes two issues:
>
> 1) Currently address_cost hook on rs6000 always return zero, but at least
> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> have to take the address update into account (scalar normal operation).
> Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
> ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
> with scaling up, the off becomes much.  With one simple hack on for pre_inc/
> pre_dec in rs6000 address_cost, the case passed.  It should be handled in
> one separated issue.
>
> 2) This case makes me think we should exclude ainc candidates in function
> mark_reg_offset_candidates.  The justification is that: ainc candidate
> handles step update itself and when we calculate the cost for it against
> its ainc_use, the cost_step has been reduced. When unrolling happens,
> the ainc computations are replicated and it doesn't save step updates
> like normal reg_offset_p candidates.
Though auto-inc candidate embeds stepping operation into memory
access, we might want to avoid it in case of unroll if there are many
sequences of memory accesses, and if the unroll factor is big.  The
rationale is embedded stepping is a u-arch operation and does have its
cost.

>
> I've updated the patch to punt ainc_use candidates as below:
>
> > +         /* Skip AINC candidate since it contains address update itself,
> > +            the replicated AINC computations when unrolling still have
> > +            updates, unlike reg_offset_p candidates can save step updates
> > +            when unrolling.  */
> > +         if (cand->ainc_use)
> > +           continue;
> > +
>
> For this new attached patch, it's bootstrapped and regress-tested without
> explicit unrolling, while the only one failure has been identified as
> rs6000 specific address_cost issue in bootstrapping and regression testing
> with explicit unrolling.
>
> By the way, with above simple hack of address_cost, I also did one
> bootstrapping and regression testing with explicit unrolling, the above
> sms-4.c did pass as I expected but had two failures instead:
>
>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2
>
> By further investigation, the 2nd one is expected due to the adddress_cost hook
> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
> sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
> I believe it's just exposed by this combination unluckily/luckily ;-) I will
> send a patch separately for it once I got time to look into it, but it should
> be unrelated to this patch series for sure.
This is the kind of situation I intended to avoid before.  IMHO, this
isn't a neat change (it can't be given we are predicting the future
transformation in compilation pipeline), accumulation of such changes
could make IVOPTs break in one way or another.  So as long as you make
sure it doesn't have functional impact in case of no-rtl_unroll, I am
fine.
>
> In a word, bootstrapping/regress-testing with unroll-loops enabled shows this
> patch looks fine.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p.
>         (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
>         (mark_reg_offset_candidates): New function.
>         (add_candidate_1): Set reg_offset_p to false for new candidate.
>         (set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
>         consider_reg_offset_for_unroll_p.
>         (determine_iv_cost): Increase step cost with estimate_unroll_factor if
>         consider_reg_offset_for_unroll_p.
>         (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
>         consider_reg_offset_for_unroll_p.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-08-31 19:41                   ` Segher Boessenkool
@ 2020-09-02  3:16                     ` Kewen.Lin
  2020-09-02 10:25                       ` Segher Boessenkool
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-09-02  3:16 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Bin.Cheng, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

Hi Segher,

on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> Hi!
> 
> Just a note:
> 
> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
>> 1) Currently address_cost hook on rs6000 always return zero, but at least
>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
>> have to take the address update into account (scalar normal operation).
> 
> From Power4 on already (not sure about Power6, but does anyone care?)
> 

Thanks for the information, it looks this issue exists for a long time.

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-01 11:19                   ` Bin.Cheng
@ 2020-09-02  3:50                     ` Kewen.Lin
  2020-09-02  3:55                       ` Bin.Cheng
  2020-09-06  2:47                     ` Hans-Peter Nilsson
  1 sibling, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-09-02  3:50 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

Hi Bin,

>> 2) This case makes me think we should exclude ainc candidates in function
>> mark_reg_offset_candidates.  The justification is that: ainc candidate
>> handles step update itself and when we calculate the cost for it against
>> its ainc_use, the cost_step has been reduced. When unrolling happens,
>> the ainc computations are replicated and it doesn't save step updates
>> like normal reg_offset_p candidates.
> Though auto-inc candidate embeds stepping operation into memory
> access, we might want to avoid it in case of unroll if there are many
> sequences of memory accesses, and if the unroll factor is big.  The
> rationale is embedded stepping is a u-arch operation and does have its
> cost.
> 

Thanks for the comments!  Agree!  Excluding them from reg_offset_p
candidates here is consistent with this expectation, it makes us
consider the unroll factor effect when checking the corresponding
step cost and the embedded stepping cost (in group/candidate cost,
minus step cost and use the cost from the address_cost hook).

>>
>> I've updated the patch to punt ainc_use candidates as below:
>>
>>> +         /* Skip AINC candidate since it contains address update itself,
>>> +            the replicated AINC computations when unrolling still have
>>> +            updates, unlike reg_offset_p candidates can save step updates
>>> +            when unrolling.  */
>>> +         if (cand->ainc_use)
>>> +           continue;
>>> +
>>
>> For this new attached patch, it's bootstrapped and regress-tested without
>> explicit unrolling, while the only one failure has been identified as
>> rs6000 specific address_cost issue in bootstrapping and regression testing
>> with explicit unrolling.
>>
>> By the way, with above simple hack of address_cost, I also did one
>> bootstrapping and regression testing with explicit unrolling, the above
>> sms-4.c did pass as I expected but had two failures instead:
>>
>>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
>>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2
>>
>> By further investigation, the 2nd one is expected due to the adddress_cost hook
>> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
>> sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
>> I believe it's just exposed by this combination unluckily/luckily ;-) I will
>> send a patch separately for it once I got time to look into it, but it should
>> be unrelated to this patch series for sure.
> This is the kind of situation I intended to avoid before.  IMHO, this
> isn't a neat change (it can't be given we are predicting the future
> transformation in compilation pipeline), accumulation of such changes
> could make IVOPTs break in one way or another.  So as long as you make
> sure it doesn't have functional impact in case of no-rtl_unroll, I am
> fine.

Yeah, I admit it's not neat, but the proposals in the previous discussions
without predicting unroll factor can not work well for all scenarios with
different unroll factors, they could over-blame some kind of candidates.
For the case of no-rtl_unroll, unroll factor estimation should set
loop->estimated_unroll to zero, all these changes won't take effect. The
estimation function follows the same logics as that of RTL unroll factor
calculation, I did test with explicit unrolling disablement before, it
worked expectedly.

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-02  3:50                     ` Kewen.Lin
@ 2020-09-02  3:55                       ` Bin.Cheng
  2020-09-02  4:51                         ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Bin.Cheng @ 2020-09-02  3:55 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

On Wed, Sep 2, 2020 at 11:50 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> >> 2) This case makes me think we should exclude ainc candidates in function
> >> mark_reg_offset_candidates.  The justification is that: ainc candidate
> >> handles step update itself and when we calculate the cost for it against
> >> its ainc_use, the cost_step has been reduced. When unrolling happens,
> >> the ainc computations are replicated and it doesn't save step updates
> >> like normal reg_offset_p candidates.
> > Though auto-inc candidate embeds stepping operation into memory
> > access, we might want to avoid it in case of unroll if there are many
> > sequences of memory accesses, and if the unroll factor is big.  The
> > rationale is embedded stepping is a u-arch operation and does have its
> > cost.
> >
>
> Thanks for the comments!  Agree!  Excluding them from reg_offset_p
> candidates here is consistent with this expectation, it makes us
> consider the unroll factor effect when checking the corresponding
> step cost and the embedded stepping cost (in group/candidate cost,
> minus step cost and use the cost from the address_cost hook).
>
> >>
> >> I've updated the patch to punt ainc_use candidates as below:
> >>
> >>> +         /* Skip AINC candidate since it contains address update itself,
> >>> +            the replicated AINC computations when unrolling still have
> >>> +            updates, unlike reg_offset_p candidates can save step updates
> >>> +            when unrolling.  */
> >>> +         if (cand->ainc_use)
> >>> +           continue;
> >>> +
> >>
> >> For this new attached patch, it's bootstrapped and regress-tested without
> >> explicit unrolling, while the only one failure has been identified as
> >> rs6000 specific address_cost issue in bootstrapping and regression testing
> >> with explicit unrolling.
> >>
> >> By the way, with above simple hack of address_cost, I also did one
> >> bootstrapping and regression testing with explicit unrolling, the above
> >> sms-4.c did pass as I expected but had two failures instead:
> >>
> >>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
> >>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2
> >>
> >> By further investigation, the 2nd one is expected due to the adddress_cost hook
> >> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
> >> sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
> >> I believe it's just exposed by this combination unluckily/luckily ;-) I will
> >> send a patch separately for it once I got time to look into it, but it should
> >> be unrelated to this patch series for sure.
> > This is the kind of situation I intended to avoid before.  IMHO, this
> > isn't a neat change (it can't be given we are predicting the future
> > transformation in compilation pipeline), accumulation of such changes
> > could make IVOPTs break in one way or another.  So as long as you make
> > sure it doesn't have functional impact in case of no-rtl_unroll, I am
> > fine.
>
> Yeah, I admit it's not neat, but the proposals in the previous discussions
> without predicting unroll factor can not work well for all scenarios with
> different unroll factors, they could over-blame some kind of candidates.
> For the case of no-rtl_unroll, unroll factor estimation should set
> loop->estimated_unroll to zero, all these changes won't take effect. The
> estimation function follows the same logics as that of RTL unroll factor
> calculation, I did test with explicit unrolling disablement before, it
> worked expectedly.
Thanks for working on this, also sorry for being nitpicking.

Thanks,
bin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-02  3:55                       ` Bin.Cheng
@ 2020-09-02  4:51                         ` Kewen.Lin
  0 siblings, 0 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-09-02  4:51 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: GCC Patches, bin.cheng, Richard Guenther, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford, Jiufu Guo

Hi Bin,

>>>> I've updated the patch to punt ainc_use candidates as below:
>>>>
>>>>> +         /* Skip AINC candidate since it contains address update itself,
>>>>> +            the replicated AINC computations when unrolling still have
>>>>> +            updates, unlike reg_offset_p candidates can save step updates
>>>>> +            when unrolling.  */
>>>>> +         if (cand->ainc_use)
>>>>> +           continue;
>>>>> +
>>>>
>>>> For this new attached patch, it's bootstrapped and regress-tested without
>>>> explicit unrolling, while the only one failure has been identified as
>>>> rs6000 specific address_cost issue in bootstrapping and regression testing
>>>> with explicit unrolling.
>>>>
>>>> By the way, with above simple hack of address_cost, I also did one
>>>> bootstrapping and regression testing with explicit unrolling, the above
>>>> sms-4.c did pass as I expected but had two failures instead:
>>>>
>>>>   PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors)
>>>>   PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2
>>>>
>>>> By further investigation, the 2nd one is expected due to the adddress_cost hook
>>>> hack, while the 1st one one exposed -fcompare-debug issue in sms.  The RTL
>>>> sequence starts to off from sms, just some NOTE_INSN_DELETED positions change.
>>>> I believe it's just exposed by this combination unluckily/luckily ;-) I will
>>>> send a patch separately for it once I got time to look into it, but it should
>>>> be unrelated to this patch series for sure.
>>> This is the kind of situation I intended to avoid before.  IMHO, this
>>> isn't a neat change (it can't be given we are predicting the future
>>> transformation in compilation pipeline), accumulation of such changes
>>> could make IVOPTs break in one way or another.  So as long as you make
>>> sure it doesn't have functional impact in case of no-rtl_unroll, I am
>>> fine.
>>
>> Yeah, I admit it's not neat, but the proposals in the previous discussions
>> without predicting unroll factor can not work well for all scenarios with
>> different unroll factors, they could over-blame some kind of candidates.
>> For the case of no-rtl_unroll, unroll factor estimation should set
>> loop->estimated_unroll to zero, all these changes won't take effect. The
                         ~~~~~~~~
Oops, one correction, should set it to *one* rather than zero.  My memory... :(

>> estimation function follows the same logics as that of RTL unroll factor
>> calculation, I did test with explicit unrolling disablement before, it
>> worked expectedly.
> Thanks for working on this, also sorry for being nitpicking.

Oh, you don't!  Your constructive advices/comments make me consider/test
more scenarios/things that I didn't realize before, it's really helpful
to get this patch better, really appreciate that!!!

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-02  3:16                     ` Kewen.Lin
@ 2020-09-02 10:25                       ` Segher Boessenkool
  2020-09-03  2:24                         ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Segher Boessenkool @ 2020-09-02 10:25 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Bin.Cheng, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

Hi!

On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> > On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> >> 1) Currently address_cost hook on rs6000 always return zero, but at least
> >> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> >> have to take the address update into account (scalar normal operation).
> > 
> > From Power4 on already (not sure about Power6, but does anyone care?)
> 
> Thanks for the information, it looks this issue exists for a long time.

Well, *is* it an issue?  The addressing doesn't get more expensive...
For example, an
  ldu 3,16(4)
is cracked to an
  ld 3,16(4)
and an
  addi 4,4,16
(the addi is not on the critical path of the load).  So it seems to me
this shouldn't increase the addressing cost at all?  (The instruction of
course is really two insns in one.)


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-02 10:25                       ` Segher Boessenkool
@ 2020-09-03  2:24                         ` Kewen.Lin
  2020-09-03 22:37                           ` Segher Boessenkool
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-09-03  2:24 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Bin.Cheng, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

Hi Segher,

on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
>> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
>>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
>>>> 1) Currently address_cost hook on rs6000 always return zero, but at least
>>>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
>>>> have to take the address update into account (scalar normal operation).
>>>
>>> From Power4 on already (not sure about Power6, but does anyone care?)
>>
>> Thanks for the information, it looks this issue exists for a long time.
> 
> Well, *is* it an issue?  The addressing doesn't get more expensive...
> For example, an
>   ldu 3,16(4)
> is cracked to an
>   ld 3,16(4)
> and an
>   addi 4,4,16
> (the addi is not on the critical path of the load).  So it seems to me
> this shouldn't increase the addressing cost at all?  (The instruction of
> course is really two insns in one.)
> 

Good question!  I agree that they can execute in parallel, but it depends
on how we interprete the addressing cost, if it's for required execution
resource, I think it's off, since comparing with ld, the ldu has two iops
and extra ALU requirement.  I'm not sure its usage elsewhere, but in the
context of IVOPTs on Power, for one normal candidate, its step cost is 4,
the cost for group (1) is zero, total cost is 4 for this combination.
for the scenario like:
    ldx rx, iv     // (1)
    ...
    iv = iv + step // (2)

While for ainc_use candidate (like ldu), its step cost is 4, but the cost
for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
say the step update is free.

We can also see (1) and (2) can also execute in parallel (same iteration).
If we consider the next iteration, it will have the dependency, but it's
the same for ldu.  So basically they are similar, I think it's unfair to
have this difference in the cost modeling.  The cracked addi should have
its cost here.  Does it make sense?

Apart from that, one P9 specific point is that the update form load isn't
preferred,  the reason is that the instruction can not retire until both
parts complete, it can hold up subsequent instructions from retiring.
If the addi stalls (starvation), the instruction can not retire and can
cause things stuck.  It seems also something we can model here?

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-03  2:24                         ` Kewen.Lin
@ 2020-09-03 22:37                           ` Segher Boessenkool
  2020-09-04  8:27                             ` Bin.Cheng
                                               ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Segher Boessenkool @ 2020-09-03 22:37 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Bin.Cheng, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote:
> on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> >> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> >>>> 1) Currently address_cost hook on rs6000 always return zero, but at least
> >>>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> >>>> have to take the address update into account (scalar normal operation).
> >>>
> >>> From Power4 on already (not sure about Power6, but does anyone care?)
> >>
> >> Thanks for the information, it looks this issue exists for a long time.
> > 
> > Well, *is* it an issue?  The addressing doesn't get more expensive...
> > For example, an
> >   ldu 3,16(4)
> > is cracked to an
> >   ld 3,16(4)
> > and an
> >   addi 4,4,16
> > (the addi is not on the critical path of the load).  So it seems to me
> > this shouldn't increase the addressing cost at all?  (The instruction of
> > course is really two insns in one.)
> 
> Good question!  I agree that they can execute in parallel, but it depends
> on how we interprete the addressing cost, if it's for required execution
> resource, I think it's off, since comparing with ld, the ldu has two iops
> and extra ALU requirement.

OTOH, if you do not use an ldu you need to use a real addi insn, which
gives you all the same cost (plus it takes more code space and decode etc.
resources).

> I'm not sure its usage elsewhere, but in the
> context of IVOPTs on Power, for one normal candidate, its step cost is 4,
> the cost for group (1) is zero, total cost is 4 for this combination.
> for the scenario like:
>     ldx rx, iv     // (1)
>     ...
>     iv = iv + step // (2)
> 
> While for ainc_use candidate (like ldu), its step cost is 4, but the cost
> for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
> say the step update is free.

That seems wrong, but the address_cost is used in more places, that is
not where to fix this?

> We can also see (1) and (2) can also execute in parallel (same iteration).
> If we consider the next iteration, it will have the dependency, but it's
> the same for ldu.  So basically they are similar, I think it's unfair to
> have this difference in the cost modeling.  The cracked addi should have
> its cost here.  Does it make sense?

It should have cost, certainly, but not address_cost I think.  The total
cost of an ldu should be a tiny bit less than that of ld + that of addi;
the address_cost of ldu should be the same as that of ld.

> Apart from that, one P9 specific point is that the update form load isn't
> preferred,  the reason is that the instruction can not retire until both
> parts complete, it can hold up subsequent instructions from retiring.
> If the addi stalls (starvation), the instruction can not retire and can
> cause things stuck.  It seems also something we can model here?

This is (almost) no problem on p9, since we no longer have issue groups.
It can hold up older insns from retiring, sure, but they *will* have
finished, and p9 can retire 64 insns per cycle.  The "completion wall"
is gone.  The only problem is if things stick around so long that
resources run out...  but you're talking 100s of insns there.


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-03 22:37                           ` Segher Boessenkool
@ 2020-09-04  8:27                             ` Bin.Cheng
  2020-09-04 13:53                               ` Segher Boessenkool
  2020-09-04  8:47                             ` Kewen.Lin
  2020-09-17 23:12                             ` Jeff Law
  2 siblings, 1 reply; 64+ messages in thread
From: Bin.Cheng @ 2020-09-04  8:27 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Kewen.Lin, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

On Fri, Sep 4, 2020 at 6:37 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote:
> > on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> > >> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> > >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
> > >>>> 1) Currently address_cost hook on rs6000 always return zero, but at least
> > >>>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> > >>>> have to take the address update into account (scalar normal operation).
> > >>>
> > >>> From Power4 on already (not sure about Power6, but does anyone care?)
> > >>
> > >> Thanks for the information, it looks this issue exists for a long time.
> > >
> > > Well, *is* it an issue?  The addressing doesn't get more expensive...
> > > For example, an
> > >   ldu 3,16(4)
> > > is cracked to an
> > >   ld 3,16(4)
> > > and an
> > >   addi 4,4,16
> > > (the addi is not on the critical path of the load).  So it seems to me
> > > this shouldn't increase the addressing cost at all?  (The instruction of
> > > course is really two insns in one.)
> >
> > Good question!  I agree that they can execute in parallel, but it depends
> > on how we interprete the addressing cost, if it's for required execution
> > resource, I think it's off, since comparing with ld, the ldu has two iops
> > and extra ALU requirement.
>
> OTOH, if you do not use an ldu you need to use a real addi insn, which
> gives you all the same cost (plus it takes more code space and decode etc.
> resources).
>
> > I'm not sure its usage elsewhere, but in the
> > context of IVOPTs on Power, for one normal candidate, its step cost is 4,
> > the cost for group (1) is zero, total cost is 4 for this combination.
> > for the scenario like:
> >     ldx rx, iv     // (1)
> >     ...
> >     iv = iv + step // (2)
> >
> > While for ainc_use candidate (like ldu), its step cost is 4, but the cost
> > for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
> > say the step update is free.
>
> That seems wrong, but the address_cost is used in more places, that is
> not where to fix this?
>
> > We can also see (1) and (2) can also execute in parallel (same iteration).
> > If we consider the next iteration, it will have the dependency, but it's
> > the same for ldu.  So basically they are similar, I think it's unfair to
> > have this difference in the cost modeling.  The cracked addi should have
> > its cost here.  Does it make sense?
>
> It should have cost, certainly, but not address_cost I think.  The total
> cost of an ldu should be a tiny bit less than that of ld + that of addi;
> the address_cost of ldu should be the same as that of ld.
Hi Segher,
In simple cases, yes, and it is also the (rough) idea of modeling
auto-inc addressing mode in ivopts, however, things are different if
loop gets complicated.  Considering the case choosing 10 auto-inc
addressing_mode/candidate vs. [base_x + iv_index].  The latter only
needs one add instruction, while the former needs 10 embedded auto-inc
operations.
Another issue is register pressure, choosing auto-inc candidates could
result in more IV, while choosing IV_index results in one IV (and more
Base pointers), however, spilling base pointer (which is loop
invariant) is usually cheaper than IV.
Another issue is auto-inc candidates probably lead to more bloated
setup code in the preheader BB, due to problems in expression
canonicalization, CSE, etc..

So it's not that easy to answer the question for complicated cases.
As for simple cases, the current model works fine with auto-inc
(somehow) preferred.

Thanks,
bin
>
> > Apart from that, one P9 specific point is that the update form load isn't
> > preferred,  the reason is that the instruction can not retire until both
> > parts complete, it can hold up subsequent instructions from retiring.
> > If the addi stalls (starvation), the instruction can not retire and can
> > cause things stuck.  It seems also something we can model here?
>
> This is (almost) no problem on p9, since we no longer have issue groups.
> It can hold up older insns from retiring, sure, but they *will* have
> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> is gone.  The only problem is if things stick around so long that
> resources run out...  but you're talking 100s of insns there.
>
>
> Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-03 22:37                           ` Segher Boessenkool
  2020-09-04  8:27                             ` Bin.Cheng
@ 2020-09-04  8:47                             ` Kewen.Lin
  2020-09-04 14:16                               ` Segher Boessenkool
  2020-09-17 23:12                             ` Jeff Law
  2 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-09-04  8:47 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Bin.Cheng, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

Hi Segher,

>> Good question!  I agree that they can execute in parallel, but it depends
>> on how we interprete the addressing cost, if it's for required execution
>> resource, I think it's off, since comparing with ld, the ldu has two iops
>> and extra ALU requirement.
> 
> OTOH, if you do not use an ldu you need to use a real addi insn, which
> gives you all the same cost (plus it takes more code space and decode etc.
> resources).

Agreed.

> 
>> I'm not sure its usage elsewhere, but in the
>> context of IVOPTs on Power, for one normal candidate, its step cost is 4,
>> the cost for group (1) is zero, total cost is 4 for this combination.
>> for the scenario like:
>>     ldx rx, iv     // (1)
>>     ...
>>     iv = iv + step // (2)
>>
>> While for ainc_use candidate (like ldu), its step cost is 4, but the cost
>> for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
>> say the step update is free.
> 
> That seems wrong, but the address_cost is used in more places, that is
> not where to fix this?

Good point, I had this question in mind too, it's used somewhere, one of
them even uses one magic number, I planned to check all its usages once
started to investigate it.  But as your comment below, this hook looks
not appropriate.

> 
>> We can also see (1) and (2) can also execute in parallel (same iteration).
>> If we consider the next iteration, it will have the dependency, but it's
>> the same for ldu.  So basically they are similar, I think it's unfair to
>> have this difference in the cost modeling.  The cracked addi should have
>> its cost here.  Does it make sense?
> 
> It should have cost, certainly, but not address_cost I think.  The total
> cost of an ldu should be a tiny bit less than that of ld + that of addi;
> the address_cost of ldu should be the same as that of ld.

OK, I'll check whether there is some other way suitable for this in the
context of IVOPTs.  Good to see that we agree on the current modeling is
a bit off on Power.  :)

> 
>> Apart from that, one P9 specific point is that the update form load isn't
>> preferred,  the reason is that the instruction can not retire until both
>> parts complete, it can hold up subsequent instructions from retiring.
>> If the addi stalls (starvation), the instruction can not retire and can
>> cause things stuck.  It seems also something we can model here?
> 
> This is (almost) no problem on p9, since we no longer have issue groups.
> It can hold up older insns from retiring, sure, but they *will* have
> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> is gone.  The only problem is if things stick around so long that
> resources run out...  but you're talking 100s of insns there.
> 

Theoretically it's fine, but the addi starvation was observed in the FP/SIMD
instructions intensive loop code, which did cause some worse performance.  :(

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-04  8:27                             ` Bin.Cheng
@ 2020-09-04 13:53                               ` Segher Boessenkool
  0 siblings, 0 replies; 64+ messages in thread
From: Segher Boessenkool @ 2020-09-04 13:53 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Kewen.Lin, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

Hi Bin,

On Fri, Sep 04, 2020 at 04:27:32PM +0800, Bin.Cheng wrote:
> On Fri, Sep 4, 2020 at 6:37 AM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > It should have cost, certainly, but not address_cost I think.  The total
> > cost of an ldu should be a tiny bit less than that of ld + that of addi;
> > the address_cost of ldu should be the same as that of ld.
> Hi Segher,
> In simple cases, yes, and it is also the (rough) idea of modeling
> auto-inc addressing mode in ivopts, however, things are different if
> loop gets complicated.

The address_cost function is used for many other things, not just
ivopts, so this shouldn't be done there.  That is all :-)

> Considering the case choosing 10 auto-inc
> addressing_mode/candidate vs. [base_x + iv_index].  The latter only
> needs one add instruction, while the former needs 10 embedded auto-inc
> operations.

Yeah.

> Another issue is register pressure, choosing auto-inc candidates could
> result in more IV, while choosing IV_index results in one IV (and more
> Base pointers), however, spilling base pointer (which is loop
> invariant) is usually cheaper than IV.
> Another issue is auto-inc candidates probably lead to more bloated
> setup code in the preheader BB, due to problems in expression
> canonicalization, CSE, etc..
> 
> So it's not that easy to answer the question for complicated cases.
> As for simple cases, the current model works fine with auto-inc
> (somehow) preferred.

Right, I wasn't saying that at all, sorry if I confused things.

Thanks,


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-04  8:47                             ` Kewen.Lin
@ 2020-09-04 14:16                               ` Segher Boessenkool
  2020-09-04 15:47                                 ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Segher Boessenkool @ 2020-09-04 14:16 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Bin.Cheng, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

Hi!

On Fri, Sep 04, 2020 at 04:47:37PM +0800, Kewen.Lin wrote:
> >> Apart from that, one P9 specific point is that the update form load isn't
> >> preferred,  the reason is that the instruction can not retire until both
> >> parts complete, it can hold up subsequent instructions from retiring.
> >> If the addi stalls (starvation), the instruction can not retire and can
> >> cause things stuck.  It seems also something we can model here?
> > 
> > This is (almost) no problem on p9, since we no longer have issue groups.
> > It can hold up older insns from retiring, sure, but they *will* have
> > finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> > is gone.  The only problem is if things stick around so long that
> > resources run out...  but you're talking 100s of insns there.
> 
> Theoretically it's fine, but the addi starvation was observed in the FP/SIMD
> instructions intensive loop code, which did cause some worse performance.  :(

"addi starvation" has nothing to do with addi (it also happens for other
insns), and nothing with update form memory insns either.  What happens
is simply that no shorter latency insns are issued by the core so long
as longer latency insns (like most float insns) are available.  So in
really nice floating point loops we execute the few integer add insns
much too late, much later than they were in the machine code, which then
makes the memory insns late as well, etc.


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-04 14:16                               ` Segher Boessenkool
@ 2020-09-04 15:47                                 ` Kewen.Lin
  0 siblings, 0 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-09-04 15:47 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Bin.Cheng, GCC Patches, bin.cheng, Richard Guenther,
	Bill Schmidt, Richard Sandiford, Jiufu Guo

Hi Segher,

on 2020/9/4 下午10:16, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Sep 04, 2020 at 04:47:37PM +0800, Kewen.Lin wrote:
>>>> Apart from that, one P9 specific point is that the update form load isn't
>>>> preferred,  the reason is that the instruction can not retire until both
>>>> parts complete, it can hold up subsequent instructions from retiring.
>>>> If the addi stalls (starvation), the instruction can not retire and can
>>>> cause things stuck.  It seems also something we can model here?
>>>
>>> This is (almost) no problem on p9, since we no longer have issue groups.
>>> It can hold up older insns from retiring, sure, but they *will* have
>>> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
>>> is gone.  The only problem is if things stick around so long that
>>> resources run out...  but you're talking 100s of insns there.
>>
>> Theoretically it's fine, but the addi starvation was observed in the FP/SIMD
>> instructions intensive loop code, which did cause some worse performance.  :(
> 
> "addi starvation" has nothing to do with addi (it also happens for other
> insns), and nothing with update form memory insns either.  What happens
> is simply that no shorter latency insns are issued by the core so long
> as longer latency insns (like most float insns) are available.  So in
> really nice floating point loops we execute the few integer add insns
> much too late, much later than they were in the machine code, which then
> makes the memory insns late as well, etc.
> 

Yeah, the starvation issue isn't addi specific, but in the FP/SIMD
insns intensive loop, "addi/add" is the major/all portion of the
shorter latency insns in most time.  So I'd argue that it's related. :) 
Since they are mainly for IV updates, memory insns depend on it,
the FP/SIMD insns depend on the memory insns, ..., it can easily
cause the stall chain reaction, I guess that's why some people call
it as "addi starvation".

As the example Bin gave in another email, more auto-inc candidates
would have more iv updates (cracked ADDIs), if one/several common
index iv can be shared among the memory insns (fewer ADDIs), we can
reduce the number of shorter latency insns.  As I know, some compiler
did implement not to perfer auto-inc candidates, it can mitigate
starvation issue in those FP/SIMD intensive loops to some extent.

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-01 11:19                   ` Bin.Cheng
  2020-09-02  3:50                     ` Kewen.Lin
@ 2020-09-06  2:47                     ` Hans-Peter Nilsson
  2020-09-15  7:41                       ` Kewen.Lin
  1 sibling, 1 reply; 64+ messages in thread
From: Hans-Peter Nilsson @ 2020-09-06  2:47 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Kewen.Lin, bin.cheng, Segher Boessenkool, GCC Patches,
	Bill Schmidt, Richard Guenther

On Tue, 1 Sep 2020, Bin.Cheng via Gcc-patches wrote:
> > Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
> > but the regression testing did show one failure (the only one):
> >
> >   PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1
> >
> > It exposes two issues:
> >
> > 1) Currently address_cost hook on rs6000 always return zero, but at least
> > from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
> > have to take the address update into account (scalar normal operation).
> > Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
> > ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
> > with scaling up, the off becomes much.  With one simple hack on for pre_inc/
> > pre_dec in rs6000 address_cost, the case passed.  It should be handled in
> > one separated issue.
> >
> > 2) This case makes me think we should exclude ainc candidates in function
> > mark_reg_offset_candidates.  The justification is that: ainc candidate
> > handles step update itself and when we calculate the cost for it against
> > its ainc_use, the cost_step has been reduced. When unrolling happens,
> > the ainc computations are replicated and it doesn't save step updates
> > like normal reg_offset_p candidates.
> Though auto-inc candidate embeds stepping operation into memory
> access, we might want to avoid it in case of unroll if there are many
> sequences of memory accesses, and if the unroll factor is big.  The
> rationale is embedded stepping is a u-arch operation and does have its
> cost.

Forgive me for barging in here (though the context is powerpc,
the dialogue and the patch seems to be generic ivopts), but
that's not a general remark I hope, about auto-inc (always)
having a cost?

For some architectures, auto-inc *is* free, as free as
register-indirect, so the more auto-inc use, the better.  All
this should be reflected by the address-cost, IMHO, and not
hardcoded into ivopts.

brgds, H-P

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-06  2:47                     ` Hans-Peter Nilsson
@ 2020-09-15  7:41                       ` Kewen.Lin
  0 siblings, 0 replies; 64+ messages in thread
From: Kewen.Lin @ 2020-09-15  7:41 UTC (permalink / raw)
  To: Hans-Peter Nilsson
  Cc: Bin.Cheng, bin.cheng, Segher Boessenkool, GCC Patches,
	Bill Schmidt, Richard Guenther

Hi Hans,

on 2020/9/6 上午10:47, Hans-Peter Nilsson wrote:
> On Tue, 1 Sep 2020, Bin.Cheng via Gcc-patches wrote:
>>> Great idea!  With explicitly specified -funroll-loops, it's bootstrapped
>>> but the regression testing did show one failure (the only one):
>>>
>>>   PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1
>>>
>>> It exposes two issues:
>>>
>>> 1) Currently address_cost hook on rs6000 always return zero, but at least
>>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
>>> have to take the address update into account (scalar normal operation).
>>> Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer
>>> ainc candidates.  In this case, the cand/group cost is -4 (minus cost_step),
>>> with scaling up, the off becomes much.  With one simple hack on for pre_inc/
>>> pre_dec in rs6000 address_cost, the case passed.  It should be handled in
>>> one separated issue.
>>>
>>> 2) This case makes me think we should exclude ainc candidates in function
>>> mark_reg_offset_candidates.  The justification is that: ainc candidate
>>> handles step update itself and when we calculate the cost for it against
>>> its ainc_use, the cost_step has been reduced. When unrolling happens,
>>> the ainc computations are replicated and it doesn't save step updates
>>> like normal reg_offset_p candidates.
>> Though auto-inc candidate embeds stepping operation into memory
>> access, we might want to avoid it in case of unroll if there are many
>> sequences of memory accesses, and if the unroll factor is big.  The
>> rationale is embedded stepping is a u-arch operation and does have its
>> cost.
> 
> Forgive me for barging in here (though the context is powerpc,
> the dialogue and the patch seems to be generic ivopts), but
> that's not a general remark I hope, about auto-inc (always)
> having a cost?
> 
> For some architectures, auto-inc *is* free, as free as
> register-indirect, so the more auto-inc use, the better.  All
> this should be reflected by the address-cost, IMHO, and not
> hardcoded into ivopts.
> 

Yeah, now ivopts doesn't hardcode the cost for auto-inc (always),
instead it allows targets to set its cost by themselves through
address_cost hook.  As the function get_address_cost_ainc, it
checks auto-inc operations supported or not and set the cost
as address_cost hook further.

One example on Power is listed as below:

Group 0:
  cand  cost    compl.  inv.expr.       inv.vars
  1     4       1       NIL;    1
  3     0       0       NIL;    NIL;
  4     0       1       NIL;    1
  5     0       1       NIL;    NIL;
  13    0       1       NIL;    NIL;
  18    -4      0       NIL;    NIL;

Cand 18 is one auto-inc candidate, whose group 0/cand cost is
-4 (minus step_cost), the iv_cost of cand 18 is 5 (step_cost +
non-original_iv cost), when it's selected, the step_cost parts
counteract, the remaining cost (1) is for non-original iv,
it shows it doesn't put any hardcoded cost to this ainc_cost
candidate.

I guess some misunderstanding was derived from some discussion
above.  Sorry if some of my previous comments misled you.

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* PING^2 [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-08-31  5:49   ` PING " Kewen.Lin
@ 2020-09-15  7:44     ` Kewen.Lin
  2020-10-13  7:06       ` PING^3 " Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-09-15  7:44 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, Segher Boessenkool, Richard Biener, Richard Sandiford

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html

BR,
Kewen

on 2020/8/31 下午1:49, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> I'd like to gentle ping this since IVOPTs part is already to land.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
> 
> BR,
> Kewen
> 
> on 2020/5/28 下午8:19, Kewen.Lin via Gcc-patches wrote:
>>
>> gcc/ChangeLog
>>
>> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>>
>> 	* cfgloop.h (struct loop): New field estimated_unroll.
>> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
>> 	(decide_unroll_runtime_iter): Likewise.
>> 	(decide_unroll_stupid): Likewise.
>> 	(estimate_unroll_factor): Likewise.
>> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
>> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
>> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-03 22:37                           ` Segher Boessenkool
  2020-09-04  8:27                             ` Bin.Cheng
  2020-09-04  8:47                             ` Kewen.Lin
@ 2020-09-17 23:12                             ` Jeff Law
  2020-09-17 23:46                               ` Segher Boessenkool
  2 siblings, 1 reply; 64+ messages in thread
From: Jeff Law @ 2020-09-17 23:12 UTC (permalink / raw)
  To: Segher Boessenkool, Kewen.Lin
  Cc: bin.cheng, GCC Patches, Bill Schmidt, Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]


On 9/3/20 4:37 PM, Segher Boessenkool wrote:
>> Apart from that, one P9 specific point is that the update form load isn't
>> preferred,  the reason is that the instruction can not retire until both
>> parts complete, it can hold up subsequent instructions from retiring.
>> If the addi stalls (starvation), the instruction can not retire and can
>> cause things stuck.  It seems also something we can model here?
> This is (almost) no problem on p9, since we no longer have issue groups.
> It can hold up older insns from retiring, sure, but they *will* have
> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> is gone.  The only problem is if things stick around so long that
> resources run out...  but you're talking 100s of insns there.

So the PA8xxx had the same issue with its dual output insns -- the big
difference is the PA8xxx systems were considered retirement bandwidth
limited (2 memory and 2 non-memory per cycle, with just a 56 entry
reorder buffer, split between memory and non-memory ops).  Holding a
slot in the reorder buffer was relatively costly.


If you can retire 64 ops per cycle and you've probably got an enormous
reorder buffer, so I wouldn't worry much about holding up insns from
retiring on your target.


jeff


[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 1763 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
  2020-09-17 23:12                             ` Jeff Law
@ 2020-09-17 23:46                               ` Segher Boessenkool
  0 siblings, 0 replies; 64+ messages in thread
From: Segher Boessenkool @ 2020-09-17 23:46 UTC (permalink / raw)
  To: Jeff Law
  Cc: Kewen.Lin, bin.cheng, GCC Patches, Bill Schmidt, Richard Guenther

Hi Jeff,

On Thu, Sep 17, 2020 at 05:12:17PM -0600, Jeff Law wrote:
> On 9/3/20 4:37 PM, Segher Boessenkool wrote:
> >> Apart from that, one P9 specific point is that the update form load isn't
> >> preferred,  the reason is that the instruction can not retire until both
> >> parts complete, it can hold up subsequent instructions from retiring.
> >> If the addi stalls (starvation), the instruction can not retire and can
> >> cause things stuck.  It seems also something we can model here?
> > This is (almost) no problem on p9, since we no longer have issue groups.
> > It can hold up older insns from retiring, sure, but they *will* have
> > finished, and p9 can retire 64 insns per cycle.  The "completion wall"
> > is gone.  The only problem is if things stick around so long that
> > resources run out...  but you're talking 100s of insns there.
> 
> So the PA8xxx had the same issue with its dual output insns -- the big
> difference is the PA8xxx systems were considered retirement bandwidth
> limited (2 memory and 2 non-memory per cycle, with just a 56 entry
> reorder buffer, split between memory and non-memory ops).  Holding a
> slot in the reorder buffer was relatively costly.
> 
> 
> If you can retire 64 ops per cycle and you've probably got an enormous
> reorder buffer, so I wouldn't worry much about holding up insns from
> retiring on your target.

Power9 doesn't have a reorder buffer or anything similar at all -- it
uses history buffers, so committing insns (pretty much what you call
retiring) is essentially for free (restoring old register values after
flushes now costs more though).


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* PING^3 [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-09-15  7:44     ` PING^2 " Kewen.Lin
@ 2020-10-13  7:06       ` Kewen.Lin
  2020-11-02  9:13         ` PING^4 " Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-10-13  7:06 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html

BR,
Kewen

on 2020/9/15 下午3:44, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
> 
> BR,
> Kewen
> 
> on 2020/8/31 下午1:49, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> I'd like to gentle ping this since IVOPTs part is already to land.
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>
>> BR,
>> Kewen
>>
>> on 2020/5/28 下午8:19, Kewen.Lin via Gcc-patches wrote:
>>>
>>> gcc/ChangeLog
>>>
>>> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>>>
>>> 	* cfgloop.h (struct loop): New field estimated_unroll.
>>> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
>>> 	(decide_unroll_runtime_iter): Likewise.
>>> 	(decide_unroll_stupid): Likewise.
>>> 	(estimate_unroll_factor): Likewise.
>>> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
>>> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
>>> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* PING^4 [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-10-13  7:06       ` PING^3 " Kewen.Lin
@ 2020-11-02  9:13         ` Kewen.Lin
  2020-11-19  5:50           ` PING^5 " Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-11-02  9:13 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, Segher Boessenkool, Richard Sandiford, Richard Biener

Hi,

Gentle ping^4 this:

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html

BR,
Kewen

on 2020/10/13 下午3:06, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
> 
> BR,
> Kewen
> 
> on 2020/9/15 下午3:44, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> Gentle ping this:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>
>> BR,
>> Kewen
>>
>> on 2020/8/31 下午1:49, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> I'd like to gentle ping this since IVOPTs part is already to land.
>>>
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>
>>> BR,
>>> Kewen
>>>
>>> on 2020/5/28 下午8:19, Kewen.Lin via Gcc-patches wrote:
>>>>
>>>> gcc/ChangeLog
>>>>
>>>> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>>>>
>>>> 	* cfgloop.h (struct loop): New field estimated_unroll.
>>>> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
>>>> 	(decide_unroll_runtime_iter): Likewise.
>>>> 	(decide_unroll_stupid): Likewise.
>>>> 	(estimate_unroll_factor): Likewise.
>>>> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
>>>> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
>>>> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
>>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* PING^5 [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-11-02  9:13         ` PING^4 " Kewen.Lin
@ 2020-11-19  5:50           ` Kewen.Lin
  2020-12-17  2:58             ` PING^6 " Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-11-19  5:50 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Sandiford, Bill Schmidt, Segher Boessenkool, Richard Biener

Hi,

Gentle ping^5 for:

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html

BR,
Kewen

on 2020/11/2 下午5:13, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping^4 this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
> 
> BR,
> Kewen
> 
> on 2020/10/13 下午3:06, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> Gentle ping this:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>
>> BR,
>> Kewen
>>
>> on 2020/9/15 下午3:44, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> Gentle ping this:
>>>
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>
>>> BR,
>>> Kewen
>>>
>>> on 2020/8/31 下午1:49, Kewen.Lin via Gcc-patches wrote:
>>>> Hi,
>>>>
>>>> I'd like to gentle ping this since IVOPTs part is already to land.
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>>
>>>> BR,
>>>> Kewen
>>>>
>>>> on 2020/5/28 下午8:19, Kewen.Lin via Gcc-patches wrote:
>>>>>
>>>>> gcc/ChangeLog
>>>>>
>>>>> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>>>>>
>>>>> 	* cfgloop.h (struct loop): New field estimated_unroll.
>>>>> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
>>>>> 	(decide_unroll_runtime_iter): Likewise.
>>>>> 	(decide_unroll_stupid): Likewise.
>>>>> 	(estimate_unroll_factor): Likewise.
>>>>> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
>>>>> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
>>>>> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
>>>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* PING^6 [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-11-19  5:50           ` PING^5 " Kewen.Lin
@ 2020-12-17  2:58             ` Kewen.Lin
  2021-01-14  2:36               ` PING^7 " Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2020-12-17  2:58 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Sandiford, Bill Schmidt, Segher Boessenkool, Richard Biener

Hi,

Gentle ping^6 for:

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html

BR,
Kewen

on 2020/11/19 下午1:50, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping^5 for:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
> 
> BR,
> Kewen
> 
> on 2020/11/2 下午5:13, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> Gentle ping^4 this:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>
>> BR,
>> Kewen
>>
>> on 2020/10/13 下午3:06, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> Gentle ping this:
>>>
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>
>>> BR,
>>> Kewen
>>>
>>> on 2020/9/15 下午3:44, Kewen.Lin via Gcc-patches wrote:
>>>> Hi,
>>>>
>>>> Gentle ping this:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>>
>>>> BR,
>>>> Kewen
>>>>
>>>> on 2020/8/31 下午1:49, Kewen.Lin via Gcc-patches wrote:
>>>>> Hi,
>>>>>
>>>>> I'd like to gentle ping this since IVOPTs part is already to land.
>>>>>
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>>>
>>>>> BR,
>>>>> Kewen
>>>>>
>>>>> on 2020/5/28 下午8:19, Kewen.Lin via Gcc-patches wrote:
>>>>>>
>>>>>> gcc/ChangeLog
>>>>>>
>>>>>> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>>>>>>
>>>>>> 	* cfgloop.h (struct loop): New field estimated_unroll.
>>>>>> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
>>>>>> 	(decide_unroll_runtime_iter): Likewise.
>>>>>> 	(decide_unroll_stupid): Likewise.
>>>>>> 	(estimate_unroll_factor): Likewise.
>>>>>> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
>>>>>> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
>>>>>> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
>>>>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* PING^7 [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-12-17  2:58             ` PING^6 " Kewen.Lin
@ 2021-01-14  2:36               ` Kewen.Lin
  0 siblings, 0 replies; 64+ messages in thread
From: Kewen.Lin @ 2021-01-14  2:36 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Sandiford, Bill Schmidt, Segher Boessenkool, Richard Biener

Hi,

Gentle ping^7 for:

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html

BR,
Kewen

on 2020/12/17 上午10:58, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping^6 for:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
> 
> BR,
> Kewen
> 
> on 2020/11/19 下午1:50, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> Gentle ping^5 for:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>
>> BR,
>> Kewen
>>
>> on 2020/11/2 下午5:13, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> Gentle ping^4 this:
>>>
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>
>>> BR,
>>> Kewen
>>>
>>> on 2020/10/13 下午3:06, Kewen.Lin via Gcc-patches wrote:
>>>> Hi,
>>>>
>>>> Gentle ping this:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>>
>>>> BR,
>>>> Kewen
>>>>
>>>> on 2020/9/15 下午3:44, Kewen.Lin via Gcc-patches wrote:
>>>>> Hi,
>>>>>
>>>>> Gentle ping this:
>>>>>
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>>>
>>>>> BR,
>>>>> Kewen
>>>>>
>>>>> on 2020/8/31 下午1:49, Kewen.Lin via Gcc-patches wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'd like to gentle ping this since IVOPTs part is already to land.
>>>>>>
>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546698.html
>>>>>>
>>>>>> BR,
>>>>>> Kewen
>>>>>>
>>>>>> on 2020/5/28 下午8:19, Kewen.Lin via Gcc-patches wrote:
>>>>>>>
>>>>>>> gcc/ChangeLog
>>>>>>>
>>>>>>> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>>>>>>>
>>>>>>> 	* cfgloop.h (struct loop): New field estimated_unroll.
>>>>>>> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
>>>>>>> 	(decide_unroll_runtime_iter): Likewise.
>>>>>>> 	(decide_unroll_stupid): Likewise.
>>>>>>> 	(estimate_unroll_factor): Likewise.
>>>>>>> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
>>>>>>> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
>>>>>>> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
>>>>>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2020-05-28 12:19 ` [PATCH 1/4] unroll: Add middle-end unroll factor estimation Kewen.Lin
  2020-08-31  5:49   ` PING " Kewen.Lin
@ 2021-01-21 21:45   ` Segher Boessenkool
  2021-01-22 12:50     ` Richard Sandiford
  2021-01-22 13:47     ` Richard Biener
  1 sibling, 2 replies; 64+ messages in thread
From: Segher Boessenkool @ 2021-01-21 21:45 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Richard Sandiford, Richard Guenther, Bill Schmidt

Hi!

What is holding up this patch still?  Ke Wen has pinged it every month
since May, and there has still not been a review.


Segher


On Thu, May 28, 2020 at 08:19:59PM +0800, Kewen.Lin wrote:
> 
> gcc/ChangeLog
> 
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
> 
> 	* cfgloop.h (struct loop): New field estimated_unroll.
> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
> 	(decide_unroll_runtime_iter): Likewise.
> 	(decide_unroll_stupid): Likewise.
> 	(estimate_unroll_factor): Likewise.
> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
> 
> ----

> ---
>  gcc/cfgloop.h             |   3 +
>  gcc/tree-ssa-loop-manip.c | 253 ++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/tree-ssa-loop-manip.h |   3 +-
>  gcc/tree-ssa-loop.c       |  33 ++++++
>  gcc/tree-ssa-loop.h       |   2 +
>  5 files changed, 292 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> index 11378ca..c5bcca7 100644
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -232,6 +232,9 @@ public:
>       Other values means unroll with the given unrolling factor.  */
>    unsigned short unroll;
>  
> +  /* Like unroll field above, but it's estimated in middle-end.  */
> +  unsigned short estimated_unroll;
> +
>    /* If this loop was inlined the main clique of the callee which does
>       not need remapping when copying the loop body.  */
>    unsigned short owned_clique;
> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> index 120b35b..8a5a1a9 100644
> --- a/gcc/tree-ssa-loop-manip.c
> +++ b/gcc/tree-ssa-loop-manip.c
> @@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "system.h"
>  #include "coretypes.h"
>  #include "backend.h"
> +#include "target.h"
>  #include "tree.h"
>  #include "gimple.h"
>  #include "cfghooks.h"
> @@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "cfgloop.h"
>  #include "tree-scalar-evolution.h"
>  #include "tree-inline.h"
> +#include "wide-int.h"
>  
>  /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
>     so that we can free them all at once.  */
> @@ -1592,3 +1594,254 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch)
>  
>    return var_before;
>  }
> +
> +/* Try to determine estimated unroll factor for given LOOP with constant number
> +   of iterations, mainly refer to decide_unroll_constant_iterations.
> +    - NITER_DESC holds number of iteration description if it isn't NULL.
> +    - NUNROLL holds a unroll factor value computed with instruction numbers.
> +    - ITER holds estimated or likely max loop iterations.
> +   Return true if it succeeds, also update estimated_unroll.  */
> +
> +static bool
> +decide_unroll_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
> +		      unsigned nunroll, const widest_int *iter)
> +{
> +  /* Skip big loops.  */
> +  if (nunroll <= 1)
> +    return false;
> +
> +  gcc_assert (niter_desc && niter_desc->assumptions);
> +
> +  /* Check number of iterations is constant, return false if no.  */
> +  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
> +      || !tree_fits_uhwi_p (niter_desc->niter))
> +    return false;
> +
> +  unsigned HOST_WIDE_INT const_niter = tree_to_uhwi (niter_desc->niter);
> +
> +  /* If unroll factor is set explicitly, use it as estimated_unroll.  */
> +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
> +    {
> +      /* It should have been peeled instead.  */
> +      if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
> +	loop->estimated_unroll = 1;
> +      else
> +	loop->estimated_unroll = loop->unroll;
> +      return true;
> +    }
> +
> +  /* Check whether the loop rolls enough to consider.  */
> +  if (const_niter < 2 * nunroll || wi::ltu_p (*iter, 2 * nunroll))
> +    return false;
> +
> +  /* Success; now compute number of iterations to unroll.  */
> +  unsigned best_unroll = 0, n_copies = 0;
> +  unsigned best_copies = 2 * nunroll + 10;
> +  unsigned i = 2 * nunroll + 2;
> +
> +  if (i > const_niter - 2)
> +    i = const_niter - 2;
> +
> +  for (; i >= nunroll - 1; i--)
> +    {
> +      unsigned exit_mod = const_niter % (i + 1);
> +
> +      if (!empty_block_p (loop->latch))
> +	n_copies = exit_mod + i + 1;
> +      else if (exit_mod != i)
> +	n_copies = exit_mod + i + 2;
> +      else
> +	n_copies = i + 1;
> +
> +      if (n_copies < best_copies)
> +	{
> +	  best_copies = n_copies;
> +	  best_unroll = i;
> +	}
> +    }
> +
> +  loop->estimated_unroll = best_unroll + 1;
> +  return true;
> +}
> +
> +/* Try to determine estimated unroll factor for given LOOP with countable but
> +   non-constant number of iterations, mainly refer to
> +   decide_unroll_runtime_iterations.
> +    - NITER_DESC holds number of iteration description if it isn't NULL.
> +    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
> +    - ITER holds estimated or likely max loop iterations.
> +   Return true if it succeeds, also update estimated_unroll.  */
> +
> +static bool
> +decide_unroll_runtime_iter (class loop *loop, const tree_niter_desc *niter_desc,
> +			unsigned nunroll_in, const widest_int *iter)
> +{
> +  unsigned nunroll = nunroll_in;
> +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
> +    nunroll = loop->unroll;
> +
> +  /* Skip big loops.  */
> +  if (nunroll <= 1)
> +    return false;
> +
> +  gcc_assert (niter_desc && niter_desc->assumptions);
> +
> +  /* Skip constant number of iterations.  */
> +  if ((!niter_desc->may_be_zero || !integer_zerop (niter_desc->may_be_zero))
> +      && tree_fits_uhwi_p (niter_desc->niter))
> +    return false;
> +
> +  /* Check whether the loop rolls.  */
> +  if (wi::ltu_p (*iter, 2 * nunroll))
> +    return false;
> +
> +  /* Success; now force nunroll to be power of 2.  */
> +  unsigned i;
> +  for (i = 1; 2 * i <= nunroll; i *= 2)
> +    continue;
> +
> +  loop->estimated_unroll = i;
> +  return true;
> +}
> +
> +/* Try to determine estimated unroll factor for given LOOP with uncountable
> +   number of iterations, mainly refer to decide_unroll_stupid.
> +    - NITER_DESC holds number of iteration description if it isn't NULL.
> +    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
> +    - ITER holds estimated or likely max loop iterations.
> +   Return true if it succeeds, also update estimated_unroll.  */
> +
> +static bool
> +decide_unroll_stupid (class loop *loop, const tree_niter_desc *niter_desc,
> +		  unsigned nunroll_in, const widest_int *iter)
> +{
> +  if (!flag_unroll_all_loops && !loop->unroll)
> +    return false;
> +
> +  unsigned nunroll = nunroll_in;
> +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
> +    nunroll = loop->unroll;
> +
> +  /* Skip big loops.  */
> +  if (nunroll <= 1)
> +    return false;
> +
> +  gcc_assert (!niter_desc || !niter_desc->assumptions);
> +
> +  /* Skip loop with multiple branches for now.  */
> +  if (num_loop_branches (loop) > 1)
> +    return false;
> +
> +  /* Check whether the loop rolls.  */
> +  if (wi::ltu_p (*iter, 2 * nunroll))
> +    return false;
> +
> +  /* Success; now force nunroll to be power of 2.  */
> +  unsigned i;
> +  for (i = 1; 2 * i <= nunroll; i *= 2)
> +    continue;
> +
> +  loop->estimated_unroll = i;
> +  return true;
> +}
> +
> +/* Try to estimate whether this given LOOP can be unrolled or not, and compute
> +   its estimated unroll factor if it can.  To avoid duplicated computation, you
> +   can pass number of iterations information by DESC.  The heuristics mainly
> +   refer to decide_unrolling in loop-unroll.c.  */
> +
> +void
> +estimate_unroll_factor (class loop *loop, tree_niter_desc *desc)
> +{
> +  /* Return the existing estimated unroll factor.  */
> +  if (loop->estimated_unroll)
> +    return;
> +
> +  /* Don't unroll explicitly.  */
> +  if (loop->unroll == 1)
> +    {
> +      loop->estimated_unroll = loop->unroll;
> +      return;
> +    }
> +
> +  /* Like decide_unrolling, don't unroll if:
> +     1) the loop is cold.
> +     2) the loop can't be manipulated.
> +     3) the loop isn't innermost.  */
> +  if (optimize_loop_for_size_p (loop) || !can_duplicate_loop_p (loop)
> +      || loop->inner != NULL)
> +    {
> +      loop->estimated_unroll = 1;
> +      return;
> +    }
> +
> +  /* Don't unroll without explicit information.  */
> +  if (!loop->unroll && !flag_unroll_loops && !flag_unroll_all_loops)
> +    {
> +      loop->estimated_unroll = 1;
> +      return;
> +    }
> +
> +  /* Check for instruction number and average instruction number.  */
> +  loop->ninsns = tree_num_loop_insns (loop, &eni_size_weights);
> +  loop->av_ninsns = tree_average_num_loop_insns (loop, &eni_size_weights);
> +  unsigned nunroll = param_max_unrolled_insns / loop->ninsns;
> +  unsigned nunroll_by_av = param_max_average_unrolled_insns / loop->av_ninsns;
> +
> +  if (nunroll > nunroll_by_av)
> +    nunroll = nunroll_by_av;
> +  if (nunroll > (unsigned) param_max_unroll_times)
> +    nunroll = param_max_unroll_times;
> +
> +  if (targetm.loop_unroll_adjust)
> +    nunroll = targetm.loop_unroll_adjust (nunroll, loop);
> +
> +  tree_niter_desc *niter_desc = NULL;
> +  bool desc_need_delete = false;
> +
> +  /* Compute number of iterations if need.  */
> +  if (!desc)
> +    {
> +      /* For now, use single_dom_exit for simplicity. TODO: Support multiple
> +	 exits like find_simple_exit if we finds some profitable cases.  */
> +      niter_desc = XNEW (class tree_niter_desc);
> +      gcc_assert (niter_desc);
> +      edge exit = single_dom_exit (loop);
> +      if (!exit || !number_of_iterations_exit (loop, exit, niter_desc, true))
> +	{
> +	  XDELETE (niter_desc);
> +	  niter_desc = NULL;
> +	}
> +      else
> +	desc_need_delete = true;
> +    }
> +  else
> +    niter_desc = desc;
> +
> +  /* For checking the loop rolls enough to consider, also consult loop bounds
> +     and profile.  */
> +  widest_int iterations;
> +  if (!get_estimated_loop_iterations (loop, &iterations)
> +      && !get_likely_max_loop_iterations (loop, &iterations))
> +    iterations = 0;
> +
> +  if (niter_desc && niter_desc->assumptions)
> +    {
> +      /* For countable loops.  */
> +      if (!decide_unroll_const_iter (loop, niter_desc, nunroll, &iterations)
> +	  && !decide_unroll_runtime_iter (loop, niter_desc, nunroll, &iterations))
> +	loop->estimated_unroll = 1;
> +    }
> +  else
> +    {
> +      if (!decide_unroll_stupid (loop, niter_desc, nunroll, &iterations))
> +	loop->estimated_unroll = 1;
> +    }
> +
> +  if (desc_need_delete)
> +    {
> +      XDELETE (niter_desc);
> +      niter_desc = NULL;
> +    }
> +}
> +
> diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
> index e789e4f..773a2b3 100644
> --- a/gcc/tree-ssa-loop-manip.h
> +++ b/gcc/tree-ssa-loop-manip.h
> @@ -55,7 +55,6 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
>  extern void tree_unroll_loop (class loop *, unsigned,
>  			      edge, class tree_niter_desc *);
>  extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
> -
> -
> +extern void estimate_unroll_factor (class loop *, tree_niter_desc *);
>  
>  #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
> diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
> index 5e8365d..25320fb 100644
> --- a/gcc/tree-ssa-loop.c
> +++ b/gcc/tree-ssa-loop.c
> @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "diagnostic-core.h"
>  #include "stringpool.h"
>  #include "attribs.h"
> +#include "sreal.h"
>  
>  
>  /* A pass making sure loops are fixed up.  */
> @@ -790,5 +791,37 @@ tree_num_loop_insns (class loop *loop, eni_weights *weights)
>    return size;
>  }
>  
> +/* Computes an estimated number of insns on average per iteration in LOOP,
> +   weighted by WEIGHTS.  Refer to function average_num_loop_insns.  */
>  
> +unsigned
> +tree_average_num_loop_insns (class loop *loop, eni_weights *weights)
> +{
> +  basic_block *body = get_loop_body (loop);
> +  gimple_stmt_iterator gsi;
> +  unsigned bb_size, i;
> +  sreal nsize = 0;
> +
> +  for (i = 0; i < loop->num_nodes; i++)
> +    {
> +      bb_size = 0;
> +      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
> +	bb_size += estimate_num_insns (gsi_stmt (gsi), weights);
> +      nsize += (sreal) bb_size
> +	       * body[i]->count.to_sreal_scale (loop->header->count);
> +      /* Avoid overflows.   */
> +      if (nsize > 1000000)
> +	{
> +	  free (body);
> +	  return 1000000;
> +	}
> +    }
> +  free (body);
> +
> +  unsigned ret = nsize.to_int ();
> +  if (!ret)
> +    ret = 1; /* To avoid division by zero.  */
> +
> +  return ret;
> +}
>  
> diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
> index 9e35125..af36177 100644
> --- a/gcc/tree-ssa-loop.h
> +++ b/gcc/tree-ssa-loop.h
> @@ -67,6 +67,8 @@ public:
>  extern bool for_each_index (tree *, bool (*) (tree, tree *, void *), void *);
>  extern char *get_lsm_tmp_name (tree ref, unsigned n, const char *suffix = NULL);
>  extern unsigned tree_num_loop_insns (class loop *, struct eni_weights *);
> +extern unsigned tree_average_num_loop_insns (class loop *,
> +					     struct eni_weights *);
>  
>  /* Returns the loop of the statement STMT.  */
>  
> -- 
> 2.7.4
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-21 21:45   ` Segher Boessenkool
@ 2021-01-22 12:50     ` Richard Sandiford
  2021-01-22 13:47     ` Richard Biener
  1 sibling, 0 replies; 64+ messages in thread
From: Richard Sandiford @ 2021-01-22 12:50 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Kewen.Lin, GCC Patches, Richard Guenther, Bill Schmidt

Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> What is holding up this patch still?  Ke Wen has pinged it every month
> since May, and there has still not been a review.

FAOD (since I'm on cc:), I don't feel qualified to review this.
Tree-level loop stuff isn't really my area.

Thanks,
Richard

>
>
> Segher
>
>
> On Thu, May 28, 2020 at 08:19:59PM +0800, Kewen.Lin wrote:
>> 
>> gcc/ChangeLog
>> 
>> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>> 
>> 	* cfgloop.h (struct loop): New field estimated_unroll.
>> 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
>> 	(decide_unroll_runtime_iter): Likewise.
>> 	(decide_unroll_stupid): Likewise.
>> 	(estimate_unroll_factor): Likewise.
>> 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
>> 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
>> 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
>> 
>> ----
>
>> ---
>>  gcc/cfgloop.h             |   3 +
>>  gcc/tree-ssa-loop-manip.c | 253 ++++++++++++++++++++++++++++++++++++++++++++++
>>  gcc/tree-ssa-loop-manip.h |   3 +-
>>  gcc/tree-ssa-loop.c       |  33 ++++++
>>  gcc/tree-ssa-loop.h       |   2 +
>>  5 files changed, 292 insertions(+), 2 deletions(-)
>> 
>> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
>> index 11378ca..c5bcca7 100644
>> --- a/gcc/cfgloop.h
>> +++ b/gcc/cfgloop.h
>> @@ -232,6 +232,9 @@ public:
>>       Other values means unroll with the given unrolling factor.  */
>>    unsigned short unroll;
>>  
>> +  /* Like unroll field above, but it's estimated in middle-end.  */
>> +  unsigned short estimated_unroll;
>> +
>>    /* If this loop was inlined the main clique of the callee which does
>>       not need remapping when copying the loop body.  */
>>    unsigned short owned_clique;
>> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
>> index 120b35b..8a5a1a9 100644
>> --- a/gcc/tree-ssa-loop-manip.c
>> +++ b/gcc/tree-ssa-loop-manip.c
>> @@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "system.h"
>>  #include "coretypes.h"
>>  #include "backend.h"
>> +#include "target.h"
>>  #include "tree.h"
>>  #include "gimple.h"
>>  #include "cfghooks.h"
>> @@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "cfgloop.h"
>>  #include "tree-scalar-evolution.h"
>>  #include "tree-inline.h"
>> +#include "wide-int.h"
>>  
>>  /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
>>     so that we can free them all at once.  */
>> @@ -1592,3 +1594,254 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch)
>>  
>>    return var_before;
>>  }
>> +
>> +/* Try to determine estimated unroll factor for given LOOP with constant number
>> +   of iterations, mainly refer to decide_unroll_constant_iterations.
>> +    - NITER_DESC holds number of iteration description if it isn't NULL.
>> +    - NUNROLL holds a unroll factor value computed with instruction numbers.
>> +    - ITER holds estimated or likely max loop iterations.
>> +   Return true if it succeeds, also update estimated_unroll.  */
>> +
>> +static bool
>> +decide_unroll_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
>> +		      unsigned nunroll, const widest_int *iter)
>> +{
>> +  /* Skip big loops.  */
>> +  if (nunroll <= 1)
>> +    return false;
>> +
>> +  gcc_assert (niter_desc && niter_desc->assumptions);
>> +
>> +  /* Check number of iterations is constant, return false if no.  */
>> +  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
>> +      || !tree_fits_uhwi_p (niter_desc->niter))
>> +    return false;
>> +
>> +  unsigned HOST_WIDE_INT const_niter = tree_to_uhwi (niter_desc->niter);
>> +
>> +  /* If unroll factor is set explicitly, use it as estimated_unroll.  */
>> +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
>> +    {
>> +      /* It should have been peeled instead.  */
>> +      if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
>> +	loop->estimated_unroll = 1;
>> +      else
>> +	loop->estimated_unroll = loop->unroll;
>> +      return true;
>> +    }
>> +
>> +  /* Check whether the loop rolls enough to consider.  */
>> +  if (const_niter < 2 * nunroll || wi::ltu_p (*iter, 2 * nunroll))
>> +    return false;
>> +
>> +  /* Success; now compute number of iterations to unroll.  */
>> +  unsigned best_unroll = 0, n_copies = 0;
>> +  unsigned best_copies = 2 * nunroll + 10;
>> +  unsigned i = 2 * nunroll + 2;
>> +
>> +  if (i > const_niter - 2)
>> +    i = const_niter - 2;
>> +
>> +  for (; i >= nunroll - 1; i--)
>> +    {
>> +      unsigned exit_mod = const_niter % (i + 1);
>> +
>> +      if (!empty_block_p (loop->latch))
>> +	n_copies = exit_mod + i + 1;
>> +      else if (exit_mod != i)
>> +	n_copies = exit_mod + i + 2;
>> +      else
>> +	n_copies = i + 1;
>> +
>> +      if (n_copies < best_copies)
>> +	{
>> +	  best_copies = n_copies;
>> +	  best_unroll = i;
>> +	}
>> +    }
>> +
>> +  loop->estimated_unroll = best_unroll + 1;
>> +  return true;
>> +}
>> +
>> +/* Try to determine estimated unroll factor for given LOOP with countable but
>> +   non-constant number of iterations, mainly refer to
>> +   decide_unroll_runtime_iterations.
>> +    - NITER_DESC holds number of iteration description if it isn't NULL.
>> +    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
>> +    - ITER holds estimated or likely max loop iterations.
>> +   Return true if it succeeds, also update estimated_unroll.  */
>> +
>> +static bool
>> +decide_unroll_runtime_iter (class loop *loop, const tree_niter_desc *niter_desc,
>> +			unsigned nunroll_in, const widest_int *iter)
>> +{
>> +  unsigned nunroll = nunroll_in;
>> +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
>> +    nunroll = loop->unroll;
>> +
>> +  /* Skip big loops.  */
>> +  if (nunroll <= 1)
>> +    return false;
>> +
>> +  gcc_assert (niter_desc && niter_desc->assumptions);
>> +
>> +  /* Skip constant number of iterations.  */
>> +  if ((!niter_desc->may_be_zero || !integer_zerop (niter_desc->may_be_zero))
>> +      && tree_fits_uhwi_p (niter_desc->niter))
>> +    return false;
>> +
>> +  /* Check whether the loop rolls.  */
>> +  if (wi::ltu_p (*iter, 2 * nunroll))
>> +    return false;
>> +
>> +  /* Success; now force nunroll to be power of 2.  */
>> +  unsigned i;
>> +  for (i = 1; 2 * i <= nunroll; i *= 2)
>> +    continue;
>> +
>> +  loop->estimated_unroll = i;
>> +  return true;
>> +}
>> +
>> +/* Try to determine estimated unroll factor for given LOOP with uncountable
>> +   number of iterations, mainly refer to decide_unroll_stupid.
>> +    - NITER_DESC holds number of iteration description if it isn't NULL.
>> +    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
>> +    - ITER holds estimated or likely max loop iterations.
>> +   Return true if it succeeds, also update estimated_unroll.  */
>> +
>> +static bool
>> +decide_unroll_stupid (class loop *loop, const tree_niter_desc *niter_desc,
>> +		  unsigned nunroll_in, const widest_int *iter)
>> +{
>> +  if (!flag_unroll_all_loops && !loop->unroll)
>> +    return false;
>> +
>> +  unsigned nunroll = nunroll_in;
>> +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
>> +    nunroll = loop->unroll;
>> +
>> +  /* Skip big loops.  */
>> +  if (nunroll <= 1)
>> +    return false;
>> +
>> +  gcc_assert (!niter_desc || !niter_desc->assumptions);
>> +
>> +  /* Skip loop with multiple branches for now.  */
>> +  if (num_loop_branches (loop) > 1)
>> +    return false;
>> +
>> +  /* Check whether the loop rolls.  */
>> +  if (wi::ltu_p (*iter, 2 * nunroll))
>> +    return false;
>> +
>> +  /* Success; now force nunroll to be power of 2.  */
>> +  unsigned i;
>> +  for (i = 1; 2 * i <= nunroll; i *= 2)
>> +    continue;
>> +
>> +  loop->estimated_unroll = i;
>> +  return true;
>> +}
>> +
>> +/* Try to estimate whether this given LOOP can be unrolled or not, and compute
>> +   its estimated unroll factor if it can.  To avoid duplicated computation, you
>> +   can pass number of iterations information by DESC.  The heuristics mainly
>> +   refer to decide_unrolling in loop-unroll.c.  */
>> +
>> +void
>> +estimate_unroll_factor (class loop *loop, tree_niter_desc *desc)
>> +{
>> +  /* Return the existing estimated unroll factor.  */
>> +  if (loop->estimated_unroll)
>> +    return;
>> +
>> +  /* Don't unroll explicitly.  */
>> +  if (loop->unroll == 1)
>> +    {
>> +      loop->estimated_unroll = loop->unroll;
>> +      return;
>> +    }
>> +
>> +  /* Like decide_unrolling, don't unroll if:
>> +     1) the loop is cold.
>> +     2) the loop can't be manipulated.
>> +     3) the loop isn't innermost.  */
>> +  if (optimize_loop_for_size_p (loop) || !can_duplicate_loop_p (loop)
>> +      || loop->inner != NULL)
>> +    {
>> +      loop->estimated_unroll = 1;
>> +      return;
>> +    }
>> +
>> +  /* Don't unroll without explicit information.  */
>> +  if (!loop->unroll && !flag_unroll_loops && !flag_unroll_all_loops)
>> +    {
>> +      loop->estimated_unroll = 1;
>> +      return;
>> +    }
>> +
>> +  /* Check for instruction number and average instruction number.  */
>> +  loop->ninsns = tree_num_loop_insns (loop, &eni_size_weights);
>> +  loop->av_ninsns = tree_average_num_loop_insns (loop, &eni_size_weights);
>> +  unsigned nunroll = param_max_unrolled_insns / loop->ninsns;
>> +  unsigned nunroll_by_av = param_max_average_unrolled_insns / loop->av_ninsns;
>> +
>> +  if (nunroll > nunroll_by_av)
>> +    nunroll = nunroll_by_av;
>> +  if (nunroll > (unsigned) param_max_unroll_times)
>> +    nunroll = param_max_unroll_times;
>> +
>> +  if (targetm.loop_unroll_adjust)
>> +    nunroll = targetm.loop_unroll_adjust (nunroll, loop);
>> +
>> +  tree_niter_desc *niter_desc = NULL;
>> +  bool desc_need_delete = false;
>> +
>> +  /* Compute number of iterations if need.  */
>> +  if (!desc)
>> +    {
>> +      /* For now, use single_dom_exit for simplicity. TODO: Support multiple
>> +	 exits like find_simple_exit if we finds some profitable cases.  */
>> +      niter_desc = XNEW (class tree_niter_desc);
>> +      gcc_assert (niter_desc);
>> +      edge exit = single_dom_exit (loop);
>> +      if (!exit || !number_of_iterations_exit (loop, exit, niter_desc, true))
>> +	{
>> +	  XDELETE (niter_desc);
>> +	  niter_desc = NULL;
>> +	}
>> +      else
>> +	desc_need_delete = true;
>> +    }
>> +  else
>> +    niter_desc = desc;
>> +
>> +  /* For checking the loop rolls enough to consider, also consult loop bounds
>> +     and profile.  */
>> +  widest_int iterations;
>> +  if (!get_estimated_loop_iterations (loop, &iterations)
>> +      && !get_likely_max_loop_iterations (loop, &iterations))
>> +    iterations = 0;
>> +
>> +  if (niter_desc && niter_desc->assumptions)
>> +    {
>> +      /* For countable loops.  */
>> +      if (!decide_unroll_const_iter (loop, niter_desc, nunroll, &iterations)
>> +	  && !decide_unroll_runtime_iter (loop, niter_desc, nunroll, &iterations))
>> +	loop->estimated_unroll = 1;
>> +    }
>> +  else
>> +    {
>> +      if (!decide_unroll_stupid (loop, niter_desc, nunroll, &iterations))
>> +	loop->estimated_unroll = 1;
>> +    }
>> +
>> +  if (desc_need_delete)
>> +    {
>> +      XDELETE (niter_desc);
>> +      niter_desc = NULL;
>> +    }
>> +}
>> +
>> diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
>> index e789e4f..773a2b3 100644
>> --- a/gcc/tree-ssa-loop-manip.h
>> +++ b/gcc/tree-ssa-loop-manip.h
>> @@ -55,7 +55,6 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
>>  extern void tree_unroll_loop (class loop *, unsigned,
>>  			      edge, class tree_niter_desc *);
>>  extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
>> -
>> -
>> +extern void estimate_unroll_factor (class loop *, tree_niter_desc *);
>>  
>>  #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
>> diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
>> index 5e8365d..25320fb 100644
>> --- a/gcc/tree-ssa-loop.c
>> +++ b/gcc/tree-ssa-loop.c
>> @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "diagnostic-core.h"
>>  #include "stringpool.h"
>>  #include "attribs.h"
>> +#include "sreal.h"
>>  
>>  
>>  /* A pass making sure loops are fixed up.  */
>> @@ -790,5 +791,37 @@ tree_num_loop_insns (class loop *loop, eni_weights *weights)
>>    return size;
>>  }
>>  
>> +/* Computes an estimated number of insns on average per iteration in LOOP,
>> +   weighted by WEIGHTS.  Refer to function average_num_loop_insns.  */
>>  
>> +unsigned
>> +tree_average_num_loop_insns (class loop *loop, eni_weights *weights)
>> +{
>> +  basic_block *body = get_loop_body (loop);
>> +  gimple_stmt_iterator gsi;
>> +  unsigned bb_size, i;
>> +  sreal nsize = 0;
>> +
>> +  for (i = 0; i < loop->num_nodes; i++)
>> +    {
>> +      bb_size = 0;
>> +      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
>> +	bb_size += estimate_num_insns (gsi_stmt (gsi), weights);
>> +      nsize += (sreal) bb_size
>> +	       * body[i]->count.to_sreal_scale (loop->header->count);
>> +      /* Avoid overflows.   */
>> +      if (nsize > 1000000)
>> +	{
>> +	  free (body);
>> +	  return 1000000;
>> +	}
>> +    }
>> +  free (body);
>> +
>> +  unsigned ret = nsize.to_int ();
>> +  if (!ret)
>> +    ret = 1; /* To avoid division by zero.  */
>> +
>> +  return ret;
>> +}
>>  
>> diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
>> index 9e35125..af36177 100644
>> --- a/gcc/tree-ssa-loop.h
>> +++ b/gcc/tree-ssa-loop.h
>> @@ -67,6 +67,8 @@ public:
>>  extern bool for_each_index (tree *, bool (*) (tree, tree *, void *), void *);
>>  extern char *get_lsm_tmp_name (tree ref, unsigned n, const char *suffix = NULL);
>>  extern unsigned tree_num_loop_insns (class loop *, struct eni_weights *);
>> +extern unsigned tree_average_num_loop_insns (class loop *,
>> +					     struct eni_weights *);
>>  
>>  /* Returns the loop of the statement STMT.  */
>>  
>> -- 
>> 2.7.4
>> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-21 21:45   ` Segher Boessenkool
  2021-01-22 12:50     ` Richard Sandiford
@ 2021-01-22 13:47     ` Richard Biener
  2021-01-22 21:37       ` Segher Boessenkool
  1 sibling, 1 reply; 64+ messages in thread
From: Richard Biener @ 2021-01-22 13:47 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Kewen.Lin, GCC Patches, Richard Sandiford, Bill Schmidt

On Thu, 21 Jan 2021, Segher Boessenkool wrote:

> Hi!
> 
> What is holding up this patch still?  Ke Wen has pinged it every month
> since May, and there has still not been a review.

I don't like it, it feels wrong but I don't have a good suggestion
that had positive feedback.  Since a reviewer / approver is indirectly
responsible for at least the design I do not want to ack this patch.
Bin made forward progress on the other parts of the series but clearly
there's somebody missing with the appropriate privileges who feels
positive about the patch and its general direction.

Sorry to be of no help here.

Richard.

> 
> Segher
> 
> 
> On Thu, May 28, 2020 at 08:19:59PM +0800, Kewen.Lin wrote:
> > 
> > gcc/ChangeLog
> > 
> > 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
> > 
> > 	* cfgloop.h (struct loop): New field estimated_unroll.
> > 	* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
> > 	(decide_unroll_runtime_iter): Likewise.
> > 	(decide_unroll_stupid): Likewise.
> > 	(estimate_unroll_factor): Likewise.
> > 	* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
> > 	* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
> > 	* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.
> > 
> > ----
> 
> > ---
> >  gcc/cfgloop.h             |   3 +
> >  gcc/tree-ssa-loop-manip.c | 253 ++++++++++++++++++++++++++++++++++++++++++++++
> >  gcc/tree-ssa-loop-manip.h |   3 +-
> >  gcc/tree-ssa-loop.c       |  33 ++++++
> >  gcc/tree-ssa-loop.h       |   2 +
> >  5 files changed, 292 insertions(+), 2 deletions(-)
> > 
> > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > index 11378ca..c5bcca7 100644
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -232,6 +232,9 @@ public:
> >       Other values means unroll with the given unrolling factor.  */
> >    unsigned short unroll;
> >  
> > +  /* Like unroll field above, but it's estimated in middle-end.  */
> > +  unsigned short estimated_unroll;
> > +
> >    /* If this loop was inlined the main clique of the callee which does
> >       not need remapping when copying the loop body.  */
> >    unsigned short owned_clique;
> > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > index 120b35b..8a5a1a9 100644
> > --- a/gcc/tree-ssa-loop-manip.c
> > +++ b/gcc/tree-ssa-loop-manip.c
> > @@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "system.h"
> >  #include "coretypes.h"
> >  #include "backend.h"
> > +#include "target.h"
> >  #include "tree.h"
> >  #include "gimple.h"
> >  #include "cfghooks.h"
> > @@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "cfgloop.h"
> >  #include "tree-scalar-evolution.h"
> >  #include "tree-inline.h"
> > +#include "wide-int.h"
> >  
> >  /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
> >     so that we can free them all at once.  */
> > @@ -1592,3 +1594,254 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch)
> >  
> >    return var_before;
> >  }
> > +
> > +/* Try to determine estimated unroll factor for given LOOP with constant number
> > +   of iterations, mainly refer to decide_unroll_constant_iterations.
> > +    - NITER_DESC holds number of iteration description if it isn't NULL.
> > +    - NUNROLL holds a unroll factor value computed with instruction numbers.
> > +    - ITER holds estimated or likely max loop iterations.
> > +   Return true if it succeeds, also update estimated_unroll.  */
> > +
> > +static bool
> > +decide_unroll_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
> > +		      unsigned nunroll, const widest_int *iter)
> > +{
> > +  /* Skip big loops.  */
> > +  if (nunroll <= 1)
> > +    return false;
> > +
> > +  gcc_assert (niter_desc && niter_desc->assumptions);
> > +
> > +  /* Check number of iterations is constant, return false if no.  */
> > +  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
> > +      || !tree_fits_uhwi_p (niter_desc->niter))
> > +    return false;
> > +
> > +  unsigned HOST_WIDE_INT const_niter = tree_to_uhwi (niter_desc->niter);
> > +
> > +  /* If unroll factor is set explicitly, use it as estimated_unroll.  */
> > +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
> > +    {
> > +      /* It should have been peeled instead.  */
> > +      if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
> > +	loop->estimated_unroll = 1;
> > +      else
> > +	loop->estimated_unroll = loop->unroll;
> > +      return true;
> > +    }
> > +
> > +  /* Check whether the loop rolls enough to consider.  */
> > +  if (const_niter < 2 * nunroll || wi::ltu_p (*iter, 2 * nunroll))
> > +    return false;
> > +
> > +  /* Success; now compute number of iterations to unroll.  */
> > +  unsigned best_unroll = 0, n_copies = 0;
> > +  unsigned best_copies = 2 * nunroll + 10;
> > +  unsigned i = 2 * nunroll + 2;
> > +
> > +  if (i > const_niter - 2)
> > +    i = const_niter - 2;
> > +
> > +  for (; i >= nunroll - 1; i--)
> > +    {
> > +      unsigned exit_mod = const_niter % (i + 1);
> > +
> > +      if (!empty_block_p (loop->latch))
> > +	n_copies = exit_mod + i + 1;
> > +      else if (exit_mod != i)
> > +	n_copies = exit_mod + i + 2;
> > +      else
> > +	n_copies = i + 1;
> > +
> > +      if (n_copies < best_copies)
> > +	{
> > +	  best_copies = n_copies;
> > +	  best_unroll = i;
> > +	}
> > +    }
> > +
> > +  loop->estimated_unroll = best_unroll + 1;
> > +  return true;
> > +}
> > +
> > +/* Try to determine estimated unroll factor for given LOOP with countable but
> > +   non-constant number of iterations, mainly refer to
> > +   decide_unroll_runtime_iterations.
> > +    - NITER_DESC holds number of iteration description if it isn't NULL.
> > +    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
> > +    - ITER holds estimated or likely max loop iterations.
> > +   Return true if it succeeds, also update estimated_unroll.  */
> > +
> > +static bool
> > +decide_unroll_runtime_iter (class loop *loop, const tree_niter_desc *niter_desc,
> > +			unsigned nunroll_in, const widest_int *iter)
> > +{
> > +  unsigned nunroll = nunroll_in;
> > +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
> > +    nunroll = loop->unroll;
> > +
> > +  /* Skip big loops.  */
> > +  if (nunroll <= 1)
> > +    return false;
> > +
> > +  gcc_assert (niter_desc && niter_desc->assumptions);
> > +
> > +  /* Skip constant number of iterations.  */
> > +  if ((!niter_desc->may_be_zero || !integer_zerop (niter_desc->may_be_zero))
> > +      && tree_fits_uhwi_p (niter_desc->niter))
> > +    return false;
> > +
> > +  /* Check whether the loop rolls.  */
> > +  if (wi::ltu_p (*iter, 2 * nunroll))
> > +    return false;
> > +
> > +  /* Success; now force nunroll to be power of 2.  */
> > +  unsigned i;
> > +  for (i = 1; 2 * i <= nunroll; i *= 2)
> > +    continue;
> > +
> > +  loop->estimated_unroll = i;
> > +  return true;
> > +}
> > +
> > +/* Try to determine estimated unroll factor for given LOOP with uncountable
> > +   number of iterations, mainly refer to decide_unroll_stupid.
> > +    - NITER_DESC holds number of iteration description if it isn't NULL.
> > +    - NUNROLL_IN holds a unroll factor value computed with instruction numbers.
> > +    - ITER holds estimated or likely max loop iterations.
> > +   Return true if it succeeds, also update estimated_unroll.  */
> > +
> > +static bool
> > +decide_unroll_stupid (class loop *loop, const tree_niter_desc *niter_desc,
> > +		  unsigned nunroll_in, const widest_int *iter)
> > +{
> > +  if (!flag_unroll_all_loops && !loop->unroll)
> > +    return false;
> > +
> > +  unsigned nunroll = nunroll_in;
> > +  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
> > +    nunroll = loop->unroll;
> > +
> > +  /* Skip big loops.  */
> > +  if (nunroll <= 1)
> > +    return false;
> > +
> > +  gcc_assert (!niter_desc || !niter_desc->assumptions);
> > +
> > +  /* Skip loop with multiple branches for now.  */
> > +  if (num_loop_branches (loop) > 1)
> > +    return false;
> > +
> > +  /* Check whether the loop rolls.  */
> > +  if (wi::ltu_p (*iter, 2 * nunroll))
> > +    return false;
> > +
> > +  /* Success; now force nunroll to be power of 2.  */
> > +  unsigned i;
> > +  for (i = 1; 2 * i <= nunroll; i *= 2)
> > +    continue;
> > +
> > +  loop->estimated_unroll = i;
> > +  return true;
> > +}
> > +
> > +/* Try to estimate whether this given LOOP can be unrolled or not, and compute
> > +   its estimated unroll factor if it can.  To avoid duplicated computation, you
> > +   can pass number of iterations information by DESC.  The heuristics mainly
> > +   refer to decide_unrolling in loop-unroll.c.  */
> > +
> > +void
> > +estimate_unroll_factor (class loop *loop, tree_niter_desc *desc)
> > +{
> > +  /* Return the existing estimated unroll factor.  */
> > +  if (loop->estimated_unroll)
> > +    return;
> > +
> > +  /* Don't unroll explicitly.  */
> > +  if (loop->unroll == 1)
> > +    {
> > +      loop->estimated_unroll = loop->unroll;
> > +      return;
> > +    }
> > +
> > +  /* Like decide_unrolling, don't unroll if:
> > +     1) the loop is cold.
> > +     2) the loop can't be manipulated.
> > +     3) the loop isn't innermost.  */
> > +  if (optimize_loop_for_size_p (loop) || !can_duplicate_loop_p (loop)
> > +      || loop->inner != NULL)
> > +    {
> > +      loop->estimated_unroll = 1;
> > +      return;
> > +    }
> > +
> > +  /* Don't unroll without explicit information.  */
> > +  if (!loop->unroll && !flag_unroll_loops && !flag_unroll_all_loops)
> > +    {
> > +      loop->estimated_unroll = 1;
> > +      return;
> > +    }
> > +
> > +  /* Check for instruction number and average instruction number.  */
> > +  loop->ninsns = tree_num_loop_insns (loop, &eni_size_weights);
> > +  loop->av_ninsns = tree_average_num_loop_insns (loop, &eni_size_weights);
> > +  unsigned nunroll = param_max_unrolled_insns / loop->ninsns;
> > +  unsigned nunroll_by_av = param_max_average_unrolled_insns / loop->av_ninsns;
> > +
> > +  if (nunroll > nunroll_by_av)
> > +    nunroll = nunroll_by_av;
> > +  if (nunroll > (unsigned) param_max_unroll_times)
> > +    nunroll = param_max_unroll_times;
> > +
> > +  if (targetm.loop_unroll_adjust)
> > +    nunroll = targetm.loop_unroll_adjust (nunroll, loop);
> > +
> > +  tree_niter_desc *niter_desc = NULL;
> > +  bool desc_need_delete = false;
> > +
> > +  /* Compute number of iterations if need.  */
> > +  if (!desc)
> > +    {
> > +      /* For now, use single_dom_exit for simplicity. TODO: Support multiple
> > +	 exits like find_simple_exit if we finds some profitable cases.  */
> > +      niter_desc = XNEW (class tree_niter_desc);
> > +      gcc_assert (niter_desc);
> > +      edge exit = single_dom_exit (loop);
> > +      if (!exit || !number_of_iterations_exit (loop, exit, niter_desc, true))
> > +	{
> > +	  XDELETE (niter_desc);
> > +	  niter_desc = NULL;
> > +	}
> > +      else
> > +	desc_need_delete = true;
> > +    }
> > +  else
> > +    niter_desc = desc;
> > +
> > +  /* For checking the loop rolls enough to consider, also consult loop bounds
> > +     and profile.  */
> > +  widest_int iterations;
> > +  if (!get_estimated_loop_iterations (loop, &iterations)
> > +      && !get_likely_max_loop_iterations (loop, &iterations))
> > +    iterations = 0;
> > +
> > +  if (niter_desc && niter_desc->assumptions)
> > +    {
> > +      /* For countable loops.  */
> > +      if (!decide_unroll_const_iter (loop, niter_desc, nunroll, &iterations)
> > +	  && !decide_unroll_runtime_iter (loop, niter_desc, nunroll, &iterations))
> > +	loop->estimated_unroll = 1;
> > +    }
> > +  else
> > +    {
> > +      if (!decide_unroll_stupid (loop, niter_desc, nunroll, &iterations))
> > +	loop->estimated_unroll = 1;
> > +    }
> > +
> > +  if (desc_need_delete)
> > +    {
> > +      XDELETE (niter_desc);
> > +      niter_desc = NULL;
> > +    }
> > +}
> > +
> > diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
> > index e789e4f..773a2b3 100644
> > --- a/gcc/tree-ssa-loop-manip.h
> > +++ b/gcc/tree-ssa-loop-manip.h
> > @@ -55,7 +55,6 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
> >  extern void tree_unroll_loop (class loop *, unsigned,
> >  			      edge, class tree_niter_desc *);
> >  extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
> > -
> > -
> > +extern void estimate_unroll_factor (class loop *, tree_niter_desc *);
> >  
> >  #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
> > diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
> > index 5e8365d..25320fb 100644
> > --- a/gcc/tree-ssa-loop.c
> > +++ b/gcc/tree-ssa-loop.c
> > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "diagnostic-core.h"
> >  #include "stringpool.h"
> >  #include "attribs.h"
> > +#include "sreal.h"
> >  
> >  
> >  /* A pass making sure loops are fixed up.  */
> > @@ -790,5 +791,37 @@ tree_num_loop_insns (class loop *loop, eni_weights *weights)
> >    return size;
> >  }
> >  
> > +/* Computes an estimated number of insns on average per iteration in LOOP,
> > +   weighted by WEIGHTS.  Refer to function average_num_loop_insns.  */
> >  
> > +unsigned
> > +tree_average_num_loop_insns (class loop *loop, eni_weights *weights)
> > +{
> > +  basic_block *body = get_loop_body (loop);
> > +  gimple_stmt_iterator gsi;
> > +  unsigned bb_size, i;
> > +  sreal nsize = 0;
> > +
> > +  for (i = 0; i < loop->num_nodes; i++)
> > +    {
> > +      bb_size = 0;
> > +      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
> > +	bb_size += estimate_num_insns (gsi_stmt (gsi), weights);
> > +      nsize += (sreal) bb_size
> > +	       * body[i]->count.to_sreal_scale (loop->header->count);
> > +      /* Avoid overflows.   */
> > +      if (nsize > 1000000)
> > +	{
> > +	  free (body);
> > +	  return 1000000;
> > +	}
> > +    }
> > +  free (body);
> > +
> > +  unsigned ret = nsize.to_int ();
> > +  if (!ret)
> > +    ret = 1; /* To avoid division by zero.  */
> > +
> > +  return ret;
> > +}
> >  
> > diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
> > index 9e35125..af36177 100644
> > --- a/gcc/tree-ssa-loop.h
> > +++ b/gcc/tree-ssa-loop.h
> > @@ -67,6 +67,8 @@ public:
> >  extern bool for_each_index (tree *, bool (*) (tree, tree *, void *), void *);
> >  extern char *get_lsm_tmp_name (tree ref, unsigned n, const char *suffix = NULL);
> >  extern unsigned tree_num_loop_insns (class loop *, struct eni_weights *);
> > +extern unsigned tree_average_num_loop_insns (class loop *,
> > +					     struct eni_weights *);
> >  
> >  /* Returns the loop of the statement STMT.  */
> >  
> > -- 
> > 2.7.4
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-22 13:47     ` Richard Biener
@ 2021-01-22 21:37       ` Segher Boessenkool
  2021-01-25  7:56         ` Richard Biener
  0 siblings, 1 reply; 64+ messages in thread
From: Segher Boessenkool @ 2021-01-22 21:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: Kewen.Lin, GCC Patches, Richard Sandiford, Bill Schmidt

On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
> On Thu, 21 Jan 2021, Segher Boessenkool wrote:
> > What is holding up this patch still?  Ke Wen has pinged it every month
> > since May, and there has still not been a review.

Richard Sandiford wrote:
> FAOD (since I'm on cc:), I don't feel qualified to review this.
> Tree-level loop stuff isn't really my area.

And Richard Biener wrote:
> I don't like it, it feels wrong but I don't have a good suggestion
> that had positive feedback.  Since a reviewer / approver is indirectly
> responsible for at least the design I do not want to ack this patch.
> Bin made forward progress on the other parts of the series but clearly
> there's somebody missing with the appropriate privileges who feels
> positive about the patch and its general direction.
> 
> Sorry to be of no help here.

How unfortunate :-(

So, first off, this will then have to work for next stage 1 to make any
progress.  Rats.

But what could have been done differently that would have helped?  Of
course Ke Wen could have written a better patch (aka one that is more
acceptable); either of you could have made your current replies earlier,
so that it is clear help needs to be sought elsewhere; and I could have
pushed people earlier, too.  No one really did anything wrong, I'm not
seeking who to blame, I'm just trying to find out how to prevent
deadlocks like this in the future (where one party waits for replies
that will never come).

Is it just that we have a big gaping hole in reviewers with experience
in such loop optimisations?


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-22 21:37       ` Segher Boessenkool
@ 2021-01-25  7:56         ` Richard Biener
  2021-01-25 17:59           ` Richard Sandiford
  2021-01-26  8:36           ` Kewen.Lin
  0 siblings, 2 replies; 64+ messages in thread
From: Richard Biener @ 2021-01-25  7:56 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Kewen.Lin, GCC Patches, Richard Sandiford, Bill Schmidt

On Fri, 22 Jan 2021, Segher Boessenkool wrote:

> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
> > On Thu, 21 Jan 2021, Segher Boessenkool wrote:
> > > What is holding up this patch still?  Ke Wen has pinged it every month
> > > since May, and there has still not been a review.
> 
> Richard Sandiford wrote:
> > FAOD (since I'm on cc:), I don't feel qualified to review this.
> > Tree-level loop stuff isn't really my area.
> 
> And Richard Biener wrote:
> > I don't like it, it feels wrong but I don't have a good suggestion
> > that had positive feedback.  Since a reviewer / approver is indirectly
> > responsible for at least the design I do not want to ack this patch.
> > Bin made forward progress on the other parts of the series but clearly
> > there's somebody missing with the appropriate privileges who feels
> > positive about the patch and its general direction.
> > 
> > Sorry to be of no help here.
> 
> How unfortunate :-(
> 
> So, first off, this will then have to work for next stage 1 to make any
> progress.  Rats.
> 
> But what could have been done differently that would have helped?  Of
> course Ke Wen could have written a better patch (aka one that is more
> acceptable); either of you could have made your current replies earlier,
> so that it is clear help needs to be sought elsewhere; and I could have
> pushed people earlier, too.  No one really did anything wrong, I'm not
> seeking who to blame, I'm just trying to find out how to prevent
> deadlocks like this in the future (where one party waits for replies
> that will never come).
> 
> Is it just that we have a big gaping hole in reviewers with experience
> in such loop optimisations?

May be.  But what I think is the biggest problem is that we do not
have a good way to achieve what the patch tries (if you review the
communications you'll see many ideas tossed around) first and foremost
because IV selection is happening early on GIMPLE and unrolling
happens late on RTL.  Both need a quite accurate estimate of costs
but unrolling has an ever harder time than IV selection where we've
got along with throwing dummy RTL at costing functions.

IMHO the patch is the wrong "start" to try fixing the issue and my
fear is that wiring this kind of "features" into the current
(fundamentally broken) state will make it much harder to rework
that state without introducing regressions on said features (I'm
there with trying to turn the vectorizer upside down - for three
years now, struggling to not regress any of the "features" we've
accumulated for various targets where most of them feel a
"bolted-on" rather than well-designed ;/).

I think IV selection and unrolling (and scheduling FWIW) need to move
closer together.  I do not have a good idea how that can work out
though but I very much believe that this "most wanted" GIMPLE unroller
will not be a good way of progressing here.  Maybe taking the bullet
and moving IV selection back to RTL is the answer.

For a "short term" solution I still think that trying to perform
unrolling and IV selection (for the D-form case you're targeting)
at the same time is a better design, even if it means complicating
the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
any good idea about scheduling).  There are currently 20+ GIMPLE
optimization passes and 10+ RTL optimization passes between
IV selection and unrolling, the idea that you can have transform
decision and transform apply this far apart looks scary.

Richard.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-25  7:56         ` Richard Biener
@ 2021-01-25 17:59           ` Richard Sandiford
  2021-01-25 20:37             ` Segher Boessenkool
                               ` (2 more replies)
  2021-01-26  8:36           ` Kewen.Lin
  1 sibling, 3 replies; 64+ messages in thread
From: Richard Sandiford @ 2021-01-25 17:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: Segher Boessenkool, Bill Schmidt, GCC Patches

Richard Biener <rguenther@suse.de> writes:
> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
>
>> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
>> > On Thu, 21 Jan 2021, Segher Boessenkool wrote:
>> > > What is holding up this patch still?  Ke Wen has pinged it every month
>> > > since May, and there has still not been a review.
>> 
>> Richard Sandiford wrote:
>> > FAOD (since I'm on cc:), I don't feel qualified to review this.
>> > Tree-level loop stuff isn't really my area.
>> 
>> And Richard Biener wrote:
>> > I don't like it, it feels wrong but I don't have a good suggestion
>> > that had positive feedback.  Since a reviewer / approver is indirectly
>> > responsible for at least the design I do not want to ack this patch.
>> > Bin made forward progress on the other parts of the series but clearly
>> > there's somebody missing with the appropriate privileges who feels
>> > positive about the patch and its general direction.
>> > 
>> > Sorry to be of no help here.
>> 
>> How unfortunate :-(
>> 
>> So, first off, this will then have to work for next stage 1 to make any
>> progress.  Rats.
>> 
>> But what could have been done differently that would have helped?  Of
>> course Ke Wen could have written a better patch (aka one that is more
>> acceptable); either of you could have made your current replies earlier,
>> so that it is clear help needs to be sought elsewhere; and I could have
>> pushed people earlier, too.  No one really did anything wrong, I'm not
>> seeking who to blame, I'm just trying to find out how to prevent
>> deadlocks like this in the future (where one party waits for replies
>> that will never come).
>> 
>> Is it just that we have a big gaping hole in reviewers with experience
>> in such loop optimisations?
>
> May be.  But what I think is the biggest problem is that we do not
> have a good way to achieve what the patch tries (if you review the
> communications you'll see many ideas tossed around) first and foremost
> because IV selection is happening early on GIMPLE and unrolling
> happens late on RTL.  Both need a quite accurate estimate of costs
> but unrolling has an ever harder time than IV selection where we've
> got along with throwing dummy RTL at costing functions.
>
> IMHO the patch is the wrong "start" to try fixing the issue and my
> fear is that wiring this kind of "features" into the current
> (fundamentally broken) state will make it much harder to rework
> that state without introducing regressions on said features (I'm
> there with trying to turn the vectorizer upside down - for three
> years now, struggling to not regress any of the "features" we've
> accumulated for various targets where most of them feel a
> "bolted-on" rather than well-designed ;/).

Thinking of any features in particular here?

Most of the ones I can think of seem to be doing things in the way
that the current infrastructure expects.  But of course, the current
infrastructure isn't perfect, so the end result isn't either.

Still, I agree with the above apart from maybe that last bit. ;-)

> I think IV selection and unrolling (and scheduling FWIW) need to move
> closer together.  I do not have a good idea how that can work out
> though but I very much believe that this "most wanted" GIMPLE unroller
> will not be a good way of progressing here.

What do you feel about unrolling in the vectoriser (by doubling the VF, etc.)
in cases where something about the target indicates that that would be
useful?  I think that's a good place to do it (for the cases that it
handles) because it's hard to unroll later and then interleave.

> Maybe taking the bullet and moving IV selection back to RTL is the
> answer.

I think that would be a bad move.  The trend recently seems to have been
to lower stuff to individual machine operations earlier in the rtl pass
pipeline (often immediately during expand) rather than split them later.
The reasoning behind that is that (1) gimple has already heavily optimised
the unlowered form and (2) lowering earlier gives the more powerful rtl
optimisers a chance to do something with the individual machine operations.
It's going to be hard for an RTL ivopts pass to piece everything back
together.

> For a "short term" solution I still think that trying to perform
> unrolling and IV selection (for the D-form case you're targeting)
> at the same time is a better design, even if it means complicating
> the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
> any good idea about scheduling).  There are currently 20+ GIMPLE
> optimization passes and 10+ RTL optimization passes between
> IV selection and unrolling, the idea that you can have transform
> decision and transform apply this far apart looks scary.

FWIW, another option might be to go back to something like:

  https://gcc.gnu.org/pipermail/gcc-patches/2019-October/532676.html

I agree that it was worth putting that series on hold and trying a more
target-independent approach, but I think in the end it didn't work out,
for the reasons Richard says.  At least the target-specific pass would
be making a strict improvement to the IL that it sees, rather than
having to predict what future passes might do or might want.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-25 17:59           ` Richard Sandiford
@ 2021-01-25 20:37             ` Segher Boessenkool
  2021-01-26  8:53               ` Kewen.Lin
  2021-01-26  8:43             ` Kewen.Lin
  2021-01-26 10:47             ` Richard Biener
  2 siblings, 1 reply; 64+ messages in thread
From: Segher Boessenkool @ 2021-01-25 20:37 UTC (permalink / raw)
  To: Richard Biener, Bill Schmidt, GCC Patches, richard.sandiford

Hi!

On Mon, Jan 25, 2021 at 05:59:23PM +0000, Richard Sandiford wrote:
> Richard Biener <rguenther@suse.de> writes:
> > On Fri, 22 Jan 2021, Segher Boessenkool wrote:
> >> But what could have been done differently that would have helped?  Of
> >> course Ke Wen could have written a better patch (aka one that is more
> >> acceptable); either of you could have made your current replies earlier,
> >> so that it is clear help needs to be sought elsewhere; and I could have
> >> pushed people earlier, too.  No one really did anything wrong, I'm not
> >> seeking who to blame, I'm just trying to find out how to prevent
> >> deadlocks like this in the future (where one party waits for replies
> >> that will never come).
> >> 
> >> Is it just that we have a big gaping hole in reviewers with experience
> >> in such loop optimisations?
> >
> > May be.  But what I think is the biggest problem is that we do not
> > have a good way to achieve what the patch tries (if you review the
> > communications you'll see many ideas tossed around) first and foremost
> > because IV selection is happening early on GIMPLE and unrolling
> > happens late on RTL.  Both need a quite accurate estimate of costs
> > but unrolling has an ever harder time than IV selection where we've
> > got along with throwing dummy RTL at costing functions.

GIMPLE already needs at least an *estimate* of how much any loop will
be unrolled (for similar reasons as the IV selection).  The actual
mechanics can happen later (in RTL), and we could even use a different
unroll factor (in some cases) than what we first estimated; but for the
GIMPLE optimisations it can be important to know what the target code
will eventually look like.

> > IMHO the patch is the wrong "start" to try fixing the issue and my
> > fear is that wiring this kind of "features" into the current
> > (fundamentally broken) state will make it much harder to rework
> > that state without introducing regressions on said features (I'm
> > there with trying to turn the vectorizer upside down - for three
> > years now, struggling to not regress any of the "features" we've
> > accumulated for various targets where most of them feel a
> > "bolted-on" rather than well-designed ;/).
> 
> Thinking of any features in particular here?
> 
> Most of the ones I can think of seem to be doing things in the way
> that the current infrastructure expects.  But of course, the current
> infrastructure isn't perfect, so the end result isn't either.
> 
> Still, I agree with the above apart from maybe that last bit. ;-)
> 
> > I think IV selection and unrolling (and scheduling FWIW) need to move
> > closer together.  I do not have a good idea how that can work out
> > though but I very much believe that this "most wanted" GIMPLE unroller
> > will not be a good way of progressing here.
> 
> What do you feel about unrolling in the vectoriser (by doubling the VF, etc.)
> in cases where something about the target indicates that that would be
> useful?  I think that's a good place to do it (for the cases that it
> handles) because it's hard to unroll later and then interleave.

Yeah, in such cases doing the actual unrolling can be quite beneficial.
My gut feeling is you'd do it at the same time as creating the actual
vector code here.

> > Maybe taking the bullet and moving IV selection back to RTL is the
> > answer.
> 
> I think that would be a bad move.

I agree.

> The trend recently seems to have been
> to lower stuff to individual machine operations earlier in the rtl pass
> pipeline (often immediately during expand) rather than split them later.

I've pushed rs6000 towards this for many years :-)

> The reasoning behind that is that (1) gimple has already heavily optimised
> the unlowered form and (2) lowering earlier gives the more powerful rtl
> optimisers a chance to do something with the individual machine operations.

And (3) you need to do much less work at expand then.  More than half of
what expand currently does is a) unnecessary, (not much) later passes
will do the same anyway; or b) harmful premature optimisations.

It indeed helps a lot that GIMPLE has already done pretty much all
optimisations that are not machine-specific :-)

> It's going to be hard for an RTL ivopts pass to piece everything back
> together.

"Loops" is a pretty hard abstraction for RTL already, simply because
RTL is very concrete, is very close to the hardware.  But on the other
hand it can be challenging to have a good cost estimate earlier on.

> > For a "short term" solution I still think that trying to perform
> > unrolling and IV selection (for the D-form case you're targeting)
> > at the same time is a better design, even if it means complicating
> > the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
> > any good idea about scheduling).  There are currently 20+ GIMPLE
> > optimization passes and 10+ RTL optimization passes between
> > IV selection and unrolling, the idea that you can have transform
> > decision and transform apply this far apart looks scary.
> 
> FWIW, another option might be to go back to something like:
> 
>   https://gcc.gnu.org/pipermail/gcc-patches/2019-October/532676.html
> 
> I agree that it was worth putting that series on hold and trying a more
> target-independent approach, but I think in the end it didn't work out,
> for the reasons Richard says.  At least the target-specific pass would
> be making a strict improvement to the IL that it sees, rather than
> having to predict what future passes might do or might want.

In my experience an approach like that patch, i.e. fixing things up
after the "normal" stuff, is not a good way to get good results.  On the
other hand it *is* a good way to "fine tune" your results.  It is a bit
like RTL peepholes (and applies to many more things than loop opts).


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-25  7:56         ` Richard Biener
  2021-01-25 17:59           ` Richard Sandiford
@ 2021-01-26  8:36           ` Kewen.Lin
  2021-01-26 10:53             ` Richard Biener
  1 sibling, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2021-01-26  8:36 UTC (permalink / raw)
  To: Richard Biener, Segher Boessenkool, Richard Sandiford
  Cc: GCC Patches, Bill Schmidt

Hi Segher/Richard B./Richard S.,

Many thanks for your all helps and comments on this!

on 2021/1/25 下午3:56, Richard Biener wrote:
> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
> 
>> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
>>> On Thu, 21 Jan 2021, Segher Boessenkool wrote:
>>>> What is holding up this patch still?  Ke Wen has pinged it every month
>>>> since May, and there has still not been a review.
>>
>> Richard Sandiford wrote:
>>> FAOD (since I'm on cc:), I don't feel qualified to review this.
>>> Tree-level loop stuff isn't really my area.
>>
>> And Richard Biener wrote:
>>> I don't like it, it feels wrong but I don't have a good suggestion
>>> that had positive feedback.  Since a reviewer / approver is indirectly
>>> responsible for at least the design I do not want to ack this patch.
>>> Bin made forward progress on the other parts of the series but clearly
>>> there's somebody missing with the appropriate privileges who feels
>>> positive about the patch and its general direction.
>>>
>>> Sorry to be of no help here.
>>
>> How unfortunate :-(
>>
>> So, first off, this will then have to work for next stage 1 to make any
>> progress.  Rats.
>>
>> But what could have been done differently that would have helped?  Of
>> course Ke Wen could have written a better patch (aka one that is more
>> acceptable); either of you could have made your current replies earlier,
>> so that it is clear help needs to be sought elsewhere; and I could have
>> pushed people earlier, too.  No one really did anything wrong, I'm not
>> seeking who to blame, I'm just trying to find out how to prevent
>> deadlocks like this in the future (where one party waits for replies
>> that will never come).
>>
>> Is it just that we have a big gaping hole in reviewers with experience
>> in such loop optimisations?
> 
> May be.  But what I think is the biggest problem is that we do not
> have a good way to achieve what the patch tries (if you review the
> communications you'll see many ideas tossed around) first and foremost
> because IV selection is happening early on GIMPLE and unrolling
> happens late on RTL.  Both need a quite accurate estimate of costs
> but unrolling has an ever harder time than IV selection where we've
> got along with throwing dummy RTL at costing functions.
> 

Yeah, exactly.

> IMHO the patch is the wrong "start" to try fixing the issue and my
> fear is that wiring this kind of "features" into the current
> (fundamentally broken) state will make it much harder to rework
> that state without introducing regressions on said features (I'm
> there with trying to turn the vectorizer upside down - for three
> years now, struggling to not regress any of the "features" we've
> accumulated for various targets where most of them feel a
> "bolted-on" rather than well-designed ;/).
> 

OK, understandable.

> I think IV selection and unrolling (and scheduling FWIW) need to move
> closer together.  I do not have a good idea how that can work out
> though but I very much believe that this "most wanted" GIMPLE unroller
> will not be a good way of progressing here.  Maybe taking the bullet
> and moving IV selection back to RTL is the answer.
> 

I haven't looked into loop-iv.c, but IVOPTS in gimple can leverage
SCEV analysis for iv detection, if moving it to RTL, it could be
very heavier to detect the full set there?

> For a "short term" solution I still think that trying to perform
> unrolling and IV selection (for the D-form case you're targeting)
> at the same time is a better design, even if it means complicating
> the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
> any good idea about scheduling).  There are currently 20+ GIMPLE
> optimization passes and 10+ RTL optimization passes between
> IV selection and unrolling, the idea that you can have transform
> decision and transform apply this far apart looks scary.
> 

I have some questions in mind for this part, for "perform unrolling
and IV selection at the same time", it can be interpreted to two
different implementation ways to me:

1) Run one gimple unrolling pass just before IVOPTS, probably using
   the same gate for IVOPTS.  The unrolling factor is computed by
   the same method as that of RTL unrolling.  But this sounds very
   like "most wanted gimple unrolling" which is what we want to avoid.

   The positive aspect here is what IVOPTS faces is already one unrolled
   loop, it can see the whole picture and get the optimal IV set.  The
   downside/question is how we position these gimple unrolling and RTL
   unrolling passes, whether we still need RTL unrolling.  If no, it's
   doubtable that one gimple unrolling can fully replace the RTL
   unrolling probably lacking some actual target information/instructions.
   If yes, it's still possible to have inconsistent unrolling factors
   between what IVOPTS optimizes on and what late RTL unrolling pass
   ends with.

2) Make IVOPTS determine the unrolling factor by considering the
   reg-offset addressing (D-form), unroll the loop and do the remainings.

   I don't think you referred to this though.  Since comparing to
   reg-reg addressing, reg-offset addressing can only save some bumps,
   it's too weak to be one deciding factor of unrolling factor.  Unlike
   vectorizer or modulo scheduling, it's more likely there are more
   important factors for unrolling factor computation than this one.

   For IVOPTS, it's more like that it doesn't care what the unrolling
   factor should be, but it needs to know what the unrolling factor
   would be, then do optimal IV selection based on that.  So it's not
   good to get it to decide the unrolling factor.


BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-25 17:59           ` Richard Sandiford
  2021-01-25 20:37             ` Segher Boessenkool
@ 2021-01-26  8:43             ` Kewen.Lin
  2021-01-26 10:47             ` Richard Biener
  2 siblings, 0 replies; 64+ messages in thread
From: Kewen.Lin @ 2021-01-26  8:43 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Segher Boessenkool, Richard Biener, Bill Schmidt

on 2021/1/26 上午1:59, Richard Sandiford via Gcc-patches wrote:
> Richard Biener <rguenther@suse.de> writes:
>> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
>>
>>> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
>>>> On Thu, 21 Jan 2021, Segher Boessenkool wrote:
>>>>> What is holding up this patch still?  Ke Wen has pinged it every month
>>>>> since May, and there has still not been a review.
>>>
>>> Richard Sandiford wrote:
>>>> FAOD (since I'm on cc:), I don't feel qualified to review this.
>>>> Tree-level loop stuff isn't really my area.
>>>
>>> And Richard Biener wrote:
>>>> I don't like it, it feels wrong but I don't have a good suggestion
>>>> that had positive feedback.  Since a reviewer / approver is indirectly
>>>> responsible for at least the design I do not want to ack this patch.
>>>> Bin made forward progress on the other parts of the series but clearly
>>>> there's somebody missing with the appropriate privileges who feels
>>>> positive about the patch and its general direction.
>>>>
>>>> Sorry to be of no help here.
>>>
>>> How unfortunate :-(
>>>
>>> So, first off, this will then have to work for next stage 1 to make any
>>> progress.  Rats.
>>>
>>> But what could have been done differently that would have helped?  Of
>>> course Ke Wen could have written a better patch (aka one that is more
>>> acceptable); either of you could have made your current replies earlier,
>>> so that it is clear help needs to be sought elsewhere; and I could have
>>> pushed people earlier, too.  No one really did anything wrong, I'm not
>>> seeking who to blame, I'm just trying to find out how to prevent
>>> deadlocks like this in the future (where one party waits for replies
>>> that will never come).
>>>
>>> Is it just that we have a big gaping hole in reviewers with experience
>>> in such loop optimisations?
>>
>> May be.  But what I think is the biggest problem is that we do not
>> have a good way to achieve what the patch tries (if you review the
>> communications you'll see many ideas tossed around) first and foremost
>> because IV selection is happening early on GIMPLE and unrolling
>> happens late on RTL.  Both need a quite accurate estimate of costs
>> but unrolling has an ever harder time than IV selection where we've
>> got along with throwing dummy RTL at costing functions.
>>
>> IMHO the patch is the wrong "start" to try fixing the issue and my
>> fear is that wiring this kind of "features" into the current
>> (fundamentally broken) state will make it much harder to rework
>> that state without introducing regressions on said features (I'm
>> there with trying to turn the vectorizer upside down - for three
>> years now, struggling to not regress any of the "features" we've
>> accumulated for various targets where most of them feel a
>> "bolted-on" rather than well-designed ;/).
> 
> Thinking of any features in particular here?
> 
> Most of the ones I can think of seem to be doing things in the way
> that the current infrastructure expects.  But of course, the current
> infrastructure isn't perfect, so the end result isn't either.
> 
> Still, I agree with the above apart from maybe that last bit. ;-)
> 
>> I think IV selection and unrolling (and scheduling FWIW) need to move
>> closer together.  I do not have a good idea how that can work out
>> though but I very much believe that this "most wanted" GIMPLE unroller
>> will not be a good way of progressing here.
> 
> What do you feel about unrolling in the vectoriser (by doubling the VF, etc.)
> in cases where something about the target indicates that that would be
> useful?  I think that's a good place to do it (for the cases that it
> handles) because it's hard to unroll later and then interleave.
> 
>> Maybe taking the bullet and moving IV selection back to RTL is the
>> answer.
> 
> I think that would be a bad move.  The trend recently seems to have been
> to lower stuff to individual machine operations earlier in the rtl pass
> pipeline (often immediately during expand) rather than split them later.
> The reasoning behind that is that (1) gimple has already heavily optimised
> the unlowered form and (2) lowering earlier gives the more powerful rtl
> optimisers a chance to do something with the individual machine operations.
> It's going to be hard for an RTL ivopts pass to piece everything back
> together.
> 
>> For a "short term" solution I still think that trying to perform
>> unrolling and IV selection (for the D-form case you're targeting)
>> at the same time is a better design, even if it means complicating
>> the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
>> any good idea about scheduling).  There are currently 20+ GIMPLE
>> optimization passes and 10+ RTL optimization passes between
>> IV selection and unrolling, the idea that you can have transform
>> decision and transform apply this far apart looks scary.
> 
> FWIW, another option might be to go back to something like:
> 
>   https://gcc.gnu.org/pipermail/gcc-patches/2019-October/532676.html
> 
> I agree that it was worth putting that series on hold and trying a more
> target-independent approach, but I think in the end it didn't work out,
> for the reasons Richard says.  At least the target-specific pass would
> be making a strict improvement to the IL that it sees, rather than
> having to predict what future passes might do or might want.
> 

Yeah, I also had this thought in mind, if we cannot find one good
target-independent approach, it seems good to revisit this series and
work with target-specific approach.

BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-25 20:37             ` Segher Boessenkool
@ 2021-01-26  8:53               ` Kewen.Lin
  2021-01-26 17:31                 ` Segher Boessenkool
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2021-01-26  8:53 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Bill Schmidt, GCC Patches, richard.sandiford, Richard Biener

on 2021/1/26 上午4:37, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Jan 25, 2021 at 05:59:23PM +0000, Richard Sandiford wrote:
>> Richard Biener <rguenther@suse.de> writes:
>>> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
>>>> But what could have been done differently that would have helped?  Of
>>>> course Ke Wen could have written a better patch (aka one that is more
>>>> acceptable); either of you could have made your current replies earlier,
>>>> so that it is clear help needs to be sought elsewhere; and I could have
>>>> pushed people earlier, too.  No one really did anything wrong, I'm not
>>>> seeking who to blame, I'm just trying to find out how to prevent
>>>> deadlocks like this in the future (where one party waits for replies
>>>> that will never come).
>>>>
>>>> Is it just that we have a big gaping hole in reviewers with experience
>>>> in such loop optimisations?
>>>
>>> May be.  But what I think is the biggest problem is that we do not
>>> have a good way to achieve what the patch tries (if you review the
>>> communications you'll see many ideas tossed around) first and foremost
>>> because IV selection is happening early on GIMPLE and unrolling
>>> happens late on RTL.  Both need a quite accurate estimate of costs
>>> but unrolling has an ever harder time than IV selection where we've
>>> got along with throwing dummy RTL at costing functions.
> 
> GIMPLE already needs at least an *estimate* of how much any loop will
> be unrolled (for similar reasons as the IV selection).  The actual
> mechanics can happen later (in RTL), and we could even use a different
> unroll factor (in some cases) than what we first estimated; but for the
> GIMPLE optimisations it can be important to know what the target code
> will eventually look like.
> 

Yeah, this point was discussed/mentioned that the estimated result
can be used for other passes too.  But I'm not sure whether we have
already known some other passes who suffer this kind of similar problem.


BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-25 17:59           ` Richard Sandiford
  2021-01-25 20:37             ` Segher Boessenkool
  2021-01-26  8:43             ` Kewen.Lin
@ 2021-01-26 10:47             ` Richard Biener
  2021-01-26 17:54               ` Segher Boessenkool
  2 siblings, 1 reply; 64+ messages in thread
From: Richard Biener @ 2021-01-26 10:47 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Segher Boessenkool, Bill Schmidt, GCC Patches

On Mon, 25 Jan 2021, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Fri, 22 Jan 2021, Segher Boessenkool wrote:
> >
> >> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
> >> > On Thu, 21 Jan 2021, Segher Boessenkool wrote:
> >> > > What is holding up this patch still?  Ke Wen has pinged it every month
> >> > > since May, and there has still not been a review.
> >> 
> >> Richard Sandiford wrote:
> >> > FAOD (since I'm on cc:), I don't feel qualified to review this.
> >> > Tree-level loop stuff isn't really my area.
> >> 
> >> And Richard Biener wrote:
> >> > I don't like it, it feels wrong but I don't have a good suggestion
> >> > that had positive feedback.  Since a reviewer / approver is indirectly
> >> > responsible for at least the design I do not want to ack this patch.
> >> > Bin made forward progress on the other parts of the series but clearly
> >> > there's somebody missing with the appropriate privileges who feels
> >> > positive about the patch and its general direction.
> >> > 
> >> > Sorry to be of no help here.
> >> 
> >> How unfortunate :-(
> >> 
> >> So, first off, this will then have to work for next stage 1 to make any
> >> progress.  Rats.
> >> 
> >> But what could have been done differently that would have helped?  Of
> >> course Ke Wen could have written a better patch (aka one that is more
> >> acceptable); either of you could have made your current replies earlier,
> >> so that it is clear help needs to be sought elsewhere; and I could have
> >> pushed people earlier, too.  No one really did anything wrong, I'm not
> >> seeking who to blame, I'm just trying to find out how to prevent
> >> deadlocks like this in the future (where one party waits for replies
> >> that will never come).
> >> 
> >> Is it just that we have a big gaping hole in reviewers with experience
> >> in such loop optimisations?
> >
> > May be.  But what I think is the biggest problem is that we do not
> > have a good way to achieve what the patch tries (if you review the
> > communications you'll see many ideas tossed around) first and foremost
> > because IV selection is happening early on GIMPLE and unrolling
> > happens late on RTL.  Both need a quite accurate estimate of costs
> > but unrolling has an ever harder time than IV selection where we've
> > got along with throwing dummy RTL at costing functions.
> >
> > IMHO the patch is the wrong "start" to try fixing the issue and my
> > fear is that wiring this kind of "features" into the current
> > (fundamentally broken) state will make it much harder to rework
> > that state without introducing regressions on said features (I'm
> > there with trying to turn the vectorizer upside down - for three
> > years now, struggling to not regress any of the "features" we've
> > accumulated for various targets where most of them feel a
> > "bolted-on" rather than well-designed ;/).
> 
> Thinking of any features in particular here?

Mostly all of the special-cases in load/store vectorization, but then...

> Most of the ones I can think of seem to be doing things in the way
> that the current infrastructure expects.  But of course, the current
> infrastructure isn't perfect, so the end result isn't either.

... indeed most of the issues are because of design decisions very
early in the vectorizers lifetime.

> Still, I agree with the above apart from maybe that last bit. ;-)
> 
> > I think IV selection and unrolling (and scheduling FWIW) need to move
> > closer together.  I do not have a good idea how that can work out
> > though but I very much believe that this "most wanted" GIMPLE unroller
> > will not be a good way of progressing here.
> 
> What do you feel about unrolling in the vectoriser (by doubling the VF, etc.)
> in cases where something about the target indicates that that would be
> useful?  I think that's a good place to do it (for the cases that it
> handles) because it's hard to unroll later and then interleave.

I think that's fine (similar to unrolling during IV selection).
Unrolling if there's a reason (other than unrolling) is good.  Of course
some costing has to be applied which might be difficult on GIMPLE
(register pressure, CPU resource allocation aka scheduling).

> > Maybe taking the bullet and moving IV selection back to RTL is the
> > answer.
> 
> I think that would be a bad move.  The trend recently seems to have been
> to lower stuff to individual machine operations earlier in the rtl pass
> pipeline (often immediately during expand) rather than split them later.
> The reasoning behind that is that (1) gimple has already heavily optimised
> the unlowered form and (2) lowering earlier gives the more powerful rtl
> optimisers a chance to do something with the individual machine operations.
> It's going to be hard for an RTL ivopts pass to piece everything back
> together.

Hmm, OK.  But of course individual machine instructions is that
determine the cost ...

Anyway, I think the GIMPLE -> RTL transition currently is a too
big step and we should eventually try to lower GIMPLE and IV
selection should happen on a lower form where we for example
can do the 1st level scheduling on or at least have a better idea
on resource allocation and latencies in dependence cycles since
that's what really matters for unrolling.

> > For a "short term" solution I still think that trying to perform
> > unrolling and IV selection (for the D-form case you're targeting)
> > at the same time is a better design, even if it means complicating
> > the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
> > any good idea about scheduling).  There are currently 20+ GIMPLE
> > optimization passes and 10+ RTL optimization passes between
> > IV selection and unrolling, the idea that you can have transform
> > decision and transform apply this far apart looks scary.
> 
> FWIW, another option might be to go back to something like:
> 
>   https://gcc.gnu.org/pipermail/gcc-patches/2019-October/532676.html
> 
> I agree that it was worth putting that series on hold and trying a more
> target-independent approach, but I think in the end it didn't work out,
> for the reasons Richard says.  At least the target-specific pass would
> be making a strict improvement to the IL that it sees, rather than
> having to predict what future passes might do or might want.
> 
> Thanks,
> Richard
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-26  8:36           ` Kewen.Lin
@ 2021-01-26 10:53             ` Richard Biener
  2021-01-27  9:43               ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Richard Biener @ 2021-01-26 10:53 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, Richard Sandiford, GCC Patches, Bill Schmidt

On Tue, 26 Jan 2021, Kewen.Lin wrote:

> Hi Segher/Richard B./Richard S.,
> 
> Many thanks for your all helps and comments on this!
> 
> on 2021/1/25 下午3:56, Richard Biener wrote:
> > On Fri, 22 Jan 2021, Segher Boessenkool wrote:
> > 
> >> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
> >>> On Thu, 21 Jan 2021, Segher Boessenkool wrote:
> >>>> What is holding up this patch still?  Ke Wen has pinged it every month
> >>>> since May, and there has still not been a review.
> >>
> >> Richard Sandiford wrote:
> >>> FAOD (since I'm on cc:), I don't feel qualified to review this.
> >>> Tree-level loop stuff isn't really my area.
> >>
> >> And Richard Biener wrote:
> >>> I don't like it, it feels wrong but I don't have a good suggestion
> >>> that had positive feedback.  Since a reviewer / approver is indirectly
> >>> responsible for at least the design I do not want to ack this patch.
> >>> Bin made forward progress on the other parts of the series but clearly
> >>> there's somebody missing with the appropriate privileges who feels
> >>> positive about the patch and its general direction.
> >>>
> >>> Sorry to be of no help here.
> >>
> >> How unfortunate :-(
> >>
> >> So, first off, this will then have to work for next stage 1 to make any
> >> progress.  Rats.
> >>
> >> But what could have been done differently that would have helped?  Of
> >> course Ke Wen could have written a better patch (aka one that is more
> >> acceptable); either of you could have made your current replies earlier,
> >> so that it is clear help needs to be sought elsewhere; and I could have
> >> pushed people earlier, too.  No one really did anything wrong, I'm not
> >> seeking who to blame, I'm just trying to find out how to prevent
> >> deadlocks like this in the future (where one party waits for replies
> >> that will never come).
> >>
> >> Is it just that we have a big gaping hole in reviewers with experience
> >> in such loop optimisations?
> > 
> > May be.  But what I think is the biggest problem is that we do not
> > have a good way to achieve what the patch tries (if you review the
> > communications you'll see many ideas tossed around) first and foremost
> > because IV selection is happening early on GIMPLE and unrolling
> > happens late on RTL.  Both need a quite accurate estimate of costs
> > but unrolling has an ever harder time than IV selection where we've
> > got along with throwing dummy RTL at costing functions.
> > 
> 
> Yeah, exactly.
> 
> > IMHO the patch is the wrong "start" to try fixing the issue and my
> > fear is that wiring this kind of "features" into the current
> > (fundamentally broken) state will make it much harder to rework
> > that state without introducing regressions on said features (I'm
> > there with trying to turn the vectorizer upside down - for three
> > years now, struggling to not regress any of the "features" we've
> > accumulated for various targets where most of them feel a
> > "bolted-on" rather than well-designed ;/).
> > 
> 
> OK, understandable.
> 
> > I think IV selection and unrolling (and scheduling FWIW) need to move
> > closer together.  I do not have a good idea how that can work out
> > though but I very much believe that this "most wanted" GIMPLE unroller
> > will not be a good way of progressing here.  Maybe taking the bullet
> > and moving IV selection back to RTL is the answer.
> > 
> 
> I haven't looked into loop-iv.c, but IVOPTS in gimple can leverage
> SCEV analysis for iv detection, if moving it to RTL, it could be
> very heavier to detect the full set there?
> 
> > For a "short term" solution I still think that trying to perform
> > unrolling and IV selection (for the D-form case you're targeting)
> > at the same time is a better design, even if it means complicating
> > the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
> > any good idea about scheduling).  There are currently 20+ GIMPLE
> > optimization passes and 10+ RTL optimization passes between
> > IV selection and unrolling, the idea that you can have transform
> > decision and transform apply this far apart looks scary.
> > 
> 
> I have some questions in mind for this part, for "perform unrolling
> and IV selection at the same time", it can be interpreted to two
> different implementation ways to me:
> 
> 1) Run one gimple unrolling pass just before IVOPTS, probably using
>    the same gate for IVOPTS.  The unrolling factor is computed by
>    the same method as that of RTL unrolling.  But this sounds very
>    like "most wanted gimple unrolling" which is what we want to avoid.
> 
>    The positive aspect here is what IVOPTS faces is already one unrolled
>    loop, it can see the whole picture and get the optimal IV set.  The
>    downside/question is how we position these gimple unrolling and RTL
>    unrolling passes, whether we still need RTL unrolling.  If no, it's
>    doubtable that one gimple unrolling can fully replace the RTL
>    unrolling probably lacking some actual target information/instructions.
>    If yes, it's still possible to have inconsistent unrolling factors
>    between what IVOPTS optimizes on and what late RTL unrolling pass
>    ends with.
> 
> 2) Make IVOPTS determine the unrolling factor by considering the
>    reg-offset addressing (D-form), unroll the loop and do the remainings.

Yes, that's what I meant.

>    I don't think you referred to this though.  Since comparing to
>    reg-reg addressing, reg-offset addressing can only save some bumps,
>    it's too weak to be one deciding factor of unrolling factor.  Unlike
>    vectorizer or modulo scheduling, it's more likely there are more
>    important factors for unrolling factor computation than this one.

But you _are_ using that bump cost to compute the unroll estimate
(and IIRC you are doing the D-form IV selection then).

>    For IVOPTS, it's more like that it doesn't care what the unrolling
>    factor should be, but it needs to know what the unrolling factor
>    would be, then do optimal IV selection based on that.  So it's not
>    good to get it to decide the unrolling factor.

Well, currently with the stupid RTL unroller you will know it will
be unrolled 8 times unless it is cold or doesn't iterate enough.
There's not much costing going on in the RTL unroller (unless you
go wild with target hooks).

Richard.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-26  8:53               ` Kewen.Lin
@ 2021-01-26 17:31                 ` Segher Boessenkool
  0 siblings, 0 replies; 64+ messages in thread
From: Segher Boessenkool @ 2021-01-26 17:31 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Bill Schmidt, GCC Patches, richard.sandiford, Richard Biener

On Tue, Jan 26, 2021 at 04:53:25PM +0800, Kewen.Lin wrote:
> on 2021/1/26 上午4:37, Segher Boessenkool wrote:
> > On Mon, Jan 25, 2021 at 05:59:23PM +0000, Richard Sandiford wrote:
> >> Richard Biener <rguenther@suse.de> writes:
> >>> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
> >>>> But what could have been done differently that would have helped?  Of
> >>>> course Ke Wen could have written a better patch (aka one that is more
> >>>> acceptable); either of you could have made your current replies earlier,
> >>>> so that it is clear help needs to be sought elsewhere; and I could have
> >>>> pushed people earlier, too.  No one really did anything wrong, I'm not
> >>>> seeking who to blame, I'm just trying to find out how to prevent
> >>>> deadlocks like this in the future (where one party waits for replies
> >>>> that will never come).
> >>>>
> >>>> Is it just that we have a big gaping hole in reviewers with experience
> >>>> in such loop optimisations?
> >>>
> >>> May be.  But what I think is the biggest problem is that we do not
> >>> have a good way to achieve what the patch tries (if you review the
> >>> communications you'll see many ideas tossed around) first and foremost
> >>> because IV selection is happening early on GIMPLE and unrolling
> >>> happens late on RTL.  Both need a quite accurate estimate of costs
> >>> but unrolling has an ever harder time than IV selection where we've
> >>> got along with throwing dummy RTL at costing functions.
> > 
> > GIMPLE already needs at least an *estimate* of how much any loop will
> > be unrolled (for similar reasons as the IV selection).  The actual
> > mechanics can happen later (in RTL), and we could even use a different
> > unroll factor (in some cases) than what we first estimated; but for the
> > GIMPLE optimisations it can be important to know what the target code
> > will eventually look like.
> 
> Yeah, this point was discussed/mentioned that the estimated result
> can be used for other passes too.  But I'm not sure whether we have
> already known some other passes who suffer this kind of similar problem.

As concrete examples, look no further than everything vectorisation; but
much more general, everything that chooses between multiple options, so
almost *everything*, needs a good estimate of how expensive those
options are.

In all cases we are working on an highly idealised code representation
(GIMPLE), but the actual costs are the costs of the machine code we
eventually generate.  In simple cases this isn't a big problem: A < 2A
whenever A > 0, and A+2B < 2A+3B, and we even can reason without too
much trouble that A+2B < 2A+B whenever B < A (everything > 0), but as
soon as we have a slightly more complex situation (more variables, or
not linear, etc.) things are much harder, and we really have to consider
what code we eventually generate: it no longer can be abstracted away.

IVOPTS does this.  Various vectorisation things have to do this (I don't
know how much is done currently, frantic waving of arms, but it is
necessary to get good results).  Anything that makes decisions that have
a bigger effect on the generated machine code, but makes those decisions
early, has to do it.

One way is to look at the cost of representative RTL.  Another way is to
make a problem-specific model to estimate the costs.  Both ways have
their strengths and weaknesses.  The first way is much more general.


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-26 10:47             ` Richard Biener
@ 2021-01-26 17:54               ` Segher Boessenkool
  0 siblings, 0 replies; 64+ messages in thread
From: Segher Boessenkool @ 2021-01-26 17:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Sandiford, Bill Schmidt, GCC Patches

Hi!

On Tue, Jan 26, 2021 at 11:47:53AM +0100, Richard Biener wrote:
> Anyway, I think the GIMPLE -> RTL transition currently is a too
> big step

Much agreed.  I also think that the expand pass itself needs a lot of
work to bring it into this century: it does much too much work, in a
circuitous way that makes debugging it hard, etc.

> and we should eventually try to lower GIMPLE

That should help the above problem, too: break expand into stages, get
rid of all the premature optimisations (and implement those elsewhere
where needed!), and get a much better debuggable end result (that is
also a lot less code).

> and IV
> selection should happen on a lower form where we for example
> can do the 1st level scheduling on

I don't think that helps, but maybe :-)

> or at least have a better idea
> on resource allocation

This is the holy grail.  Wait, I should decorate that:


    O_o !!! >>>  This is the holy grail.  <<< !!! O_o


> and latencies in dependence cycles since
> that's what really matters for unrolling.

I don't think latencies matter much for such decisions.  If the compiler
depends too much on actual machine latencies, makes "too sharp"
decisions, the code will run lousy on a slightly different (say, newer)
machine.  But certainly there needs to be *some* idea of how parallel
some code can run, yes.


Segher

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-26 10:53             ` Richard Biener
@ 2021-01-27  9:43               ` Kewen.Lin
  2021-03-01  2:45                 ` Kewen.Lin
  0 siblings, 1 reply; 64+ messages in thread
From: Kewen.Lin @ 2021-01-27  9:43 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, Richard Sandiford, GCC Patches, Bill Schmidt

on 2021/1/26 下午6:53, Richard Biener wrote:
> On Tue, 26 Jan 2021, Kewen.Lin wrote:
> 
>> Hi Segher/Richard B./Richard S.,
>>
>> Many thanks for your all helps and comments on this!
>>
>> on 2021/1/25 下午3:56, Richard Biener wrote:
>>> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
>>>
>>>> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
>>>>> On Thu, 21 Jan 2021, Segher Boessenkool wrote:
>>>>>> What is holding up this patch still?  Ke Wen has pinged it every month
>>>>>> since May, and there has still not been a review.
>>>>
>>>> Richard Sandiford wrote:
>>>>> FAOD (since I'm on cc:), I don't feel qualified to review this.
>>>>> Tree-level loop stuff isn't really my area.
>>>>
>>>> And Richard Biener wrote:
>>>>> I don't like it, it feels wrong but I don't have a good suggestion
>>>>> that had positive feedback.  Since a reviewer / approver is indirectly
>>>>> responsible for at least the design I do not want to ack this patch.
>>>>> Bin made forward progress on the other parts of the series but clearly
>>>>> there's somebody missing with the appropriate privileges who feels
>>>>> positive about the patch and its general direction.
>>>>>
>>>>> Sorry to be of no help here.
>>>>
>>>> How unfortunate :-(
>>>>
>>>> So, first off, this will then have to work for next stage 1 to make any
>>>> progress.  Rats.
>>>>
>>>> But what could have been done differently that would have helped?  Of
>>>> course Ke Wen could have written a better patch (aka one that is more
>>>> acceptable); either of you could have made your current replies earlier,
>>>> so that it is clear help needs to be sought elsewhere; and I could have
>>>> pushed people earlier, too.  No one really did anything wrong, I'm not
>>>> seeking who to blame, I'm just trying to find out how to prevent
>>>> deadlocks like this in the future (where one party waits for replies
>>>> that will never come).
>>>>
>>>> Is it just that we have a big gaping hole in reviewers with experience
>>>> in such loop optimisations?
>>>
>>> May be.  But what I think is the biggest problem is that we do not
>>> have a good way to achieve what the patch tries (if you review the
>>> communications you'll see many ideas tossed around) first and foremost
>>> because IV selection is happening early on GIMPLE and unrolling
>>> happens late on RTL.  Both need a quite accurate estimate of costs
>>> but unrolling has an ever harder time than IV selection where we've
>>> got along with throwing dummy RTL at costing functions.
>>>
>>
>> Yeah, exactly.
>>
>>> IMHO the patch is the wrong "start" to try fixing the issue and my
>>> fear is that wiring this kind of "features" into the current
>>> (fundamentally broken) state will make it much harder to rework
>>> that state without introducing regressions on said features (I'm
>>> there with trying to turn the vectorizer upside down - for three
>>> years now, struggling to not regress any of the "features" we've
>>> accumulated for various targets where most of them feel a
>>> "bolted-on" rather than well-designed ;/).
>>>
>>
>> OK, understandable.
>>
>>> I think IV selection and unrolling (and scheduling FWIW) need to move
>>> closer together.  I do not have a good idea how that can work out
>>> though but I very much believe that this "most wanted" GIMPLE unroller
>>> will not be a good way of progressing here.  Maybe taking the bullet
>>> and moving IV selection back to RTL is the answer.
>>>
>>
>> I haven't looked into loop-iv.c, but IVOPTS in gimple can leverage
>> SCEV analysis for iv detection, if moving it to RTL, it could be
>> very heavier to detect the full set there?
>>
>>> For a "short term" solution I still think that trying to perform
>>> unrolling and IV selection (for the D-form case you're targeting)
>>> at the same time is a better design, even if it means complicating
>>> the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
>>> any good idea about scheduling).  There are currently 20+ GIMPLE
>>> optimization passes and 10+ RTL optimization passes between
>>> IV selection and unrolling, the idea that you can have transform
>>> decision and transform apply this far apart looks scary.
>>>
>>
>> I have some questions in mind for this part, for "perform unrolling
>> and IV selection at the same time", it can be interpreted to two
>> different implementation ways to me:
>>
>> 1) Run one gimple unrolling pass just before IVOPTS, probably using
>>    the same gate for IVOPTS.  The unrolling factor is computed by
>>    the same method as that of RTL unrolling.  But this sounds very
>>    like "most wanted gimple unrolling" which is what we want to avoid.
>>
>>    The positive aspect here is what IVOPTS faces is already one unrolled
>>    loop, it can see the whole picture and get the optimal IV set.  The
>>    downside/question is how we position these gimple unrolling and RTL
>>    unrolling passes, whether we still need RTL unrolling.  If no, it's
>>    doubtable that one gimple unrolling can fully replace the RTL
>>    unrolling probably lacking some actual target information/instructions.
>>    If yes, it's still possible to have inconsistent unrolling factors
>>    between what IVOPTS optimizes on and what late RTL unrolling pass
>>    ends with.
>>
>> 2) Make IVOPTS determine the unrolling factor by considering the
>>    reg-offset addressing (D-form), unroll the loop and do the remainings.
> 
> Yes, that's what I meant.
> 
>>    I don't think you referred to this though.  Since comparing to
>>    reg-reg addressing, reg-offset addressing can only save some bumps,
>>    it's too weak to be one deciding factor of unrolling factor.  Unlike
>>    vectorizer or modulo scheduling, it's more likely there are more
>>    important factors for unrolling factor computation than this one.
> 
> But you _are_ using that bump cost to compute the unroll estimate
> (and IIRC you are doing the D-form IV selection then).

The patch series uses the way like the opposite direction that uses
unrolling factor (will use UF later for short) estimated to adjust
the bump costs.  The process looks like:
  - estimate UF by following the similar UF determination in RTL unroller.
  - identify the reg-offset (D-form) available iv use groups, mark some
    reg-offset iv cands.
  - scale up all pair costs if need, for reg-offset iv cands, the bump cost
    (step cost) is costed by just once, for the others, it's UF-1 time.
  - run the IV selection algorithm as usual.

From the perspective of bump cost, we can compute one UF (let's call it 
ivopts_UF here), it means:
  - when actual UF > ivopts_UF, the reg-offset iv selection wins with
    fewer bump costs.
  - while actual UF <= ivopts_UF, the reg-reg (indexed addressing) wins.

It looks we can get with the below:

  G for group cost, B for basic iv cost, S for step iv cost, N for iv count.

    G1 = (gcost<oiv,grp1> + gcost<oiv,grp2> ... ) * ivopts_UF;
    B1 = bcost<oiv>;
    S1 = scost<oiv> * (ivopts_UF-1);
    N1 = 1;
    
    G2 = (gcost<iv1, grp1> + gcost<iv2, grp2> ...) * ivopts_UF;
    B2 = bcost(iv1) + bcost(iv2) + ...;
    S2 = scost(iv1) + scost(iv2) + ...;
    N2 = count of (iv1, iv2, ...)

  Let G1 + B1 + S1 + N1 = G2 + B2 + S2 + N2, evaluate the value of ivopts_UF,
  then use the ceiling (match 2^n) of the result.

Here it should have one upper bound by considering the range of target
supported offset (calling it as offset_UF_bound).

I have some concerns on the ivopts_UF computed like this way, since here
we only compute it by considering all address type iv uses, but don't
consider the other iv uses.  It's possible that there are some iv uses
who prefer iv cands which lead to indexed addressing, we can possibly
miss that.  If we want to consider all iv uses globally, it's like to
run the IV selection and conclude the ivopts_UF.  For me, it looks very
hard in the current framework, but we probably can have one range from
low bound to upper bound and iterate it for iv selection till we can
find one or all finish (can be binary searching even), it's seems still
not good due to the possible time consuming.  // (A)

Apart from this, let's assume we have determined one ivopts_UF (have
considered the offset_UF_bound), do we need to take care of some other
possible upper bounds? especially those ones we process in RTL unroller.

This question originates from the concern that when we determine the
ivopts_UF, we only focus on those iv cand and iv uses (whatever all or 
just address type), it's likely that some/most statements of the loop
aren't qualified to show up as iv uses.  Then we can have calculated
ivopts_UF 4, but actually the loop size is big and UF is 2 according
to RTL unroller handlings, since IVOPTs perform early so we will make
it unroll 4 times first then degrade the performance (causing more
spillings etc.).  Here ivopts_UF computation focuses on bump costs
saving, it may not be the critical thing for the loop, comparing to
those similar passes who want to drive unrolling and truely make more
gains (typically more parallelism like vectorization and SMS).  // (B)

So I think we may need to consider UF determination from RTL unroller
somehow, like to respect parameters for unrolling and loop_unroll_adjust
hook if it works in gimple phase.  I assume that there are some tuning
work behind those parameters and hooks.

As above, then the approach seems to be like:
  1. compute ivopts_UF
  2. adjust it according to reg-offset allowable range.
  3. adjust it to respect RTL UF determination. // for (B)
  4. adjust the costs according to this ivopts_UF.
  5. iv selection as usual.
  6. check the iv set as expected. // 4,5,6 for (A)
  7. post-pass unrolling as ivopts_UF.

For 5-7, it can be replaced as to halt the ivopts processing for the
loop, unroll it as ivopts_UF, rerun the ivopts again, but I think we
need to rerun the analysis on the unrolled loop then.  Since we already
have decided the UF, it looks not different to unroll it in post-pass,
or at the end of pass, just making sure the unrolling to happen.

> 
>>    For IVOPTS, it's more like that it doesn't care what the unrolling
>>    factor should be, but it needs to know what the unrolling factor
>>    would be, then do optimal IV selection based on that.  So it's not
>>    good to get it to decide the unrolling factor.
> 
> Well, currently with the stupid RTL unroller you will know it will
> be unrolled 8 times unless it is cold or doesn't iterate enough.
> There's not much costing going on in the RTL unroller (unless you
> go wild with target hooks).

But I guess there were some tuning work behind those things?
Like some condition checks, parameters and hooks to adjust the value.


BR,
Kewen

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] unroll: Add middle-end unroll factor estimation
  2021-01-27  9:43               ` Kewen.Lin
@ 2021-03-01  2:45                 ` Kewen.Lin
  0 siblings, 0 replies; 64+ messages in thread
From: Kewen.Lin @ 2021-03-01  2:45 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Sandiford, Bill Schmidt, GCC Patches, Segher Boessenkool

on 2021/1/27 下午5:43, Kewen.Lin via Gcc-patches wrote:
> on 2021/1/26 下午6:53, Richard Biener wrote:
>> On Tue, 26 Jan 2021, Kewen.Lin wrote:
>>
>>> Hi Segher/Richard B./Richard S.,
>>>
>>> Many thanks for your all helps and comments on this!
>>>
>>> on 2021/1/25 下午3:56, Richard Biener wrote:
>>>> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
>>>>
>>>>> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
>>>>>> On Thu, 21 Jan 2021, Segher Boessenkool wrote:
>>>>>>> What is holding up this patch still?  Ke Wen has pinged it every month
>>>>>>> since May, and there has still not been a review.
>>>>>
>>>>> Richard Sandiford wrote:
>>>>>> FAOD (since I'm on cc:), I don't feel qualified to review this.
>>>>>> Tree-level loop stuff isn't really my area.
>>>>>
>>>>> And Richard Biener wrote:
>>>>>> I don't like it, it feels wrong but I don't have a good suggestion
>>>>>> that had positive feedback.  Since a reviewer / approver is indirectly
>>>>>> responsible for at least the design I do not want to ack this patch.
>>>>>> Bin made forward progress on the other parts of the series but clearly
>>>>>> there's somebody missing with the appropriate privileges who feels
>>>>>> positive about the patch and its general direction.
>>>>>>
>>>>>> Sorry to be of no help here.
>>>>>
>>>>> How unfortunate :-(
>>>>>
>>>>> So, first off, this will then have to work for next stage 1 to make any
>>>>> progress.  Rats.
>>>>>
>>>>> But what could have been done differently that would have helped?  Of
>>>>> course Ke Wen could have written a better patch (aka one that is more
>>>>> acceptable); either of you could have made your current replies earlier,
>>>>> so that it is clear help needs to be sought elsewhere; and I could have
>>>>> pushed people earlier, too.  No one really did anything wrong, I'm not
>>>>> seeking who to blame, I'm just trying to find out how to prevent
>>>>> deadlocks like this in the future (where one party waits for replies
>>>>> that will never come).
>>>>>
>>>>> Is it just that we have a big gaping hole in reviewers with experience
>>>>> in such loop optimisations?
>>>>
>>>> May be.  But what I think is the biggest problem is that we do not
>>>> have a good way to achieve what the patch tries (if you review the
>>>> communications you'll see many ideas tossed around) first and foremost
>>>> because IV selection is happening early on GIMPLE and unrolling
>>>> happens late on RTL.  Both need a quite accurate estimate of costs
>>>> but unrolling has an ever harder time than IV selection where we've
>>>> got along with throwing dummy RTL at costing functions.
>>>>
>>>
>>> Yeah, exactly.
>>>
>>>> IMHO the patch is the wrong "start" to try fixing the issue and my
>>>> fear is that wiring this kind of "features" into the current
>>>> (fundamentally broken) state will make it much harder to rework
>>>> that state without introducing regressions on said features (I'm
>>>> there with trying to turn the vectorizer upside down - for three
>>>> years now, struggling to not regress any of the "features" we've
>>>> accumulated for various targets where most of them feel a
>>>> "bolted-on" rather than well-designed ;/).
>>>>
>>>
>>> OK, understandable.
>>>
>>>> I think IV selection and unrolling (and scheduling FWIW) need to move
>>>> closer together.  I do not have a good idea how that can work out
>>>> though but I very much believe that this "most wanted" GIMPLE unroller
>>>> will not be a good way of progressing here.  Maybe taking the bullet
>>>> and moving IV selection back to RTL is the answer.
>>>>
>>>
>>> I haven't looked into loop-iv.c, but IVOPTS in gimple can leverage
>>> SCEV analysis for iv detection, if moving it to RTL, it could be
>>> very heavier to detect the full set there?
>>>
>>>> For a "short term" solution I still think that trying to perform
>>>> unrolling and IV selection (for the D-form case you're targeting)
>>>> at the same time is a better design, even if it means complicating
>>>> the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
>>>> any good idea about scheduling).  There are currently 20+ GIMPLE
>>>> optimization passes and 10+ RTL optimization passes between
>>>> IV selection and unrolling, the idea that you can have transform
>>>> decision and transform apply this far apart looks scary.
>>>>
>>>
>>> I have some questions in mind for this part, for "perform unrolling
>>> and IV selection at the same time", it can be interpreted to two
>>> different implementation ways to me:
>>>
>>> 1) Run one gimple unrolling pass just before IVOPTS, probably using
>>>    the same gate for IVOPTS.  The unrolling factor is computed by
>>>    the same method as that of RTL unrolling.  But this sounds very
>>>    like "most wanted gimple unrolling" which is what we want to avoid.
>>>
>>>    The positive aspect here is what IVOPTS faces is already one unrolled
>>>    loop, it can see the whole picture and get the optimal IV set.  The
>>>    downside/question is how we position these gimple unrolling and RTL
>>>    unrolling passes, whether we still need RTL unrolling.  If no, it's
>>>    doubtable that one gimple unrolling can fully replace the RTL
>>>    unrolling probably lacking some actual target information/instructions.
>>>    If yes, it's still possible to have inconsistent unrolling factors
>>>    between what IVOPTS optimizes on and what late RTL unrolling pass
>>>    ends with.
>>>
>>> 2) Make IVOPTS determine the unrolling factor by considering the
>>>    reg-offset addressing (D-form), unroll the loop and do the remainings.
>>
>> Yes, that's what I meant.
>>
>>>    I don't think you referred to this though.  Since comparing to
>>>    reg-reg addressing, reg-offset addressing can only save some bumps,
>>>    it's too weak to be one deciding factor of unrolling factor.  Unlike
>>>    vectorizer or modulo scheduling, it's more likely there are more
>>>    important factors for unrolling factor computation than this one.
>>
>> But you _are_ using that bump cost to compute the unroll estimate
>> (and IIRC you are doing the D-form IV selection then).
> 
> The patch series uses the way like the opposite direction that uses
> unrolling factor (will use UF later for short) estimated to adjust
> the bump costs.  The process looks like:
>   - estimate UF by following the similar UF determination in RTL unroller.
>   - identify the reg-offset (D-form) available iv use groups, mark some
>     reg-offset iv cands.
>   - scale up all pair costs if need, for reg-offset iv cands, the bump cost
>     (step cost) is costed by just once, for the others, it's UF-1 time.
>   - run the IV selection algorithm as usual.
> 
> From the perspective of bump cost, we can compute one UF (let's call it 
> ivopts_UF here), it means:
>   - when actual UF > ivopts_UF, the reg-offset iv selection wins with
>     fewer bump costs.
>   - while actual UF <= ivopts_UF, the reg-reg (indexed addressing) wins.
> 
> It looks we can get with the below:
> 
>   G for group cost, B for basic iv cost, S for step iv cost, N for iv count.
> 
>     G1 = (gcost<oiv,grp1> + gcost<oiv,grp2> ... ) * ivopts_UF;
>     B1 = bcost<oiv>;
>     S1 = scost<oiv> * (ivopts_UF-1);
>     N1 = 1;
>     
>     G2 = (gcost<iv1, grp1> + gcost<iv2, grp2> ...) * ivopts_UF;
>     B2 = bcost(iv1) + bcost(iv2) + ...;
>     S2 = scost(iv1) + scost(iv2) + ...;
>     N2 = count of (iv1, iv2, ...)
> 
>   Let G1 + B1 + S1 + N1 = G2 + B2 + S2 + N2, evaluate the value of ivopts_UF,
>   then use the ceiling (match 2^n) of the result.
> 
> Here it should have one upper bound by considering the range of target
> supported offset (calling it as offset_UF_bound).
> 
> I have some concerns on the ivopts_UF computed like this way, since here
> we only compute it by considering all address type iv uses, but don't
> consider the other iv uses.  It's possible that there are some iv uses
> who prefer iv cands which lead to indexed addressing, we can possibly
> miss that.  If we want to consider all iv uses globally, it's like to
> run the IV selection and conclude the ivopts_UF.  For me, it looks very
> hard in the current framework, but we probably can have one range from
> low bound to upper bound and iterate it for iv selection till we can
> find one or all finish (can be binary searching even), it's seems still
> not good due to the possible time consuming.  // (A)
> 
> Apart from this, let's assume we have determined one ivopts_UF (have
> considered the offset_UF_bound), do we need to take care of some other
> possible upper bounds? especially those ones we process in RTL unroller.
> 
> This question originates from the concern that when we determine the
> ivopts_UF, we only focus on those iv cand and iv uses (whatever all or 
> just address type), it's likely that some/most statements of the loop
> aren't qualified to show up as iv uses.  Then we can have calculated
> ivopts_UF 4, but actually the loop size is big and UF is 2 according
> to RTL unroller handlings, since IVOPTs perform early so we will make
> it unroll 4 times first then degrade the performance (causing more
> spillings etc.).  Here ivopts_UF computation focuses on bump costs
> saving, it may not be the critical thing for the loop, comparing to
> those similar passes who want to drive unrolling and truely make more
> gains (typically more parallelism like vectorization and SMS).  // (B)
> 
> So I think we may need to consider UF determination from RTL unroller
> somehow, like to respect parameters for unrolling and loop_unroll_adjust
> hook if it works in gimple phase.  I assume that there are some tuning
> work behind those parameters and hooks.
> 
> As above, then the approach seems to be like:
>   1. compute ivopts_UF
>   2. adjust it according to reg-offset allowable range.
>   3. adjust it to respect RTL UF determination. // for (B)
>   4. adjust the costs according to this ivopts_UF.
>   5. iv selection as usual.
>   6. check the iv set as expected. // 4,5,6 for (A)
>   7. post-pass unrolling as ivopts_UF.
> 
> For 5-7, it can be replaced as to halt the ivopts processing for the
> loop, unroll it as ivopts_UF, rerun the ivopts again, but I think we
> need to rerun the analysis on the unrolled loop then.  Since we already
> have decided the UF, it looks not different to unroll it in post-pass,
> or at the end of pass, just making sure the unrolling to happen.
> 

Hi Richard(s)/Segher,

I'd like to ping this to avoid the "boat" to sink again. :-)

What do you think of the above proposal and the related concerns?


BR,
Kewen

>>
>>>    For IVOPTS, it's more like that it doesn't care what the unrolling
>>>    factor should be, but it needs to know what the unrolling factor
>>>    would be, then do optimal IV selection based on that.  So it's not
>>>    good to get it to decide the unrolling factor.
>>
>> Well, currently with the stupid RTL unroller you will know it will
>> be unrolled 8 times unless it is cold or doesn't iterate enough.
>> There's not much costing going on in the RTL unroller (unless you
>> go wild with target hooks).
> 
> But I guess there were some tuning work behind those things?
> Like some condition checks, parameters and hooks to adjust the value.
> 
> 
> BR,
> Kewen
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2021-03-01  2:45 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-28 12:17 [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Kewen.Lin
2020-05-28 12:19 ` [PATCH 1/4] unroll: Add middle-end unroll factor estimation Kewen.Lin
2020-08-31  5:49   ` PING " Kewen.Lin
2020-09-15  7:44     ` PING^2 " Kewen.Lin
2020-10-13  7:06       ` PING^3 " Kewen.Lin
2020-11-02  9:13         ` PING^4 " Kewen.Lin
2020-11-19  5:50           ` PING^5 " Kewen.Lin
2020-12-17  2:58             ` PING^6 " Kewen.Lin
2021-01-14  2:36               ` PING^7 " Kewen.Lin
2021-01-21 21:45   ` Segher Boessenkool
2021-01-22 12:50     ` Richard Sandiford
2021-01-22 13:47     ` Richard Biener
2021-01-22 21:37       ` Segher Boessenkool
2021-01-25  7:56         ` Richard Biener
2021-01-25 17:59           ` Richard Sandiford
2021-01-25 20:37             ` Segher Boessenkool
2021-01-26  8:53               ` Kewen.Lin
2021-01-26 17:31                 ` Segher Boessenkool
2021-01-26  8:43             ` Kewen.Lin
2021-01-26 10:47             ` Richard Biener
2021-01-26 17:54               ` Segher Boessenkool
2021-01-26  8:36           ` Kewen.Lin
2021-01-26 10:53             ` Richard Biener
2021-01-27  9:43               ` Kewen.Lin
2021-03-01  2:45                 ` Kewen.Lin
2020-05-28 12:23 ` [PATCH 2/4] param: Introduce one param to control ivopts reg-offset consideration Kewen.Lin
2020-05-28 12:24 ` [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling Kewen.Lin
2020-06-01 17:59   ` Richard Sandiford
2020-06-02  3:39     ` Kewen.Lin
2020-06-02  7:14       ` Richard Sandiford
2020-06-03  3:18         ` Kewen.Lin
2020-08-08  8:01   ` Bin.Cheng
2020-08-10  4:27     ` Kewen.Lin
2020-08-10 12:38       ` Bin.Cheng
2020-08-10 14:41         ` Kewen.Lin
2020-08-16  3:59           ` Bin.Cheng
2020-08-18  9:02             ` [PATCH 3/4 v2] " Kewen.Lin
2020-08-22  5:11               ` Bin.Cheng
2020-08-25 12:46                 ` [PATCH 3/4 v3] " Kewen.Lin
2020-08-31 19:41                   ` Segher Boessenkool
2020-09-02  3:16                     ` Kewen.Lin
2020-09-02 10:25                       ` Segher Boessenkool
2020-09-03  2:24                         ` Kewen.Lin
2020-09-03 22:37                           ` Segher Boessenkool
2020-09-04  8:27                             ` Bin.Cheng
2020-09-04 13:53                               ` Segher Boessenkool
2020-09-04  8:47                             ` Kewen.Lin
2020-09-04 14:16                               ` Segher Boessenkool
2020-09-04 15:47                                 ` Kewen.Lin
2020-09-17 23:12                             ` Jeff Law
2020-09-17 23:46                               ` Segher Boessenkool
2020-09-01 11:19                   ` Bin.Cheng
2020-09-02  3:50                     ` Kewen.Lin
2020-09-02  3:55                       ` Bin.Cheng
2020-09-02  4:51                         ` Kewen.Lin
2020-09-06  2:47                     ` Hans-Peter Nilsson
2020-09-15  7:41                       ` Kewen.Lin
2020-06-02 11:38 ` [PATCH 0/4] IVOPTs consider step cost for different forms when unrolling Richard Biener
2020-06-03  3:46   ` Kewen.Lin
2020-06-03  7:07     ` Richard Biener
2020-06-03  7:58       ` Kewen.Lin
2020-06-03  9:27         ` Richard Biener
2020-06-03 10:47           ` Kewen.Lin
2020-06-03 11:08             ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).