[RFC] Tree Loop Unroller Pass

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] Tree Loop Unroller Pass
@ 2018-02-12 23:56 Kugan Vivekanandarajah
  2018-02-20 19:53 ` Andrew Pinski
  0 siblings, 1 reply; 11+ messages in thread
From: Kugan Vivekanandarajah @ 2018-02-12 23:56 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 429 bytes --]

Implements tree loop unroller using the infrastructure provided.

gcc/ChangeLog:

2018-02-12  Kugan Vivekanandarajah  <kuganv@linaro.org>

    * Makefile.in (OBJS): Add tree-ssa-loop-unroll.o.
    * common.opt (ftree-loop-unroll): New option.
    * passes.def: Add pass_tree_loop_uroll
    * timevar.def (TV_TREE_LOOP_UNROLL): Add.
    * tree-pass.h (make_pass_tree_loop_unroll): Declare.
    * tree-ssa-loop-unroll.c: New file.

[-- Attachment #2: 0002-tree-loop-unroller.patch --]
[-- Type: text/x-patch, Size: 10746 bytes --]

From 71baaf8393dd79a98b4c0216e56d87083caf0177 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org>
Date: Mon, 12 Feb 2018 10:44:00 +1100
Subject: [PATCH 2/4] tree-loop-unroller

Change-Id: I58c25b5f2e796d4166af3ea4e50a0f4a3078b6c2
---
 gcc/Makefile.in            |   1 +
 gcc/common.opt             |   4 +
 gcc/passes.def             |   1 +
 gcc/timevar.def            |   1 +
 gcc/tree-pass.h            |   1 +
 gcc/tree-ssa-loop-unroll.c | 268 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 276 insertions(+)
 create mode 100644 gcc/tree-ssa-loop-unroll.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 374bf3e..de3c146 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1536,6 +1536,7 @@ OBJS = \
 	tree-ssa-loop-im.o \
 	tree-ssa-loop-ivcanon.o \
 	tree-ssa-loop-ivopts.o \
+	tree-ssa-loop-unroll.o \
 	tree-ssa-loop-manip.o \
 	tree-ssa-loop-niter.o \
 	tree-ssa-loop-prefetch.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index b20a9aa..ea47b8c 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1770,6 +1770,10 @@ fivopts
 Common Report Var(flag_ivopts) Init(1) Optimization
 Optimize induction variables on trees.
 
+ftree-loop-unroll
+Common Report Var(flag_tree_loop_unroll) Init(1) Optimization
+Perform loop unrolling in gimple.
+
 fjump-tables
 Common Var(flag_jump_tables) Init(1) Optimization
 Use jump tables for sufficiently large switch statements.
diff --git a/gcc/passes.def b/gcc/passes.def
index 9802f08..57f7cc2 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -302,6 +302,7 @@ along with GCC; see the file COPYING3.  If not see
           NEXT_PASS (pass_predcom);
 	  NEXT_PASS (pass_complete_unroll);
 	  NEXT_PASS (pass_slp_vectorize);
+	  NEXT_PASS (pass_tree_loop_unroll);
 	  NEXT_PASS (pass_loop_prefetch);
 	  /* Run IVOPTs after the last pass that uses data-reference analysis
 	     as that doesn't handle TARGET_MEM_REFs.  */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 91221ae..a6bb847 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -202,6 +202,7 @@ DEFTIMEVAR (TV_TREE_LOOP_DISTRIBUTION, "tree loop distribution")
 DEFTIMEVAR (TV_CHECK_DATA_DEPS       , "tree check data dependences")
 DEFTIMEVAR (TV_TREE_PREFETCH	     , "tree prefetching")
 DEFTIMEVAR (TV_TREE_LOOP_IVOPTS	     , "tree iv optimization")
+DEFTIMEVAR (TV_TREE_LOOP_UNROLL     , "tree loop unroll")
 DEFTIMEVAR (TV_PREDCOM		     , "predictive commoning")
 DEFTIMEVAR (TV_TREE_CH		     , "tree copy headers")
 DEFTIMEVAR (TV_TREE_SSA_UNCPROP	     , "tree SSA uncprop")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 93a6a99..2c0740f 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -388,6 +388,7 @@ extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_tree_loop_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-unroll.c b/gcc/tree-ssa-loop-unroll.c
new file mode 100644
index 0000000..04cf092
--- /dev/null
+++ b/gcc/tree-ssa-loop-unroll.c
@@ -0,0 +1,268 @@
+
+/* Tree Loop Unroller.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "hash-set.h"
+#include "machmode.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "target.h"
+#include "gimple.h"
+#include "cfgloop.h"
+#include "tree-ssa-loop.h"
+#include "tree-scalar-evolution.h"
+#include "tree-ssa-loop-manip.h"
+#include "tree-ssa-loop-niter.h"
+#include "tree-ssa-loop-ivopts.h"
+#include "gimple.h"
+#include "predict.h"
+#include "cfghooks.h"
+#include "tree-inline.h"
+#include "gimple-iterator.h"
+#include "fold-const.h"
+#include "tree-data-ref.h"
+#include "params.h"
+
+/* Count the load memory streams in the LOOP.  If the streams are larger
+   (compared to MAX_LOAD_STREAMS), we dont need to compute all of
+   them.  This is used to limit the partial unrolling factor to avoid
+   too many memory load streams.  */
+
+static signed
+count_mem_load_streams (struct loop *loop, signed max_load_streams)
+{
+  basic_block *bbs = get_loop_body (loop);
+  unsigned nbbs = loop->num_nodes;
+  gimple_stmt_iterator gsi;
+  signed count = 0;
+  hash_set <data_reference_p> *dr_set = new hash_set<data_reference_p> ();
+  vec<data_reference_p> datarefs_vec;
+  datarefs_vec.create (20);
+  unsigned k;
+
+  for (unsigned i = 0; i < nbbs; i++)
+    {
+      basic_block bb = bbs[i];
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  tree access_fn;
+	  vec<tree> access_fns;
+	  datarefs_vec.truncate (0);
+
+	  if (!find_data_references_in_stmt (loop, stmt, &datarefs_vec))
+	     continue;
+
+	  for (unsigned j = 0; j < datarefs_vec.length (); ++j)
+	    {
+	      data_reference_p dr = datarefs_vec[j];
+	      if (!DR_IS_READ (dr))
+		continue;
+	      access_fns = DR_ACCESS_FNS (dr);
+
+	      FOR_EACH_VEC_ELT (access_fns, k, access_fn)
+		{
+		  if (TREE_CODE (access_fn) == POLYNOMIAL_CHREC)
+		    {
+		      if (!dr_set->add (dr))
+			count++;
+		      break;
+		    }
+		}
+	    }
+
+	  if (count > max_load_streams / 2)
+	    break;
+	}
+    }
+  free_data_refs (datarefs_vec);
+  free (dr_set);
+  return count;
+}
+
+/* Verify that loop in a form that is suitable for unrolling.  */
+
+static bool
+loop_form_ok_p (struct loop *loop)
+{
+  if (!single_exit (loop))
+    return false;
+
+  /* Loop has preheader.  */
+  if (!loop_preheader_edge (loop))
+    return false;
+
+  return true;
+}
+
+/* Return false if the cost model does not prefer LOOP to be unrolled.
+   Return true and update the DESC if we decided that the LOOP
+   is to be unrolled.  Also, in this case, determine the factor (NUNROLL)
+   by which the LOOP should be unrolled.  */
+
+static bool
+decide_unroll_iterations (struct loop *loop,
+			  struct tree_niter_desc *desc,
+			  signed *nunroll)
+{
+  /* nunroll = total number of copies of the original loop body in
+     unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
+  signed ninsns = tree_num_loop_insns (loop, &eni_size_weights);
+  signed ninsns_outer = tree_num_loop_insns (loop_outer (loop),
+					     &eni_size_weights);
+  signed n_unroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / ninsns;
+  signed load_streams = 0;
+  signed max_load_streams = -1;
+  signed n_unroll2;
+
+  if (ninsns_outer >= PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS))
+    return false;
+  ninsns_outer -= ninsns;
+  n_unroll2 = (PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) - ninsns_outer)/ ninsns;
+  if (n_unroll2 < n_unroll)
+    n_unroll = n_unroll2;
+
+  if (n_unroll > PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
+    n_unroll = PARAM_VALUE (PARAM_MAX_UNROLL_TIMES);
+
+  /* Check for simple loops.  */
+  edge exit = single_exit (loop);
+  if (!number_of_iterations_exit (loop, exit,
+				  desc, false, false)
+      || !integer_onep (desc->assumptions)
+      || !integer_zerop (desc->may_be_zero)
+      || !desc->niter)
+    return false;
+
+  /* Now force nunroll to be power of 2.  */
+  n_unroll = 1 << (floor_log2 (n_unroll));
+  if (targetm.hw_max_mem_read_streams
+      && (max_load_streams = targetm.hw_max_mem_read_streams ()) > 0)
+    {
+      load_streams = count_mem_load_streams (loop, max_load_streams);
+      if (load_streams > 0)
+	{
+	  signed t = 1 << (floor_log2 (max_load_streams / load_streams));
+	  if (t < n_unroll)
+	    n_unroll = t;
+	}
+    }
+
+  if (!can_unroll_loop_p (loop, n_unroll, desc))
+    return false;
+
+  *nunroll = n_unroll;
+  return true;
+}
+
+/* Unroll LOOP based on the cost model.  */
+
+static bool
+unroll_loop (struct loop *loop)
+{
+  struct tree_niter_desc desc;
+  signed n_unroll;
+
+  if (!decide_unroll_iterations (loop, &desc, &n_unroll))
+    return false;
+  if (n_unroll <= 1)
+    return false;
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\nunrollong loop=%d times=%d\n", loop->num, n_unroll);
+  initialize_original_copy_tables ();
+  tree_unroll_loop (loop, n_unroll,
+		    single_dom_exit (loop), &desc);
+  free_original_copy_tables ();
+  return true;
+}
+
+static unsigned
+execute_tree_loop_unroll ()
+{
+  struct loop *loop;
+  bool changed = false;
+  int todo_flags = 0;
+  if (optimize_function_for_size_p (cfun))
+    return 0;
+
+  estimate_numbers_of_iterations (cfun);
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      if (!loop_form_ok_p (loop))
+	continue;
+      if (unroll_loop (loop))
+	{
+	  changed |= true;
+	  free_numbers_of_iterations_estimates (loop);
+	}
+    }
+
+  if (changed)
+    {
+      scev_reset ();
+      todo_flags |= TODO_cleanup_cfg;
+    }
+
+  return todo_flags;
+}
+
+namespace {
+const pass_data pass_data_tree_loop_unroll =
+{
+  GIMPLE_PASS, /* type */
+  "tree-loop-unroll", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_TREE_LOOP_UNROLL, /* tv_id */
+  PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  (TODO_update_ssa | TODO_verify_all),
+};
+
+class pass_tree_loop_unroll : public gimple_opt_pass
+{
+public:
+  pass_tree_loop_unroll (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_tree_loop_unroll, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () { return new pass_tree_loop_unroll (m_ctxt); }
+  virtual bool gate (function *) { return flag_tree_loop_unroll != 0; }
+  virtual unsigned int execute (function *)
+    {
+      return execute_tree_loop_unroll ();
+    }
+
+}; // class pass_tree_loop_unroll
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_tree_loop_unroll (gcc::context *ctxt)
+{
+  return new pass_tree_loop_unroll (ctxt);
+}
+
-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree Loop Unroller Pass
  2018-02-12 23:56 [RFC] Tree Loop Unroller Pass Kugan Vivekanandarajah
@ 2018-02-20 19:53 ` Andrew Pinski
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Pinski @ 2018-02-20 19:53 UTC (permalink / raw)
  To: Kugan Vivekanandarajah; +Cc: GCC Patches

On Mon, Feb 12, 2018 at 3:55 PM, Kugan Vivekanandarajah
<kugan.vivekanandarajah@linaro.org> wrote:
> Implements tree loop unroller using the infrastructure provided.
>
> gcc/ChangeLog:
>
> 2018-02-12  Kugan Vivekanandarajah  <kuganv@linaro.org>
>
>     * Makefile.in (OBJS): Add tree-ssa-loop-unroll.o.
>     * common.opt (ftree-loop-unroll): New option.
>     * passes.def: Add pass_tree_loop_uroll
>     * timevar.def (TV_TREE_LOOP_UNROLL): Add.
>     * tree-pass.h (make_pass_tree_loop_unroll): Declare.
>     * tree-ssa-loop-unroll.c: New file.


I think it was decided to name new gimple passes as gimple-* rather
than tree-ssa-*.
The option should most likely just take over -funrol-loops, etc.

Shouldn't:
+  if (targetm.hw_max_mem_read_streams
+      && (max_load_streams = targetm.hw_max_mem_read_streams ()) > 0)
+    {
+      load_streams = count_mem_load_streams (loop, max_load_streams);
+      if (load_streams > 0)
+ {
+   signed t = 1 << (floor_log2 (max_load_streams / load_streams));
+   if (t < n_unroll)
+     n_unroll = t;
+ }
+    }

this be a target hook instead of doing inline here.  It seems too
target specific notion of what a stream is.  Even the notion of a
stream here seems micro-arch specific and not generic enough.


A few more comments about the pass.
You don't check can_duplicate_loop_p on the loop.
You use optimize_function_for_size_p on the whole function instead of
checking if the loop is cold (by checking optimize_loop_for_size_p).

In my version of gimple loop unrolling, I had to add
lpt_decision.already_unrolled and mark the loop as already unrolled so
the rtl loop unroller would not happen again.
-fopt-report does not report when the unrolling has happened unlike
the RTL version.

Thanks,
Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree loop unroller pass
  2018-02-16 14:22     ` Wilco Dijkstra
  2018-02-16 15:00       ` Richard Biener
@ 2018-02-20 15:46       ` Michael Matz
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Matz @ 2018-02-20 15:46 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Richard Biener, Kugan Vivekanandarajah, GCC Patches, nd

[-- Attachment #1: Type: text/plain, Size: 5293 bytes --]

Hi,

On Fri, 16 Feb 2018, Wilco Dijkstra wrote:

> > How so?
> 
> I thought it is well-known for many years that the rtl unroller doesn't 
> work properly. In practically all cases where LLVM beats GCC, it is due 
> to unrolling small loops.

I thought it was because of vectorizing at -O2, not due to unrolling 
itself.

> > To generate more ILP for modern out-of-order processors you need to be
> > able to do followup transforms that remove dependences.Â  So rather than
> > inventing magic params we should look at those transforms and key
> > unrolling on them.Â  Like we do in predictive commoning or other passes
> > that end up performing unrolling as part of their transform.
> 
> This is why unrolling needs to be done at the tree level. Alias info is 
> correct, addressing modes end up more optimal and the scheduler can now 
> interleave the iterations (often not possible after the rtl-unroller due 
> to bad alias info).

The better alias info at gimple level is offsetted by worse information 
about addressing modes, register pressure and precise machine 
instructions.  It's not a very obvious tradeoff between both approaches.

> > Our measurements on x86 concluded that unrolling isn't worth it, in 
> > fact it very often hurts.  That was of course with saner params than 
> > the defaults of the RTL unroller.
> >
> > Often you even have to fight with followup passes doing stuff that 
> > ends up inreasing register pressure too much so we end up spilling.
> 
> Yes that's why I mentioned we should only unroll small loops where there 
> is always a benefit from reduced loop counter increments and branching.

With modern OOO processors the loop counter increments and branching costs 
about nothing on average due to speculation and bypassing.  In fact a loop 
cache might prefer small loops.

> > So _please_ first get testcases we know unrolling will be beneficial 
> > on and _also_ have a thorough description _why_.
> 
> I'm sure we can find good examples. The why will be obvious just from 
> instruction count.

The instruction count is only one part.  If the unrolling really enables 
followup transformations like cross-iteration redundancy removal, then 
it's worthwhile.  But for that we already have passes (e.g. predictive 
commoning).  If the only thing unrolling does is getting rid of N-1 loop 
counter test-and-branch, it's not beneficial on OOO processors (if the 
predictor is worth anything).

The reason why we tried some unrolling measurements last year is to 
establish if unrolling is worth it or not on an normal OOO x86-64 
processor (I basically wanted to have a case for implementing a generic 
unroller on gimple, and replace the RTL one (!)).  The results were so 
underwhelming that I dropped the plan.

> > With Ooo CPUs speculatively executing the next iterations I very much
> > doubt that.
> 
> OoO execution is like really dumb loop unrolling,

Actually I think of it as a rather clever way of unrolling.  The processor 
has exact knowledge of (non-)dependencies, even on those the compiler 
can't proof to not exist.

And if a dependency exists for real (such that an OOO CPU can't ignore it) 
how would you propose the compiler to magically remove it to get rid of 
the involved instructions?  That can usually only be done by duplicating 
threads-of-compute and that increases not decreases instruction count (or 
leaves them the same).  For instance the RTL unroller replaces IV adjusts 
ala:
  i = i + 1; use (i)
  i = i + 1; use (i)
with
  i1 = i + 1; use (i1)
  i2 = i + 2; use (i2)
So, compiler was able to get rid of one dependency chain, but in expense 
of having to use a new constant and register.  A cost analysis needs to 
happen to be sure that this is worthwhile.

> you still have all the dependencies between iterations, all the 
> branches, all the pointer increments etc. Optimizing those reduces 
> instruction counts like vectorization. Fewer instructions implies faster 
> code.

Also not true in this generality.  We recently had a case where 
vectorization hurt very much but it wasn't obvious why.  Our current 
theory (and we had many) is that because before the vector ALU actually 
could start working there had to be eight (strided) loads done per vector, 
for multiple streams, and all those loads eventually clogged the 
pipeline.  The scalar loop (on the OOO CPU) nicely intermingled ALU and 
memory ops.  In the end Richi got the vector loop to be only 2-3% slower 
than the scalar loop (and that was a fairly major effort), even though
from looking at the instructions it had only 70-80% of the original scalar 
dynamic instructions count.

In any case, I'm with Richi on this one about having some dependable 
testcases that speed up with a tree unroller, reasons of why this is so, 
and some cost model that reflects these reasons at least roughly.

All the above is of course only true for OOO CPUs.  For in-order unrolling 
is helpful.  But for getting that it would be good to work towards 
exchanging the RTL with a gimple unroller if must be to not have two of 
them.

Ciao,
Michael.
P.S: For a very long time I also thought "hmm, gimple unroller, that would 
be nice to have; could get rid of RTL one".  Until these experiments last 
year.  Since then I'm basically 180 degrees around ;-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree loop unroller pass
  2018-02-16 11:56   ` Richard Biener
  2018-02-16 14:22     ` Wilco Dijkstra
@ 2018-02-20  0:17     ` Kugan Vivekanandarajah
  1 sibling, 0 replies; 11+ messages in thread
From: Kugan Vivekanandarajah @ 2018-02-20  0:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: Wilco Dijkstra, GCC Patches, nd

Hi Richard,

On 16 February 2018 at 22:56, Richard Biener <richard.guenther@gmail.com> wrote:
> On Thu, Feb 15, 2018 at 11:30 PM, Kugan Vivekanandarajah
> <kugan.vivekanandarajah@linaro.org> wrote:
>> Hi Wilko,
>>
>> Thanks for your comments.
>>
>> On 14 February 2018 at 00:05, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>>> Hi Kugan,
>>>
>>>> Based on the previous discussions, I tried to implement a tree loop
>>>> unroller for partial unrolling. I would like to queue this RFC patches
>>>> for next stage1 review.
>>>
>>> This is a great plan - GCC urgently requires a good unroller!
>
> How so?
>
>>>> * Cost-model for selecting the loop uses the same params used
>>>> elsewhere in related optimizations. I was told that keeping this same
>>>> would allow better tuning for all the optimizations.
>>>
>>> I'd advise against using the existing params as is. Unrolling by 8x by default is
>>> way too aggressive and counterproductive. It was perhaps OK for in-order cores
>>> 20 years ago, but not today. The goal of unrolling is to create more ILP in small
>>> loops, not to generate huge blocks of repeated code which definitely won't fit in
>>> micro-op caches and loop buffers...
>>>
>> OK, I will create separate params. It is possible that I misunderstood
>> it in the first place.
>
> To generate more ILP for modern out-of-order processors you need to be
> able to do followup transforms that remove dependences.  So rather than
> inventing magic params we should look at those transforms and key
> unrolling on them.  Like we do in predictive commoning or other passes
> that end up performing unrolling as part of their transform.
>
> Our measurements on x86 concluded that unrolling isn't worth it, in fact
> it very often hurts.  That was of course with saner params than the defaults
> of the RTL unroller.

My preliminary benchmarking with x86 using default params slows no
overall gain. Some gains and some regressions. I didn't play with the
parameters to see if it improves.

But for AArch64 - Falkor (with follow up tag collision avoidance for
prefetching), we did see gains (again we could do better here):

SPECint_base2006  1.37%
SPECint_base2006  -0.73%
SPECspeed2017_int_base -0.1%
SPECspeed2017_fp_base 0.89%
SPECrate2017_fp_base 1.72%

We also noticed that sometimes the gains for  passes like prefetch
loop array comes mainly from unrolling rather than the software
prefetches.

>
> Often you even have to fight with followup passes doing stuff that ends up
> inreasing register pressure too much so we end up spilling.
If we can have an approximate register pressure model that can be used
while deciding unrolling factor, it might help to some extend. I saw
Bin posting some patches for register pressure calculation. Do you
think using that here will be helpful?

In general, I agree that cost model can be more accurate but getting
the right information within acceptable computation cost is the trick.
Do you have any preference on cost model if we decided to have
separate loop unroller pass? I.e., what information from loop should
we use other than the usual parameters we have?

>
>>
>>> Also we need to enable this by default, at least with -O3, maybe even for small
>>> (or rather tiny) loops in -O2 like LLVM does.
>> It is enabled for -O3 and above now.
>
> So _please_ first get testcases we know unrolling will be beneficial on
> and _also_ have a thorough description _why_.

I will try to analyse the benchmarks whose performance is improving
and create test cases.

>
>>>
>>>> * I have also implemented an option to limit loops based on memory
>>>> streams. i.e., some micro-architectures where limiting the resulting
>>>> memory streams is preferred and used  to limit unrolling factor.
>>>
>>> I'm not convinced this is needed once you tune the parameters for unrolling.
>>> If you have say 4 read streams you must have > 10 instructions already so
>>> you may want to unroll this 2x in -O3, but definitely not 8x. So I see the streams
>>> issue as a problem caused by too aggressive unroll settings. I think if you
>>> address that first, you're unlikely going to have an issue with too many streams.
>>>
>>
>> I will experiment with some microbenchmarks. I still think that it
>> will be useful for some micro-architectures. Thats why, it its not
>> enabled by default. If a back-end thinks that it is useful, they can
>> enable limiting unroll factor based on memory streams.
>
> Note that without doing scheduling at the same time (basically interleaving
> iterations rather than pasting them after each other) I have a hard time
> believing that maxing memory streams is any good on any microarchitecture.

Sorry, I didn't meant to say that. Rather, keep the emory streams
below what prefetches will be happy with.

>
> So transform-wise you'd end up with "vectorizing" without "vectorizing" and you
> can share dependence analysis.
>
>>>> * I expect that there will be some cost-model changes might be needed
>>>> to handle (or provide ability to handle) various loop preferences of
>>>> the micro-architectures. I am sending this patch for review early to
>>>> get feedbacks on this.
>>>
>>> Yes it should be feasible to have settings based on backend preference
>>> and optimization level (so O3/Ofast will unroll more than O2).
>>>
>>>> * Position of the pass in passes.def can also be changed. Example,
>>>> unrolling before SLP.
>>>
>>> As long as it runs before IVOpt so we get base+immediate addressing modes.
>> Thats what I am doing now.
>
> Note I believe that IVOPTs should be moved a bit later than it is
> placed right now.

Ok, I will benchmark with this.

Thanks,
Kugan

>
> Richard.
>
>> Thanks,
>> Kugan
>>
>>>
>>> Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree loop unroller pass
  2018-02-16 15:00       ` Richard Biener
@ 2018-02-16 15:32         ` Wilco Dijkstra
  0 siblings, 0 replies; 11+ messages in thread
From: Wilco Dijkstra @ 2018-02-16 15:32 UTC (permalink / raw)
  To: Richard Biener, Kugan Vivekanandarajah; +Cc: GCC Patches, nd

Richard Biener wrote:

> With Ooo CPUs speculatively executing the next iterations I very much doubt that.

OoO execution is like really dumb loop unrolling, you still have all the dependencies
between iterations, all the branches, all the pointer increments etc. Optimizing those
reduces instruction counts like vectorization. Fewer instructions implies faster code.

Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree loop unroller pass
  2018-02-16 14:22     ` Wilco Dijkstra
@ 2018-02-16 15:00       ` Richard Biener
  2018-02-16 15:32         ` Wilco Dijkstra
  2018-02-20 15:46       ` Michael Matz
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Biener @ 2018-02-16 15:00 UTC (permalink / raw)
  To: Wilco Dijkstra, Kugan Vivekanandarajah; +Cc: GCC Patches, nd

On February 16, 2018 3:22:22 PM GMT+01:00, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>Richard Biener wrote:
>>> This is a great plan - GCC urgently requires a good unroller!
>>
>> How so?
>
>I thought it is well-known for many years that the rtl unroller doesn't
>work properly.
>In practically all cases where LLVM beats GCC, it is due to unrolling
>small loops.
>
>You may have noticed how people have been enabling
>-fprefetch-loop-arrays by
>default in some AArch64 configurations and then strip out most/all
>prefetches in
>order to get the effect of tree unrolling... However the unroll
>parameters of this
>pass are even worse than -funroll-loops, so it ends up using crazy
>unroll factors.
>
>> To generate more ILP for modern out-of-order processors you need to
>be
>> able to do followup transforms that remove dependences.  So rather
>than
>> inventing magic params we should look at those transforms and key
>> unrolling on them.  Like we do in predictive commoning or other
>passes
>> that end up performing unrolling as part of their transform.
>
>This is why unrolling needs to be done at the tree level. Alias info is
>correct,
>addressing modes end up more optimal and the scheduler can now
>interleave 
>the iterations (often not possible after the rtl-unroller due to bad
>alias info).
> 
>> Our measurements on x86 concluded that unrolling isn't worth it, in
>fact
>> it very often hurts.  That was of course with saner params than the
>defaults
>> of the RTL unroller.
>>
>> Often you even have to fight with followup passes doing stuff that
>ends up
>> inreasing register pressure too much so we end up spilling.
>
>Yes that's why I mentioned we should only unroll small loops where
>there
>is always a benefit from reduced loop counter increments and branching.
>
>> So _please_ first get testcases we know unrolling will be beneficial
>on
>> and _also_ have a thorough description _why_.
>
>I'm sure we can find good examples. The why will be obvious just from
>instruction
>count.

With Ooo CPUs speculatively executing the next iterations I very much doubt that. 

Richard. 

>Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree loop unroller pass
  2018-02-16 11:56   ` Richard Biener
@ 2018-02-16 14:22     ` Wilco Dijkstra
  2018-02-16 15:00       ` Richard Biener
  2018-02-20 15:46       ` Michael Matz
  2018-02-20  0:17     ` Kugan Vivekanandarajah
  1 sibling, 2 replies; 11+ messages in thread
From: Wilco Dijkstra @ 2018-02-16 14:22 UTC (permalink / raw)
  To: Richard Biener, Kugan Vivekanandarajah; +Cc: GCC Patches, nd

Richard Biener wrote:
>> This is a great plan - GCC urgently requires a good unroller!
>
> How so?

I thought it is well-known for many years that the rtl unroller doesn't work properly.
In practically all cases where LLVM beats GCC, it is due to unrolling small loops.

You may have noticed how people have been enabling -fprefetch-loop-arrays by
default in some AArch64 configurations and then strip out most/all prefetches in
order to get the effect of tree unrolling... However the unroll parameters of this
pass are even worse than -funroll-loops, so it ends up using crazy unroll factors.

> To generate more ILP for modern out-of-order processors you need to be
> able to do followup transforms that remove dependences.  So rather than
> inventing magic params we should look at those transforms and key
> unrolling on them.  Like we do in predictive commoning or other passes
> that end up performing unrolling as part of their transform.

This is why unrolling needs to be done at the tree level. Alias info is correct,
addressing modes end up more optimal and the scheduler can now interleave 
the iterations (often not possible after the rtl-unroller due to bad alias info).

> Our measurements on x86 concluded that unrolling isn't worth it, in fact
> it very often hurts.  That was of course with saner params than the defaults
> of the RTL unroller.
>
> Often you even have to fight with followup passes doing stuff that ends up
> inreasing register pressure too much so we end up spilling.

Yes that's why I mentioned we should only unroll small loops where there
is always a benefit from reduced loop counter increments and branching.

> So _please_ first get testcases we know unrolling will be beneficial on
> and _also_ have a thorough description _why_.

I'm sure we can find good examples. The why will be obvious just from instruction
count.

Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree loop unroller pass
  2018-02-15 22:31 ` Kugan Vivekanandarajah
@ 2018-02-16 11:56   ` Richard Biener
  2018-02-16 14:22     ` Wilco Dijkstra
  2018-02-20  0:17     ` Kugan Vivekanandarajah
  0 siblings, 2 replies; 11+ messages in thread
From: Richard Biener @ 2018-02-16 11:56 UTC (permalink / raw)
  To: Kugan Vivekanandarajah; +Cc: Wilco Dijkstra, GCC Patches, nd

On Thu, Feb 15, 2018 at 11:30 PM, Kugan Vivekanandarajah
<kugan.vivekanandarajah@linaro.org> wrote:
> Hi Wilko,
>
> Thanks for your comments.
>
> On 14 February 2018 at 00:05, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>> Hi Kugan,
>>
>>> Based on the previous discussions, I tried to implement a tree loop
>>> unroller for partial unrolling. I would like to queue this RFC patches
>>> for next stage1 review.
>>
>> This is a great plan - GCC urgently requires a good unroller!

How so?

>>> * Cost-model for selecting the loop uses the same params used
>>> elsewhere in related optimizations. I was told that keeping this same
>>> would allow better tuning for all the optimizations.
>>
>> I'd advise against using the existing params as is. Unrolling by 8x by default is
>> way too aggressive and counterproductive. It was perhaps OK for in-order cores
>> 20 years ago, but not today. The goal of unrolling is to create more ILP in small
>> loops, not to generate huge blocks of repeated code which definitely won't fit in
>> micro-op caches and loop buffers...
>>
> OK, I will create separate params. It is possible that I misunderstood
> it in the first place.

To generate more ILP for modern out-of-order processors you need to be
able to do followup transforms that remove dependences.  So rather than
inventing magic params we should look at those transforms and key
unrolling on them.  Like we do in predictive commoning or other passes
that end up performing unrolling as part of their transform.

Our measurements on x86 concluded that unrolling isn't worth it, in fact
it very often hurts.  That was of course with saner params than the defaults
of the RTL unroller.

Often you even have to fight with followup passes doing stuff that ends up
inreasing register pressure too much so we end up spilling.

>
>> Also we need to enable this by default, at least with -O3, maybe even for small
>> (or rather tiny) loops in -O2 like LLVM does.
> It is enabled for -O3 and above now.

So _please_ first get testcases we know unrolling will be beneficial on
and _also_ have a thorough description _why_.

>>
>>> * I have also implemented an option to limit loops based on memory
>>> streams. i.e., some micro-architectures where limiting the resulting
>>> memory streams is preferred and used  to limit unrolling factor.
>>
>> I'm not convinced this is needed once you tune the parameters for unrolling.
>> If you have say 4 read streams you must have > 10 instructions already so
>> you may want to unroll this 2x in -O3, but definitely not 8x. So I see the streams
>> issue as a problem caused by too aggressive unroll settings. I think if you
>> address that first, you're unlikely going to have an issue with too many streams.
>>
>
> I will experiment with some microbenchmarks. I still think that it
> will be useful for some micro-architectures. Thats why, it its not
> enabled by default. If a back-end thinks that it is useful, they can
> enable limiting unroll factor based on memory streams.

Note that without doing scheduling at the same time (basically interleaving
iterations rather than pasting them after each other) I have a hard time
believing that maxing memory streams is any good on any microarchitecture.

So transform-wise you'd end up with "vectorizing" without "vectorizing" and you
can share dependence analysis.

>>> * I expect that there will be some cost-model changes might be needed
>>> to handle (or provide ability to handle) various loop preferences of
>>> the micro-architectures. I am sending this patch for review early to
>>> get feedbacks on this.
>>
>> Yes it should be feasible to have settings based on backend preference
>> and optimization level (so O3/Ofast will unroll more than O2).
>>
>>> * Position of the pass in passes.def can also be changed. Example,
>>> unrolling before SLP.
>>
>> As long as it runs before IVOpt so we get base+immediate addressing modes.
> Thats what I am doing now.

Note I believe that IVOPTs should be moved a bit later than it is
placed right now.

Richard.

> Thanks,
> Kugan
>
>>
>> Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree loop unroller pass
  2018-02-13 13:05 [RFC] Tree loop unroller pass Wilco Dijkstra
@ 2018-02-15 22:31 ` Kugan Vivekanandarajah
  2018-02-16 11:56   ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Kugan Vivekanandarajah @ 2018-02-15 22:31 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

Hi Wilko,

Thanks for your comments.

On 14 February 2018 at 00:05, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Hi Kugan,
>
>> Based on the previous discussions, I tried to implement a tree loop
>> unroller for partial unrolling. I would like to queue this RFC patches
>> for next stage1 review.
>
> This is a great plan - GCC urgently requires a good unroller!
>
>> * Cost-model for selecting the loop uses the same params used
>> elsewhere in related optimizations. I was told that keeping this same
>> would allow better tuning for all the optimizations.
>
> I'd advise against using the existing params as is. Unrolling by 8x by default is
> way too aggressive and counterproductive. It was perhaps OK for in-order cores
> 20 years ago, but not today. The goal of unrolling is to create more ILP in small
> loops, not to generate huge blocks of repeated code which definitely won't fit in
> micro-op caches and loop buffers...
>
OK, I will create separate params. It is possible that I misunderstood
it in the first place.


> Also we need to enable this by default, at least with -O3, maybe even for small
> (or rather tiny) loops in -O2 like LLVM does.
It is enabled for -O3 and above now.

>
>> * I have also implemented an option to limit loops based on memory
>> streams. i.e., some micro-architectures where limiting the resulting
>> memory streams is preferred and used  to limit unrolling factor.
>
> I'm not convinced this is needed once you tune the parameters for unrolling.
> If you have say 4 read streams you must have > 10 instructions already so
> you may want to unroll this 2x in -O3, but definitely not 8x. So I see the streams
> issue as a problem caused by too aggressive unroll settings. I think if you
> address that first, you're unlikely going to have an issue with too many streams.
>

I will experiment with some microbenchmarks. I still think that it
will be useful for some micro-architectures. Thats why, it its not
enabled by default. If a back-end thinks that it is useful, they can
enable limiting unroll factor based on memory streams.

>> * I expect that there will be some cost-model changes might be needed
>> to handle (or provide ability to handle) various loop preferences of
>> the micro-architectures. I am sending this patch for review early to
>> get feedbacks on this.
>
> Yes it should be feasible to have settings based on backend preference
> and optimization level (so O3/Ofast will unroll more than O2).
>
>> * Position of the pass in passes.def can also be changed. Example,
>> unrolling before SLP.
>
> As long as it runs before IVOpt so we get base+immediate addressing modes.
Thats what I am doing now.

Thanks,
Kugan

>
> Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Tree loop unroller pass
@ 2018-02-13 13:05 Wilco Dijkstra
  2018-02-15 22:31 ` Kugan Vivekanandarajah
  0 siblings, 1 reply; 11+ messages in thread
From: Wilco Dijkstra @ 2018-02-13 13:05 UTC (permalink / raw)
  To: kugan.vivekanandarajah; +Cc: GCC Patches, nd

Hi Kugan,

> Based on the previous discussions, I tried to implement a tree loop
> unroller for partial unrolling. I would like to queue this RFC patches
> for next stage1 review.

This is a great plan - GCC urgently requires a good unroller!

> * Cost-model for selecting the loop uses the same params used
> elsewhere in related optimizations. I was told that keeping this same
> would allow better tuning for all the optimizations.

I'd advise against using the existing params as is. Unrolling by 8x by default is
way too aggressive and counterproductive. It was perhaps OK for in-order cores
20 years ago, but not today. The goal of unrolling is to create more ILP in small
loops, not to generate huge blocks of repeated code which definitely won't fit in
micro-op caches and loop buffers...

Also we need to enable this by default, at least with -O3, maybe even for small
(or rather tiny) loops in -O2 like LLVM does.

> * I have also implemented an option to limit loops based on memory
> streams. i.e., some micro-architectures where limiting the resulting
> memory streams is preferred and used  to limit unrolling factor.

I'm not convinced this is needed once you tune the parameters for unrolling.
If you have say 4 read streams you must have > 10 instructions already so
you may want to unroll this 2x in -O3, but definitely not 8x. So I see the streams
issue as a problem caused by too aggressive unroll settings. I think if you
address that first, you're unlikely going to have an issue with too many streams.

> * I expect that there will be some cost-model changes might be needed
> to handle (or provide ability to handle) various loop preferences of
> the micro-architectures. I am sending this patch for review early to
> get feedbacks on this.

Yes it should be feasible to have settings based on backend preference
and optimization level (so O3/Ofast will unroll more than O2).

> * Position of the pass in passes.def can also be changed. Example,
> unrolling before SLP.

As long as it runs before IVOpt so we get base+immediate addressing modes.

Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] Tree loop unroller pass
@ 2018-02-12 23:52 Kugan Vivekanandarajah
  0 siblings, 0 replies; 11+ messages in thread
From: Kugan Vivekanandarajah @ 2018-02-12 23:52 UTC (permalink / raw)
  To: GCC Patches

Hi All,

Based on the previous discussions, I tried to implement a tree loop
unroller for partial unrolling. I would like to queue this RFC patches
for next stage1 review.

In summary:

* Cost-model for selecting the loop uses the same params used
elsewhere in related optimizations. I was told that keeping this same
would allow better tuning for all the optimizations.

* I have also implemented an option to limit loops based on memory
streams. i.e., some micro-architectures where limiting the resulting
memory streams is preferred and used  to limit unrolling factor.

* I have tested this on variants of aarch64 and the results are
promising. I am in the process of running benchmarks on x86. I will
update the results later.

* I expect that there will be some cost-model changes might be needed
to handle (or provide ability to handle) various loop preferences of
the micro-architectures. I am sending this patch for review early to
get feedbacks on this.

* Position of the pass in passes.def can also be changed. Example,
unrolling before SLP.

* I have bootstrapped and regression tested on aarch64-linux-gnu.
There are no execution errors or ICEs. There are some testsuite
differences as expected. Few of them needs further evaluation and I am
doing that now.

Patches are organized as:

Patch1: Adds a target hook TARGET_HW_MAX_MEM_READ_STREAMS. Loop
unroller, if defined, will try to limit the unrolling factor based on
this.

Patch2: Implements tree loop unroller using the infrastructure
provided. Pass itself is very simple.

Patch3: Implements target hook TARGET_HW_MAX_MEM_READ_STREAMS for aarch64.

Patch4: Implements a machine reorg pass for aarch64/Falkor to handle
prefetcher tag collision. This is strictly not part of the loop
unroller but for Falkor, unrolling can make h/w prefetcher performing
badly if there are too-much tag collisions based on the discussions in
https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.

Thanks,
Kugan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-02-20 19:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-12 23:56 [RFC] Tree Loop Unroller Pass Kugan Vivekanandarajah
2018-02-20 19:53 ` Andrew Pinski
  -- strict thread matches above, loose matches on Subject: below --
2018-02-13 13:05 [RFC] Tree loop unroller pass Wilco Dijkstra
2018-02-15 22:31 ` Kugan Vivekanandarajah
2018-02-16 11:56   ` Richard Biener
2018-02-16 14:22     ` Wilco Dijkstra
2018-02-16 15:00       ` Richard Biener
2018-02-16 15:32         ` Wilco Dijkstra
2018-02-20 15:46       ` Michael Matz
2018-02-20  0:17     ` Kugan Vivekanandarajah
2018-02-12 23:52 Kugan Vivekanandarajah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).