public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
@ 2011-07-12 12:35 Илья Энкович
  2011-07-12 17:08 ` William J. Schmidt
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Илья Энкович @ 2011-07-12 12:35 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2447 bytes --]

Hello,

Here is a patch related to missed optimization opportunity in tree
reassoc phase.

Currently tree reassoc phase always generates a linear form which
requires the minimum registers but has the highest tree height and
does not allow computation to be performed in parallel. It may be
critical for performance if required operations have high latency but
can be pipelined (i.e. few execution units or low throughput). This
problem becomes important on current Atom processors which are
in-order and have many such instructions: IMUL and scalar SSE FP
instructions.

This patch introduces a new feature to tree reassoc phase to generate
computation tree with reduced height allowing to perform few
long-latency instructions in parallel. It changes only one part of
reassociation - rewrite_expr_tree. A level of parallelism is
controlled via target hook and/or command line option.

New feature is enabled for Atom only by default. Patch boosts mostly
CFP2000 geomean on Atom: +3.04% for 32 bit and +0.32% for 64 bit.

Bootstrapped and checked on x86_64-linux.

Thanks,
Ilya
--
gcc/

2011-07-12  Enkovich Ilya  <ilya.enkovich@intel.com>

	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): New hook documentation.

	* doc/tm.texi (reassociation_width): Likewise.

	* hooks.h (hook_int_const_gimple_1): New default hook.

	* hooks.c (hook_int_const_gimple_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* common.opt (ftree-reassoc-width): New option added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

	(pass_reassoc): TODO_remove_unused_locals flag added.

gcc/testsuite/

2011-07-12  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	-ftree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 13803 bytes --]

diff --git a/gcc/common.opt b/gcc/common.opt
index f127936..1ba383f 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1993,6 +1993,11 @@ ftree-reassoc
 Common Report Var(flag_tree_reassoc) Init(1) Optimization
 Enable reassociation on tree level
 
+ftree-reassoc-width=
+Common RejectNegative Joined UInteger Report Var(flag_tree_reassoc_width) Init(-1) Optimization
+Set reassociation parallelism level to be used by tree
+reassociation.
+
 ftree-salias
 Common Ignore
 Does nothing.  Preserved for backward compatibility.
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a46101b..9a232cb 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2088,7 +2088,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER1
+  m_BDVER1,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -33753,6 +33761,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -34549,6 +34559,22 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+static int
+ix86_reassociation_width (const_gimple stmt)
+{
+  tree rhs1 = gimple_assign_rhs1 (stmt);
+  int res = 1;
+
+  if (INTEGRAL_TYPE_P (TREE_TYPE (rhs1))
+      && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_TYPE_P (TREE_TYPE (rhs1))
+	   && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8cef4e7..359b45a 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -314,6 +314,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -412,6 +414,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c0648a5..901e6ce 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6802,6 +6802,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (const_gimple @var{stmt})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3660d36..908d3c1 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6741,6 +6741,11 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9dfde82..a0cb560 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,12 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_const_gimple_1 (const_gimple a ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index a1b0112..5cc2d42 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -67,6 +67,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_const_gimple_1 (const_gimple);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/target.def b/gcc/target.def
index c67f0ba..fa13282 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -904,6 +904,15 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"",
+int, (const_gimple stmt),
+hook_int_const_gimple_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..90c1cd4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 -ftree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..bc40a1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..d308937
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index d8f9e2e..0bf48cb 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1474,6 +1475,167 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* Find out how many cycles we need to compute whole statements
+   chain holding ops_num operands if may execute up to cpu_width
+   statements per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (VEC(operand_entry_t, heap) * ops, gimple stmt)
+{
+  int ops_num = VEC_length (operand_entry_t, ops);
+  int width;
+  int cycles_best;
+
+  if (flag_tree_reassoc_width > 0)
+    width = flag_tree_reassoc_width;
+  else
+    width = targetm.sched.reassociation_width (stmt);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  while (width > 1 &&
+	 get_required_cycles (ops_num, width - 1) == cycles_best)
+    width--;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+	       "Width = %d was chosen for reassociation\n", width);
+    }
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XNEWVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0 &&
+	  (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+         operands into the last statement.  It will also
+         break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
+	  add_referenced_var (var);
+	  stmts[i] = build_and_add_sum (var, op1, op2, opcode);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+
+  free (stmts);
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -2139,7 +2301,15 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  int width = get_reassociation_width (ops, stmt);
+
+		  if (width > 1 &&
+		      VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }
@@ -2297,7 +2467,8 @@ struct gimple_opt_pass pass_reassoc =
   0,					/* properties_provided */
   0,					/* properties_destroyed */
   0,					/* todo_flags_start */
-  TODO_verify_ssa
+  TODO_remove_unused_locals
+    | TODO_verify_ssa
     | TODO_verify_flow
     | TODO_ggc_collect			/* todo_flags_finish */
  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2011-09-06 16:43 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-12 12:35 [PATCH Atom][PR middle-end/44382] Tree reassociation improvement Илья Энкович
2011-07-12 17:08 ` William J. Schmidt
2011-07-12 17:22   ` H.J. Lu
2011-07-13  8:28     ` Ilya Enkovich
2011-07-12 17:24   ` William J. Schmidt
2011-07-13  8:13     ` Ilya Enkovich
2011-07-13  8:30       ` Jakub Jelinek
2011-07-13  9:04         ` Ilya Enkovich
2011-07-13  9:52           ` Jakub Jelinek
2011-07-13  9:59             ` Ilya Enkovich
2011-07-13 11:01               ` Richard Guenther
2011-07-13 13:11   ` William J. Schmidt
2011-07-13 18:05 ` Michael Meissner
2011-07-13 20:18   ` Ilya Enkovich
2011-07-14  2:39 ` Michael Meissner
2011-07-14  9:40   ` Richard Guenther
2011-07-14  9:42     ` Richard Guenther
2011-07-14 10:04       ` Ilya Enkovich
2011-07-14 10:14         ` Richard Guenther
2011-07-14 11:03           ` Ilya Enkovich
2011-07-14 15:31             ` Ilya Enkovich
2011-07-19 12:10               ` Richard Guenther
2011-07-19 15:25                 ` Ilya Enkovich
2011-08-05 11:46                   ` Richard Guenther
2011-08-05 15:08                     ` Ilya Enkovich
2011-08-10 15:19                       ` Ilya Enkovich
2011-08-19  9:09                         ` Ilya Enkovich
2011-08-26 11:45                           ` Ilya Enkovich
2011-08-30 13:16                         ` Richard Guenther
2011-08-31 15:38                           ` Ilya Enkovich
2011-09-01  9:24                             ` Richard Guenther
2011-09-01 10:28                               ` Ilya Enkovich
2011-09-02 12:53                                 ` Uros Bizjak
2011-09-02 13:07                                   ` Richard Guenther
2011-09-02 13:46                                     ` Ilya Enkovich
2011-09-02 14:01                                       ` Richard Guenther
2011-09-02 14:13                                         ` Ilya Enkovich
2011-09-06 12:40                                   ` Ilya Enkovich
2011-09-06 15:53                                     ` Uros Bizjak
2011-09-06 16:18                                       ` Ilya Enkovich
2011-09-06 16:45                                         ` H.J. Lu
2011-07-26  9:54                 ` Ilya Enkovich
2011-08-03  8:35                   ` Ilya Enkovich
2011-07-21 20:14               ` Joseph S. Myers
2011-07-14 16:04       ` Michael Meissner
2011-07-14 16:08         ` Ilya Enkovich
2011-07-14 10:38   ` Ilya Enkovich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).