[PATCH Atom][PR middle-end/44382] Tree reassociation improvement

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
@ 2011-07-12 12:35 Илья Энкович
  2011-07-12 17:08 ` William J. Schmidt
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Илья Энкович @ 2011-07-12 12:35 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2447 bytes --]

Hello,

Here is a patch related to missed optimization opportunity in tree
reassoc phase.

Currently tree reassoc phase always generates a linear form which
requires the minimum registers but has the highest tree height and
does not allow computation to be performed in parallel. It may be
critical for performance if required operations have high latency but
can be pipelined (i.e. few execution units or low throughput). This
problem becomes important on current Atom processors which are
in-order and have many such instructions: IMUL and scalar SSE FP
instructions.

This patch introduces a new feature to tree reassoc phase to generate
computation tree with reduced height allowing to perform few
long-latency instructions in parallel. It changes only one part of
reassociation - rewrite_expr_tree. A level of parallelism is
controlled via target hook and/or command line option.

New feature is enabled for Atom only by default. Patch boosts mostly
CFP2000 geomean on Atom: +3.04% for 32 bit and +0.32% for 64 bit.

Bootstrapped and checked on x86_64-linux.

Thanks,
Ilya
--
gcc/

2011-07-12  Enkovich Ilya  <ilya.enkovich@intel.com>

	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): New hook documentation.

	* doc/tm.texi (reassociation_width): Likewise.

	* hooks.h (hook_int_const_gimple_1): New default hook.

	* hooks.c (hook_int_const_gimple_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* common.opt (ftree-reassoc-width): New option added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

	(pass_reassoc): TODO_remove_unused_locals flag added.

gcc/testsuite/

2011-07-12  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	-ftree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 13803 bytes --]

diff --git a/gcc/common.opt b/gcc/common.opt
index f127936..1ba383f 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1993,6 +1993,11 @@ ftree-reassoc
 Common Report Var(flag_tree_reassoc) Init(1) Optimization
 Enable reassociation on tree level
 
+ftree-reassoc-width=
+Common RejectNegative Joined UInteger Report Var(flag_tree_reassoc_width) Init(-1) Optimization
+Set reassociation parallelism level to be used by tree
+reassociation.
+
 ftree-salias
 Common Ignore
 Does nothing.  Preserved for backward compatibility.
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a46101b..9a232cb 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2088,7 +2088,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER1
+  m_BDVER1,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -33753,6 +33761,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -34549,6 +34559,22 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+static int
+ix86_reassociation_width (const_gimple stmt)
+{
+  tree rhs1 = gimple_assign_rhs1 (stmt);
+  int res = 1;
+
+  if (INTEGRAL_TYPE_P (TREE_TYPE (rhs1))
+      && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_TYPE_P (TREE_TYPE (rhs1))
+	   && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8cef4e7..359b45a 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -314,6 +314,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -412,6 +414,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c0648a5..901e6ce 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6802,6 +6802,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (const_gimple @var{stmt})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3660d36..908d3c1 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6741,6 +6741,11 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9dfde82..a0cb560 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,12 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_const_gimple_1 (const_gimple a ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index a1b0112..5cc2d42 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -67,6 +67,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_const_gimple_1 (const_gimple);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/target.def b/gcc/target.def
index c67f0ba..fa13282 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -904,6 +904,15 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"",
+int, (const_gimple stmt),
+hook_int_const_gimple_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..90c1cd4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 -ftree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..bc40a1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..d308937
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index d8f9e2e..0bf48cb 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1474,6 +1475,167 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* Find out how many cycles we need to compute whole statements
+   chain holding ops_num operands if may execute up to cpu_width
+   statements per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (VEC(operand_entry_t, heap) * ops, gimple stmt)
+{
+  int ops_num = VEC_length (operand_entry_t, ops);
+  int width;
+  int cycles_best;
+
+  if (flag_tree_reassoc_width > 0)
+    width = flag_tree_reassoc_width;
+  else
+    width = targetm.sched.reassociation_width (stmt);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  while (width > 1 &&
+	 get_required_cycles (ops_num, width - 1) == cycles_best)
+    width--;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+	       "Width = %d was chosen for reassociation\n", width);
+    }
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XNEWVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0 &&
+	  (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+         operands into the last statement.  It will also
+         break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
+	  add_referenced_var (var);
+	  stmts[i] = build_and_add_sum (var, op1, op2, opcode);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+
+  free (stmts);
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -2139,7 +2301,15 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  int width = get_reassociation_width (ops, stmt);
+
+		  if (width > 1 &&
+		      VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }
@@ -2297,7 +2467,8 @@ struct gimple_opt_pass pass_reassoc =
   0,					/* properties_provided */
   0,					/* properties_destroyed */
   0,					/* todo_flags_start */
-  TODO_verify_ssa
+  TODO_remove_unused_locals
+    | TODO_verify_ssa
     | TODO_verify_flow
     | TODO_ggc_collect			/* todo_flags_finish */
  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-12 12:35 [PATCH Atom][PR middle-end/44382] Tree reassociation improvement Илья Энкович
@ 2011-07-12 17:08 ` William J. Schmidt
  2011-07-12 17:22   ` H.J. Lu
                     ` (2 more replies)
  2011-07-13 18:05 ` Michael Meissner
  2011-07-14  2:39 ` Michael Meissner
  2 siblings, 3 replies; 47+ messages in thread
From: William J. Schmidt @ 2011-07-12 17:08 UTC (permalink / raw)
  To: Илья
	Энкович
  Cc: gcc-patches

Ilya, thanks for posting this!  This patch is useful also on powerpc64.
Applying it solved a performance degradation with bwaves due to loss of
reassociation somewhere between 4.5 and 4.6 (still tracking it down).
When we apply -ftree-reassoc-width=2 to bwaves, the more optimal code
generation returns.

Bill

On Tue, 2011-07-12 at 16:30 +0400, Ð˜Ð»ÑŒÑ ÐÐ½ÐºÐ¾Ð²Ð¸Ñ‡ wrote:
> Hello,
> 
> Here is a patch related to missed optimization opportunity in tree
> reassoc phase.
> 
> Currently tree reassoc phase always generates a linear form which
> requires the minimum registers but has the highest tree height and
> does not allow computation to be performed in parallel. It may be
> critical for performance if required operations have high latency but
> can be pipelined (i.e. few execution units or low throughput). This
> problem becomes important on current Atom processors which are
> in-order and have many such instructions: IMUL and scalar SSE FP
> instructions.
> 
> This patch introduces a new feature to tree reassoc phase to generate
> computation tree with reduced height allowing to perform few
> long-latency instructions in parallel. It changes only one part of
> reassociation - rewrite_expr_tree. A level of parallelism is
> controlled via target hook and/or command line option.
> 
> New feature is enabled for Atom only by default. Patch boosts mostly
> CFP2000 geomean on Atom: +3.04% for 32 bit and +0.32% for 64 bit.
> 
> Bootstrapped and checked on x86_64-linux.
> 
> Thanks,
> Ilya
> --
> gcc/
> 
> 2011-07-12  Enkovich Ilya  <ilya.enkovich@intel.com>
> 
> 	* target.def (reassociation_width): New hook.
> 
> 	* doc/tm.texi.in (reassociation_width): New hook documentation.
> 
> 	* doc/tm.texi (reassociation_width): Likewise.
> 
> 	* hooks.h (hook_int_const_gimple_1): New default hook.
> 
> 	* hooks.c (hook_int_const_gimple_1): Likewise.
> 
> 	* config/i386/i386.h (ix86_tune_indices): Add
> 	X86_TUNE_REASSOC_INT_TO_PARALLEL and
> 	X86_TUNE_REASSOC_FP_TO_PARALLEL.
> 
> 	(TARGET_REASSOC_INT_TO_PARALLEL): New.
> 	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
> 
> 	* config/i386/i386.c (initial_ix86_tune_features): Add
> 	X86_TUNE_REASSOC_INT_TO_PARALLEL and
> 	X86_TUNE_REASSOC_FP_TO_PARALLEL.
> 
> 	(ix86_reassociation_width) implementation of
> 	new hook for i386 target.
> 
> 	* common.opt (ftree-reassoc-width): New option added.
> 
> 	* tree-ssa-reassoc.c (get_required_cycles): New function.
> 	(get_reassociation_width): Likewise.
> 	(rewrite_expr_tree_parallel): Likewise.
> 
> 	(reassociate_bb): Now checks reassociation width to be used
> 	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
> 	if needed.
> 
> 	(pass_reassoc): TODO_remove_unused_locals flag added.
> 
> gcc/testsuite/
> 
> 2011-07-12  Enkovich Ilya  <ilya.enkovich@intel.com>
> 
> 	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
> 	-ftree-reassoc-width=1.
> 
> 	* gcc.dg/tree-ssa/reassoc-24.c: New test.
> 	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-12 17:08 ` William J. Schmidt
@ 2011-07-12 17:22   ` H.J. Lu
  2011-07-13  8:28     ` Ilya Enkovich
  2011-07-12 17:24   ` William J. Schmidt
  2011-07-13 13:11   ` William J. Schmidt
  2 siblings, 1 reply; 47+ messages in thread
From: H.J. Lu @ 2011-07-12 17:22 UTC (permalink / raw)
  To: William J. Schmidt
  Cc: Илья
	Энкович,
	gcc-patches

On Tue, Jul 12, 2011 at 9:50 AM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> Ilya, thanks for posting this!  This patch is useful also on powerpc64.
> Applying it solved a performance degradation with bwaves due to loss of
> reassociation somewhere between 4.5 and 4.6 (still tracking it down).
> When we apply -ftree-reassoc-width=2 to bwaves, the more optimal code
> generation returns.


It is good to know.  Ilya, please mention PR middle-end/44382
in ChangeLog.

Thanks.

H.J.
--

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-12 17:08 ` William J. Schmidt
  2011-07-12 17:22   ` H.J. Lu
@ 2011-07-12 17:24   ` William J. Schmidt
  2011-07-13  8:13     ` Ilya Enkovich
  2011-07-13 13:11   ` William J. Schmidt
  2 siblings, 1 reply; 47+ messages in thread
From: William J. Schmidt @ 2011-07-12 17:24 UTC (permalink / raw)
  To: Илья
	Энкович
  Cc: gcc-patches

On Tue, 2011-07-12 at 11:50 -0500, William J. Schmidt wrote:
> Ilya, thanks for posting this!  This patch is useful also on powerpc64.
> Applying it solved a performance degradation with bwaves due to loss of
> reassociation somewhere between 4.5 and 4.6 (still tracking it down).
> When we apply -ftree-reassoc-width=2 to bwaves, the more optimal code
> generation returns.
> 
> Bill
> 

However, it does not fix http://gcc.gnu.org/PR45671, which surprises me
as it was marked as a duplicate of this one.  Any thoughts on why this
isn't sufficient to reassociate the linear chain of adds?

Test case:

int myfunction (int a, int b, int c, int d, int e, int f, int g, int h)
{
  int ret;

  ret = a + b + c + d + e + f + g + h;
  return ret;

}


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-12 17:24   ` William J. Schmidt
@ 2011-07-13  8:13     ` Ilya Enkovich
  2011-07-13  8:30       ` Jakub Jelinek
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-13  8:13 UTC (permalink / raw)
  To: William J. Schmidt; +Cc: gcc-patches

Hello William,

> However, it does not fix http://gcc.gnu.org/PR45671, which surprises me
> as it was marked as a duplicate of this one.  Any thoughts on why this
> isn't sufficient to reassociate the linear chain of adds?
>
> Test case:
>
> int myfunction (int a, int b, int c, int d, int e, int f, int g, int h)
> {
>  int ret;
>
>  ret = a + b + c + d + e + f + g + h;
>  return ret;
>
> }
>
>
>

Reassociation does not work for signed integers because signed integer
is not wrap-around type in C. You can change it by passing -fwrapv
option but it will disable other useful optimization. Reassociation of
signed integers without this option is not a trivial question because
in that case you may introduce overflows and therefore undefined
behavior.

BR
Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-12 17:22   ` H.J. Lu
@ 2011-07-13  8:28     ` Ilya Enkovich
  0 siblings, 0 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-13  8:28 UTC (permalink / raw)
  To: H.J. Lu; +Cc: William J. Schmidt, gcc-patches

> Ilya, please mention PR middle-end/44382
> in ChangeLog.
>

Thanks for notice. Here is corrected ChangeLog:

gcc/

2011-07-12  Enkovich Ilya  <ilya.enkovich@intel.com>

       PR middle-end/44382
       * target.def (reassociation_width): New hook.

       * doc/tm.texi.in (reassociation_width): New hook documentation.

       * doc/tm.texi (reassociation_width): Likewise.

       * hooks.h (hook_int_const_gimple_1): New default hook.

       * hooks.c (hook_int_const_gimple_1): Likewise.

       * config/i386/i386.h (ix86_tune_indices): Add
       X86_TUNE_REASSOC_INT_TO_PARALLEL and
       X86_TUNE_REASSOC_FP_TO_PARALLEL.

       (TARGET_REASSOC_INT_TO_PARALLEL): New.
       (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

       * config/i386/i386.c (initial_ix86_tune_features): Add
       X86_TUNE_REASSOC_INT_TO_PARALLEL and
       X86_TUNE_REASSOC_FP_TO_PARALLEL.

       (ix86_reassociation_width) implementation of
       new hook for i386 target.

       * common.opt (ftree-reassoc-width): New option added.

       * tree-ssa-reassoc.c (get_required_cycles): New function.
       (get_reassociation_width): Likewise.
       (rewrite_expr_tree_parallel): Likewise.

       (reassociate_bb): Now checks reassociation width to be used
       and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
       if needed.

       (pass_reassoc): TODO_remove_unused_locals flag added.

gcc/testsuite/

2011-07-12  Enkovich Ilya  <ilya.enkovich@intel.com>

       * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
       -ftree-reassoc-width=1.

       * gcc.dg/tree-ssa/reassoc-24.c: New test.
       * gcc.dg/tree-ssa/reassoc-25.c: Likewise.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-13  8:13     ` Ilya Enkovich
@ 2011-07-13  8:30       ` Jakub Jelinek
  2011-07-13  9:04         ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Jakub Jelinek @ 2011-07-13  8:30 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: William J. Schmidt, gcc-patches

On Wed, Jul 13, 2011 at 11:52:25AM +0400, Ilya Enkovich wrote:
> > However, it does not fix http://gcc.gnu.org/PR45671, which surprises me
> > as it was marked as a duplicate of this one. Â Any thoughts on why this
> > isn't sufficient to reassociate the linear chain of adds?
> >
> > Test case:
> >
> > int myfunction (int a, int b, int c, int d, int e, int f, int g, int h)
> > {
> > Â int ret;
> >
> > Â ret = a + b + c + d + e + f + g + h;
> > Â return ret;
> >
> > }
> >
> >
> >
> 
> Reassociation does not work for signed integers because signed integer
> is not wrap-around type in C. You can change it by passing -fwrapv
> option but it will disable other useful optimization. Reassociation of
> signed integers without this option is not a trivial question because
> in that case you may introduce overflows and therefore undefined
> behavior.

Well, if it is clearly a win to reassociate, you can always reassociate
them by doing arithmetics in corresponding unsigned type and afterwards
converting back to the signed type.

	Jakub

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-13  8:30       ` Jakub Jelinek
@ 2011-07-13  9:04         ` Ilya Enkovich
  2011-07-13  9:52           ` Jakub Jelinek
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-13  9:04 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: William J. Schmidt, gcc-patches

>
> Well, if it is clearly a win to reassociate, you can always reassociate
> them by doing arithmetics in corresponding unsigned type and afterwards
> converting back to the signed type.
>
>        Jakub
>

You are right. But in this case we again make all operands have
wrap-around type and thus disable some other optimization. It would be
nice to have opportunity to reassociate and still have undefined
behavior on overflow for optimizations. One way to do it for add/sub
is to use wider type (long long instead of int).

Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-13  9:04         ` Ilya Enkovich
@ 2011-07-13  9:52           ` Jakub Jelinek
  2011-07-13  9:59             ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Jakub Jelinek @ 2011-07-13  9:52 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: William J. Schmidt, gcc-patches

On Wed, Jul 13, 2011 at 01:01:59PM +0400, Ilya Enkovich wrote:
> > Well, if it is clearly a win to reassociate, you can always reassociate
> > them by doing arithmetics in corresponding unsigned type and afterwards
> > converting back to the signed type.
> 
> You are right. But in this case we again make all operands have
> wrap-around type and thus disable some other optimization. It would be
> nice to have opportunity to reassociate and still have undefined
> behavior on overflow for optimizations. One way to do it for add/sub
> is to use wider type (long long instead of int).

I disagree.  Widening would result in worse code in most cases, as you need
to sign extend all the operands.  On the other side, I doubt you can
actually usefully use the undefinedness of signed overflow for a series of
3 or more operands of the associative operation.

	Jakub

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-13  9:52           ` Jakub Jelinek
@ 2011-07-13  9:59             ` Ilya Enkovich
  2011-07-13 11:01               ` Richard Guenther
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-13  9:59 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: William J. Schmidt, gcc-patches

2011/7/13 Jakub Jelinek <jakub@redhat.com>:
>
> I disagree.  Widening would result in worse code in most cases, as you need
> to sign extend all the operands.  On the other side, I doubt you can
> actually usefully use the undefinedness of signed overflow for a series of
> 3 or more operands of the associative operation.
>
>        Jakub
>

Sounds reasonable. Type casting to unsigned should be a better solution here.

Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-13  9:59             ` Ilya Enkovich
@ 2011-07-13 11:01               ` Richard Guenther
  0 siblings, 0 replies; 47+ messages in thread
From: Richard Guenther @ 2011-07-13 11:01 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Jakub Jelinek, William J. Schmidt, gcc-patches

On Wed, Jul 13, 2011 at 11:18 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2011/7/13 Jakub Jelinek <jakub@redhat.com>:
>>
>> I disagree.  Widening would result in worse code in most cases, as you need
>> to sign extend all the operands.  On the other side, I doubt you can
>> actually usefully use the undefinedness of signed overflow for a series of
>> 3 or more operands of the associative operation.
>>
>>        Jakub
>>
>
> Sounds reasonable. Type casting to unsigned should be a better solution here.

Well, the solution of course lies in the no-undefined-overflow branch
where we have separate tree codes for arithmetic with/without
undefined overflow.

Richard.

> Ilya
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-12 17:08 ` William J. Schmidt
  2011-07-12 17:22   ` H.J. Lu
  2011-07-12 17:24   ` William J. Schmidt
@ 2011-07-13 13:11   ` William J. Schmidt
  2 siblings, 0 replies; 47+ messages in thread
From: William J. Schmidt @ 2011-07-13 13:11 UTC (permalink / raw)
  To: Илья
	Энкович
  Cc: gcc-patches

On Tue, 2011-07-12 at 11:50 -0500, William J. Schmidt wrote:
> Ilya, thanks for posting this!  This patch is useful also on powerpc64.
> Applying it solved a performance degradation with bwaves due to loss of
> reassociation somewhere between 4.5 and 4.6 (still tracking it down).
> When we apply -ftree-reassoc-width=2 to bwaves, the more optimal code
> generation returns.

On further investigation, this is improving the code generation but not
reverting all of the performance loss.  We'll open a bug on this one
once we have it narrowed down a little further.

Bill
> 
> Bill
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-12 12:35 [PATCH Atom][PR middle-end/44382] Tree reassociation improvement Илья Энкович
  2011-07-12 17:08 ` William J. Schmidt
@ 2011-07-13 18:05 ` Michael Meissner
  2011-07-13 20:18   ` Ilya Enkovich
  2011-07-14  2:39 ` Michael Meissner
  2 siblings, 1 reply; 47+ messages in thread
From: Michael Meissner @ 2011-07-13 18:05 UTC (permalink / raw)
  To: Илья
	Энкович
  Cc: gcc-patches

I just ran a spec 2006 run on the powerpc (32-bit) last night setting the
reassociation to 2.  I do see a win in bwaves, but unfortunately it is not
enough of a win, and it is still a regression to GCC 4.5.  However, I see some
regressions in 3 other benchmarks (I tend to omit differences of less than 2%):

401.bzip2          97.99%
410.bwaves        113.88%
436.cactusADM      93.96%
444.namd           93.74%

The profile differences are as follows.  Unfortunately, I'm not sure I can post
sample counts under Spec rules:

Bzip2:

	GCC 4.7		GCC 4.7 with patches	Function
	=======		====================	========
	28.96%		28.39%			mainSort
	15.94%		15.49%			BZ2_decompress
	12.56%		12.35%			mainGtU.part.0
	11.59%		11.54%			generateMTFValues
	8.89%		9.04%			fallbackSort
	6.60%		8.28%			BZ2_compressBlock
	7.48%		7.21%			handle_compress.isra.2
	6.24%		5.95%			BZ2_bzDecompress
	0.55%		0.58%			add_pair_to_block
	0.54%		0.54%			BZ2_hbMakeCodeLengths

Bwaves:

	GCC 4.7		GCC 4.7 with patches	Function
	=======		====================	========
	78.70%		74.73%			mat_times_vec_
	11.68%		13.21%			bi_cgstab_block_
	 6.72%		 8.47%			shell_
	 2.11%		 2.62%			jacobian_
	 0.79%		 0.96%			flux_

CactusADM:

	GCC 4.7		GCC 4.7 with patches	Function
	=======		====================	========
	99.67%		99.69%			bench_staggeredleapfrog2_

Namd:

	GCC 4.7		GCC 4.7 with patches	Function
	=======		====================	========
	15.43%		14.71%			_ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded.part.39
	11.94%		11.80%			_ZN20ComputeNonbondedUtil19calc_pair_fullelectEP9nonbonded.part.40
	10.18%		11.52%			_ZN20ComputeNonbondedUtil32calc_pair_energy_merge_fullelectEP9nonbonded.part.37
	9.87%		9.02%			_ZN20ComputeNonbondedUtil16calc_pair_energyEP9nonbonded.part.41
	9.55%		8.85%			_ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded.part.42
	9.52%		9.05%			_ZN20ComputeNonbondedUtil25calc_pair_merge_fullelectEP9nonbonded.part.38
	7.24%		8.72%			_ZN20ComputeNonbondedUtil26calc_self_energy_fullelectEP9nonbonded.part.31
	6.28%		6.42%			_ZN20ComputeNonbondedUtil19calc_self_fullelectEP9nonbonded.part.32
	5.23%		6.18%			_ZN20ComputeNonbondedUtil32calc_self_energy_merge_fullelectEP9nonbonded.part.29
	5.13%		4.66%			_ZN20ComputeNonbondedUtil16calc_self_energyEP9nonbonded.part.33
	4.72%		4.43%			_ZN20ComputeNonbondedUtil25calc_self_merge_fullelectEP9nonbonded.part.30
	4.60%		4.37%			_ZN20ComputeNonbondedUtil9calc_selfEP9nonbonded.part.34

-- 
Michael Meissner, IBM
5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
meissner@linux.vnet.ibm.com	fax +1 (978) 399-6899

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-13 18:05 ` Michael Meissner
@ 2011-07-13 20:18   ` Ilya Enkovich
  0 siblings, 0 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-13 20:18 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches

2011/7/13 Michael Meissner <meissner@linux.vnet.ibm.com>:
> I just ran a spec 2006 run on the powerpc (32-bit) last night setting the
> reassociation to 2.  I do see a win in bwaves, but unfortunately it is not
> enough of a win, and it is still a regression to GCC 4.5.  However, I see some
> regressions in 3 other benchmarks (I tend to omit differences of less than 2%):
>
> 401.bzip2          97.99%
> 410.bwaves        113.88%
> 436.cactusADM      93.96%
> 444.namd           93.74%
>

I tried it on AMD, Core i5, ARM (Cortex A9) and Atom using SPEC2000
and EEMBC suites. All platforms showed both gains and regressions. The
only platform whose gains outweighted regressions enough without
additional tuning was Atom. Surely we should avoid tree height
reduction in some cases and may tune it but I think it will be nice to
have feature checked in first since at least one platform benefits
from it.

Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-12 12:35 [PATCH Atom][PR middle-end/44382] Tree reassociation improvement Илья Энкович
  2011-07-12 17:08 ` William J. Schmidt
  2011-07-13 18:05 ` Michael Meissner
@ 2011-07-14  2:39 ` Michael Meissner
  2011-07-14  9:40   ` Richard Guenther
  2011-07-14 10:38   ` Ilya Enkovich
  2 siblings, 2 replies; 47+ messages in thread
From: Michael Meissner @ 2011-07-14  2:39 UTC (permalink / raw)
  To: Илья
	Энкович
  Cc: gcc-patches

One minor note, you will need to update doc/invoke.texi to document the new
switch before checkin: -ftree-reassoc-width=<n>

-- 
Michael Meissner, IBM
5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
meissner@linux.vnet.ibm.com	fax +1 (978) 399-6899

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14  2:39 ` Michael Meissner
@ 2011-07-14  9:40   ` Richard Guenther
  2011-07-14  9:42     ` Richard Guenther
  2011-07-14 10:38   ` Ilya Enkovich
  1 sibling, 1 reply; 47+ messages in thread
From: Richard Guenther @ 2011-07-14  9:40 UTC (permalink / raw)
  To: Michael Meissner,
	Илья
	Энкович,
	gcc-patches

2011/7/14 Michael Meissner <meissner@linux.vnet.ibm.com>:
> One minor note, you will need to update doc/invoke.texi to document the new
> switch before checkin: -ftree-reassoc-width=<n>

You also need approval and somebody to review the patch before checkin.

Richard.

> --
> Michael Meissner, IBM
> 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
> meissner@linux.vnet.ibm.com     fax +1 (978) 399-6899
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14  9:40   ` Richard Guenther
@ 2011-07-14  9:42     ` Richard Guenther
  2011-07-14 10:04       ` Ilya Enkovich
  2011-07-14 16:04       ` Michael Meissner
  0 siblings, 2 replies; 47+ messages in thread
From: Richard Guenther @ 2011-07-14  9:42 UTC (permalink / raw)
  To: Michael Meissner,
	Илья
	Энкович,
	gcc-patches

On Thu, Jul 14, 2011 at 11:31 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> 2011/7/14 Michael Meissner <meissner@linux.vnet.ibm.com>:
>> One minor note, you will need to update doc/invoke.texi to document the new
>> switch before checkin: -ftree-reassoc-width=<n>
>
> You also need approval and somebody to review the patch before checkin.

Btw, rather than a new target hook and an option I suggest to use
a --param which default you can modify in the backend.

Richard.

> Richard.
>
>> --
>> Michael Meissner, IBM
>> 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
>> meissner@linux.vnet.ibm.com     fax +1 (978) 399-6899
>>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14  9:42     ` Richard Guenther
@ 2011-07-14 10:04       ` Ilya Enkovich
  2011-07-14 10:14         ` Richard Guenther
  2011-07-14 16:04       ` Michael Meissner
  1 sibling, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-14 10:04 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Meissner, gcc-patches

2011/7/14 Richard Guenther <richard.guenther@gmail.com>:
>
> Btw, rather than a new target hook and an option I suggest to use
> a --param which default you can modify in the backend.
>
> Richard.
>

Introduced target hook does not just define default value for option.
It is meant to make decision in each particular case. For now it
returns a constant value but march dependent heuristics will be added
later.

Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14 10:04       ` Ilya Enkovich
@ 2011-07-14 10:14         ` Richard Guenther
  2011-07-14 11:03           ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Guenther @ 2011-07-14 10:14 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Michael Meissner, gcc-patches

On Thu, Jul 14, 2011 at 11:59 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2011/7/14 Richard Guenther <richard.guenther@gmail.com>:
>>
>> Btw, rather than a new target hook and an option I suggest to use
>> a --param which default you can modify in the backend.
>>
>> Richard.
>>
>
> Introduced target hook does not just define default value for option.
> It is meant to make decision in each particular case. For now it
> returns a constant value but march dependent heuristics will be added
> later.

But then how comes the option to override it is useful?  It isn't dependent
on the particular case.  At least the option should be a --param.

Richard.

> Ilya
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14  2:39 ` Michael Meissner
  2011-07-14  9:40   ` Richard Guenther
@ 2011-07-14 10:38   ` Ilya Enkovich
  1 sibling, 0 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-14 10:38 UTC (permalink / raw)
  To: Michael Meissner,
	Илья
	Энкович,
	gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1664 bytes --]

> One minor note, you will need to update doc/invoke.texi to document the new
> switch before checkin: -ftree-reassoc-width=<n>
>
> --
> Michael Meissner, IBM

Thanks for the note! Here is fixed patch.

Ilya
--
gcc/

2011-07-14  Enkovich Ilya  <ilya.enkovich@intel.com>

	PR middle-end/44382
	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): New hook documentation.

	* doc/tm.texi (reassociation_width): Likewise.

	* doc/invoke.texi (ftree-reassoc-width): New option documented.

	* hooks.h (hook_int_const_gimple_1): New default hook.

	* hooks.c (hook_int_const_gimple_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* common.opt (ftree-reassoc-width): New option added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

	(pass_reassoc): TODO_remove_unused_locals flag added.

gcc/testsuite/

2011-07-14  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	-ftree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 15046 bytes --]

diff --git a/gcc/common.opt b/gcc/common.opt
index f127936..1ba383f 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1993,6 +1993,11 @@ ftree-reassoc
 Common Report Var(flag_tree_reassoc) Init(1) Optimization
 Enable reassociation on tree level
 
+ftree-reassoc-width=
+Common RejectNegative Joined UInteger Report Var(flag_tree_reassoc_width) Init(-1) Optimization
+Set reassociation parallelism level to be used by tree
+reassociation.
+
 ftree-salias
 Common Ignore
 Does nothing.  Preserved for backward compatibility.
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a46101b..9a232cb 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2088,7 +2088,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER1
+  m_BDVER1,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -33753,6 +33761,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -34549,6 +34559,22 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+static int
+ix86_reassociation_width (const_gimple stmt)
+{
+  tree rhs1 = gimple_assign_rhs1 (stmt);
+  int res = 1;
+
+  if (INTEGRAL_TYPE_P (TREE_TYPE (rhs1))
+      && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_TYPE_P (TREE_TYPE (rhs1))
+	   && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8cef4e7..359b45a 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -314,6 +314,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -412,6 +414,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index f146cc5..ff44ba8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -404,7 +404,7 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-phiprop -ftree-loop-distribution -ftree-loop-distribute-patterns @gol
 -ftree-loop-ivcanon -ftree-loop-linear -ftree-loop-optimize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-pta -ftree-reassoc @gol
--ftree-sink -ftree-sra -ftree-switch-conversion @gol
+-ftree-reassoc-width=@var{n} -ftree-sink -ftree-sra -ftree-switch-conversion @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
@@ -6864,6 +6864,11 @@ and the @option{large-stack-frame-growth} parameter to 400.
 Perform reassociation on trees.  This flag is enabled by default
 at @option{-O} and higher.
 
+@item -ftree-reassoc-width=n
+@opindex ftree-reassoc-width
+Set the maximum number of instructions executed in parallel in
+reassociated tree. By default target dependent heuristics are used.
+
 @item -ftree-pre
 @opindex ftree-pre
 Perform partial redundancy elimination (PRE) on trees.  This flag is
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c0648a5..901e6ce 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6802,6 +6802,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (const_gimple @var{stmt})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3660d36..908d3c1 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6741,6 +6741,11 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9dfde82..a0cb560 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,12 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_const_gimple_1 (const_gimple a ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index a1b0112..5cc2d42 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -67,6 +67,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_const_gimple_1 (const_gimple);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/target.def b/gcc/target.def
index c67f0ba..fa13282 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -904,6 +904,15 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"",
+int, (const_gimple stmt),
+hook_int_const_gimple_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..90c1cd4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 -ftree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..bc40a1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..d308937
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index d8f9e2e..0bf48cb 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1474,6 +1475,167 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* Find out how many cycles we need to compute whole statements
+   chain holding ops_num operands if may execute up to cpu_width
+   statements per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (VEC(operand_entry_t, heap) * ops, gimple stmt)
+{
+  int ops_num = VEC_length (operand_entry_t, ops);
+  int width;
+  int cycles_best;
+
+  if (flag_tree_reassoc_width > 0)
+    width = flag_tree_reassoc_width;
+  else
+    width = targetm.sched.reassociation_width (stmt);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  while (width > 1 &&
+	 get_required_cycles (ops_num, width - 1) == cycles_best)
+    width--;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+	       "Width = %d was chosen for reassociation\n", width);
+    }
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XNEWVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0 &&
+	  (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+         operands into the last statement.  It will also
+         break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
+	  add_referenced_var (var);
+	  stmts[i] = build_and_add_sum (var, op1, op2, opcode);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+
+  free (stmts);
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -2139,7 +2301,15 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  int width = get_reassociation_width (ops, stmt);
+
+		  if (width > 1 &&
+		      VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }
@@ -2297,7 +2467,8 @@ struct gimple_opt_pass pass_reassoc =
   0,					/* properties_provided */
   0,					/* properties_destroyed */
   0,					/* todo_flags_start */
-  TODO_verify_ssa
+  TODO_remove_unused_locals
+    | TODO_verify_ssa
     | TODO_verify_flow
     | TODO_ggc_collect			/* todo_flags_finish */
  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14 10:14         ` Richard Guenther
@ 2011-07-14 11:03           ` Ilya Enkovich
  2011-07-14 15:31             ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-14 11:03 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Meissner, gcc-patches

2011/7/14 Richard Guenther <richard.guenther@gmail.com>:
>
> But then how comes the option to override it is useful?  It isn't dependent
> on the particular case.  At least the option should be a --param.
>
> Richard.
>

Option is still useful if you want to try feature on platform with no
hook implemented and for other performance experiments. I agree
--param usage should be better here. I'll fix it.

Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14 11:03           ` Ilya Enkovich
@ 2011-07-14 15:31             ` Ilya Enkovich
  2011-07-19 12:10               ` Richard Guenther
  2011-07-21 20:14               ` Joseph S. Myers
  0 siblings, 2 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-14 15:31 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Meissner, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2070 bytes --]

> 2011/7/14 Richard Guenther <richard.guenther@gmail.com>:
>>
>> But then how comes the option to override it is useful?  It isn't dependent
>> on the particular case.  At least the option should be a --param.
>>
>> Richard.
>>
>
> Option is still useful if you want to try feature on platform with no
> hook implemented and for other performance experiments. I agree
> --param usage should be better here. I'll fix it.
>
> Ilya
>

Here is a patch with new param instead of new option. Bootstrapped and
checked on x86_64-linux.

Ilya
--
gcc/

2011-07-14  Enkovich Ilya  <ilya.enkovich@intel.com>

	PR middle-end/44382
	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): New hook documentation.

	* doc/tm.texi (reassociation_width): Likewise.

	* doc/invoke.texi (tree-reassoc-width): New param documented.

	* hooks.h (hook_int_const_gimple_1): New default hook.

	* hooks.c (hook_int_const_gimple_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* params.def (PARAM_TREE_REASSOC_WIDTH): New param added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

	(pass_reassoc): TODO_remove_unused_locals flag added.

gcc/testsuite/

2011-07-14  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	--param tree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 14494 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a46101b..9a232cb 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2088,7 +2088,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER1
+  m_BDVER1,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -33753,6 +33761,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -34549,6 +34559,22 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+static int
+ix86_reassociation_width (const_gimple stmt)
+{
+  tree rhs1 = gimple_assign_rhs1 (stmt);
+  int res = 1;
+
+  if (INTEGRAL_TYPE_P (TREE_TYPE (rhs1))
+      && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_TYPE_P (TREE_TYPE (rhs1))
+	   && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8cef4e7..359b45a 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -314,6 +314,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -412,6 +414,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index f146cc5..be6b126 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9031,6 +9031,11 @@ The smallest number of different values for which it is best to use a
 jump-table instead of a tree of conditional branches.  If the value is
 0, use the default for the machine.  The default is 0.
 
+@item tree-reassoc-width
+Set the maximum number of instructions executed in parallel in
+reassociated tree. This parameter overrides target dependent
+heuristics used by default if has non zero value.
+
 @end table
 @end table
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c0648a5..901e6ce 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6802,6 +6802,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (const_gimple @var{stmt})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3660d36..908d3c1 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6741,6 +6741,11 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9dfde82..a0cb560 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,12 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_const_gimple_1 (const_gimple a ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index a1b0112..5cc2d42 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -67,6 +67,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_const_gimple_1 (const_gimple);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/params.def b/gcc/params.def
index 78601f6..be25619 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -902,6 +902,12 @@ DEFPARAM (PARAM_CASE_VALUES_THRESHOLD,
 	  "if 0, use the default for the machine",
           0, 0, 0)
 
+/* Reassociation width to be used by tree reassoc optimization.  */
+DEFPARAM (PARAM_TREE_REASSOC_WIDTH,
+	  "tree-reassoc-width",
+	  "Set the maximum number of instructions executed in parallel in "
+	  "reassociated tree. If 0, use the target dependent heuristic.",
+	  0, 0, 0)
 
 /*
 Local variables:
diff --git a/gcc/target.def b/gcc/target.def
index c67f0ba..fa13282 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -904,6 +904,15 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"",
+int, (const_gimple stmt),
+hook_int_const_gimple_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..a80a5a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 --param tree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..c871628
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..4ff66ef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index d8f9e2e..c448ed4 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
+#include "params.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1474,6 +1476,168 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* Find out how many cycles we need to compute whole statements
+   chain holding ops_num operands if may execute up to cpu_width
+   statements per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (VEC(operand_entry_t, heap) * ops, gimple stmt)
+{
+  int param_width = PARAM_VALUE (PARAM_TREE_REASSOC_WIDTH);
+  int ops_num = VEC_length (operand_entry_t, ops);
+  int width;
+  int cycles_best;
+
+  if (param_width > 0)
+    width = param_width;
+  else
+    width = targetm.sched.reassociation_width (stmt);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  while (width > 1 &&
+	 get_required_cycles (ops_num, width - 1) == cycles_best)
+    width--;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+	       "Width = %d was chosen for reassociation\n", width);
+    }
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XNEWVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0 &&
+	  (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+         operands into the last statement.  It will also
+         break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
+	  add_referenced_var (var);
+	  stmts[i] = build_and_add_sum (var, op1, op2, opcode);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+
+  free (stmts);
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -2139,7 +2303,15 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  int width = get_reassociation_width (ops, stmt);
+
+		  if (width > 1 &&
+		      VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }
@@ -2297,7 +2469,8 @@ struct gimple_opt_pass pass_reassoc =
   0,					/* properties_provided */
   0,					/* properties_destroyed */
   0,					/* todo_flags_start */
-  TODO_verify_ssa
+  TODO_remove_unused_locals
+    | TODO_verify_ssa
     | TODO_verify_flow
     | TODO_ggc_collect			/* todo_flags_finish */
  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14  9:42     ` Richard Guenther
  2011-07-14 10:04       ` Ilya Enkovich
@ 2011-07-14 16:04       ` Michael Meissner
  2011-07-14 16:08         ` Ilya Enkovich
  1 sibling, 1 reply; 47+ messages in thread
From: Michael Meissner @ 2011-07-14 16:04 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Michael Meissner,
	Илья
	Энкович,
	gcc-patches

On Thu, Jul 14, 2011 at 11:32:59AM +0200, Richard Guenther wrote:
> On Thu, Jul 14, 2011 at 11:31 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
> > 2011/7/14 Michael Meissner <meissner@linux.vnet.ibm.com>:
> >> One minor note, you will need to update doc/invoke.texi to document the new
> >> switch before checkin: -ftree-reassoc-width=<n>
> >
> > You also need approval and somebody to review the patch before checkin.
> 
> Btw, rather than a new target hook and an option I suggest to use
> a --param which default you can modify in the backend.

You need the target hook that tells how big the reassociation based on the
type.  Machines have different numbers of functional units, for example, maybe
3 integer units and 2 floating point units.  For example, in the PowerPC, I
would set the basic integer and binary floating point types to 2 to match the
dispatch rules, but for decimal floating point, I would set it to one, since
the machines currently only have 1 decimal unit.

With the hook, I could see eliminating the switch and/or --param altogether,
and doing only in the hook.

-- 
Michael Meissner, IBM
5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
meissner@linux.vnet.ibm.com	fax +1 (978) 399-6899

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14 16:04       ` Michael Meissner
@ 2011-07-14 16:08         ` Ilya Enkovich
  0 siblings, 0 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-14 16:08 UTC (permalink / raw)
  To: Michael Meissner, Richard Guenther,
	Илья
	Энкович,
	gcc-patches

>
> You need the target hook that tells how big the reassociation based on the
> type.  Machines have different numbers of functional units, for example, maybe
> 3 integer units and 2 floating point units.  For example, in the PowerPC, I
> would set the basic integer and binary floating point types to 2 to match the
> dispatch rules, but for decimal floating point, I would set it to one, since
> the machines currently only have 1 decimal unit.
>
> With the hook, I could see eliminating the switch and/or --param altogether,
> and doing only in the hook.
>

Hook introduced by this patch meets these requirements. But I think it
is useful to have option to override the hook because you cannot tune
it perfectly for each case.

Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14 15:31             ` Ilya Enkovich
@ 2011-07-19 12:10               ` Richard Guenther
  2011-07-19 15:25                 ` Ilya Enkovich
  2011-07-26  9:54                 ` Ilya Enkovich
  2011-07-21 20:14               ` Joseph S. Myers
  1 sibling, 2 replies; 47+ messages in thread
From: Richard Guenther @ 2011-07-19 12:10 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Michael Meissner, gcc-patches

On Thu, Jul 14, 2011 at 5:00 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2011/7/14 Richard Guenther <richard.guenther@gmail.com>:
>>>
>>> But then how comes the option to override it is useful?  It isn't dependent
>>> on the particular case.  At least the option should be a --param.
>>>
>>> Richard.
>>>
>>
>> Option is still useful if you want to try feature on platform with no
>> hook implemented and for other performance experiments. I agree
>> --param usage should be better here. I'll fix it.
>>
>> Ilya
>>
>
> Here is a patch with new param instead of new option. Bootstrapped and
> checked on x86_64-linux.

 }

+static int
+ix86_reassociation_width (const_gimple stmt)

all functions need comments describing what they do and what parameters
they get.

+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH
(const_gimple @var{stmt})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.

I don't think we should get an actual statement here, but an
enum tree_code and a machine mode.

+/* Find out how many cycles we need to compute whole statements
+   chain holding ops_num operands if may execute up to cpu_width
+   statements per cycle.  */

I have a hard time parsing this sentence, maybe a native english
speaker can help here.

+static int
+get_reassociation_width (VEC(operand_entry_t, heap) * ops, gimple stmt)
+{
+  int param_width = PARAM_VALUE (PARAM_TREE_REASSOC_WIDTH);
+  int ops_num = VEC_length (operand_entry_t, ops);
+  int width;
+  int cycles_best;
+
+  if (param_width > 0)
+    width = param_width;
+  else
+    width = targetm.sched.reassociation_width (stmt);

this is the only place you need stmt, with tree-code and mode args
you'd not need it (and it's odd anyway).

+  while (width > 1 &&
+        get_required_cycles (ops_num, width - 1) == cycles_best)

&&s go on the next line, similar cases in your patch as well

+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+              "Width = %d was chosen for reassociation\n", width);
+    }

no {}s around single-stmts

+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+                           VEC(operand_entry_t, heap) * ops)
+{

it's not easy to follow the flow of this function, esp. I wonder

+      else
+       {
+         tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
+         add_referenced_var (var);
+         stmts[i] = build_and_add_sum (var, op1, op2, opcode);
+       }

what you need to create new stmts for, and if you possibly create
multiple temporary variables here.

You don't seem to handle the special-cases of rewrite_expr_tree
for the leafs of your tree, especially the PHI node special-casing.
I think you may run into vectorization issues here.

-  TODO_verify_ssa
+  TODO_remove_unused_locals
+    | TODO_verify_ssa

why?

I think the patch looks reasonable overall, please update it with the
above comments and re-post it.

Thanks,
Richard.

> Ilya
> --
> gcc/
>
> 2011-07-14  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        PR middle-end/44382
>        * target.def (reassociation_width): New hook.
>
>        * doc/tm.texi.in (reassociation_width): New hook documentation.
>
>        * doc/tm.texi (reassociation_width): Likewise.
>
>        * doc/invoke.texi (tree-reassoc-width): New param documented.
>
>        * hooks.h (hook_int_const_gimple_1): New default hook.
>
>        * hooks.c (hook_int_const_gimple_1): Likewise.
>
>        * config/i386/i386.h (ix86_tune_indices): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (TARGET_REASSOC_INT_TO_PARALLEL): New.
>        (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
>
>        * config/i386/i386.c (initial_ix86_tune_features): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (ix86_reassociation_width) implementation of
>        new hook for i386 target.
>
>        * params.def (PARAM_TREE_REASSOC_WIDTH): New param added.
>
>        * tree-ssa-reassoc.c (get_required_cycles): New function.
>        (get_reassociation_width): Likewise.
>        (rewrite_expr_tree_parallel): Likewise.
>
>        (reassociate_bb): Now checks reassociation width to be used
>        and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
>        if needed.
>
>        (pass_reassoc): TODO_remove_unused_locals flag added.
>
> gcc/testsuite/
>
> 2011-07-14  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
>        --param tree-reassoc-width=1.
>
>        * gcc.dg/tree-ssa/reassoc-24.c: New test.
>        * gcc.dg/tree-ssa/reassoc-25.c: Likewise.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-19 12:10               ` Richard Guenther
@ 2011-07-19 15:25                 ` Ilya Enkovich
  2011-08-05 11:46                   ` Richard Guenther
  2011-07-26  9:54                 ` Ilya Enkovich
  1 sibling, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-19 15:25 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Meissner, gcc-patches

Hello Richard,

Thanks a lot for the review!

> it's not easy to follow the flow of this function, esp. I wonder
>
> +      else
> +       {
> +         tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
> +         add_referenced_var (var);
> +         stmts[i] = build_and_add_sum (var, op1, op2, opcode);
> +       }
>
> what you need to create new stmts for, and if you possibly create
> multiple temporary variables here.
>
> -  TODO_verify_ssa
> +  TODO_remove_unused_locals
> +    | TODO_verify_ssa
>
> why?
>

Current rewrite_expr_tree uses original statements and  just change
order of operands (tree leafs) usage. To keep data flow correct it
moves all statements down to the last one. I tried such approach but
it did not work well because of increased registers pressure in some
cases (all operands are live at the same time right before the first
statement of our sequence). Then I tried to move existing statements
to the appropriate places (just after the last operand definition) but
it appeared to be not so easy. It is easy to guarantee consistence in
the final result but not easy to keep data flow consistent after each
statement move. And if we do not keep code consistent after each move
then we may get wrong debug statement generated. After few tries I
decided that the most easy for both understanding and implementation
way is to recreate statements and remove the old chain. So, I used
build_and_add_sum to create all statements except the last one which
never needs recreation. I added TODO_remove_unused_locals since temps
used in original statements should be removed. Is this approach OK?

> You don't seem to handle the special-cases of rewrite_expr_tree
> for the leafs of your tree, especially the PHI node special-casing.
> I think you may run into vectorization issues here.
>
I do not see any reason why these cases are checked in
rewrite_expr_tree. It is optimization of operands order and may be it
will be good to move it somewhere. Is it OK if I move related code to
the end of optimize_ops_list method?

Thanks
Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-14 15:31             ` Ilya Enkovich
  2011-07-19 12:10               ` Richard Guenther
@ 2011-07-21 20:14               ` Joseph S. Myers
  1 sibling, 0 replies; 47+ messages in thread
From: Joseph S. Myers @ 2011-07-21 20:14 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Richard Guenther, Michael Meissner, gcc-patches

On Thu, 14 Jul 2011, Ilya Enkovich wrote:

> 	* doc/tm.texi.in (reassociation_width): New hook documentation.

Unless the documentation for a new hook is based on pre-existing GFDL-only 
text, it should go in target.def not tm.texi.in, with the only thing added 
to tm.texi.in being a single @hook line.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-19 12:10               ` Richard Guenther
  2011-07-19 15:25                 ` Ilya Enkovich
@ 2011-07-26  9:54                 ` Ilya Enkovich
  2011-08-03  8:35                   ` Ilya Enkovich
  1 sibling, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-07-26  9:54 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

Hello,

Here is updated patch for tree reassoc phase. Update includes coding
style fixes, comments update, target hook arguments change. I also
moved a part of code from rewrite_expr_tree into separate function to
reuse it in rewrite_expr_tree_parallel for grouping operands with the
same rank.

Bootstrapped and checked on x86_64-linux.

Ilya
--
gcc/

2011-07-26  Enkovich Ilya  <ilya.enkovich@intel.com>

	PR middle-end/44382
	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): Likewise.

	* doc/tm.texi (reassociation_width): Likewise.

	* doc/invoke.texi (tree-reassoc-width): New param documented.

	* hooks.h (hook_int_uint_mode_1): New default hook.

	* hooks.c (hook_int_uint_mode_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* params.def (PARAM_TREE_REASSOC_WIDTH): New param added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(swap_ops_for_binary_stmt): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(rewrite_expr_tree): Refactored. Part of code moved into
	swap_ops_for_binary_stmt.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

	(pass_reassoc): TODO_remove_unused_locals flag added.

gcc/testsuite/

2011-07-26  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	--param tree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 19448 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 96263ed..1d1103a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2168,7 +2168,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER 
+  m_BDVER,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -33949,6 +33957,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -34746,6 +34756,25 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+/* Implementation of reassociation_width target hook used by
+   reassoc phase to identify parallelism level in reassociated
+   tree.  Statements tree_code is passed in OPC.  Arguments type
+   is passed in MODE.  */
+
+static int
+ix86_reassociation_width (unsigned int opc ATTRIBUTE_UNUSED,
+			  enum machine_mode mode)
+{
+  int res = 1;
+
+  if (INTEGRAL_MODE_P (mode) && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_MODE_P (mode) && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 20c9a8f..e95e244 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -315,6 +315,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -413,6 +415,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 7783786..8c0396b 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9063,6 +9063,11 @@ The smallest number of different values for which it is best to use a
 jump-table instead of a tree of conditional branches.  If the value is
 0, use the default for the machine.  The default is 0.
 
+@item tree-reassoc-width
+Set the maximum number of instructions executed in parallel in
+reassociated tree. This parameter overrides target dependent
+heuristics used by default if has non zero value.
+
 @end table
 @end table
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 097531f..ce1b95b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6826,6 +6826,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (unsigned int @var{opc}, enum machine_mode @var{mode})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 01beeb4..33d2231 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6760,6 +6760,8 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9dfde82..41903cf 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,13 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_uint_mode_1 (unsigned int a ATTRIBUTE_UNUSED,
+		      enum machine_mode b ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index a1b0112..d5e2836 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -67,6 +67,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_uint_mode_1 (unsigned int, enum machine_mode);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/params.def b/gcc/params.def
index 6bfb40f..08529d4 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -908,6 +908,12 @@ DEFPARAM (PARAM_CASE_VALUES_THRESHOLD,
 	  "if 0, use the default for the machine",
           0, 0, 0)
 
+/* Reassociation width to be used by tree reassoc optimization.  */
+DEFPARAM (PARAM_TREE_REASSOC_WIDTH,
+	  "tree-reassoc-width",
+	  "Set the maximum number of instructions executed in parallel in "
+	  "reassociated tree. If 0, use the target dependent heuristic.",
+	  0, 0, 0)
 
 /*
 Local variables:
diff --git a/gcc/target.def b/gcc/target.def
index 9ff97e6..e038588 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -913,6 +913,16 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"This hook is called by tree reassociator to determine a level of\n\
+parallelism required in output calculations chain.",
+int, (unsigned int opc, enum machine_mode mode),
+hook_int_uint_mode_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..a80a5a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 --param tree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..c871628
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..4ff66ef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 86d26fb..fcc1d5f 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
+#include "params.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1473,6 +1475,62 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* This function checks three consequtive operands in
+   passed operands vector OPS starting from OPINDEX and
+   swaps two operands if it is profitable for binary operation
+   consuming OPINDEX + 1 abnd OPINDEX + 2 operands.
+
+   We pair ops with the same rank if possible.
+
+   The alternative we try is to see if STMT is a destructive
+   update style statement, which is like:
+   b = phi (a, ...)
+   a = c + b;
+   In that case, we want to use the destructive update form to
+   expose the possible vectorizer sum reduction opportunity.
+   In that case, the third operand will be the phi node. This
+   check is not performed if STMT is null.
+
+   We could, of course, try to be better as noted above, and do a
+   lot of work to try to find these opportunities in >3 operand
+   cases, but it is unlikely to be worth it.  */
+
+static void
+swap_ops_for_binary_stmt (VEC(operand_entry_t, heap) * ops,
+			  unsigned int opindex, gimple stmt)
+{
+  operand_entry_t oe1, oe2, oe3;
+
+  oe1 = VEC_index (operand_entry_t, ops, opindex);
+  oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
+  oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
+
+  if ((oe1->rank == oe2->rank
+       && oe2->rank != oe3->rank)
+      || (stmt && is_phi_for_stmt (stmt, oe3->op)
+	  && !is_phi_for_stmt (stmt, oe1->op)
+	  && !is_phi_for_stmt (stmt, oe2->op)))
+    {
+      struct operand_entry temp = *oe3;
+      oe3->op = oe1->op;
+      oe3->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+  else if ((oe1->rank == oe3->rank
+	    && oe2->rank != oe3->rank)
+	   || (stmt && is_phi_for_stmt (stmt, oe2->op)
+	       && !is_phi_for_stmt (stmt, oe1->op)
+	       && !is_phi_for_stmt (stmt, oe3->op)))
+    {
+      struct operand_entry temp = *oe2;
+      oe2->op = oe1->op;
+      oe2->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -1485,53 +1543,10 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   tree rhs2 = gimple_assign_rhs2 (stmt);
   operand_entry_t oe;
 
-  /* If we have three operands left, then we want to make sure the one
-     that gets the double binary op are the ones with the same rank.
-
-     The alternative we try is to see if this is a destructive
-     update style statement, which is like:
-     b = phi (a, ...)
-     a = c + b;
-     In that case, we want to use the destructive update form to
-     expose the possible vectorizer sum reduction opportunity.
-     In that case, the third operand will be the phi node.
-
-     We could, of course, try to be better as noted above, and do a
-     lot of work to try to find these opportunities in >3 operand
-     cases, but it is unlikely to be worth it.  */
+  /* If we have three operands left, then we want to make sure the ones
+     that get the double binary op are chosen wisely.  */
   if (opindex + 3 == VEC_length (operand_entry_t, ops))
-    {
-      operand_entry_t oe1, oe2, oe3;
-
-      oe1 = VEC_index (operand_entry_t, ops, opindex);
-      oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
-      oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
-
-      if ((oe1->rank == oe2->rank
-	   && oe2->rank != oe3->rank)
-	  || (is_phi_for_stmt (stmt, oe3->op)
-	      && !is_phi_for_stmt (stmt, oe1->op)
-	      && !is_phi_for_stmt (stmt, oe2->op)))
-	{
-	  struct operand_entry temp = *oe3;
-	  oe3->op = oe1->op;
-	  oe3->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-      else if ((oe1->rank == oe3->rank
-		&& oe2->rank != oe3->rank)
-	       || (is_phi_for_stmt (stmt, oe2->op)
-		   && !is_phi_for_stmt (stmt, oe1->op)
-		   && !is_phi_for_stmt (stmt, oe3->op)))
-	{
-	  struct operand_entry temp = *oe2;
-	  oe2->op = oe1->op;
-	  oe2->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-    }
+    swap_ops_for_binary_stmt (ops, opindex, stmt);
 
   /* The final recursion case for this function is that you have
      exactly two operations left.
@@ -1616,6 +1631,172 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   rewrite_expr_tree (SSA_NAME_DEF_STMT (rhs1), opindex + 1, ops, moved);
 }
 
+/* Find out how many cycles we need to compute statements chain.
+   OPS_NUM holds number os statements in a chain.  CPU_WIDTH is a
+   maximum number of independent statements we may execute per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (VEC(operand_entry_t, heap) * ops,
+			 enum tree_code opc, enum machine_mode mode)
+{
+  int param_width = PARAM_VALUE (PARAM_TREE_REASSOC_WIDTH);
+  int ops_num = VEC_length (operand_entry_t, ops);
+  int width;
+  int cycles_best;
+
+  if (param_width > 0)
+    width = param_width;
+  else
+    width = targetm.sched.reassociation_width (opc, mode);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  while (width > 1
+	 && get_required_cycles (ops_num, width - 1) == cycles_best)
+    width--;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file,
+	     "Width = %d was chosen for reassociation\n", width);
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XNEWVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0 &&
+	  (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      /* Now we choose operands for the next statement.  Non zero
+	 value in ready_stmts_end means here that we should use
+	 the result of already generated statements as new operand.  */
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  if (op_index > 1)
+	    swap_ops_for_binary_stmt (ops, op_index - 2, NULL);
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+	 operands into the last statement.  It will also
+	 break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
+	  add_referenced_var (var);
+	  stmts[i] = build_and_add_sum (var, op1, op2, opcode);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+
+  free (stmts);
+}
+
 /* Transform STMT, which is really (A +B) + (C + D) into the left
    linear form, ((A+B)+C)+D.
    Recurse on D if necessary.  */
@@ -2138,7 +2319,16 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  enum machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+		  int width = get_reassociation_width (ops, rhs_code, mode);
+
+		  if (width > 1
+		      && VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }
@@ -2296,7 +2486,8 @@ struct gimple_opt_pass pass_reassoc =
   0,					/* properties_provided */
   0,					/* properties_destroyed */
   0,					/* todo_flags_start */
-  TODO_verify_ssa
+  TODO_remove_unused_locals
+    | TODO_verify_ssa
     | TODO_verify_flow
     | TODO_ggc_collect			/* todo_flags_finish */
  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-26  9:54                 ` Ilya Enkovich
@ 2011-08-03  8:35                   ` Ilya Enkovich
  0 siblings, 0 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-08-03  8:35 UTC (permalink / raw)
  To: gcc-patches

Ping.

2011/7/26 Ilya Enkovich <enkovich.gnu@gmail.com>:
> Hello,
>
> Here is updated patch for tree reassoc phase. Update includes coding
> style fixes, comments update, target hook arguments change. I also
> moved a part of code from rewrite_expr_tree into separate function to
> reuse it in rewrite_expr_tree_parallel for grouping operands with the
> same rank.
>
> Bootstrapped and checked on x86_64-linux.
>
> Ilya
> --
> gcc/
>
> 2011-07-26  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        PR middle-end/44382
>        * target.def (reassociation_width): New hook.
>
>        * doc/tm.texi.in (reassociation_width): Likewise.
>
>        * doc/tm.texi (reassociation_width): Likewise.
>
>        * doc/invoke.texi (tree-reassoc-width): New param documented.
>
>        * hooks.h (hook_int_uint_mode_1): New default hook.
>
>        * hooks.c (hook_int_uint_mode_1): Likewise.
>
>        * config/i386/i386.h (ix86_tune_indices): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (TARGET_REASSOC_INT_TO_PARALLEL): New.
>        (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
>
>        * config/i386/i386.c (initial_ix86_tune_features): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (ix86_reassociation_width) implementation of
>        new hook for i386 target.
>
>        * params.def (PARAM_TREE_REASSOC_WIDTH): New param added.
>
>        * tree-ssa-reassoc.c (get_required_cycles): New function.
>        (get_reassociation_width): Likewise.
>        (swap_ops_for_binary_stmt): Likewise.
>        (rewrite_expr_tree_parallel): Likewise.
>
>        (rewrite_expr_tree): Refactored. Part of code moved into
>        swap_ops_for_binary_stmt.
>
>        (reassociate_bb): Now checks reassociation width to be used
>        and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
>        if needed.
>
>        (pass_reassoc): TODO_remove_unused_locals flag added.
>
> gcc/testsuite/
>
> 2011-07-26  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
>        --param tree-reassoc-width=1.
>
>        * gcc.dg/tree-ssa/reassoc-24.c: New test.
>        * gcc.dg/tree-ssa/reassoc-25.c: Likewise.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-07-19 15:25                 ` Ilya Enkovich
@ 2011-08-05 11:46                   ` Richard Guenther
  2011-08-05 15:08                     ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Guenther @ 2011-08-05 11:46 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Michael Meissner, gcc-patches

On Tue, Jul 19, 2011 at 4:46 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> Hello Richard,
>
> Thanks a lot for the review!
>
>> it's not easy to follow the flow of this function, esp. I wonder
>>
>> +      else
>> +       {
>> +         tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
>> +         add_referenced_var (var);
>> +         stmts[i] = build_and_add_sum (var, op1, op2, opcode);
>> +       }
>>
>> what you need to create new stmts for, and if you possibly create
>> multiple temporary variables here.
>>
>> -  TODO_verify_ssa
>> +  TODO_remove_unused_locals
>> +    | TODO_verify_ssa
>>
>> why?
>>
>
> Current rewrite_expr_tree uses original statements and  just change
> order of operands (tree leafs) usage. To keep data flow correct it
> moves all statements down to the last one. I tried such approach but
> it did not work well because of increased registers pressure in some
> cases (all operands are live at the same time right before the first
> statement of our sequence). Then I tried to move existing statements
> to the appropriate places (just after the last operand definition) but
> it appeared to be not so easy. It is easy to guarantee consistence in
> the final result but not easy to keep data flow consistent after each
> statement move. And if we do not keep code consistent after each move
> then we may get wrong debug statement generated. After few tries I
> decided that the most easy for both understanding and implementation
> way is to recreate statements and remove the old chain. So, I used
> build_and_add_sum to create all statements except the last one which
> never needs recreation. I added TODO_remove_unused_locals since temps
> used in original statements should be removed. Is this approach OK?

Well, it's enough to delay that to later passes that do this, so I'd prefer
to not change this in this patch.

>> You don't seem to handle the special-cases of rewrite_expr_tree
>> for the leafs of your tree, especially the PHI node special-casing.
>> I think you may run into vectorization issues here.
>>
> I do not see any reason why these cases are checked in
> rewrite_expr_tree. It is optimization of operands order and may be it
> will be good to move it somewhere. Is it OK if I move related code to
> the end of optimize_ops_list method?

I suppose yes.  Maybe the cases are also obsoleted by the improved
loop PHI handling.

Richard.

> Thanks
> Ilya
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-08-05 11:46                   ` Richard Guenther
@ 2011-08-05 15:08                     ` Ilya Enkovich
  2011-08-10 15:19                       ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-08-05 15:08 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Meissner, gcc-patches

Hello Richard,

Thanks for reply!

>
> Well, it's enough to delay that to later passes that do this, so I'd prefer
> to not change this in this patch.
>
OK, I'll fix it in next patch version.

> I suppose yes.  Maybe the cases are also obsoleted by the improved
> loop PHI handling.
>
I noticed that PHI node special cases are caught only when we have
three operands in operands vector which is not a case for new expr
rewritter. Grouping of operands with the same rank is a good thing to
do though, so I moved this part of code into a separate function.

I've sent new patch in this mail:
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02240.html
It has all comments fixed except TODO_remove_unused_locals flags.

Thanks
Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-08-05 15:08                     ` Ilya Enkovich
@ 2011-08-10 15:19                       ` Ilya Enkovich
  2011-08-19  9:09                         ` Ilya Enkovich
  2011-08-30 13:16                         ` Richard Guenther
  0 siblings, 2 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-08-10 15:19 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Meissner, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1813 bytes --]

Hello,

Here is a new version of the patch. Changes from the previous version
(http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02240.html):
 - updated to trunk
 - TODO_remove_unused_locals flag was removed from todo_flags_finish
of reassoc pass

Bootstrapped and checked on x86_64-linux.

Thanks,
Ilya
---
gcc/

2011-08-10  Enkovich Ilya  <ilya.enkovich@intel.com>

	PR middle-end/44382
	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): Likewise.

	* doc/tm.texi (reassociation_width): Likewise.

	* doc/invoke.texi (tree-reassoc-width): New param documented.

	* hooks.h (hook_int_uint_mode_1): New default hook.

	* hooks.c (hook_int_uint_mode_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* params.def (PARAM_TREE_REASSOC_WIDTH): New param added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(swap_ops_for_binary_stmt): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(rewrite_expr_tree): Refactored. Part of code moved into
	swap_ops_for_binary_stmt.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

gcc/testsuite/

2011-08-10  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	--param tree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 19138 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 05dd57c..527d786 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2168,7 +2168,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER
+  m_BDVER,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -34159,6 +34167,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -34956,6 +34966,25 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+/* Implementation of reassociation_width target hook used by
+   reassoc phase to identify parallelism level in reassociated
+   tree.  Statements tree_code is passed in OPC.  Arguments type
+   is passed in MODE.  */
+
+static int
+ix86_reassociation_width (unsigned int opc ATTRIBUTE_UNUSED,
+			  enum machine_mode mode)
+{
+  int res = 1;
+
+  if (INTEGRAL_MODE_P (mode) && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_MODE_P (mode) && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index f43586d..f54a031 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -316,6 +316,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -414,6 +416,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index f5d53d1..73a625a 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9068,6 +9068,11 @@ The smallest number of different values for which it is best to use a
 jump-table instead of a tree of conditional branches.  If the value is
 0, use the default for the machine.  The default is 0.
 
+@item tree-reassoc-width
+Set the maximum number of instructions executed in parallel in
+reassociated tree. This parameter overrides target dependent
+heuristics used by default if has non zero value.
+
 @end table
 @end table
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 74a2324..581383d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6844,6 +6844,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (unsigned int @var{opc}, enum machine_mode @var{mode})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index f63fe4a..f383a8c 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6778,6 +6778,8 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9dfde82..41903cf 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,13 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_uint_mode_1 (unsigned int a ATTRIBUTE_UNUSED,
+		      enum machine_mode b ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index a1b0112..d5e2836 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -67,6 +67,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_uint_mode_1 (unsigned int, enum machine_mode);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/params.def b/gcc/params.def
index e8372ed..c376c80 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -914,6 +914,13 @@ DEFPARAM (PARAM_ALLOW_STORE_DATA_RACES,
 	  "Allow new data races on stores to be introduced",
 	  1, 0, 1)
 
+/* Reassociation width to be used by tree reassoc optimization.  */
+DEFPARAM (PARAM_TREE_REASSOC_WIDTH,
+	  "tree-reassoc-width",
+	  "Set the maximum number of instructions executed in parallel in "
+	  "reassociated tree. If 0, use the target dependent heuristic.",
+	  0, 0, 0)
+
 
 /*
 Local variables:
diff --git a/gcc/target.def b/gcc/target.def
index 9ff97e6..e038588 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -913,6 +913,16 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"This hook is called by tree reassociator to determine a level of\n\
+parallelism required in output calculations chain.",
+int, (unsigned int opc, enum machine_mode mode),
+hook_int_uint_mode_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..a80a5a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 --param tree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..c871628
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..4ff66ef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 51f7ef8..6b3ee15 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
+#include "params.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1617,6 +1619,62 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* This function checks three consequtive operands in
+   passed operands vector OPS starting from OPINDEX and
+   swaps two operands if it is profitable for binary operation
+   consuming OPINDEX + 1 abnd OPINDEX + 2 operands.
+
+   We pair ops with the same rank if possible.
+
+   The alternative we try is to see if STMT is a destructive
+   update style statement, which is like:
+   b = phi (a, ...)
+   a = c + b;
+   In that case, we want to use the destructive update form to
+   expose the possible vectorizer sum reduction opportunity.
+   In that case, the third operand will be the phi node. This
+   check is not performed if STMT is null.
+
+   We could, of course, try to be better as noted above, and do a
+   lot of work to try to find these opportunities in >3 operand
+   cases, but it is unlikely to be worth it.  */
+
+static void
+swap_ops_for_binary_stmt (VEC(operand_entry_t, heap) * ops,
+			  unsigned int opindex, gimple stmt)
+{
+  operand_entry_t oe1, oe2, oe3;
+
+  oe1 = VEC_index (operand_entry_t, ops, opindex);
+  oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
+  oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
+
+  if ((oe1->rank == oe2->rank
+       && oe2->rank != oe3->rank)
+      || (stmt && is_phi_for_stmt (stmt, oe3->op)
+	  && !is_phi_for_stmt (stmt, oe1->op)
+	  && !is_phi_for_stmt (stmt, oe2->op)))
+    {
+      struct operand_entry temp = *oe3;
+      oe3->op = oe1->op;
+      oe3->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+  else if ((oe1->rank == oe3->rank
+	    && oe2->rank != oe3->rank)
+	   || (stmt && is_phi_for_stmt (stmt, oe2->op)
+	       && !is_phi_for_stmt (stmt, oe1->op)
+	       && !is_phi_for_stmt (stmt, oe3->op)))
+    {
+      struct operand_entry temp = *oe2;
+      oe2->op = oe1->op;
+      oe2->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -1629,53 +1687,10 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   tree rhs2 = gimple_assign_rhs2 (stmt);
   operand_entry_t oe;
 
-  /* If we have three operands left, then we want to make sure the one
-     that gets the double binary op are the ones with the same rank.
-
-     The alternative we try is to see if this is a destructive
-     update style statement, which is like:
-     b = phi (a, ...)
-     a = c + b;
-     In that case, we want to use the destructive update form to
-     expose the possible vectorizer sum reduction opportunity.
-     In that case, the third operand will be the phi node.
-
-     We could, of course, try to be better as noted above, and do a
-     lot of work to try to find these opportunities in >3 operand
-     cases, but it is unlikely to be worth it.  */
+  /* If we have three operands left, then we want to make sure the ones
+     that get the double binary op are chosen wisely.  */
   if (opindex + 3 == VEC_length (operand_entry_t, ops))
-    {
-      operand_entry_t oe1, oe2, oe3;
-
-      oe1 = VEC_index (operand_entry_t, ops, opindex);
-      oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
-      oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
-
-      if ((oe1->rank == oe2->rank
-	   && oe2->rank != oe3->rank)
-	  || (is_phi_for_stmt (stmt, oe3->op)
-	      && !is_phi_for_stmt (stmt, oe1->op)
-	      && !is_phi_for_stmt (stmt, oe2->op)))
-	{
-	  struct operand_entry temp = *oe3;
-	  oe3->op = oe1->op;
-	  oe3->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-      else if ((oe1->rank == oe3->rank
-		&& oe2->rank != oe3->rank)
-	       || (is_phi_for_stmt (stmt, oe2->op)
-		   && !is_phi_for_stmt (stmt, oe1->op)
-		   && !is_phi_for_stmt (stmt, oe3->op)))
-	{
-	  struct operand_entry temp = *oe2;
-	  oe2->op = oe1->op;
-	  oe2->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-    }
+    swap_ops_for_binary_stmt (ops, opindex, stmt);
 
   /* The final recursion case for this function is that you have
      exactly two operations left.
@@ -1760,6 +1775,172 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   rewrite_expr_tree (SSA_NAME_DEF_STMT (rhs1), opindex + 1, ops, moved);
 }
 
+/* Find out how many cycles we need to compute statements chain.
+   OPS_NUM holds number os statements in a chain.  CPU_WIDTH is a
+   maximum number of independent statements we may execute per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (VEC(operand_entry_t, heap) * ops,
+			 enum tree_code opc, enum machine_mode mode)
+{
+  int param_width = PARAM_VALUE (PARAM_TREE_REASSOC_WIDTH);
+  int ops_num = VEC_length (operand_entry_t, ops);
+  int width;
+  int cycles_best;
+
+  if (param_width > 0)
+    width = param_width;
+  else
+    width = targetm.sched.reassociation_width (opc, mode);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  while (width > 1
+	 && get_required_cycles (ops_num, width - 1) == cycles_best)
+    width--;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file,
+	     "Width = %d was chosen for reassociation\n", width);
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XNEWVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0 &&
+	  (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      /* Now we choose operands for the next statement.  Non zero
+	 value in ready_stmts_end means here that we should use
+	 the result of already generated statements as new operand.  */
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  if (op_index > 1)
+	    swap_ops_for_binary_stmt (ops, op_index - 2, NULL);
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+	 operands into the last statement.  It will also
+	 break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
+	  add_referenced_var (var);
+	  stmts[i] = build_and_add_sum (var, op1, op2, opcode);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+
+  free (stmts);
+}
+
 /* Transform STMT, which is really (A +B) + (C + D) into the left
    linear form, ((A+B)+C)+D.
    Recurse on D if necessary.  */
@@ -2282,7 +2463,16 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  enum machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+		  int width = get_reassociation_width (ops, rhs_code, mode);
+
+		  if (width > 1
+		      && VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-08-10 15:19                       ` Ilya Enkovich
@ 2011-08-19  9:09                         ` Ilya Enkovich
  2011-08-26 11:45                           ` Ilya Enkovich
  2011-08-30 13:16                         ` Richard Guenther
  1 sibling, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-08-19  9:09 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Meissner, gcc-patches

Ping.

2011/8/10 Ilya Enkovich <enkovich.gnu@gmail.com>:
> Hello,
>
> Here is a new version of the patch. Changes from the previous version
> (http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02240.html):
>  - updated to trunk
>  - TODO_remove_unused_locals flag was removed from todo_flags_finish
> of reassoc pass
>
> Bootstrapped and checked on x86_64-linux.
>
> Thanks,
> Ilya
> ---
> gcc/
>
> 2011-08-10  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        PR middle-end/44382
>        * target.def (reassociation_width): New hook.
>
>        * doc/tm.texi.in (reassociation_width): Likewise.
>
>        * doc/tm.texi (reassociation_width): Likewise.
>
>        * doc/invoke.texi (tree-reassoc-width): New param documented.
>
>        * hooks.h (hook_int_uint_mode_1): New default hook.
>
>        * hooks.c (hook_int_uint_mode_1): Likewise.
>
>        * config/i386/i386.h (ix86_tune_indices): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (TARGET_REASSOC_INT_TO_PARALLEL): New.
>        (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
>
>        * config/i386/i386.c (initial_ix86_tune_features): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (ix86_reassociation_width) implementation of
>        new hook for i386 target.
>
>        * params.def (PARAM_TREE_REASSOC_WIDTH): New param added.
>
>        * tree-ssa-reassoc.c (get_required_cycles): New function.
>        (get_reassociation_width): Likewise.
>        (swap_ops_for_binary_stmt): Likewise.
>        (rewrite_expr_tree_parallel): Likewise.
>
>        (rewrite_expr_tree): Refactored. Part of code moved into
>        swap_ops_for_binary_stmt.
>
>        (reassociate_bb): Now checks reassociation width to be used
>        and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
>        if needed.
>
> gcc/testsuite/
>
> 2011-08-10  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
>        --param tree-reassoc-width=1.
>
>        * gcc.dg/tree-ssa/reassoc-24.c: New test.
>        * gcc.dg/tree-ssa/reassoc-25.c: Likewise.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-08-19  9:09                         ` Ilya Enkovich
@ 2011-08-26 11:45                           ` Ilya Enkovich
  0 siblings, 0 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-08-26 11:45 UTC (permalink / raw)
  To: Richard Guenther, Michael Meissner, gcc-patches

Double ping.

2011/8/19 Ilya Enkovich <enkovich.gnu@gmail.com>:
> Ping.
>
> 2011/8/10 Ilya Enkovich <enkovich.gnu@gmail.com>:
>> Hello,
>>
>> Here is a new version of the patch. Changes from the previous version
>> (http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02240.html):
>>  - updated to trunk
>>  - TODO_remove_unused_locals flag was removed from todo_flags_finish
>> of reassoc pass
>>
>> Bootstrapped and checked on x86_64-linux.
>>
>> Thanks,
>> Ilya
>> ---
>> gcc/
>>
>> 2011-08-10  Enkovich Ilya  <ilya.enkovich@intel.com>
>>
>>        PR middle-end/44382
>>        * target.def (reassociation_width): New hook.
>>
>>        * doc/tm.texi.in (reassociation_width): Likewise.
>>
>>        * doc/tm.texi (reassociation_width): Likewise.
>>
>>        * doc/invoke.texi (tree-reassoc-width): New param documented.
>>
>>        * hooks.h (hook_int_uint_mode_1): New default hook.
>>
>>        * hooks.c (hook_int_uint_mode_1): Likewise.
>>
>>        * config/i386/i386.h (ix86_tune_indices): Add
>>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>>
>>        (TARGET_REASSOC_INT_TO_PARALLEL): New.
>>        (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
>>
>>        * config/i386/i386.c (initial_ix86_tune_features): Add
>>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>>
>>        (ix86_reassociation_width) implementation of
>>        new hook for i386 target.
>>
>>        * params.def (PARAM_TREE_REASSOC_WIDTH): New param added.
>>
>>        * tree-ssa-reassoc.c (get_required_cycles): New function.
>>        (get_reassociation_width): Likewise.
>>        (swap_ops_for_binary_stmt): Likewise.
>>        (rewrite_expr_tree_parallel): Likewise.
>>
>>        (rewrite_expr_tree): Refactored. Part of code moved into
>>        swap_ops_for_binary_stmt.
>>
>>        (reassociate_bb): Now checks reassociation width to be used
>>        and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
>>        if needed.
>>
>> gcc/testsuite/
>>
>> 2011-08-10  Enkovich Ilya  <ilya.enkovich@intel.com>
>>
>>        * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
>>        --param tree-reassoc-width=1.
>>
>>        * gcc.dg/tree-ssa/reassoc-24.c: New test.
>>        * gcc.dg/tree-ssa/reassoc-25.c: Likewise.
>>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-08-10 15:19                       ` Ilya Enkovich
  2011-08-19  9:09                         ` Ilya Enkovich
@ 2011-08-30 13:16                         ` Richard Guenther
  2011-08-31 15:38                           ` Ilya Enkovich
  1 sibling, 1 reply; 47+ messages in thread
From: Richard Guenther @ 2011-08-30 13:16 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Michael Meissner, gcc-patches

On Wed, Aug 10, 2011 at 4:49 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> Hello,
>
> Here is a new version of the patch. Changes from the previous version
> (http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02240.html):
>  - updated to trunk
>  - TODO_remove_unused_locals flag was removed from todo_flags_finish
> of reassoc pass
>
> Bootstrapped and checked on x86_64-linux.

The fact that you have to adjust gcc.dg/tree-ssa/pr38533.c looks problematic
to me.  Can you investigate the problem report, look at the geneated
code with the atom default of the param and see whether it's still
reasonably "fixed" (maybe you'd already done this)?

+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  while (width > 1
+        && get_required_cycles (ops_num, width - 1) == cycles_best)
+    width--;

I suppose get_required_cycles () is monotonic in width?  Would it
make sense to binary search the best width then (I realize the
default is 1 and the only target differing has 2, but ...)?  Maybe at
least add a comment to this effect?  Or not decrement by one
but instead divide by two on each iteration (which is the same for 1 and 2 ...)?
It's also all a mapping that is constant - we should be able to
pre-compute it for all values, eventually "compressing" it to a
much simpler formula width = f (cpu_width, ops_num)?

+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+                           VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XNEWVEC (gimple, stmt_num);

use XALLOCAVEC here and remove the free call.

+      if (ready_stmts_end == 0 &&
+         (i - stmt_index >= width || op_index < 1))

&&s go to the next line, not at the end

+      else
+       {
+         tree var = create_tmp_reg (TREE_TYPE (last_rhs1), "reassoc");
+         add_referenced_var (var);

please re-use a common variable for the whole chain, they will
all necessarily have compatible type.  Creating and maintaining
decls is expensive (also avoid giving them names, thus just
pass NULL instead of "reassoc").

+               {
+                 enum machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+                 int width = get_reassociation_width (ops, rhs_code, mode);

as you are passing in the mode here, consider passing the number
of operands in ops as well instead of ops.  This way we might consider
moving this whole function to the target or to generic code.  Similar
move the dump printing in get_reassociation_width

+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file,
+            "Width = %d was chosen for reassociation\n", width);

here to the caller.

Sorry for taking so long to review this again.  I'll promise to be quick
once you post an update.

Thanks,
Richard.

> Thanks,
> Ilya
> ---
> gcc/
>
> 2011-08-10  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        PR middle-end/44382
>        * target.def (reassociation_width): New hook.
>
>        * doc/tm.texi.in (reassociation_width): Likewise.
>
>        * doc/tm.texi (reassociation_width): Likewise.
>
>        * doc/invoke.texi (tree-reassoc-width): New param documented.
>
>        * hooks.h (hook_int_uint_mode_1): New default hook.
>
>        * hooks.c (hook_int_uint_mode_1): Likewise.
>
>        * config/i386/i386.h (ix86_tune_indices): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (TARGET_REASSOC_INT_TO_PARALLEL): New.
>        (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
>
>        * config/i386/i386.c (initial_ix86_tune_features): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (ix86_reassociation_width) implementation of
>        new hook for i386 target.
>
>        * params.def (PARAM_TREE_REASSOC_WIDTH): New param added.
>
>        * tree-ssa-reassoc.c (get_required_cycles): New function.
>        (get_reassociation_width): Likewise.
>        (swap_ops_for_binary_stmt): Likewise.
>        (rewrite_expr_tree_parallel): Likewise.
>
>        (rewrite_expr_tree): Refactored. Part of code moved into
>        swap_ops_for_binary_stmt.
>
>        (reassociate_bb): Now checks reassociation width to be used
>        and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
>        if needed.
>
> gcc/testsuite/
>
> 2011-08-10  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
>        --param tree-reassoc-width=1.
>
>        * gcc.dg/tree-ssa/reassoc-24.c: New test.
>        * gcc.dg/tree-ssa/reassoc-25.c: Likewise.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-08-30 13:16                         ` Richard Guenther
@ 2011-08-31 15:38                           ` Ilya Enkovich
  2011-09-01  9:24                             ` Richard Guenther
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-08-31 15:38 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3941 bytes --]

Hello Richard,

Thanks for the review!

> The fact that you have to adjust gcc.dg/tree-ssa/pr38533.c looks problematic
> to me.  Can you investigate the problem report, look at the geneated
> code with the atom default of the param and see whether it's still
> reasonably "fixed" (maybe you'd already done this)?
This test was created as a regression test to the problem in
linearize_expr_tree which moves all statements down to the first
modified one during reassociation increasing registers pressure. Test
has a lot of definitions which are all ORed like this:
  def r1
  def r2
  s = r1 or r2
  def r3
  s = s or r3
  def r4
  s = s or r4
  ...
and it checks that register pressure is not increased after
reassociation by simple scan of two sequential defs. If we use
reassociation width higher than 1 then test will fails because we need
to increase register pressure to get parallelism and finally get code
like this:
  def r1
  def r2
  def r3
  t1 = r1 or r2
  s = s or r3
  def r4
  def r5
  s = s or t1
  t1 = r4 or r5
  ...
So, I had to fix a test.
>
> +  /* Check if we may use less width and still compute sequence for
> +     the same time.  It will allow us to reduce registers usage.  */
> +  while (width > 1
> +        && get_required_cycles (ops_num, width - 1) == cycles_best)
> +    width--;
>
> I suppose get_required_cycles () is monotonic in width?  Would it
> make sense to binary search the best width then (I realize the
> default is 1 and the only target differing has 2, but ...)?  Maybe at
> least add a comment to this effect?  Or not decrement by one
> but instead divide by two on each iteration (which is the same for 1 and 2 ...)?
> It's also all a mapping that is constant - we should be able to
> pre-compute it for all values, eventually "compressing" it to a
> much simpler formula width = f (cpu_width, ops_num)?
I replaced sequential search with a binary search. I did not pay much
attention to this function because do not think it is really time
consuming compared to other parts of reassociation phase. Am I too
optimistic? If you think it might significantly affect compile time I
can introduce a table of pre-computed values (or make it later as a
separate fix).

I made all other fixes as you suggested.

Bootstrapped and checked on x86_64-linux.

Thanks,
Ilya
---
gcc/

2011-08-31  Enkovich Ilya  <ilya.enkovich@intel.com>

	PR middle-end/44382
	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): Likewise.

	* doc/tm.texi (reassociation_width): Likewise.

	* doc/invoke.texi (tree-reassoc-width): New param documented.

	* hooks.h (hook_int_uint_mode_1): New default hook.

	* hooks.c (hook_int_uint_mode_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* params.def (PARAM_TREE_REASSOC_WIDTH): New param added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(swap_ops_for_binary_stmt): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(rewrite_expr_tree): Refactored. Part of code moved into
	swap_ops_for_binary_stmt.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

gcc/testsuite/

2011-08-31  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	--param tree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 19328 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 48b9be0..8d9c513 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2168,7 +2168,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER
+  m_BDVER,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -34823,6 +34831,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -35620,6 +35630,25 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+/* Implementation of reassociation_width target hook used by
+   reassoc phase to identify parallelism level in reassociated
+   tree.  Statements tree_code is passed in OPC.  Arguments type
+   is passed in MODE.  */
+
+static int
+ix86_reassociation_width (unsigned int opc ATTRIBUTE_UNUSED,
+			  enum machine_mode mode)
+{
+  int res = 1;
+
+  if (INTEGRAL_MODE_P (mode) && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_MODE_P (mode) && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 47442a0..21f5075 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -318,6 +318,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -416,6 +418,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 62a841c..ba1e095 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9075,6 +9075,11 @@ The smallest number of different values for which it is best to use a
 jump-table instead of a tree of conditional branches.  If the value is
 0, use the default for the machine.  The default is 0.
 
+@item tree-reassoc-width
+Set the maximum number of instructions executed in parallel in
+reassociated tree. This parameter overrides target dependent
+heuristics used by default if has non zero value.
+
 @end table
 @end table
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 7364aa1..335c1d1 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6840,6 +6840,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (unsigned int @var{opc}, enum machine_mode @var{mode})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4535fd6..6783826 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6774,6 +6774,8 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9025183..1ba44f9 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,13 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_uint_mode_1 (unsigned int a ATTRIBUTE_UNUSED,
+		      enum machine_mode b ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index 156d708..54ace24 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -68,6 +68,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_uint_mode_1 (unsigned int, enum machine_mode);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/params.def b/gcc/params.def
index e8372ed..c376c80 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -914,6 +914,13 @@ DEFPARAM (PARAM_ALLOW_STORE_DATA_RACES,
 	  "Allow new data races on stores to be introduced",
 	  1, 0, 1)
 
+/* Reassociation width to be used by tree reassoc optimization.  */
+DEFPARAM (PARAM_TREE_REASSOC_WIDTH,
+	  "tree-reassoc-width",
+	  "Set the maximum number of instructions executed in parallel in "
+	  "reassociated tree. If 0, use the target dependent heuristic.",
+	  0, 0, 0)
+
 
 /*
 Local variables:
diff --git a/gcc/target.def b/gcc/target.def
index 857f463..1e09ba7 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -913,6 +913,16 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"This hook is called by tree reassociator to determine a level of\n\
+parallelism required in output calculations chain.",
+int, (unsigned int opc, enum machine_mode mode),
+hook_int_uint_mode_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..a80a5a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 --param tree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..c871628
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..4ff66ef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 51f7ef8..65f6387 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
+#include "params.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1617,6 +1619,62 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* This function checks three consequtive operands in
+   passed operands vector OPS starting from OPINDEX and
+   swaps two operands if it is profitable for binary operation
+   consuming OPINDEX + 1 abnd OPINDEX + 2 operands.
+
+   We pair ops with the same rank if possible.
+
+   The alternative we try is to see if STMT is a destructive
+   update style statement, which is like:
+   b = phi (a, ...)
+   a = c + b;
+   In that case, we want to use the destructive update form to
+   expose the possible vectorizer sum reduction opportunity.
+   In that case, the third operand will be the phi node. This
+   check is not performed if STMT is null.
+
+   We could, of course, try to be better as noted above, and do a
+   lot of work to try to find these opportunities in >3 operand
+   cases, but it is unlikely to be worth it.  */
+
+static void
+swap_ops_for_binary_stmt (VEC(operand_entry_t, heap) * ops,
+			  unsigned int opindex, gimple stmt)
+{
+  operand_entry_t oe1, oe2, oe3;
+
+  oe1 = VEC_index (operand_entry_t, ops, opindex);
+  oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
+  oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
+
+  if ((oe1->rank == oe2->rank
+       && oe2->rank != oe3->rank)
+      || (stmt && is_phi_for_stmt (stmt, oe3->op)
+	  && !is_phi_for_stmt (stmt, oe1->op)
+	  && !is_phi_for_stmt (stmt, oe2->op)))
+    {
+      struct operand_entry temp = *oe3;
+      oe3->op = oe1->op;
+      oe3->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+  else if ((oe1->rank == oe3->rank
+	    && oe2->rank != oe3->rank)
+	   || (stmt && is_phi_for_stmt (stmt, oe2->op)
+	       && !is_phi_for_stmt (stmt, oe1->op)
+	       && !is_phi_for_stmt (stmt, oe3->op)))
+    {
+      struct operand_entry temp = *oe2;
+      oe2->op = oe1->op;
+      oe2->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -1629,53 +1687,10 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   tree rhs2 = gimple_assign_rhs2 (stmt);
   operand_entry_t oe;
 
-  /* If we have three operands left, then we want to make sure the one
-     that gets the double binary op are the ones with the same rank.
-
-     The alternative we try is to see if this is a destructive
-     update style statement, which is like:
-     b = phi (a, ...)
-     a = c + b;
-     In that case, we want to use the destructive update form to
-     expose the possible vectorizer sum reduction opportunity.
-     In that case, the third operand will be the phi node.
-
-     We could, of course, try to be better as noted above, and do a
-     lot of work to try to find these opportunities in >3 operand
-     cases, but it is unlikely to be worth it.  */
+  /* If we have three operands left, then we want to make sure the ones
+     that get the double binary op are chosen wisely.  */
   if (opindex + 3 == VEC_length (operand_entry_t, ops))
-    {
-      operand_entry_t oe1, oe2, oe3;
-
-      oe1 = VEC_index (operand_entry_t, ops, opindex);
-      oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
-      oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
-
-      if ((oe1->rank == oe2->rank
-	   && oe2->rank != oe3->rank)
-	  || (is_phi_for_stmt (stmt, oe3->op)
-	      && !is_phi_for_stmt (stmt, oe1->op)
-	      && !is_phi_for_stmt (stmt, oe2->op)))
-	{
-	  struct operand_entry temp = *oe3;
-	  oe3->op = oe1->op;
-	  oe3->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-      else if ((oe1->rank == oe3->rank
-		&& oe2->rank != oe3->rank)
-	       || (is_phi_for_stmt (stmt, oe2->op)
-		   && !is_phi_for_stmt (stmt, oe1->op)
-		   && !is_phi_for_stmt (stmt, oe3->op)))
-	{
-	  struct operand_entry temp = *oe2;
-	  oe2->op = oe1->op;
-	  oe2->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-    }
+    swap_ops_for_binary_stmt (ops, opindex, stmt);
 
   /* The final recursion case for this function is that you have
      exactly two operations left.
@@ -1760,6 +1775,175 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   rewrite_expr_tree (SSA_NAME_DEF_STMT (rhs1), opindex + 1, ops, moved);
 }
 
+/* Find out how many cycles we need to compute statements chain.
+   OPS_NUM holds number os statements in a chain.  CPU_WIDTH is a
+   maximum number of independent statements we may execute per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (int ops_num, enum tree_code opc,
+			 enum machine_mode mode)
+{
+  int param_width = PARAM_VALUE (PARAM_TREE_REASSOC_WIDTH);
+  int width;
+  int width_min;
+  int cycles_best;
+
+  if (param_width > 0)
+    width = param_width;
+  else
+    width = targetm.sched.reassociation_width (opc, mode);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  width_min = 1;
+  while (width > width_min)
+    {
+      int width_mid = (width + width_min) / 2;
+
+      if (get_required_cycles (ops_num, width_mid) == cycles_best)
+	width = width_mid;
+      else if (width_min < width_mid)
+	width_min = width_mid;
+      else
+	break;
+    }
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XALLOCAVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+  tree lhs_var;
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  lhs_var = create_tmp_reg (TREE_TYPE (last_rhs1), NULL);
+  add_referenced_var (lhs_var);
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0
+	  && (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      /* Now we choose operands for the next statement.  Non zero
+	 value in ready_stmts_end means here that we should use
+	 the result of already generated statements as new operand.  */
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  if (op_index > 1)
+	    swap_ops_for_binary_stmt (ops, op_index - 2, NULL);
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+	 operands into the last statement.  It will also
+	 break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	stmts[i] = build_and_add_sum (lhs_var, op1, op2, opcode);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+}
+
 /* Transform STMT, which is really (A +B) + (C + D) into the left
    linear form, ((A+B)+C)+D.
    Recurse on D if necessary.  */
@@ -2282,7 +2466,21 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  enum machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+		  int ops_num = VEC_length (operand_entry_t, ops);
+		  int width = get_reassociation_width (ops_num, rhs_code, mode);
+
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file,
+			     "Width = %d was chosen for reassociation\n", width);
+
+		  if (width > 1
+		      && VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-08-31 15:38                           ` Ilya Enkovich
@ 2011-09-01  9:24                             ` Richard Guenther
  2011-09-01 10:28                               ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Guenther @ 2011-09-01  9:24 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: gcc-patches

On Wed, Aug 31, 2011 at 2:17 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> Hello Richard,
>
> Thanks for the review!
>
>> The fact that you have to adjust gcc.dg/tree-ssa/pr38533.c looks problematic
>> to me.  Can you investigate the problem report, look at the geneated
>> code with the atom default of the param and see whether it's still
>> reasonably "fixed" (maybe you'd already done this)?
> This test was created as a regression test to the problem in
> linearize_expr_tree which moves all statements down to the first
> modified one during reassociation increasing registers pressure. Test
> has a lot of definitions which are all ORed like this:
>  def r1
>  def r2
>  s = r1 or r2
>  def r3
>  s = s or r3
>  def r4
>  s = s or r4
>  ...
> and it checks that register pressure is not increased after
> reassociation by simple scan of two sequential defs. If we use
> reassociation width higher than 1 then test will fails because we need
> to increase register pressure to get parallelism and finally get code
> like this:
>  def r1
>  def r2
>  def r3
>  t1 = r1 or r2
>  s = s or r3
>  def r4
>  def r5
>  s = s or t1
>  t1 = r4 or r5
>  ...
> So, I had to fix a test.

Ok.  Thanks for explaining.

>> +  /* Check if we may use less width and still compute sequence for
>> +     the same time.  It will allow us to reduce registers usage.  */
>> +  while (width > 1
>> +        && get_required_cycles (ops_num, width - 1) == cycles_best)
>> +    width--;
>>
>> I suppose get_required_cycles () is monotonic in width?  Would it
>> make sense to binary search the best width then (I realize the
>> default is 1 and the only target differing has 2, but ...)?  Maybe at
>> least add a comment to this effect?  Or not decrement by one
>> but instead divide by two on each iteration (which is the same for 1 and 2 ...)?
>> It's also all a mapping that is constant - we should be able to
>> pre-compute it for all values, eventually "compressing" it to a
>> much simpler formula width = f (cpu_width, ops_num)?
> I replaced sequential search with a binary search. I did not pay much
> attention to this function because do not think it is really time
> consuming compared to other parts of reassociation phase. Am I too
> optimistic? If you think it might significantly affect compile time I
> can introduce a table of pre-computed values (or make it later as a
> separate fix).

+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.  */
+  width_min = 1;
+  while (width > width_min)
+    {
+      int width_mid = (width + width_min) / 2;
+
+      if (get_required_cycles (ops_num, width_mid) == cycles_best)
+       width = width_mid;
+      else if (width_min < width_mid)
+       width_min = width_mid;
+      else
+       break;
+    }

this seems to not allow cycles_best to drop with lower width, but
that it can't should be an implementation detail of get_required_cycles.
To make it not so, can you add a comment before the loop, like

  /* get_required_cycles is monotonically increasing with lower width
     so we can perform a binary search for the minimal width that still
     results in the optimal cycle count.  */

> I made all other fixes as you suggested.

With the above change the non-x86 specifc parts are ok.  Please get
approval for them from a x86 maintainer.

Thanks,
Richard.


> Bootstrapped and checked on x86_64-linux.
>
> Thanks,
> Ilya
> ---
> gcc/
>
> 2011-08-31  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        PR middle-end/44382
>        * target.def (reassociation_width): New hook.
>
>        * doc/tm.texi.in (reassociation_width): Likewise.
>
>        * doc/tm.texi (reassociation_width): Likewise.
>
>        * doc/invoke.texi (tree-reassoc-width): New param documented.
>
>        * hooks.h (hook_int_uint_mode_1): New default hook.
>
>        * hooks.c (hook_int_uint_mode_1): Likewise.
>
>        * config/i386/i386.h (ix86_tune_indices): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (TARGET_REASSOC_INT_TO_PARALLEL): New.
>        (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
>
>        * config/i386/i386.c (initial_ix86_tune_features): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (ix86_reassociation_width) implementation of
>        new hook for i386 target.
>
>        * params.def (PARAM_TREE_REASSOC_WIDTH): New param added.
>
>        * tree-ssa-reassoc.c (get_required_cycles): New function.
>        (get_reassociation_width): Likewise.
>        (swap_ops_for_binary_stmt): Likewise.
>        (rewrite_expr_tree_parallel): Likewise.
>
>        (rewrite_expr_tree): Refactored. Part of code moved into
>        swap_ops_for_binary_stmt.
>
>        (reassociate_bb): Now checks reassociation width to be used
>        and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
>        if needed.
>
> gcc/testsuite/
>
> 2011-08-31  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
>        --param tree-reassoc-width=1.
>
>        * gcc.dg/tree-ssa/reassoc-24.c: New test.
>        * gcc.dg/tree-ssa/reassoc-25.c: Likewise.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-01  9:24                             ` Richard Guenther
@ 2011-09-01 10:28                               ` Ilya Enkovich
  2011-09-02 12:53                                 ` Uros Bizjak
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-09-01 10:28 UTC (permalink / raw)
  To: Richard Guenther, Uros Bizjak; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2181 bytes --]

>
> this seems to not allow cycles_best to drop with lower width, but
> that it can't should be an implementation detail of get_required_cycles.
> To make it not so, can you add a comment before the loop, like
>
>  /* get_required_cycles is monotonically increasing with lower width
>     so we can perform a binary search for the minimal width that still
>     results in the optimal cycle count.  */
>

Fixed. Thanks!

>
> With the above change the non-x86 specifc parts are ok.  Please get
> approval for them from a x86 maintainer.
>

Could please someone review x86 part?

Thanks,
Ilya
---
gcc/

2011-09-01  Enkovich Ilya  <ilya.enkovich@intel.com>

	PR middle-end/44382
	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): Likewise.

	* doc/tm.texi (reassociation_width): Likewise.

	* doc/invoke.texi (tree-reassoc-width): New param documented.

	* hooks.h (hook_int_uint_mode_1): New default hook.

	* hooks.c (hook_int_uint_mode_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* params.def (PARAM_TREE_REASSOC_WIDTH): New param added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(swap_ops_for_binary_stmt): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(rewrite_expr_tree): Refactored. Part of code moved into
	swap_ops_for_binary_stmt.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

gcc/testsuite/

2011-09-01  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	--param tree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 19514 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 48b9be0..8d9c513 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2168,7 +2168,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER
+  m_BDVER,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -34823,6 +34831,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -35620,6 +35630,25 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+/* Implementation of reassociation_width target hook used by
+   reassoc phase to identify parallelism level in reassociated
+   tree.  Statements tree_code is passed in OPC.  Arguments type
+   is passed in MODE.  */
+
+static int
+ix86_reassociation_width (unsigned int opc ATTRIBUTE_UNUSED,
+			  enum machine_mode mode)
+{
+  int res = 1;
+
+  if (INTEGRAL_MODE_P (mode) && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_MODE_P (mode) && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 47442a0..21f5075 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -318,6 +318,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -416,6 +418,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 62a841c..ba1e095 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9075,6 +9075,11 @@ The smallest number of different values for which it is best to use a
 jump-table instead of a tree of conditional branches.  If the value is
 0, use the default for the machine.  The default is 0.
 
+@item tree-reassoc-width
+Set the maximum number of instructions executed in parallel in
+reassociated tree. This parameter overrides target dependent
+heuristics used by default if has non zero value.
+
 @end table
 @end table
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 7364aa1..335c1d1 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6840,6 +6840,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (unsigned int @var{opc}, enum machine_mode @var{mode})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4535fd6..6783826 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6774,6 +6774,8 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9025183..1ba44f9 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,13 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_uint_mode_1 (unsigned int a ATTRIBUTE_UNUSED,
+		      enum machine_mode b ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index 156d708..54ace24 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -68,6 +68,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_uint_mode_1 (unsigned int, enum machine_mode);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/params.def b/gcc/params.def
index e8372ed..c376c80 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -914,6 +914,13 @@ DEFPARAM (PARAM_ALLOW_STORE_DATA_RACES,
 	  "Allow new data races on stores to be introduced",
 	  1, 0, 1)
 
+/* Reassociation width to be used by tree reassoc optimization.  */
+DEFPARAM (PARAM_TREE_REASSOC_WIDTH,
+	  "tree-reassoc-width",
+	  "Set the maximum number of instructions executed in parallel in "
+	  "reassociated tree. If 0, use the target dependent heuristic.",
+	  0, 0, 0)
+
 
 /*
 Local variables:
diff --git a/gcc/target.def b/gcc/target.def
index 857f463..1e09ba7 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -913,6 +913,16 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"This hook is called by tree reassociator to determine a level of\n\
+parallelism required in output calculations chain.",
+int, (unsigned int opc, enum machine_mode mode),
+hook_int_uint_mode_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..a80a5a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 --param tree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..c871628
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..4ff66ef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 51f7ef8..03e0672 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
+#include "params.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1617,6 +1619,62 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* This function checks three consequtive operands in
+   passed operands vector OPS starting from OPINDEX and
+   swaps two operands if it is profitable for binary operation
+   consuming OPINDEX + 1 abnd OPINDEX + 2 operands.
+
+   We pair ops with the same rank if possible.
+
+   The alternative we try is to see if STMT is a destructive
+   update style statement, which is like:
+   b = phi (a, ...)
+   a = c + b;
+   In that case, we want to use the destructive update form to
+   expose the possible vectorizer sum reduction opportunity.
+   In that case, the third operand will be the phi node. This
+   check is not performed if STMT is null.
+
+   We could, of course, try to be better as noted above, and do a
+   lot of work to try to find these opportunities in >3 operand
+   cases, but it is unlikely to be worth it.  */
+
+static void
+swap_ops_for_binary_stmt (VEC(operand_entry_t, heap) * ops,
+			  unsigned int opindex, gimple stmt)
+{
+  operand_entry_t oe1, oe2, oe3;
+
+  oe1 = VEC_index (operand_entry_t, ops, opindex);
+  oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
+  oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
+
+  if ((oe1->rank == oe2->rank
+       && oe2->rank != oe3->rank)
+      || (stmt && is_phi_for_stmt (stmt, oe3->op)
+	  && !is_phi_for_stmt (stmt, oe1->op)
+	  && !is_phi_for_stmt (stmt, oe2->op)))
+    {
+      struct operand_entry temp = *oe3;
+      oe3->op = oe1->op;
+      oe3->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+  else if ((oe1->rank == oe3->rank
+	    && oe2->rank != oe3->rank)
+	   || (stmt && is_phi_for_stmt (stmt, oe2->op)
+	       && !is_phi_for_stmt (stmt, oe1->op)
+	       && !is_phi_for_stmt (stmt, oe3->op)))
+    {
+      struct operand_entry temp = *oe2;
+      oe2->op = oe1->op;
+      oe2->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -1629,53 +1687,10 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   tree rhs2 = gimple_assign_rhs2 (stmt);
   operand_entry_t oe;
 
-  /* If we have three operands left, then we want to make sure the one
-     that gets the double binary op are the ones with the same rank.
-
-     The alternative we try is to see if this is a destructive
-     update style statement, which is like:
-     b = phi (a, ...)
-     a = c + b;
-     In that case, we want to use the destructive update form to
-     expose the possible vectorizer sum reduction opportunity.
-     In that case, the third operand will be the phi node.
-
-     We could, of course, try to be better as noted above, and do a
-     lot of work to try to find these opportunities in >3 operand
-     cases, but it is unlikely to be worth it.  */
+  /* If we have three operands left, then we want to make sure the ones
+     that get the double binary op are chosen wisely.  */
   if (opindex + 3 == VEC_length (operand_entry_t, ops))
-    {
-      operand_entry_t oe1, oe2, oe3;
-
-      oe1 = VEC_index (operand_entry_t, ops, opindex);
-      oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
-      oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
-
-      if ((oe1->rank == oe2->rank
-	   && oe2->rank != oe3->rank)
-	  || (is_phi_for_stmt (stmt, oe3->op)
-	      && !is_phi_for_stmt (stmt, oe1->op)
-	      && !is_phi_for_stmt (stmt, oe2->op)))
-	{
-	  struct operand_entry temp = *oe3;
-	  oe3->op = oe1->op;
-	  oe3->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-      else if ((oe1->rank == oe3->rank
-		&& oe2->rank != oe3->rank)
-	       || (is_phi_for_stmt (stmt, oe2->op)
-		   && !is_phi_for_stmt (stmt, oe1->op)
-		   && !is_phi_for_stmt (stmt, oe3->op)))
-	{
-	  struct operand_entry temp = *oe2;
-	  oe2->op = oe1->op;
-	  oe2->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-    }
+    swap_ops_for_binary_stmt (ops, opindex, stmt);
 
   /* The final recursion case for this function is that you have
      exactly two operations left.
@@ -1760,6 +1775,178 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   rewrite_expr_tree (SSA_NAME_DEF_STMT (rhs1), opindex + 1, ops, moved);
 }
 
+/* Find out how many cycles we need to compute statements chain.
+   OPS_NUM holds number os statements in a chain.  CPU_WIDTH is a
+   maximum number of independent statements we may execute per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (int ops_num, enum tree_code opc,
+			 enum machine_mode mode)
+{
+  int param_width = PARAM_VALUE (PARAM_TREE_REASSOC_WIDTH);
+  int width;
+  int width_min;
+  int cycles_best;
+
+  if (param_width > 0)
+    width = param_width;
+  else
+    width = targetm.sched.reassociation_width (opc, mode);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.
+     get_required_cycles is monotonically increasing with lower width
+     so we can perform a binary search for the minimal width that still
+     results in the optimal cycle count.  */
+  width_min = 1;
+  while (width > width_min)
+    {
+      int width_mid = (width + width_min) / 2;
+
+      if (get_required_cycles (ops_num, width_mid) == cycles_best)
+	width = width_mid;
+      else if (width_min < width_mid)
+	width_min = width_mid;
+      else
+	break;
+    }
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XALLOCAVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+  tree lhs_var;
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  lhs_var = create_tmp_reg (TREE_TYPE (last_rhs1), NULL);
+  add_referenced_var (lhs_var);
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0
+	  && (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      /* Now we choose operands for the next statement.  Non zero
+	 value in ready_stmts_end means here that we should use
+	 the result of already generated statements as new operand.  */
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  if (op_index > 1)
+	    swap_ops_for_binary_stmt (ops, op_index - 2, NULL);
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+	 operands into the last statement.  It will also
+	 break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	stmts[i] = build_and_add_sum (lhs_var, op1, op2, opcode);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+}
+
 /* Transform STMT, which is really (A +B) + (C + D) into the left
    linear form, ((A+B)+C)+D.
    Recurse on D if necessary.  */
@@ -2282,7 +2469,21 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  enum machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+		  int ops_num = VEC_length (operand_entry_t, ops);
+		  int width = get_reassociation_width (ops_num, rhs_code, mode);
+
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file,
+			     "Width = %d was chosen for reassociation\n", width);
+
+		  if (width > 1
+		      && VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-01 10:28                               ` Ilya Enkovich
@ 2011-09-02 12:53                                 ` Uros Bizjak
  2011-09-02 13:07                                   ` Richard Guenther
  2011-09-06 12:40                                   ` Ilya Enkovich
  0 siblings, 2 replies; 47+ messages in thread
From: Uros Bizjak @ 2011-09-02 12:53 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Richard Guenther, gcc-patches

On Thu, Sep 1, 2011 at 12:27 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>
>> this seems to not allow cycles_best to drop with lower width, but
>> that it can't should be an implementation detail of get_required_cycles.
>> To make it not so, can you add a comment before the loop, like
>>
>>  /* get_required_cycles is monotonically increasing with lower width
>>     so we can perform a binary search for the minimal width that still
>>     results in the optimal cycle count.  */
>>
>
> Fixed. Thanks!
>
>>
>> With the above change the non-x86 specifc parts are ok.  Please get
>> approval for them from a x86 maintainer.
>>
>
> Could please someone review x86 part?

I assume that you need to split tune attribute to int and FP part to
handle reassociation for other targets, since Atom handles both in the
same way.

Please also describe function return value in the comment (and perhaps
in documentation, too).

OK with this addition.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-02 12:53                                 ` Uros Bizjak
@ 2011-09-02 13:07                                   ` Richard Guenther
  2011-09-02 13:46                                     ` Ilya Enkovich
  2011-09-06 12:40                                   ` Ilya Enkovich
  1 sibling, 1 reply; 47+ messages in thread
From: Richard Guenther @ 2011-09-02 13:07 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Ilya Enkovich, gcc-patches

On Fri, Sep 2, 2011 at 2:52 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Sep 1, 2011 at 12:27 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>
>>> this seems to not allow cycles_best to drop with lower width, but
>>> that it can't should be an implementation detail of get_required_cycles.
>>> To make it not so, can you add a comment before the loop, like
>>>
>>>  /* get_required_cycles is monotonically increasing with lower width
>>>     so we can perform a binary search for the minimal width that still
>>>     results in the optimal cycle count.  */
>>>
>>
>> Fixed. Thanks!
>>
>>>
>>> With the above change the non-x86 specifc parts are ok.  Please get
>>> approval for them from a x86 maintainer.
>>>
>>
>> Could please someone review x86 part?
>
> I assume that you need to split tune attribute to int and FP part to
> handle reassociation for other targets, since Atom handles both in the
> same way.
>
> Please also describe function return value in the comment (and perhaps
> in documentation, too).
>
> OK with this addition.

Btw, I would expect integer add and integer multiply to have different
settings for some targets which would mean splitting this up even
further ...

Richard.

> Thanks,
> Uros.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-02 13:07                                   ` Richard Guenther
@ 2011-09-02 13:46                                     ` Ilya Enkovich
  2011-09-02 14:01                                       ` Richard Guenther
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-09-02 13:46 UTC (permalink / raw)
  To: richard.guenther, Uros Bizjak; +Cc: gcc-patches

2011/9/2 Richard Guenther <richard.guenther@gmail.com>:
>On Fri, Sep 2, 2011 at 2:52 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>
>> I assume that you need to split tune attribute to int and FP part to
>> handle reassociation for other targets, since Atom handles both in the
>> same way.
>>
>> Please also describe function return value in the comment (and perhaps
>> in documentation, too).
>>
>> OK with this addition.
>
> Btw, I would expect integer add and integer multiply to have different
> settings for some targets which would mean splitting this up even
> further ...

Which tune attributes are meant here? Is it X86_TUNE_REASSOC_* flags
or new command line param?

Thanks,
Ilya
>
> Richard.
>
>> Thanks,
>> Uros.
>>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-02 13:46                                     ` Ilya Enkovich
@ 2011-09-02 14:01                                       ` Richard Guenther
  2011-09-02 14:13                                         ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Guenther @ 2011-09-02 14:01 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Uros Bizjak, gcc-patches

On Fri, Sep 2, 2011 at 3:45 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2011/9/2 Richard Guenther <richard.guenther@gmail.com>:
>>On Fri, Sep 2, 2011 at 2:52 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>
>>> I assume that you need to split tune attribute to int and FP part to
>>> handle reassociation for other targets, since Atom handles both in the
>>> same way.
>>>
>>> Please also describe function return value in the comment (and perhaps
>>> in documentation, too).
>>>
>>> OK with this addition.
>>
>> Btw, I would expect integer add and integer multiply to have different
>> settings for some targets which would mean splitting this up even
>> further ...
>
> Which tune attributes are meant here? Is it X86_TUNE_REASSOC_* flags
> or new command line param?

The X86_TUNE_REASSOC_* flags.  The setting surely depends on the
number of available execution units and/or whether the instructions
are pipelined or not.

Richard.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-02 14:01                                       ` Richard Guenther
@ 2011-09-02 14:13                                         ` Ilya Enkovich
  0 siblings, 0 replies; 47+ messages in thread
From: Ilya Enkovich @ 2011-09-02 14:13 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Uros Bizjak, gcc-patches

2011/9/2 Richard Guenther <richard.guenther@gmail.com>:
> On Fri, Sep 2, 2011 at 3:45 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2011/9/2 Richard Guenther <richard.guenther@gmail.com>:
>>>On Fri, Sep 2, 2011 at 2:52 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>>
>>>> I assume that you need to split tune attribute to int and FP part to
>>>> handle reassociation for other targets, since Atom handles both in the
>>>> same way.
>>>>
>>>> Please also describe function return value in the comment (and perhaps
>>>> in documentation, too).
>>>>
>>>> OK with this addition.
>>>
>>> Btw, I would expect integer add and integer multiply to have different
>>> settings for some targets which would mean splitting this up even
>>> further ...
>>
>> Which tune attributes are meant here? Is it X86_TUNE_REASSOC_* flags
>> or new command line param?
>
> The X86_TUNE_REASSOC_* flags.  The setting surely depends on the
> number of available execution units and/or whether the instructions
> are pipelined or not.

Now I'm not sure it is a good idea to use these tune flags at all
because we may need a plenty of them. We have a lot of types including
vector ones and a variety of opcodes. Any combination may require
delicate tuning. Some sort of separate table would suit better here.
Though I doubt we should introduce it right now filled with 1s.

Ilya

>
> Richard.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-02 12:53                                 ` Uros Bizjak
  2011-09-02 13:07                                   ` Richard Guenther
@ 2011-09-06 12:40                                   ` Ilya Enkovich
  2011-09-06 15:53                                     ` Uros Bizjak
  1 sibling, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-09-06 12:40 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]

Hello,

2011/9/2 Uros Bizjak <ubizjak@gmail.com>:
> I assume that you need to split tune attribute to int and FP part to
> handle reassociation for other targets, since Atom handles both in the
> same way.
>
> Please also describe function return value in the comment (and perhaps
> in documentation, too).
>
> OK with this addition.
>
> Thanks,
> Uros.
>

Here is patch with added comment for hook implementation. Is it OK?

Thanks,
Ilya
---
gcc/

2011-09-06  Enkovich Ilya  <ilya.enkovich@intel.com>

	PR middle-end/44382
	* target.def (reassociation_width): New hook.

	* doc/tm.texi.in (reassociation_width): Likewise.

	* doc/tm.texi (reassociation_width): Likewise.

	* doc/invoke.texi (tree-reassoc-width): New param documented.

	* hooks.h (hook_int_uint_mode_1): New default hook.

	* hooks.c (hook_int_uint_mode_1): Likewise.

	* config/i386/i386.h (ix86_tune_indices): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(TARGET_REASSOC_INT_TO_PARALLEL): New.
	(TARGET_REASSOC_FP_TO_PARALLEL): Likewise.

	* config/i386/i386.c (initial_ix86_tune_features): Add
	X86_TUNE_REASSOC_INT_TO_PARALLEL and
	X86_TUNE_REASSOC_FP_TO_PARALLEL.

	(ix86_reassociation_width) implementation of
	new hook for i386 target.

	* params.def (PARAM_TREE_REASSOC_WIDTH): New param added.

	* tree-ssa-reassoc.c (get_required_cycles): New function.
	(get_reassociation_width): Likewise.
	(swap_ops_for_binary_stmt): Likewise.
	(rewrite_expr_tree_parallel): Likewise.

	(rewrite_expr_tree): Refactored. Part of code moved into
	swap_ops_for_binary_stmt.

	(reassociate_bb): Now checks reassociation width to be used
	and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
	if needed.

gcc/testsuite/

2011-09-06  Enkovich Ilya  <ilya.enkovich@intel.com>

	* gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
	--param tree-reassoc-width=1.

	* gcc.dg/tree-ssa/reassoc-24.c: New test.
	* gcc.dg/tree-ssa/reassoc-25.c: Likewise.

[-- Attachment #2: PR44382.diff --]
[-- Type: application/octet-stream, Size: 19789 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a9c0aa7..555db59 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2168,7 +2168,15 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
 
   /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
      the auto-vectorizer.  */
-  m_BDVER
+  m_BDVER,
+
+  /* X86_TUNE_REASSOC_INT_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of integer computation.  */
+  m_ATOM,
+
+  /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
+     during reassociation of fp computation.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -34843,6 +34851,8 @@ ix86_enum_va_list (int idx, const char **pname, tree *ptree)
 #define TARGET_SCHED_DISPATCH has_dispatch
 #undef TARGET_SCHED_DISPATCH_DO
 #define TARGET_SCHED_DISPATCH_DO do_dispatch
+#undef TARGET_SCHED_REASSOCIATION_WIDTH
+#define TARGET_SCHED_REASSOCIATION_WIDTH ix86_reassociation_width
 
 /* The size of the dispatch window is the total number of bytes of
    object code allowed in a window.  */
@@ -35640,6 +35650,32 @@ has_dispatch (rtx insn, int action)
   return false;
 }
 
+/* Implementation of reassociation_width target hook used by
+   reassoc phase to identify parallelism level in reassociated
+   tree.  Statements tree_code is passed in OPC.  Arguments type
+   is passed in MODE.
+
+   Currently parallel reassociation is enabled for Atom
+   processors only and we set reassociation width to be 2
+   because Atom may issue up to 2 instructions per cycle.
+
+   Return value should be fixed if parallel reassociation is
+   enabled for other processors.  */
+
+static int
+ix86_reassociation_width (unsigned int opc ATTRIBUTE_UNUSED,
+			  enum machine_mode mode)
+{
+  int res = 1;
+
+  if (INTEGRAL_MODE_P (mode) && TARGET_REASSOC_INT_TO_PARALLEL)
+    res = 2;
+  else if (FLOAT_MODE_P (mode) && TARGET_REASSOC_FP_TO_PARALLEL)
+    res = 2;
+
+  return res;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 47442a0..21f5075 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -318,6 +318,8 @@ enum ix86_tune_indices {
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
   X86_TUNE_AVX128_OPTIMAL,
+  X86_TUNE_REASSOC_INT_TO_PARALLEL,
+  X86_TUNE_REASSOC_FP_TO_PARALLEL,
 
   X86_TUNE_LAST
 };
@@ -416,6 +418,11 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
 #define TARGET_AVX128_OPTIMAL \
 	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_REASSOC_INT_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_INT_TO_PARALLEL]
+#define TARGET_REASSOC_FP_TO_PARALLEL \
+	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
+
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
   X86_ARCH_CMOVE,		/* || TARGET_SSE */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index c5b19eb..b1ff187 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9076,6 +9076,11 @@ The smallest number of different values for which it is best to use a
 jump-table instead of a tree of conditional branches.  If the value is
 0, use the default for the machine.  The default is 0.
 
+@item tree-reassoc-width
+Set the maximum number of instructions executed in parallel in
+reassociated tree. This parameter overrides target dependent
+heuristics used by default if has non zero value.
+
 @end table
 @end table
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 7364aa1..335c1d1 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6840,6 +6840,11 @@ the order of instructions is important for correctness when scheduling, but
 also the latencies of operations.
 @end deftypevr
 
+@deftypefn {Target Hook} int TARGET_SCHED_REASSOCIATION_WIDTH (unsigned int @var{opc}, enum machine_mode @var{mode})
+This hook is called by tree reassociator to determine a level of
+parallelism required in output calculations chain.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4535fd6..6783826 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -6774,6 +6774,8 @@ in its second parameter.
 
 @hook TARGET_SCHED_EXPOSED_PIPELINE
 
+@hook TARGET_SCHED_REASSOCIATION_WIDTH
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 9025183..1ba44f9 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -161,6 +161,13 @@ default_can_output_mi_thunk_no_vcall (const_tree a ATTRIBUTE_UNUSED,
 }
 
 int
+hook_int_uint_mode_1 (unsigned int a ATTRIBUTE_UNUSED,
+		      enum machine_mode b ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
+int
 hook_int_const_tree_0 (const_tree a ATTRIBUTE_UNUSED)
 {
   return 0;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index 156d708..54ace24 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -68,6 +68,7 @@ extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
 extern void hook_void_gcc_optionsp (struct gcc_options *);
 
+extern int hook_int_uint_mode_1 (unsigned int, enum machine_mode);
 extern int hook_int_const_tree_0 (const_tree);
 extern int hook_int_const_tree_const_tree_1 (const_tree, const_tree);
 extern int hook_int_rtx_0 (rtx);
diff --git a/gcc/params.def b/gcc/params.def
index e8372ed..c376c80 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -914,6 +914,13 @@ DEFPARAM (PARAM_ALLOW_STORE_DATA_RACES,
 	  "Allow new data races on stores to be introduced",
 	  1, 0, 1)
 
+/* Reassociation width to be used by tree reassoc optimization.  */
+DEFPARAM (PARAM_TREE_REASSOC_WIDTH,
+	  "tree-reassoc-width",
+	  "Set the maximum number of instructions executed in parallel in "
+	  "reassociated tree. If 0, use the target dependent heuristic.",
+	  0, 0, 0)
+
 
 /*
 Local variables:
diff --git a/gcc/target.def b/gcc/target.def
index 857f463..1e09ba7 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -913,6 +913,16 @@ the order of instructions is important for correctness when scheduling, but\n\
 also the latencies of operations.",
 bool, false)
 
+/* The following member value is a function that returns number
+   of operations reassociator should try to put in parallel for
+   statements of the given type.  By default 1 is used.  */
+DEFHOOK
+(reassociation_width,
+"This hook is called by tree reassociator to determine a level of\n\
+parallelism required in output calculations chain.",
+int, (unsigned int opc, enum machine_mode mode),
+hook_int_uint_mode_1)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
index e787227..a80a5a8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c
@@ -1,6 +1,6 @@
 /* PR middle-end/38533 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc1" } */
+/* { dg-options "-O2 --param tree-reassoc-width=1 -fdump-tree-reassoc1" } */
 
 #define A asm volatile ("" : "=r" (f) : "0" (0)); e |= f;
 #define B A A A A A A A A A A A
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
new file mode 100644
index 0000000..c871628
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-24.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=2 -fdump-tree-reassoc1" } */
+
+unsigned int
+foo (void)
+{
+  unsigned int a = 0;
+  unsigned int b;
+
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+  asm volatile ("" : "=r" (b) : "0" (0));
+  a += b;
+
+  return a;
+}
+
+/* Verify there are two pairs of __asm__ statements with no
+   intervening stmts.  */
+/* { dg-final { scan-tree-dump-times "__asm__\[^;\n]*;\n *__asm__" 2 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
new file mode 100644
index 0000000..4ff66ef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/reassoc-25.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param tree-reassoc-width=3 -fdump-tree-reassoc1-details" } */
+
+unsigned int
+foo (int a, int b, int c, int d)
+{
+  unsigned int s = 0;
+
+  s += a;
+  s += b;
+  s += c;
+  s += d;
+
+  return s;
+}
+
+/* Verify reassociation width was chosen to be 2.  */
+/* { dg-final { scan-tree-dump-times "Width = 2" 1 "reassoc1"} } */
+/* { dg-final { cleanup-tree-dump "reassoc1" } } */
diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
index 51f7ef8..03e0672 100644
--- a/gcc/tree-ssa-reassoc.c
+++ b/gcc/tree-ssa-reassoc.c
@@ -40,6 +40,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "pointer-set.h"
 #include "cfgloop.h"
 #include "flags.h"
+#include "target.h"
+#include "params.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -1617,6 +1619,62 @@ remove_visited_stmt_chain (tree var)
     }
 }
 
+/* This function checks three consequtive operands in
+   passed operands vector OPS starting from OPINDEX and
+   swaps two operands if it is profitable for binary operation
+   consuming OPINDEX + 1 abnd OPINDEX + 2 operands.
+
+   We pair ops with the same rank if possible.
+
+   The alternative we try is to see if STMT is a destructive
+   update style statement, which is like:
+   b = phi (a, ...)
+   a = c + b;
+   In that case, we want to use the destructive update form to
+   expose the possible vectorizer sum reduction opportunity.
+   In that case, the third operand will be the phi node. This
+   check is not performed if STMT is null.
+
+   We could, of course, try to be better as noted above, and do a
+   lot of work to try to find these opportunities in >3 operand
+   cases, but it is unlikely to be worth it.  */
+
+static void
+swap_ops_for_binary_stmt (VEC(operand_entry_t, heap) * ops,
+			  unsigned int opindex, gimple stmt)
+{
+  operand_entry_t oe1, oe2, oe3;
+
+  oe1 = VEC_index (operand_entry_t, ops, opindex);
+  oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
+  oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
+
+  if ((oe1->rank == oe2->rank
+       && oe2->rank != oe3->rank)
+      || (stmt && is_phi_for_stmt (stmt, oe3->op)
+	  && !is_phi_for_stmt (stmt, oe1->op)
+	  && !is_phi_for_stmt (stmt, oe2->op)))
+    {
+      struct operand_entry temp = *oe3;
+      oe3->op = oe1->op;
+      oe3->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+  else if ((oe1->rank == oe3->rank
+	    && oe2->rank != oe3->rank)
+	   || (stmt && is_phi_for_stmt (stmt, oe2->op)
+	       && !is_phi_for_stmt (stmt, oe1->op)
+	       && !is_phi_for_stmt (stmt, oe3->op)))
+    {
+      struct operand_entry temp = *oe2;
+      oe2->op = oe1->op;
+      oe2->rank = oe1->rank;
+      oe1->op = temp.op;
+      oe1->rank= temp.rank;
+    }
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order.  */
@@ -1629,53 +1687,10 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   tree rhs2 = gimple_assign_rhs2 (stmt);
   operand_entry_t oe;
 
-  /* If we have three operands left, then we want to make sure the one
-     that gets the double binary op are the ones with the same rank.
-
-     The alternative we try is to see if this is a destructive
-     update style statement, which is like:
-     b = phi (a, ...)
-     a = c + b;
-     In that case, we want to use the destructive update form to
-     expose the possible vectorizer sum reduction opportunity.
-     In that case, the third operand will be the phi node.
-
-     We could, of course, try to be better as noted above, and do a
-     lot of work to try to find these opportunities in >3 operand
-     cases, but it is unlikely to be worth it.  */
+  /* If we have three operands left, then we want to make sure the ones
+     that get the double binary op are chosen wisely.  */
   if (opindex + 3 == VEC_length (operand_entry_t, ops))
-    {
-      operand_entry_t oe1, oe2, oe3;
-
-      oe1 = VEC_index (operand_entry_t, ops, opindex);
-      oe2 = VEC_index (operand_entry_t, ops, opindex + 1);
-      oe3 = VEC_index (operand_entry_t, ops, opindex + 2);
-
-      if ((oe1->rank == oe2->rank
-	   && oe2->rank != oe3->rank)
-	  || (is_phi_for_stmt (stmt, oe3->op)
-	      && !is_phi_for_stmt (stmt, oe1->op)
-	      && !is_phi_for_stmt (stmt, oe2->op)))
-	{
-	  struct operand_entry temp = *oe3;
-	  oe3->op = oe1->op;
-	  oe3->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-      else if ((oe1->rank == oe3->rank
-		&& oe2->rank != oe3->rank)
-	       || (is_phi_for_stmt (stmt, oe2->op)
-		   && !is_phi_for_stmt (stmt, oe1->op)
-		   && !is_phi_for_stmt (stmt, oe3->op)))
-	{
-	  struct operand_entry temp = *oe2;
-	  oe2->op = oe1->op;
-	  oe2->rank = oe1->rank;
-	  oe1->op = temp.op;
-	  oe1->rank= temp.rank;
-	}
-    }
+    swap_ops_for_binary_stmt (ops, opindex, stmt);
 
   /* The final recursion case for this function is that you have
      exactly two operations left.
@@ -1760,6 +1775,178 @@ rewrite_expr_tree (gimple stmt, unsigned int opindex,
   rewrite_expr_tree (SSA_NAME_DEF_STMT (rhs1), opindex + 1, ops, moved);
 }
 
+/* Find out how many cycles we need to compute statements chain.
+   OPS_NUM holds number os statements in a chain.  CPU_WIDTH is a
+   maximum number of independent statements we may execute per cycle.  */
+
+static int
+get_required_cycles (int ops_num, int cpu_width)
+{
+  int res;
+  int elog;
+  unsigned int rest;
+
+  /* While we have more than 2 * cpu_width operands
+     we may reduce number of operands by cpu_width
+     per cycle.  */
+  res = ops_num / (2 * cpu_width);
+
+  /* Remained operands count may be reduced twice per cycle
+     until we have only one operand.  */
+  rest = (unsigned)(ops_num - res * cpu_width);
+  elog = exact_log2 (rest);
+  if (elog >= 0)
+    res += elog;
+  else
+    res += floor_log2 (rest) + 1;
+
+  return res;
+}
+
+/* Returns an optimal number of registers to use for computation of
+   given statements.  */
+
+static int
+get_reassociation_width (int ops_num, enum tree_code opc,
+			 enum machine_mode mode)
+{
+  int param_width = PARAM_VALUE (PARAM_TREE_REASSOC_WIDTH);
+  int width;
+  int width_min;
+  int cycles_best;
+
+  if (param_width > 0)
+    width = param_width;
+  else
+    width = targetm.sched.reassociation_width (opc, mode);
+
+  if (width == 1)
+    return width;
+
+  /* Get the minimal time required for sequence computation.  */
+  cycles_best = get_required_cycles (ops_num, width);
+
+  /* Check if we may use less width and still compute sequence for
+     the same time.  It will allow us to reduce registers usage.
+     get_required_cycles is monotonically increasing with lower width
+     so we can perform a binary search for the minimal width that still
+     results in the optimal cycle count.  */
+  width_min = 1;
+  while (width > width_min)
+    {
+      int width_mid = (width + width_min) / 2;
+
+      if (get_required_cycles (ops_num, width_mid) == cycles_best)
+	width = width_mid;
+      else if (width_min < width_mid)
+	width_min = width_mid;
+      else
+	break;
+    }
+
+  return width;
+}
+
+/* Recursively rewrite our linearized statements so that the operators
+   match those in OPS[OPINDEX], putting the computation in rank
+   order and trying to allow operations to be executed in
+   parallel.  */
+
+static void
+rewrite_expr_tree_parallel (gimple stmt, int width,
+			    VEC(operand_entry_t, heap) * ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = VEC_length (operand_entry_t, ops);
+  int stmt_num = op_num - 1;
+  gimple *stmts = XALLOCAVEC (gimple, stmt_num);
+  int op_index = op_num - 1;
+  int stmt_index = 0;
+  int ready_stmts_end = 0;
+  int i = 0;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+  tree lhs_var;
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  lhs_var = create_tmp_reg (TREE_TYPE (last_rhs1), NULL);
+  add_referenced_var (lhs_var);
+
+  for (i = 0; i < stmt_num; i++)
+    {
+      tree op1, op2;
+
+      /* Determine whether we should use results of
+	 already handled statements or not.  */
+      if (ready_stmts_end == 0
+	  && (i - stmt_index >= width || op_index < 1))
+	ready_stmts_end = i;
+
+      /* Now we choose operands for the next statement.  Non zero
+	 value in ready_stmts_end means here that we should use
+	 the result of already generated statements as new operand.  */
+      if (ready_stmts_end > 0)
+	{
+	  op1 = gimple_assign_lhs (stmts[stmt_index++]);
+	  if (ready_stmts_end > stmt_index)
+	    op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	  else if (op_index >= 0)
+	    op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  else
+	    {
+	      gcc_assert (stmt_index < i);
+	      op2 = gimple_assign_lhs (stmts[stmt_index++]);
+	    }
+
+	  if (stmt_index >= ready_stmts_end)
+	    ready_stmts_end = 0;
+	}
+      else
+	{
+	  if (op_index > 1)
+	    swap_ops_for_binary_stmt (ops, op_index - 2, NULL);
+	  op2 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	  op1 = VEC_index (operand_entry_t, ops, op_index--)->op;
+	}
+
+      /* If we emit the last statement then we should put
+	 operands into the last statement.  It will also
+	 break the loop.  */
+      if (op_index < 0 && stmt_index == i)
+	i = stmt_num - 1;
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Transforming ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+
+      /* We keep original statement only for the last one.  All
+	 others are recreated.  */
+      if (i == stmt_num - 1)
+	{
+	  gimple_assign_set_rhs1 (stmts[i], op1);
+	  gimple_assign_set_rhs2 (stmts[i], op2);
+	  update_stmt (stmts[i]);
+	}
+      else
+	stmts[i] = build_and_add_sum (lhs_var, op1, op2, opcode);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0, 0);
+	}
+    }
+
+  remove_visited_stmt_chain (last_rhs1);
+}
+
 /* Transform STMT, which is really (A +B) + (C + D) into the left
    linear form, ((A+B)+C)+D.
    Recurse on D if necessary.  */
@@ -2282,7 +2469,21 @@ reassociate_bb (basic_block bb)
 		    }
 		}
 	      else
-		rewrite_expr_tree (stmt, 0, ops, false);
+		{
+		  enum machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+		  int ops_num = VEC_length (operand_entry_t, ops);
+		  int width = get_reassociation_width (ops_num, rhs_code, mode);
+
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file,
+			     "Width = %d was chosen for reassociation\n", width);
+
+		  if (width > 1
+		      && VEC_length (operand_entry_t, ops) > 3)
+		    rewrite_expr_tree_parallel (stmt, width, ops);
+		  else
+		    rewrite_expr_tree (stmt, 0, ops, false);
+		}
 
 	      VEC_free (operand_entry_t, heap, ops);
 	    }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-06 12:40                                   ` Ilya Enkovich
@ 2011-09-06 15:53                                     ` Uros Bizjak
  2011-09-06 16:18                                       ` Ilya Enkovich
  0 siblings, 1 reply; 47+ messages in thread
From: Uros Bizjak @ 2011-09-06 15:53 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: gcc-patches

On Tue, Sep 6, 2011 at 2:39 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:

>> I assume that you need to split tune attribute to int and FP part to
>> handle reassociation for other targets, since Atom handles both in the
>> same way.
>>
>> Please also describe function return value in the comment (and perhaps
>> in documentation, too).
>>
>> OK with this addition.
>>
>> Thanks,
>> Uros.
>>
>
> Here is patch with added comment for hook implementation. Is it OK?
>
> Thanks,
> Ilya
> ---
> gcc/
>
> 2011-09-06  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        PR middle-end/44382
>        * target.def (reassociation_width): New hook.
>
>        * doc/tm.texi.in (reassociation_width): Likewise.
>
>        * doc/tm.texi (reassociation_width): Likewise.
>
>        * doc/invoke.texi (tree-reassoc-width): New param documented.
>
>        * hooks.h (hook_int_uint_mode_1): New default hook.
>
>        * hooks.c (hook_int_uint_mode_1): Likewise.
>
>        * config/i386/i386.h (ix86_tune_indices): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (TARGET_REASSOC_INT_TO_PARALLEL): New.
>        (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
>
>        * config/i386/i386.c (initial_ix86_tune_features): Add
>        X86_TUNE_REASSOC_INT_TO_PARALLEL and
>        X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
>        (ix86_reassociation_width) implementation of
>        new hook for i386 target.
>
>        * params.def (PARAM_TREE_REASSOC_WIDTH): New param added.
>
>        * tree-ssa-reassoc.c (get_required_cycles): New function.
>        (get_reassociation_width): Likewise.
>        (swap_ops_for_binary_stmt): Likewise.
>        (rewrite_expr_tree_parallel): Likewise.
>
>        (rewrite_expr_tree): Refactored. Part of code moved into
>        swap_ops_for_binary_stmt.
>
>        (reassociate_bb): Now checks reassociation width to be used
>        and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
>        if needed.
>
> gcc/testsuite/
>
> 2011-09-06  Enkovich Ilya  <ilya.enkovich@intel.com>
>
>        * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
>        --param tree-reassoc-width=1.
>
>        * gcc.dg/tree-ssa/reassoc-24.c: New test.
>        * gcc.dg/tree-ssa/reassoc-25.c: Likewise.
>

OK.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-06 15:53                                     ` Uros Bizjak
@ 2011-09-06 16:18                                       ` Ilya Enkovich
  2011-09-06 16:45                                         ` H.J. Lu
  0 siblings, 1 reply; 47+ messages in thread
From: Ilya Enkovich @ 2011-09-06 16:18 UTC (permalink / raw)
  To: gcc-patches

2011/9/6 Uros Bizjak <ubizjak@gmail.com>:
>
> OK.
>
> Thanks,
> Uros.
>

Could please someone check it in for me?

Thanks,
Ilya

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH Atom][PR middle-end/44382] Tree reassociation improvement
  2011-09-06 16:18                                       ` Ilya Enkovich
@ 2011-09-06 16:45                                         ` H.J. Lu
  0 siblings, 0 replies; 47+ messages in thread
From: H.J. Lu @ 2011-09-06 16:45 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: gcc-patches

On Tue, Sep 6, 2011 at 9:07 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2011/9/6 Uros Bizjak <ubizjak@gmail.com>:
>>
>> OK.
>>
>> Thanks,
>> Uros.
>>
>
> Could please someone check it in for me?
>

I checked it in for you.

-- 
H.J.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2011-09-06 16:43 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-12 12:35 [PATCH Atom][PR middle-end/44382] Tree reassociation improvement Илья Энкович
2011-07-12 17:08 ` William J. Schmidt
2011-07-12 17:22   ` H.J. Lu
2011-07-13  8:28     ` Ilya Enkovich
2011-07-12 17:24   ` William J. Schmidt
2011-07-13  8:13     ` Ilya Enkovich
2011-07-13  8:30       ` Jakub Jelinek
2011-07-13  9:04         ` Ilya Enkovich
2011-07-13  9:52           ` Jakub Jelinek
2011-07-13  9:59             ` Ilya Enkovich
2011-07-13 11:01               ` Richard Guenther
2011-07-13 13:11   ` William J. Schmidt
2011-07-13 18:05 ` Michael Meissner
2011-07-13 20:18   ` Ilya Enkovich
2011-07-14  2:39 ` Michael Meissner
2011-07-14  9:40   ` Richard Guenther
2011-07-14  9:42     ` Richard Guenther
2011-07-14 10:04       ` Ilya Enkovich
2011-07-14 10:14         ` Richard Guenther
2011-07-14 11:03           ` Ilya Enkovich
2011-07-14 15:31             ` Ilya Enkovich
2011-07-19 12:10               ` Richard Guenther
2011-07-19 15:25                 ` Ilya Enkovich
2011-08-05 11:46                   ` Richard Guenther
2011-08-05 15:08                     ` Ilya Enkovich
2011-08-10 15:19                       ` Ilya Enkovich
2011-08-19  9:09                         ` Ilya Enkovich
2011-08-26 11:45                           ` Ilya Enkovich
2011-08-30 13:16                         ` Richard Guenther
2011-08-31 15:38                           ` Ilya Enkovich
2011-09-01  9:24                             ` Richard Guenther
2011-09-01 10:28                               ` Ilya Enkovich
2011-09-02 12:53                                 ` Uros Bizjak
2011-09-02 13:07                                   ` Richard Guenther
2011-09-02 13:46                                     ` Ilya Enkovich
2011-09-02 14:01                                       ` Richard Guenther
2011-09-02 14:13                                         ` Ilya Enkovich
2011-09-06 12:40                                   ` Ilya Enkovich
2011-09-06 15:53                                     ` Uros Bizjak
2011-09-06 16:18                                       ` Ilya Enkovich
2011-09-06 16:45                                         ` H.J. Lu
2011-07-26  9:54                 ` Ilya Enkovich
2011-08-03  8:35                   ` Ilya Enkovich
2011-07-21 20:14               ` Joseph S. Myers
2011-07-14 16:04       ` Michael Meissner
2011-07-14 16:08         ` Ilya Enkovich
2011-07-14 10:38   ` Ilya Enkovich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).