public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* RFC: Faster for_each_rtx-like iterators
@ 2014-05-07 20:52 Richard Sandiford
  2014-05-07 21:31 ` Mike Stump
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Richard Sandiford @ 2014-05-07 20:52 UTC (permalink / raw)
  To: gcc-patches

I noticed for_each_rtx showing up in profiles and thought I'd have a go
at using worklist-based iterators instead.  So far I have three:

  FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
  FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
  FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *

with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.

I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
most walks really don't modify the structure.  I think we should
encourage const_rtxes to be used whereever possible.  E.g. it might
make it easier to have non-GC storage for temporary rtxes in future.

I've locally replaced all for_each_rtx calls in the generic code with
these iterators and they make things reproducably faster.  The speed-up
on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
but maybe that's enough to justify the churn.

Implementation-wise, the main observation is that most subrtxes are part
of a single contiguous sequence of "e" fields.  E.g. when compiling an
oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
subrtxes of 7,636,542 rtxes.  Of those:

(A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
(B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
                      no "E" fields, and
(C)    43,532 (00.6%) are more complicated.

(A) is really a special case of (B) in which the block has zero length.
Those are the only two cases that really need to be handled inline.
The implementation does this by having a mapping from an rtx code to the
bounds of its "e" sequence, in the form of a start index and count.

Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
probably expect, bloating the inline code with that case made things
slower rather than faster.

The vast majority (in fact all in the combine.ii run above) of iterations
can be done with a 16-element stack worklist.  We obviously still need a
heap fallback for the pathological cases though.

I spent a bit of time trying different iterator implementations and
seeing which produced the best code.  Specific results from that were:

- The storage used for the worklist is separate from the iterator,
  in order to avoid capturing iterator fields.

- Although the natural type of the storage would be auto_vec <..., 16>,
  that produced some overhead compared with a separate stack array and heap
  vector pointer.  With the heap vector pointer, the only overhead is an
  assignment in the constructor and an "if (x) release (x)"-style sequence
  in the destructor.  I think the extra complication over auto_vec is worth
  it because in this case the heap version is so very rarely needed.

- Several existing for_each_rtx callbacks have something like:

    if (GET_CODE (x) == CONST)
      return -1;

  or:

    if (CONSTANT_P (x))
      return -1;

  to avoid walking subrtxes of constants.  That can be done without
  extra code checks and branches by having a separate code->bound
  mapping in which all constants are treated as leaf rtxes.  This usage
  should be common enough to outweigh the cache penalty of two arrays.

  The choice between iterating over constants or not is given in the
  final parameter of the FOR_EACH_* iterator.

- The maximum number of fields in (B)-type rtxes is 3.  We get better
  code by making that explicit rather than having a general loop.

- (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
  check to test for that and for cases where the stack worklist is
  too small.

To give an example:

/* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
   whose UID is greater than the int uid that D points to.  */

static int
refs_newer_value_cb (rtx *x, void *d)
{
  if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
    return 1;

  return 0;
}

/* Return TRUE if EXPR refers to a VALUE whose uid is greater than
   that of V.  */

static bool
refs_newer_value_p (rtx expr, rtx v)
{
  int minuid = CSELIB_VAL_PTR (v)->uid;

  return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
}

becomes:

/* Return TRUE if EXPR refers to a VALUE whose uid is greater than
   that of V.  */

static bool
refs_newer_value_p (const_rtx expr, rtx v)
{
  int minuid = CSELIB_VAL_PTR (v)->uid;
  subrtx_iterator::array_type array;
  FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
    if (GET_CODE (*iter) == VALUE && CSELIB_VAL_PTR (*iter)->uid > minuid)
      return true;
  return false;
}

The iterator also allows subrtxes of a specific rtx to be skipped;
this is the equivalent of returning -1 from a for_each_rtx callback.
It also allows the current rtx to be replaced in the worklist by
another.  E.g.:

static void
mark_constants_in_pattern (rtx insn)
{
  subrtx_iterator::array_type array;
  FOR_EACH_SUBRTX (iter, array, PATTERN (insn), ALL)
    {
      const_rtx x = *iter;
      if (GET_CODE (x) == SYMBOL_REF)
	{
	  if (CONSTANT_POOL_ADDRESS_P (x))
	    {
	      struct constant_descriptor_rtx *desc = SYMBOL_REF_CONSTANT (x);
	      if (desc->mark == 0)
		{
		  desc->mark = 1;
		  iter.substitute (desc->constant);
		}
	    }
	  else if (TREE_CONSTANT_POOL_ADDRESS_P (x))
	    {
	      tree decl = SYMBOL_REF_DECL (x);
	      if (!TREE_ASM_WRITTEN (DECL_INITIAL (decl)))
		{
		  n_deferred_constants--;
		  output_constant_def_contents (CONST_CAST_RTX (x));
		}
	    }
	}
    }
}

where the iter.substitute avoids a recursive walk.  (The CONST_CAST_RTX
is "temporary".)

Does this look like it's worth pursuing?  If so, is it OK to start
submitting now or should it wait until 4.9.1 is out?  If it is OK,
I plan to convert the ports too and get rid of for_each_rtx.

Thanks,
Richard


diff --git a/gcc/rtl-iter.h b/gcc/rtl-iter.h
new file mode 100644
index 0000000..1bc171d
--- /dev/null
+++ b/gcc/rtl-iter.h
@@ -0,0 +1,293 @@
+/* RTL iterators
+   Copyright (C) 2014 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* This structure describes the subrtxes of an rtx as follows:
+
+   - if the rtx has no subrtxes, START and COUNT are both 0.
+
+   - if all the subrtxes of an rtx are stored in a contiguous block
+     of XEXPs ("e"s), START is the index of the first XEXP and COUNT
+     is the number of them.
+
+   - otherwise START is arbitrary and COUNT is UCHAR_MAX.  */
+struct rtx_subrtx_bound_info {
+  unsigned char start;
+  unsigned char count;
+};
+extern rtx_subrtx_bound_info rtx_all_subrtx_bounds[];
+extern rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[];
+
+/* Return true if CODE has no subrtxes.  */
+
+static inline bool
+leaf_code_p (enum rtx_code code)
+{
+  return rtx_all_subrtx_bounds[code].count == 0;
+}
+
+/* Used to iterate over subrtxes of an rtx.  T abstracts the type of
+   access.  */
+template <typename T>
+class generic_subrtx_iterator
+{
+  static const size_t LOCAL_ELEMS = 16;
+  typedef typename T::value_type value_type;
+  typedef typename T::rtx_type rtx_type;
+  typedef typename T::rtunion_type rtunion_type;
+
+public:
+  struct array_type
+  {
+    array_type ();
+    ~array_type ();
+    value_type stack[LOCAL_ELEMS];
+    vec <value_type, va_heap, vl_embed> *heap;
+  };
+  generic_subrtx_iterator (array_type &, value_type,
+			   const rtx_subrtx_bound_info *);
+  ~generic_subrtx_iterator ();
+
+  value_type operator * () const;
+  bool at_end () const;
+  void next ();
+  void skip_subrtxes ();
+  void substitute (value_type);
+
+private:
+  /* The bounds to use for iterating over subrtxes.  */
+  const rtx_subrtx_bound_info *m_bounds;
+
+  /* The storage used for the worklist.  */
+  array_type &m_array;
+
+  /* The current rtx.  */
+  value_type m_current;
+
+  /* The base of the current worklist.  */
+  value_type *m_base;
+
+  /* The number of subrtxes in M_BASE.  */
+  size_t m_end;
+
+  /* The following booleans shouldn't end up in registers or memory
+     but just direct control flow.  */
+
+  /* True if the iteration is over.  */
+  bool m_done;
+
+  /* True if we should skip the subrtxes of M_CURRENT.  */
+  bool m_skip;
+
+  /* True if M_CURRENT has been replaced with a different rtx.  */
+  bool m_substitute;
+
+  static void free_array (array_type &);
+  static size_t add_subrtxes_to_queue (array_type &, value_type *, size_t,
+				       rtx_type);
+  static value_type *add_single_to_queue (array_type &, value_type *, size_t,
+					  value_type);
+};
+
+template <typename T>
+inline generic_subrtx_iterator <T>::array_type::array_type () : heap (0) {}
+
+template <typename T>
+inline generic_subrtx_iterator <T>::array_type::~array_type ()
+{
+  if (__builtin_expect (heap != 0, false))
+    free_array (*this);
+}
+
+/* Iterate over X and its subrtxes, in arbitrary order.  Use ARRAY to
+   store the worklist.  We use an external array in order to avoid
+   capturing the fields of this structure when taking the address of
+   the array.  Use BOUNDS to find the bounds of simple "e"-string codes.  */
+
+template <typename T>
+inline generic_subrtx_iterator <T>::
+generic_subrtx_iterator (array_type &array, value_type x,
+			 const rtx_subrtx_bound_info *bounds)
+  : m_bounds (bounds),
+    m_array (array),
+    m_current (x),
+    m_base (m_array.stack),
+    m_end (0),
+    m_done (false),
+    m_skip (false),
+    m_substitute (false)
+{
+}
+
+template <typename T>
+inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
+{
+}
+
+/* Return the current subrtx.  */
+
+template <typename T>
+inline typename T::value_type
+generic_subrtx_iterator <T>::operator * () const
+{
+  return m_current;
+}
+
+/* Return true if the iteration has finished.  */
+
+template <typename T>
+inline bool
+generic_subrtx_iterator <T>::at_end () const
+{
+  return m_done;
+}
+
+/* Move on to the next subrtx.  */
+
+template <typename T>
+inline void
+generic_subrtx_iterator <T>::next ()
+{
+  if (m_substitute)
+    {
+      m_substitute = false;
+      m_skip = false;
+      return;
+    }
+  if (!m_skip)
+    {
+      /* Add the subrtxes of M_CURRENT.  */
+      rtx_type x = T::get_rtx (m_current);
+      if (__builtin_expect (x != 0, true))
+	{
+	  enum rtx_code code = GET_CODE (x);
+	  ssize_t count = m_bounds[code].count;
+	  if (count > 0)
+	    {
+	      /* Handle the simple case of a single "e" block that is known
+		 to fit into the current array.  */
+	      if (__builtin_expect (m_end + count <= LOCAL_ELEMS + 1, true))
+		{
+		  /* Set M_CURRENT to the first subrtx and queue the rest.  */
+		  ssize_t start = m_bounds[code].start;
+		  rtunion_type *src = &x->u.fld[start];
+		  if (__builtin_expect (count > 2, false))
+		    m_base[m_end++] = T::get_value (src[2].rt_rtx);
+		  if (count > 1)
+		    m_base[m_end++] = T::get_value (src[1].rt_rtx);
+		  m_current = T::get_value (src[0].rt_rtx);
+		  return;
+		}
+	      /* Handle cases which aren't simple "e" sequences or where
+		 the sequence might overrun M_BASE.  */
+	      count = add_subrtxes_to_queue (m_array, m_base, m_end, x);
+	      if (count > 0)
+		{
+		  m_end += count;
+		  if (m_end > LOCAL_ELEMS)
+		    m_base = m_array.heap->address ();
+		  m_current = m_base[--m_end];
+		  return;
+		}
+	    }
+	}
+    }
+  else
+    m_skip = false;
+  if (m_end == 0)
+    m_done = true;
+  else
+    m_current = m_base[--m_end];
+}
+
+/* Skip the subrtxes of the current rtx.  */
+
+template <typename T>
+inline void
+generic_subrtx_iterator <T>::skip_subrtxes ()
+{
+  m_skip = true;
+}
+
+/* Ignore the subrtxes of the current rtx and look at X instead.  */
+
+template <typename T>
+inline void
+generic_subrtx_iterator <T>::substitute (value_type x)
+{
+  m_substitute = true;
+  m_current = x;
+}
+
+/* Iterators for const_rtx.  */
+struct const_rtx_accessor
+{
+  typedef const_rtx value_type;
+  typedef const_rtx rtx_type;
+  typedef const rtunion rtunion_type;
+  static rtx_type get_rtx (value_type x) { return x; }
+  static value_type get_value (rtx_type x) { return x; }
+};
+typedef generic_subrtx_iterator <const_rtx_accessor> subrtx_iterator;
+
+/* Iterators for non-constant rtx.  */
+struct rtx_var_accessor
+{
+  typedef rtx value_type;
+  typedef rtx rtx_type;
+  typedef rtunion rtunion_type;
+  static rtx_type get_rtx (value_type x) { return x; }
+  static value_type get_value (rtx_type x) { return x; }
+};
+typedef generic_subrtx_iterator <rtx_var_accessor> subrtx_var_iterator;
+
+/* Iterators for rtx *.  */
+struct rtx_ptr_accessor
+{
+  typedef rtx *value_type;
+  typedef rtx rtx_type;
+  typedef rtunion rtunion_type;
+  static rtx_type get_rtx (value_type ptr) { return *ptr; }
+  static value_type get_value (rtx_type &x) { return &x; }
+};
+typedef generic_subrtx_iterator <rtx_ptr_accessor> subrtx_ptr_iterator;
+
+#define ALL_BOUNDS rtx_all_subrtx_bounds
+#define NONCONST_BOUNDS rtx_nonconst_subrtx_bounds
+
+/* Use ITER to iterate over const_rtx X and its recursive subrtxes,
+   using subrtx_iterator::array ARRAY as the storage for the worklist.
+   ARRAY can be reused for multiple consecutive iterations but shouldn't
+   be shared by two concurrent iterations.  TYPE is ALL if all subrtxes
+   are of interest or NONCONST if it is safe to ignore subrtxes of
+   constants.  */
+#define FOR_EACH_SUBRTX(ITER, ARRAY, X, TYPE) \
+  for (subrtx_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
+       ITER.next ())
+
+/* Like FOR_EACH_SUBRTX, but iterate over subrtxes of an rtx X.  */
+#define FOR_EACH_SUBRTX_VAR(ITER, ARRAY, X, TYPE) \
+  for (subrtx_var_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
+       ITER.next ())
+
+/* Like FOR_EACH_SUBRTX, but iterate over subrtx pointers of rtx pointer X.
+   For example, if X is &PATTERN (insn) and the pattern is a SET, iterate
+   over &PATTERN (insn), &SET_DEST (PATTERN (insn)), etc.  */
+#define FOR_EACH_SUBRTX_PTR(ITER, ARRAY, X, TYPE) \
+  for (subrtx_ptr_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
+       ITER.next ())
diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index f3471b1..df52788 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -37,6 +37,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree.h"
 #include "emit-rtl.h"  /* FIXME: Can go away once crtl is moved to rtl.h.  */
 #include "addresses.h"
+#include "rtl-iter.h"
 
 /* Forward declarations */
 static void set_of_1 (rtx, const_rtx, void *);
@@ -62,6 +63,9 @@ static unsigned int num_sign_bit_copies1 (const_rtx, enum machine_mode, const_rt
    -1 if a code has no such operand.  */
 static int non_rtx_starting_operands[NUM_RTX_CODE];
 
+rtx_subrtx_bound_info rtx_all_subrtx_bounds[NUM_RTX_CODE];
+rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[NUM_RTX_CODE];
+
 /* Truncation narrows the mode from SOURCE mode to DESTINATION mode.
    If TARGET_MODE_REP_EXTENDED (DESTINATION, DESTINATION_REP) is
    SIGN_EXTEND then while narrowing we also have to enforce the
@@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE];
 static unsigned int
 num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
 \f
+/* Store X into index I of ARRAY.  ARRAY is known to have at least I
+   elements.  Return the new base of ARRAY.  */
+
+template <typename T>
+typename T::value_type *
+generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
+						  value_type *base,
+						  size_t i, value_type x)
+{
+  if (base == array.stack)
+    {
+      if (i < LOCAL_ELEMS)
+	{
+	  base[i] = x;
+	  return base;
+	}
+      gcc_checking_assert (i == LOCAL_ELEMS);
+      vec_safe_grow (array.heap, i + 1);
+      base = array.heap->address ();
+      memcpy (base, array.stack, sizeof (array.stack));
+      base[LOCAL_ELEMS] = x;
+      return base;
+    }
+  unsigned int length = array.heap->length ();
+  if (length > i)
+    {
+      gcc_checking_assert (base == array.heap->address ());
+      base[i] = x;
+      return base;
+    }
+  else
+    {
+      gcc_checking_assert (i == length);
+      vec_safe_push (array.heap, x);
+      return array.heap->address ();
+    }
+}
+
+/* Add the subrtxes of X to worklist ARRAY, starting at END.  Return the
+   number of elements added to the worklist.  */
+
+template <typename T>
+size_t
+generic_subrtx_iterator <T>::add_subrtxes_to_queue (array_type &array,
+						    value_type *base,
+						    size_t end, rtx_type x)
+{
+  const char *format = GET_RTX_FORMAT (GET_CODE (x));
+  size_t orig_end = end;
+  for (int i = 0; format[i]; ++i)
+    if (format[i] == 'e')
+      {
+	value_type subx = T::get_value (x->u.fld[i].rt_rtx);
+	if (__builtin_expect (end < LOCAL_ELEMS, true))
+	  base[end++] = subx;
+	else
+	  base = add_single_to_queue (array, base, end++, subx);
+      }
+    else if (format[i] == 'E')
+      {
+	int length = GET_NUM_ELEM (x->u.fld[i].rt_rtvec);
+	rtx *vec = x->u.fld[i].rt_rtvec->elem;
+	if (__builtin_expect (end + length <= LOCAL_ELEMS, true))
+	  for (int j = 0; j < length; j++)
+	    base[end++] = T::get_value (vec[j]);
+	else
+	  for (int j = 0; j < length; j++)
+	    base = add_single_to_queue (array, base, end++,
+					T::get_value (vec[j]));
+      }
+  return end - orig_end;
+}
+
+template <typename T>
+void
+generic_subrtx_iterator <T>::free_array (array_type &array)
+{
+  vec_free (array.heap);
+}
+
+template <typename T>
+const size_t generic_subrtx_iterator <T>::LOCAL_ELEMS;
+
+template class generic_subrtx_iterator <const_rtx_accessor>;
+template class generic_subrtx_iterator <rtx_var_accessor>;
+template class generic_subrtx_iterator <rtx_ptr_accessor>;
+
 /* Return 1 if the value of X is unstable
    (would be different at a different point in the program).
    The frame pointer, arg pointer, etc. are considered stable
@@ -5324,8 +5415,42 @@ truncated_to_mode (enum machine_mode mode, const_rtx x)
   return false;
 }
 \f
+/* Return true if RTX code CODE has a single sequence of zero or more
+   "e" operands and no rtvec operands.  Initialize its rtx_all_subrtx_bounds
+   entry in that case.  */
+
+static bool
+setup_reg_subrtx_bounds (unsigned int code)
+{
+  const char *format = GET_RTX_FORMAT ((enum rtx_code) code);
+  unsigned int i = 0;
+  for (; format[i] != 'e'; ++i)
+    {
+      if (!format[i])
+	/* No subrtxes.  Leave start and count as 0.  */
+	return true;
+      if (format[i] == 'E' || format[i] == 'V')
+	return false;
+    }
+
+  /* Record the sequence of 'e's.  */
+  rtx_all_subrtx_bounds[code].start = i;
+  do
+    ++i;
+  while (format[i] == 'e');
+  rtx_all_subrtx_bounds[code].count = i - rtx_all_subrtx_bounds[code].start;
+  /* rtl-iter.h relies on this.  */
+  gcc_checking_assert (rtx_all_subrtx_bounds[code].count <= 3);
+
+  for (; format[i]; ++i)
+    if (format[i] == 'E' || format[i] == 'V' || format[i] == 'e')
+      return false;
+
+  return true;
+}
+
 /* Initialize non_rtx_starting_operands, which is used to speed up
-   for_each_rtx.  */
+   for_each_rtx, and rtx_all_subrtx_bounds.  */
 void
 init_rtlanal (void)
 {
@@ -5335,6 +5460,10 @@ init_rtlanal (void)
       const char *format = GET_RTX_FORMAT (i);
       const char *first = strpbrk (format, "eEV");
       non_rtx_starting_operands[i] = first ? first - format : -1;
+      if (!setup_reg_subrtx_bounds (i))
+	rtx_all_subrtx_bounds[i].count = UCHAR_MAX;
+      if (GET_RTX_CLASS (i) != RTX_CONST_OBJ)
+	rtx_nonconst_subrtx_bounds[i] = rtx_all_subrtx_bounds[i];
     }
 
   init_num_sign_bit_copies_in_rep ();

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-05-19 18:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-07 20:52 RFC: Faster for_each_rtx-like iterators Richard Sandiford
2014-05-07 21:31 ` Mike Stump
2014-05-08  1:20 ` Trevor Saunders
2014-05-08  6:25   ` Richard Sandiford
2014-05-08  6:30     ` Richard Sandiford
2014-05-08 11:20     ` Trevor Saunders
2014-05-09  6:18 ` Jeff Law
2014-05-09 10:41   ` Richard Biener
2014-05-17  7:34   ` Richard Sandiford
2014-05-19 18:19     ` Jeff Law
2014-05-19 18:46       ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).