On Thu, May 08, 2014 at 07:25:50AM +0100, Richard Sandiford wrote:
> Trevor Saunders <tsaunders@mozilla.com> writes:
> > On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote:
> >> I noticed for_each_rtx showing up in profiles and thought I'd have a go
> >> at using worklist-based iterators instead.  So far I have three:
> >> 
> >>   FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
> >>   FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
> >>   FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
> >> 
> >> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
> >> 
> >> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
> >> most walks really don't modify the structure.  I think we should
> >> encourage const_rtxes to be used whereever possible.  E.g. it might
> >> make it easier to have non-GC storage for temporary rtxes in future.
> >> 
> >> I've locally replaced all for_each_rtx calls in the generic code with
> >> these iterators and they make things reproducably faster.  The speed-up
> >> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
> >> but maybe that's enough to justify the churn.
> >
> > seems pretty nice, and it seems like it'll make code a little more
> > readable too :)
> >
> >> Implementation-wise, the main observation is that most subrtxes are part
> >> of a single contiguous sequence of "e" fields.  E.g. when compiling an
> >> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
> >> subrtxes of 7,636,542 rtxes.  Of those:
> >> 
> >> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
> >> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
> >>                       no "E" fields, and
> >> (C)    43,532 (00.6%) are more complicated.
> >> 
> >> (A) is really a special case of (B) in which the block has zero length.
> >> Those are the only two cases that really need to be handled inline.
> >> The implementation does this by having a mapping from an rtx code to the
> >> bounds of its "e" sequence, in the form of a start index and count.
> >> 
> >> Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
> >> probably expect, bloating the inline code with that case made things
> >> slower rather than faster.
> >> 
> >> The vast majority (in fact all in the combine.ii run above) of iterations
> >> can be done with a 16-element stack worklist.  We obviously still need a
> >> heap fallback for the pathological cases though.
> >> 
> >> I spent a bit of time trying different iterator implementations and
> >> seeing which produced the best code.  Specific results from that were:
> >> 
> >> - The storage used for the worklist is separate from the iterator,
> >>   in order to avoid capturing iterator fields.
> >> 
> >> - Although the natural type of the storage would be auto_vec <..., 16>,
> >>   that produced some overhead compared with a separate stack array and heap
> >>   vector pointer.  With the heap vector pointer, the only overhead is an
> >>   assignment in the constructor and an "if (x) release (x)"-style sequence
> >>   in the destructor.  I think the extra complication over auto_vec is worth
> >>   it because in this case the heap version is so very rarely needed.
> >
> > hm, where does the overhead come from exactly? it seems like if  its
> >  faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something
> >  about vectors since this isn't the only place it could matter.  does it
> >  matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second
> >  is basically just a wrapper around the former I'd expect has no effect.
> >  I'm not saying you're doing the wrong thing here, but if we can make
> >  generic vectors faster we probably should ;) or is the issue the
> >  __builtin_expect()s you can add?
> 
> Part of the problem is that by having an array in the vec itself,
> the other fields effectively have their address taken too.
> So m_alloc, m_num and m_using_auto_storage need to be set up
> and maintained on the stack, even though we're almost sure that they
> will never be used.

ok

> >> - The maximum number of fields in (B)-type rtxes is 3.  We get better
> >>   code by making that explicit rather than having a general loop.
> >> 
> >> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
> >>   check to test for that and for cases where the stack worklist is
> >>   too small.
> >
> >  can we use uint8_t?
> 
> We don't really use that in GCC yet.  I don't mind setting a precedent
> though :-)
> 
> >> To give an example:
> >> 
> >> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
> >>    whose UID is greater than the int uid that D points to.  */
> >> 
> >> static int
> >> refs_newer_value_cb (rtx *x, void *d)
> >> {
> >>   if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
> >>     return 1;
> >> 
> >>   return 0;
> >> }
> >> 
> >> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
> >>    that of V.  */
> >> 
> >> static bool
> >> refs_newer_value_p (rtx expr, rtx v)
> >> {
> >>   int minuid = CSELIB_VAL_PTR (v)->uid;
> >> 
> >>   return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
> >> }
> >> 
> >> becomes:
> >> 
> >> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
> >>    that of V.  */
> >> 
> >> static bool
> >> refs_newer_value_p (const_rtx expr, rtx v)
> >> {
> >>   int minuid = CSELIB_VAL_PTR (v)->uid;
> >>   subrtx_iterator::array_type array;
> >
> > some reason to not hide it in the macro?
> 
> I wanted to make the macro a true for loop, with no extra baggage.
> AFAIK there's no way of declaring both the array and the iterator
> in the first part of the for loop without making them part of the
> same object (and thus capturing the iterator fields again).
> A "do ... while (0)" wrapper might be too surprising.

yeah, for some reason I thought you could do for (T x, U y; ..) but
you seem to be right.

> Also, keeping the array separate allows you to reuse it between
> iterations, so you only do the allocation and free once.  E.g.:

fair.
Trev

> 
>   subrtx_iterator::array_type array;
>   for (rtx insn = ...; insn; insn = NEXT_INSN (insn))
>     if (INSN_P (insn))
>       FOR_EACH_SUBRTX (...)
> 
> >> +template <typename T>
> >> +inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
> >> +{
> >> +}
> >
> >  What's wrong with just letting the compiler generate that for you?
> 
> Bah, thanks for catching that.  It was left over from an earlier
> experiment in which the destructor did do some cleanup of the array.
> 
> >> @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE];
> >>  static unsigned int
> >>  num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
> >>  
> >> +/* Store X into index I of ARRAY.  ARRAY is known to have at least I
> >> +   elements.  Return the new base of ARRAY.  */
> >> +
> >> +template <typename T>
> >> +typename T::value_type *
> >> +generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
> >> +						  value_type *base,
> >> +						  size_t i, value_type x)
> >> +{
> >> +  if (base == array.stack)
> >> +    {
> >> +      if (i < LOCAL_ELEMS)
> >> +	{
> >> +	  base[i] = x;
> >> +	  return base;
> >> +	}
> >
> > it seems like this first little bit might be worth inlining, but I guess
> > you tested and it wasn't.
> 
> Yeah.  The cases where an "e" or entire "E" fit within the stack buffer
> are already handled inline.  So in practice we only get here if we know
> we have blown or are about to blow the stack buffer.  The case it
> handles is where part of an "E" field fits within the buffer and part of
> it doesn't.
> 
> One option would have been to force the heap buffer to be used when
> entering this function.  But that would then make the simple check:
> 
> 		  if (m_end > LOCAL_ELEMS)
> 		    m_base = m_array.heap->address ();
> 
> unsafe.  It seemed better to make add_single_to_queue general and
> handle both stack and heap pushes.
> 
> Thanks,
> Richard