On Thu, May 08, 2014 at 07:25:50AM +0100, Richard Sandiford wrote: > Trevor Saunders writes: > > On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote: > >> I noticed for_each_rtx showing up in profiles and thought I'd have a go > >> at using worklist-based iterators instead. So far I have three: > >> > >> FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx > >> FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx > >> FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx * > >> > >> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement. > >> > >> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because > >> most walks really don't modify the structure. I think we should > >> encourage const_rtxes to be used whereever possible. E.g. it might > >> make it easier to have non-GC storage for temporary rtxes in future. > >> > >> I've locally replaced all for_each_rtx calls in the generic code with > >> these iterators and they make things reproducably faster. The speed-up > >> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%, > >> but maybe that's enough to justify the churn. > > > > seems pretty nice, and it seems like it'll make code a little more > > readable too :) > > > >> Implementation-wise, the main observation is that most subrtxes are part > >> of a single contiguous sequence of "e" fields. E.g. when compiling an > >> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the > >> subrtxes of 7,636,542 rtxes. Of those: > >> > >> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields, > >> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and > >> no "E" fields, and > >> (C) 43,532 (00.6%) are more complicated. > >> > >> (A) is really a special case of (B) in which the block has zero length. > >> Those are the only two cases that really need to be handled inline. > >> The implementation does this by having a mapping from an rtx code to the > >> bounds of its "e" sequence, in the form of a start index and count. > >> > >> Out of (C), the vast majority (43,509) are PARALLELs. However, as you'd > >> probably expect, bloating the inline code with that case made things > >> slower rather than faster. > >> > >> The vast majority (in fact all in the combine.ii run above) of iterations > >> can be done with a 16-element stack worklist. We obviously still need a > >> heap fallback for the pathological cases though. > >> > >> I spent a bit of time trying different iterator implementations and > >> seeing which produced the best code. Specific results from that were: > >> > >> - The storage used for the worklist is separate from the iterator, > >> in order to avoid capturing iterator fields. > >> > >> - Although the natural type of the storage would be auto_vec <..., 16>, > >> that produced some overhead compared with a separate stack array and heap > >> vector pointer. With the heap vector pointer, the only overhead is an > >> assignment in the constructor and an "if (x) release (x)"-style sequence > >> in the destructor. I think the extra complication over auto_vec is worth > >> it because in this case the heap version is so very rarely needed. > > > > hm, where does the overhead come from exactly? it seems like if its > > faster to use vec *foo; we should fix something > > about vectors since this isn't the only place it could matter. does it > > matter if you use vec * or vec ? the second > > is basically just a wrapper around the former I'd expect has no effect. > > I'm not saying you're doing the wrong thing here, but if we can make > > generic vectors faster we probably should ;) or is the issue the > > __builtin_expect()s you can add? > > Part of the problem is that by having an array in the vec itself, > the other fields effectively have their address taken too. > So m_alloc, m_num and m_using_auto_storage need to be set up > and maintained on the stack, even though we're almost sure that they > will never be used. ok > >> - The maximum number of fields in (B)-type rtxes is 3. We get better > >> code by making that explicit rather than having a general loop. > >> > >> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single > >> check to test for that and for cases where the stack worklist is > >> too small. > > > > can we use uint8_t? > > We don't really use that in GCC yet. I don't mind setting a precedent > though :-) > > >> To give an example: > >> > >> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE > >> whose UID is greater than the int uid that D points to. */ > >> > >> static int > >> refs_newer_value_cb (rtx *x, void *d) > >> { > >> if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d) > >> return 1; > >> > >> return 0; > >> } > >> > >> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than > >> that of V. */ > >> > >> static bool > >> refs_newer_value_p (rtx expr, rtx v) > >> { > >> int minuid = CSELIB_VAL_PTR (v)->uid; > >> > >> return for_each_rtx (&expr, refs_newer_value_cb, &minuid); > >> } > >> > >> becomes: > >> > >> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than > >> that of V. */ > >> > >> static bool > >> refs_newer_value_p (const_rtx expr, rtx v) > >> { > >> int minuid = CSELIB_VAL_PTR (v)->uid; > >> subrtx_iterator::array_type array; > > > > some reason to not hide it in the macro? > > I wanted to make the macro a true for loop, with no extra baggage. > AFAIK there's no way of declaring both the array and the iterator > in the first part of the for loop without making them part of the > same object (and thus capturing the iterator fields again). > A "do ... while (0)" wrapper might be too surprising. yeah, for some reason I thought you could do for (T x, U y; ..) but you seem to be right. > Also, keeping the array separate allows you to reuse it between > iterations, so you only do the allocation and free once. E.g.: fair. Trev > > subrtx_iterator::array_type array; > for (rtx insn = ...; insn; insn = NEXT_INSN (insn)) > if (INSN_P (insn)) > FOR_EACH_SUBRTX (...) > > >> +template > >> +inline generic_subrtx_iterator ::~generic_subrtx_iterator () > >> +{ > >> +} > > > > What's wrong with just letting the compiler generate that for you? > > Bah, thanks for catching that. It was left over from an earlier > experiment in which the destructor did do some cleanup of the array. > > >> @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE]; > >> static unsigned int > >> num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1]; > >> > >> +/* Store X into index I of ARRAY. ARRAY is known to have at least I > >> + elements. Return the new base of ARRAY. */ > >> + > >> +template > >> +typename T::value_type * > >> +generic_subrtx_iterator ::add_single_to_queue (array_type &array, > >> + value_type *base, > >> + size_t i, value_type x) > >> +{ > >> + if (base == array.stack) > >> + { > >> + if (i < LOCAL_ELEMS) > >> + { > >> + base[i] = x; > >> + return base; > >> + } > > > > it seems like this first little bit might be worth inlining, but I guess > > you tested and it wasn't. > > Yeah. The cases where an "e" or entire "E" fit within the stack buffer > are already handled inline. So in practice we only get here if we know > we have blown or are about to blow the stack buffer. The case it > handles is where part of an "E" field fits within the buffer and part of > it doesn't. > > One option would have been to force the heap buffer to be used when > entering this function. But that would then make the simple check: > > if (m_end > LOCAL_ELEMS) > m_base = m_array.heap->address (); > > unsafe. It seemed better to make add_single_to_queue general and > handle both stack and heap pushes. > > Thanks, > Richard