Hi,
PR52272 reported a performance regression in spec2006/410.bwaves once GCC is
prevented from representing address of one memory object using address of
another memory object.  Also as I commented in that PR, we have two possible
fixes for this:
1) Improve how TMR.base is deduced, so that we can represent addr of mem obj
using another one, while not breaking PR50955.
2) Add iv candidates with base object stripped.  In this way, we use the
common base-stripped part to represent all address expressions, in the form
of [base_1 + common], [base_2 + common], ..., [base_n + common].

In terms of code generation, method 2) is at least as good as 1), actually
better in my opinion.  The problem of 2) is we need to tell when iv
candidates should be added for the common part and when shouldn't.  This
issue can be generalized and described as: We know IVO tries to add
candidates by deriving from iv uses.  One disadvantage is that candidates
are derived from iv use independently.  It doesn't take common sub
expression among different iv uses into consideration.  As a result,
candidate for common sub expression is not added, while many useless
candidates are added.

As a matter of fact, candidate derived from iv use is useful only if it's
common enough and could be shared among different uses.  A candidate is most
likely useless if it's derived from a single use and could not be shared by
others.  This patch works in this way by firstly recording all kinds
candidates derived from iv uses, then adding candidates for common ones.

The patch improves 410.bwaves by 3-4% on x86_64.  I also saw regression for
400.perlbench and small regression for 401.bzip on x86_64, but I can confirm
they are false alarms caused by align issues. 
For aarch64, fp cases are obviously improved for both spec2000 and spec2006.
Also the patch causes 2-3% regression for 459.GemsFDTD, which I think is
another irrelevant issue caused by heuristic candidate selecting algorithm.
Unfortunately, I don't have fix to it currently.

This patch may add more candidates in some cases, but generally candidates
number is smaller because we don't need to add useless candidates now.
Statistic data shows there are quite fewer loops with more than 30
candidates when building spec2k6 on x86_64 using this patch.

Bootstrap and test on x86_64.  I will re-test it against latest trunk on
AArch64.  Is it OK?

Thanks,
bin

2015-11-03  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/52272
	* tree-ssa-loop-ivopts.c (struct iv_common_cand): New struct.
	(struct iv_common_cand_hasher): New struct.
	(iv_common_cand_hasher::hash): New function.
	(iv_common_cand_hasher::equal): New function.
	(struct ivopts_data): New fields, iv_common_cand_tab and
	iv_common_cands.
	(tree_ssa_iv_optimize_init): Initialize above fields.
	(record_common_cand, common_cand_cmp): New functions.
	(add_iv_candidate_derived_from_uses): New function.
	(add_iv_candidate_for_use): Record iv_common_cands derived from
	iv use in hash table, instead of adding candidates directly.
	(add_iv_candidate_for_uses): Call
add_iv_candidate_derived_from_uses.
	(record_important_candidates): Add important candidates to iv uses'
	related_cands.  Always keep related_cands for future use.
	(try_add_cand_for): Use iv uses' related_cands.
	(free_loop_data, tree_ssa_iv_optimize_finalize): Release new fields
	in struct ivopts_data, iv_common_cand_tab and iv_common_cands.