Hi

I am posting this patch on behalf of Carey (cc'ed). I also have some
review comments that I will make as a reply to this later.


This implements a new AArch64 specific back-end pass that helps optimize
branch-dense code, which can be a bottleneck for performance on some Arm
cores. This is achieved by padding out the branch-dense sections of the
instruction stream with nops.

This has proven to show up to a 2.61%~ improvement on the Cortex A-72
(SPEC CPU 2006: sjeng).

The implementation includes the addition of a new RTX instruction class
FILLER_INSN, which has been white listed to allow placement of NOPs
outside of a basic block. This is to allow padding after unconditional
branches. This is favorable so that any performance gained from
diluting branches is not paid straight back via excessive eating of nops.

It was deemed that a new RTX class was less invasive than modifying
behavior in regards to standard UNSPEC nops.

## Command Line Options

Three new target-specific options are provided:
- mbranch-dilution
- mbranch-dilution-granularity={num}
- mbranch-dilution-max-branches={num}

A number of cores known to be able to benefit from this pass have been
given default tuning values for their granularity and max-branches.
Each affected core has a very specific granule size and associated
max-branch limit. This is a microarchitecture specific optimization. 
Typical usage should be -mdilute-branches with a specificed -mcpu. Cores 
with a granularity tuned to 0 will be ignored. Options are provided for 
experimentation.

## Algorithm and Heuristic

The pass takes a very simple 'sliding window' approach to the problem. 
We crawl through each instruction (starting at the first branch) and 
keep track of the number of branches within the current "granule" (or 
window). When this exceeds the max-branch value, the pass will dilute 
the current granule, inserting nops to push out some of the branches. 
The heuristic will favour unconditonal branches (for performance 
reasons), or branches that are between two other branches (in order to 
decrease the likelihood of another dilution call being needed).

Each branch type required a different method for nop insertion due to 
RTL/basic_block restrictions:

- Returning calls do not end a basic block so can be handled by emitting
a generic nop.
- Unconditional branches must be the end of a basic block, and nops 
cannot be outside of a basic block.
   Thus the need for FILLER_INSN, which allows placement outside of a 
basic block - and translates to a nop.
- For most conditional branches we've taken a simple approach and only 
handle the fallthru edge for simplicity,
   which we do by inserting a "nop block" of nops on the fallthru edge, 
mapping that back to the original destination block.
- asm gotos and pcsets are going to be tricky to analyse from a dilution 
perspective so are ignored at present.


## Changelog

gcc/testsuite/ChangeLog:

2018-11-09  Carey Williams  <Carey.Williams@arm.com>

	* gcc.target/aarch64/branch-dilution-off.c: New test.
	* gcc.target/aarch64/branch-dilution-on.c: New test.


gcc/ChangeLog:

2018-11-09  Carey Williams  <Carey.Williams@arm.com>

	* cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
	* cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
	basic blocks.
	* config.gcc (extra_objs): Add aarch64-branch-dilution.o.
	* config/aarch64/aarch64-branch-dilution.c: New file.
	* config/aarch64/aarch64-passes.def (branch-dilution): Register
	pass.
	* config/aarch64/aarch64-protos.h (struct tune_params): Declare
	tuning parameters bdilution_gsize and bdilution_maxb.
	(make_pass_branch_dilution): New declaration.
	* config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
	cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
	cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
	thunderx_tunings,tsv110_tunings,xgene1_tunings,
	qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
	Provide default tunings for bdilution_gsize and bdilution_maxb.
	* config/aarch64/aarch64.md (filler_insn): Define new insn.
	* config/aarch64/aarch64.opt (mbranch-dilution,
	mbranch-dilution-granularity,
	mbranch-dilution-max-branches): Define new branch dilution
	options.
	* config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
	for aarch64-branch-dilution.c.
	* coretypes.h (rtx_filler_insn): New rtx class.
	* doc/invoke.texi (mbranch-dilution,
	mbranch-dilution-granularity,
	mbranch-dilution-max-branches): Document branch dilution
	options.
	* emit-rtl.c (emit_filler_after): New emit function.
	* rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN.
	* rtl.h (class GTY): New class for rtx_filler_insn.
	(is_a_helper ::test): New test helper for rtx_filler_insn.
	(macro FILLER_INSN_P(X)): New predicate.
	* target-insns.def (filler_insn): Add target insn def.

### Testing
- Successful compilation of 3 stage bootstrap with the pass forced on 
(for stage 2, 3)
- No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)
- No 'make check' regressions

Is this ok for trunk?

Thanks
Sudi