Hi I am posting this patch on behalf of Carey (cc'ed). I also have some review comments that I will make as a reply to this later. This implements a new AArch64 specific back-end pass that helps optimize branch-dense code, which can be a bottleneck for performance on some Arm cores. This is achieved by padding out the branch-dense sections of the instruction stream with nops. This has proven to show up to a 2.61%~ improvement on the Cortex A-72 (SPEC CPU 2006: sjeng). The implementation includes the addition of a new RTX instruction class FILLER_INSN, which has been white listed to allow placement of NOPs outside of a basic block. This is to allow padding after unconditional branches. This is favorable so that any performance gained from diluting branches is not paid straight back via excessive eating of nops. It was deemed that a new RTX class was less invasive than modifying behavior in regards to standard UNSPEC nops. ## Command Line Options Three new target-specific options are provided: - mbranch-dilution - mbranch-dilution-granularity={num} - mbranch-dilution-max-branches={num} A number of cores known to be able to benefit from this pass have been given default tuning values for their granularity and max-branches. Each affected core has a very specific granule size and associated max-branch limit. This is a microarchitecture specific optimization. Typical usage should be -mdilute-branches with a specificed -mcpu. Cores with a granularity tuned to 0 will be ignored. Options are provided for experimentation. ## Algorithm and Heuristic The pass takes a very simple 'sliding window' approach to the problem. We crawl through each instruction (starting at the first branch) and keep track of the number of branches within the current "granule" (or window). When this exceeds the max-branch value, the pass will dilute the current granule, inserting nops to push out some of the branches. The heuristic will favour unconditonal branches (for performance reasons), or branches that are between two other branches (in order to decrease the likelihood of another dilution call being needed). Each branch type required a different method for nop insertion due to RTL/basic_block restrictions: - Returning calls do not end a basic block so can be handled by emitting a generic nop. - Unconditional branches must be the end of a basic block, and nops cannot be outside of a basic block. Thus the need for FILLER_INSN, which allows placement outside of a basic block - and translates to a nop. - For most conditional branches we've taken a simple approach and only handle the fallthru edge for simplicity, which we do by inserting a "nop block" of nops on the fallthru edge, mapping that back to the original destination block. - asm gotos and pcsets are going to be tricky to analyse from a dilution perspective so are ignored at present. ## Changelog gcc/testsuite/ChangeLog: 2018-11-09 Carey Williams * gcc.target/aarch64/branch-dilution-off.c: New test. * gcc.target/aarch64/branch-dilution-on.c: New test. gcc/ChangeLog: 2018-11-09 Carey Williams * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case. * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside basic blocks. * config.gcc (extra_objs): Add aarch64-branch-dilution.o. * config/aarch64/aarch64-branch-dilution.c: New file. * config/aarch64/aarch64-passes.def (branch-dilution): Register pass. * config/aarch64/aarch64-protos.h (struct tune_params): Declare tuning parameters bdilution_gsize and bdilution_maxb. (make_pass_branch_dilution): New declaration. * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings, cortexa53_tunings,cortexa57_tunings,cortexa72_tunings, cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings, thunderx_tunings,tsv110_tunings,xgene1_tunings, qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings): Provide default tunings for bdilution_gsize and bdilution_maxb. * config/aarch64/aarch64.md (filler_insn): Define new insn. * config/aarch64/aarch64.opt (mbranch-dilution, mbranch-dilution-granularity, mbranch-dilution-max-branches): Define new branch dilution options. * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule for aarch64-branch-dilution.c. * coretypes.h (rtx_filler_insn): New rtx class. * doc/invoke.texi (mbranch-dilution, mbranch-dilution-granularity, mbranch-dilution-max-branches): Document branch dilution options. * emit-rtl.c (emit_filler_after): New emit function. * rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN. * rtl.h (class GTY): New class for rtx_filler_insn. (is_a_helper ::test): New test helper for rtx_filler_insn. (macro FILLER_INSN_P(X)): New predicate. * target-insns.def (filler_insn): Add target insn def. ### Testing - Successful compilation of 3 stage bootstrap with the pass forced on (for stage 2, 3) - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017) - No 'make check' regressions Is this ok for trunk? Thanks Sudi