From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 113797 invoked by alias); 23 Nov 2018 17:57:04 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 113344 invoked by uid 89); 23 Nov 2018 17:57:04 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 spammy=williams, Williams, deemed, straight X-HELO: EUR01-VE1-obe.outbound.protection.outlook.com Received: from mail-eopbgr140052.outbound.protection.outlook.com (HELO EUR01-VE1-obe.outbound.protection.outlook.com) (40.107.14.52) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 23 Nov 2018 17:57:00 +0000 Received: from VI1PR08MB2813.eurprd08.prod.outlook.com (10.170.236.150) by VI1PR08MB3774.eurprd08.prod.outlook.com (20.178.15.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1361.16; Fri, 23 Nov 2018 17:56:56 +0000 Received: from VI1PR08MB2813.eurprd08.prod.outlook.com ([fe80::ec59:34fe:cea0:ec06]) by VI1PR08MB2813.eurprd08.prod.outlook.com ([fe80::ec59:34fe:cea0:ec06%3]) with mapi id 15.20.1294.048; Fri, 23 Nov 2018 17:56:56 +0000 From: Carey Williams To: Richard Earnshaw , Kyrill Tkachov , Richard Biener , Sudakshina Das CC: GCC Patches , nd Subject: Re: [PATCH, GCC, AArch64] Branch Dilution Pass Date: Fri, 23 Nov 2018 17:57:00 -0000 Message-ID: References: <12813115-bfeb-ba2a-cd99-85aa1ff27921@arm.com> <5BE998A1.1030500@foss.arm.com>,<182bffbc-ea27-a7c5-d94e-719a7909ca98@arm.com> In-Reply-To: <182bffbc-ea27-a7c5-d94e-719a7909ca98@arm.com> authentication-results: spf=none (sender IP is ) smtp.mailfrom=Carey.Williams@arm.com; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-SW-Source: 2018-11/txt/msg01994.txt.bz2 Hi all, Thank you for the comments. For the same reasons as stated by Kyrill and Ri= chard E., I do believe it has to be the compiler that implements this. I wi= ll look to split the patch and address the points made by Andrew. ________________________________ From: Richard Earnshaw (lists) Sent: 12 November 2018 15:55:58 To: Kyrill Tkachov; Richard Biener; Sudakshina Das Cc: GCC Patches; nd; Carey Williams Subject: Re: [PATCH, GCC, AArch64] Branch Dilution Pass On 12/11/2018 15:13, Kyrill Tkachov wrote: > Hi Richard, > > On 12/11/18 14:13, Richard Biener wrote: >> On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das wrote: >> > >> > Hi >> > >> > I am posting this patch on behalf of Carey (cc'ed). I also have some >> > review comments that I will make as a reply to this later. >> > >> > >> > This implements a new AArch64 specific back-end pass that helps >> optimize >> > branch-dense code, which can be a bottleneck for performance on some >> Arm >> > cores. This is achieved by padding out the branch-dense sections of the >> > instruction stream with nops. >> >> Wouldn't this be more suitable for implementing inside the assembler? >> > > The number of NOPs to insert to get the performance benefits varies from > core to core, > I don't think we want to add such CPU-specific optimisation logic to the > assembler. Additionally, the compiler has to keep track of branch ranges. It can't do this properly if the assembler is emitting more instructions than the compiler thinks it is. R. > > Thanks, > Kyrill > >> > This has proven to show up to a 2.61%~ improvement on the Cortex A-72 >> > (SPEC CPU 2006: sjeng). >> > >> > The implementation includes the addition of a new RTX instruction class >> > FILLER_INSN, which has been white listed to allow placement of NOPs >> > outside of a basic block. This is to allow padding after unconditional >> > branches. This is favorable so that any performance gained from >> > diluting branches is not paid straight back via excessive eating of >> nops. >> > >> > It was deemed that a new RTX class was less invasive than modifying >> > behavior in regards to standard UNSPEC nops. >> > >> > ## Command Line Options >> > >> > Three new target-specific options are provided: >> > - mbranch-dilution >> > - mbranch-dilution-granularity=3D{num} >> > - mbranch-dilution-max-branches=3D{num} >> > >> > A number of cores known to be able to benefit from this pass have been >> > given default tuning values for their granularity and max-branches. >> > Each affected core has a very specific granule size and associated >> > max-branch limit. This is a microarchitecture specific optimization. >> > Typical usage should be -mdilute-branches with a specificed -mcpu. >> Cores >> > with a granularity tuned to 0 will be ignored. Options are provided for >> > experimentation. >> > >> > ## Algorithm and Heuristic >> > >> > The pass takes a very simple 'sliding window' approach to the problem. >> > We crawl through each instruction (starting at the first branch) and >> > keep track of the number of branches within the current "granule" (or >> > window). When this exceeds the max-branch value, the pass will dilute >> > the current granule, inserting nops to push out some of the branches. >> > The heuristic will favour unconditonal branches (for performance >> > reasons), or branches that are between two other branches (in order to >> > decrease the likelihood of another dilution call being needed). >> > >> > Each branch type required a different method for nop insertion due to >> > RTL/basic_block restrictions: >> > >> > - Returning calls do not end a basic block so can be handled by >> emitting >> > a generic nop. >> > - Unconditional branches must be the end of a basic block, and nops >> > cannot be outside of a basic block. >> > Thus the need for FILLER_INSN, which allows placement outside of a >> > basic block - and translates to a nop. >> > - For most conditional branches we've taken a simple approach and only >> > handle the fallthru edge for simplicity, >> > which we do by inserting a "nop block" of nops on the fallthru edge, >> > mapping that back to the original destination block. >> > - asm gotos and pcsets are going to be tricky to analyse from a >> dilution >> > perspective so are ignored at present. >> > >> > >> > ## Changelog >> > >> > gcc/testsuite/ChangeLog: >> > >> > 2018-11-09 Carey Williams >> > >> > * gcc.target/aarch64/branch-dilution-off.c: New test. >> > * gcc.target/aarch64/branch-dilution-on.c: New test. >> > >> > >> > gcc/ChangeLog: >> > >> > 2018-11-09 Carey Williams >> > >> > * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case. >> > * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN >> outside >> > basic blocks. >> > * config.gcc (extra_objs): Add aarch64-branch-dilution.o. >> > * config/aarch64/aarch64-branch-dilution.c: New file. >> > * config/aarch64/aarch64-passes.def (branch-dilution): Register >> > pass. >> > * config/aarch64/aarch64-protos.h (struct tune_params): Declare >> > tuning parameters bdilution_gsize and bdilution_maxb. >> > (make_pass_branch_dilution): New declaration. >> > * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings, >> > cortexa53_tunings,cortexa57_tunings,cortexa72_tunings, >> > cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings, >> > thunderx_tunings,tsv110_tunings,xgene1_tunings, >> > qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings): >> > Provide default tunings for bdilution_gsize and bdilution_maxb. >> > * config/aarch64/aarch64.md (filler_insn): Define new insn. >> > * config/aarch64/aarch64.opt (mbranch-dilution, >> > mbranch-dilution-granularity, >> > mbranch-dilution-max-branches): Define new branch dilution >> > options. >> > * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New >> rule >> > for aarch64-branch-dilution.c. >> > * coretypes.h (rtx_filler_insn): New rtx class. >> > * doc/invoke.texi (mbranch-dilution, >> > mbranch-dilution-granularity, >> > mbranch-dilution-max-branches): Document branch dilution >> > options. >> > * emit-rtl.c (emit_filler_after): New emit function. >> > * rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN. >> > * rtl.h (class GTY): New class for rtx_filler_insn. >> > (is_a_helper ::test): New test helper for rtx_filler_insn. >> > (macro FILLER_INSN_P(X)): New predicate. >> > * target-insns.def (filler_insn): Add target insn def. >> > >> > ### Testing >> > - Successful compilation of 3 stage bootstrap with the pass forced on >> > (for stage 2, 3) >> > - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017) >> > - No 'make check' regressions >> > >> > Is this ok for trunk? >> > >> > Thanks >> > Sudi >