From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-490792-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 113797 invoked by alias); 23 Nov 2018 17:57:04 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 113344 invoked by uid 89); 23 Nov 2018 17:57:04 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 spammy=williams, Williams, deemed, straight
X-HELO: EUR01-VE1-obe.outbound.protection.outlook.com
Received: from mail-eopbgr140052.outbound.protection.outlook.com (HELO EUR01-VE1-obe.outbound.protection.outlook.com) (40.107.14.52) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 23 Nov 2018 17:57:00 +0000
Received: from VI1PR08MB2813.eurprd08.prod.outlook.com (10.170.236.150) by VI1PR08MB3774.eurprd08.prod.outlook.com (20.178.15.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1361.16; Fri, 23 Nov 2018 17:56:56 +0000
Received: from VI1PR08MB2813.eurprd08.prod.outlook.com ([fe80::ec59:34fe:cea0:ec06]) by VI1PR08MB2813.eurprd08.prod.outlook.com ([fe80::ec59:34fe:cea0:ec06%3]) with mapi id 15.20.1294.048; Fri, 23 Nov 2018 17:56:56 +0000
From: Carey Williams <Carey.Williams@arm.com>
To: Richard Earnshaw <Richard.Earnshaw@arm.com>, Kyrill Tkachov	<kyrylo.tkachov@foss.arm.com>, Richard Biener <richard.guenther@gmail.com>,	Sudakshina Das <Sudi.Das@arm.com>
CC: GCC Patches <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>
Subject: Re: [PATCH, GCC, AArch64] Branch Dilution Pass
Date: Fri, 23 Nov 2018 17:57:00 -0000
Message-ID: <VI1PR08MB2813CD0C91020D4AD3C3A45681D40@VI1PR08MB2813.eurprd08.prod.outlook.com>
References: <12813115-bfeb-ba2a-cd99-85aa1ff27921@arm.com> <CAFiYyc1nDu5hqTEU70YD7rHRzRuFa8g6q1APa22WsUe3nOik-A@mail.gmail.com> <5BE998A1.1030500@foss.arm.com>,<182bffbc-ea27-a7c5-d94e-719a7909ca98@arm.com>
In-Reply-To: <182bffbc-ea27-a7c5-d94e-719a7909ca98@arm.com>
authentication-results: spf=none (sender IP is ) smtp.mailfrom=Carey.Williams@arm.com;
received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2018-11/txt/msg01994.txt.bz2

Hi all,

Thank you for the comments. For the same reasons as stated by Kyrill and Ri=
chard E., I do believe it has to be the compiler that implements this. I wi=
ll look to split the patch and address the points made by Andrew.

________________________________
From: Richard Earnshaw (lists) <Richard.Earnshaw@arm.com>
Sent: 12 November 2018 15:55:58
To: Kyrill Tkachov; Richard Biener; Sudakshina Das
Cc: GCC Patches; nd; Carey Williams
Subject: Re: [PATCH, GCC, AArch64] Branch Dilution Pass

On 12/11/2018 15:13, Kyrill Tkachov wrote:
> Hi Richard,
>
> On 12/11/18 14:13, Richard Biener wrote:
>> On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das <Sudi.Das@arm.com> wrote:
>> >
>> > Hi
>> >
>> > I am posting this patch on behalf of Carey (cc'ed). I also have some
>> > review comments that I will make as a reply to this later.
>> >
>> >
>> > This implements a new AArch64 specific back-end pass that helps
>> optimize
>> > branch-dense code, which can be a bottleneck for performance on some
>> Arm
>> > cores. This is achieved by padding out the branch-dense sections of the
>> > instruction stream with nops.
>>
>> Wouldn't this be more suitable for implementing inside the assembler?
>>
>
> The number of NOPs to insert to get the performance benefits varies from
> core to core,
> I don't think we want to add such CPU-specific optimisation logic to the
> assembler.

Additionally, the compiler has to keep track of branch ranges.  It can't
do this properly if the assembler is emitting more instructions than the
compiler thinks it is.

R.

>
> Thanks,
> Kyrill
>
>> > This has proven to show up to a 2.61%~ improvement on the Cortex A-72
>> > (SPEC CPU 2006: sjeng).
>> >
>> > The implementation includes the addition of a new RTX instruction class
>> > FILLER_INSN, which has been white listed to allow placement of NOPs
>> > outside of a basic block. This is to allow padding after unconditional
>> > branches. This is favorable so that any performance gained from
>> > diluting branches is not paid straight back via excessive eating of
>> nops.
>> >
>> > It was deemed that a new RTX class was less invasive than modifying
>> > behavior in regards to standard UNSPEC nops.
>> >
>> > ## Command Line Options
>> >
>> > Three new target-specific options are provided:
>> > - mbranch-dilution
>> > - mbranch-dilution-granularity=3D{num}
>> > - mbranch-dilution-max-branches=3D{num}
>> >
>> > A number of cores known to be able to benefit from this pass have been
>> > given default tuning values for their granularity and max-branches.
>> > Each affected core has a very specific granule size and associated
>> > max-branch limit. This is a microarchitecture specific optimization.
>> > Typical usage should be -mdilute-branches with a specificed -mcpu.
>> Cores
>> > with a granularity tuned to 0 will be ignored. Options are provided for
>> > experimentation.
>> >
>> > ## Algorithm and Heuristic
>> >
>> > The pass takes a very simple 'sliding window' approach to the problem.
>> > We crawl through each instruction (starting at the first branch) and
>> > keep track of the number of branches within the current "granule" (or
>> > window). When this exceeds the max-branch value, the pass will dilute
>> > the current granule, inserting nops to push out some of the branches.
>> > The heuristic will favour unconditonal branches (for performance
>> > reasons), or branches that are between two other branches (in order to
>> > decrease the likelihood of another dilution call being needed).
>> >
>> > Each branch type required a different method for nop insertion due to
>> > RTL/basic_block restrictions:
>> >
>> > - Returning calls do not end a basic block so can be handled by
>> emitting
>> > a generic nop.
>> > - Unconditional branches must be the end of a basic block, and nops
>> > cannot be outside of a basic block.
>> >    Thus the need for FILLER_INSN, which allows placement outside of a
>> > basic block - and translates to a nop.
>> > - For most conditional branches we've taken a simple approach and only
>> > handle the fallthru edge for simplicity,
>> >    which we do by inserting a "nop block" of nops on the fallthru edge,
>> > mapping that back to the original destination block.
>> > - asm gotos and pcsets are going to be tricky to analyse from a
>> dilution
>> > perspective so are ignored at present.
>> >
>> >
>> > ## Changelog
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> > 2018-11-09  Carey Williams <Carey.Williams@arm.com>
>> >
>> >         * gcc.target/aarch64/branch-dilution-off.c: New test.
>> >         * gcc.target/aarch64/branch-dilution-on.c: New test.
>> >
>> >
>> > gcc/ChangeLog:
>> >
>> > 2018-11-09  Carey Williams <Carey.Williams@arm.com>
>> >
>> >         * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
>> >         * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN
>> outside
>> >         basic blocks.
>> >         * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
>> >         * config/aarch64/aarch64-branch-dilution.c: New file.
>> >         * config/aarch64/aarch64-passes.def (branch-dilution): Register
>> >         pass.
>> >         * config/aarch64/aarch64-protos.h (struct tune_params): Declare
>> >         tuning parameters bdilution_gsize and bdilution_maxb.
>> >         (make_pass_branch_dilution): New declaration.
>> >         * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
>> > cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
>> > cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
>> >         thunderx_tunings,tsv110_tunings,xgene1_tunings,
>> > qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
>> >         Provide default tunings for bdilution_gsize and bdilution_maxb.
>> >         * config/aarch64/aarch64.md (filler_insn): Define new insn.
>> >         * config/aarch64/aarch64.opt (mbranch-dilution,
>> >         mbranch-dilution-granularity,
>> >         mbranch-dilution-max-branches): Define new branch dilution
>> >         options.
>> >         * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New
>> rule
>> >         for aarch64-branch-dilution.c.
>> >         * coretypes.h (rtx_filler_insn): New rtx class.
>> >         * doc/invoke.texi (mbranch-dilution,
>> >         mbranch-dilution-granularity,
>> >         mbranch-dilution-max-branches): Document branch dilution
>> >         options.
>> >         * emit-rtl.c (emit_filler_after): New emit function.
>> >         * rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN.
>> >         * rtl.h (class GTY): New class for rtx_filler_insn.
>> >         (is_a_helper ::test): New test helper for rtx_filler_insn.
>> >         (macro FILLER_INSN_P(X)): New predicate.
>> >         * target-insns.def (filler_insn): Add target insn def.
>> >
>> > ### Testing
>> > - Successful compilation of 3 stage bootstrap with the pass forced on
>> > (for stage 2, 3)
>> > - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)
>> > - No 'make check' regressions
>> >
>> > Is this ok for trunk?
>> >
>> > Thanks
>> > Sudi
>