From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11451 invoked by alias); 9 Aug 2016 09:43:13 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 10582 invoked by uid 89); 9 Aug 2016 09:43:13 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.2 required=5.0 tests=BAYES_50,KAM_ASCII_DIVIDERS,RP_MATCHES_RCVD,SPF_PASS autolearn=no version=3.3.2 spammy=benchmarks, nonexistent, pinskia@gmail.com, pinskiagmailcom X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 09 Aug 2016 09:43:11 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EB5662F; Tue, 9 Aug 2016 02:44:35 -0700 (PDT) Received: from e105689-lin.cambridge.arm.com (e105689-lin.cambridge.arm.com [10.2.207.32]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 926333F459; Tue, 9 Aug 2016 02:43:07 -0700 (PDT) Subject: Re: [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair To: Andrew Pinski , GCC Patches References: From: "Richard Earnshaw (lists)" Message-ID: <6c55b703-5b5f-eba2-7234-4f846738e9a5@arm.com> Date: Tue, 09 Aug 2016 09:43:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-SW-Source: 2016-08/txt/msg00677.txt.bz2 On 08/08/16 18:48, Andrew Pinski wrote: > On Fri, Aug 5, 2016 at 12:18 AM, Andrew Pinski wrote: >> Hi, >> On ThunderX, load (and store) pair that does a pair of two word >> (32bits) load/stores is slower in some cases than doing two >> load/stores. For some internal benchmarks, it provides a 2-5% >> improvement. >> >> This patch disables the forming of the load/store pairs for SImode if >> we are tuning for ThunderX. I used the tuning flags route so it can >> be overridden if needed later on or if someone else wants to use the >> same method for their core. >> >> OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions. > > > Here is a new version based on feedback both on the list and off. > I added a check for alignment to greater than 8 bytes as that is > alignment < 8 causes the slow down. > I also added two new testcases testing this to make sure it did the > load pair optimization when it is profitable. > > OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions. > > Thanks, > Andrew Pinski > > ChangeLog: > * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option. > * config/aarch64/aarch64.c (thunderx_tunings): Enable > AARCH64_EXTRA_TUNE_SLOW_LDPW. > (aarch64_operands_ok_for_ldpstp): Return false if > AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode > and the alignment is less than 8 byte. > (aarch64_operands_adjust_ok_for_ldpstp): Likewise. > > testsuite/ChangeLog: > * gcc.target/aarch64/thunderxloadpair.c: New testcase. > * gcc.target/aarch64/thunderxnoloadpair.c: New testcase. > >> >> Thanks, >> Andrew Pinski >> >> ChangeLog: >> * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option. >> * config/aarch64/aarch64.c (thunderx_tunings): Enable >> AARCH64_EXTRA_TUNE_SLOW_LDPW. >> (aarch64_operands_ok_for_ldpstp): Return false if >> AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode. >> (aarch64_operands_adjust_ok_for_ldpstp): Likewise. >> OK with the changes noted below. R. >> stldpw.diff.txt >> >> >> Index: config/aarch64/aarch64-tuning-flags.def >> =================================================================== >> --- config/aarch64/aarch64-tuning-flags.def (revision 239228) >> +++ config/aarch64/aarch64-tuning-flags.def (working copy) >> @@ -29,3 +29,4 @@ >> AARCH64_TUNE_ to give an enum name. */ >> >> AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS) >> +AARCH64_EXTRA_TUNING_OPTION ("slow_ldpw", SLOW_LDPW) I think this should now be renamed SLOW_UNALIGNED_LDPW. I think it also should be commented as to what it does at this point. >> Index: config/aarch64/aarch64.c >> =================================================================== >> --- config/aarch64/aarch64.c (revision 239228) >> +++ config/aarch64/aarch64.c (working copy) >> @@ -712,7 +712,7 @@ >> 0, /* max_case_values. */ >> 0, /* cache_line_size. */ >> tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */ >> - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ >> + (AARCH64_EXTRA_TUNE_SLOW_LDPW) /* tune_flags. */ >> }; >> >> static const struct tune_params xgene1_tunings = >> @@ -13593,6 +13593,15 @@ >> if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2)) >> return false; >> >> + /* If we have SImode and slow ldp, check the alignment to be greater >> + than 8 byte. */ Comment doesn't match code. Should be "at least 8 byte alignment". >> + if (mode == SImode >> + && (aarch64_tune_params.extra_tuning_flags >> + & AARCH64_EXTRA_TUNE_SLOW_LDPW) >> + && !optimize_size >> + && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT) >> + return false; >> + >> /* Check if the addresses are in the form of [base+offset]. */ >> extract_base_offset_in_addr (mem_1, &base_1, &offset_1); >> if (base_1 == NULL_RTX || offset_1 == NULL_RTX) >> @@ -13752,6 +13761,15 @@ >> return false; >> } >> >> + /* If we have SImode and slow ldp, check the alignment to be greater >> + than 8 byte. */ Likewise. >> + if (mode == SImode >> + && (aarch64_tune_params.extra_tuning_flags >> + & AARCH64_EXTRA_TUNE_SLOW_LDPW) >> + && !optimize_size >> + && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT) >> + return false; >> + >> if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1))) >> rclass_1 = FP_REGS; >> else >> Index: testsuite/gcc.target/aarch64/thunderxloadpair.c >> =================================================================== >> --- testsuite/gcc.target/aarch64/thunderxloadpair.c (nonexistent) >> +++ testsuite/gcc.target/aarch64/thunderxloadpair.c (working copy) >> @@ -0,0 +1,20 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-O2 -mcpu=thunderx" } */ >> + >> +struct ldp >> +{ >> + long long c; >> + int a, b; >> +}; >> + >> + >> +int f(struct ldp *a) >> +{ >> + return a->a + a->b; >> +} >> + >> + >> +/* We know the alignement of a->a to be 8 byte aligned so it is profitable >> + to do ldp. */ >> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */ >> + >> Index: testsuite/gcc.target/aarch64/thunderxnoloadpair.c >> =================================================================== >> --- testsuite/gcc.target/aarch64/thunderxnoloadpair.c (nonexistent) >> +++ testsuite/gcc.target/aarch64/thunderxnoloadpair.c (working copy) >> @@ -0,0 +1,17 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-O2 -mcpu=thunderx" } */ >> + >> +struct noldp >> +{ >> + int a, b; >> +}; >> + >> + >> +int f(struct noldp *a) >> +{ >> + return a->a + a->b; >> +} >> + >> +/* We know the alignement of a->a to be 4 byte aligned so it is not profitable >> + to do ldp. */ >> +/* { dg-final { scan-assembler-time "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */