From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-433552-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 11451 invoked by alias); 9 Aug 2016 09:43:13 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 10582 invoked by uid 89); 9 Aug 2016 09:43:13 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=1.2 required=5.0 tests=BAYES_50,KAM_ASCII_DIVIDERS,RP_MATCHES_RCVD,SPF_PASS autolearn=no version=3.3.2 spammy=benchmarks, nonexistent, pinskia@gmail.com, pinskiagmailcom
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 09 Aug 2016 09:43:11 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EB5662F;	Tue,  9 Aug 2016 02:44:35 -0700 (PDT)
Received: from e105689-lin.cambridge.arm.com (e105689-lin.cambridge.arm.com [10.2.207.32])	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 926333F459;	Tue,  9 Aug 2016 02:43:07 -0700 (PDT)
Subject: Re: [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair
To: Andrew Pinski <pinskia@gmail.com>, GCC Patches <gcc-patches@gcc.gnu.org>
References: <CA+=Sn1mTJPVOke20gDmH=RrjEA-qvdg=b+CgSJReoi3rvzM25g@mail.gmail.com> <CA+=Sn1mn=G2L0L49-rTSMyOLrMDuJdmsGhFCzH-DyXz6Sh4JWg@mail.gmail.com>
From: "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com>
Message-ID: <6c55b703-5b5f-eba2-7234-4f846738e9a5@arm.com>
Date: Tue, 09 Aug 2016 09:43:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0
MIME-Version: 1.0
In-Reply-To: <CA+=Sn1mn=G2L0L49-rTSMyOLrMDuJdmsGhFCzH-DyXz6Sh4JWg@mail.gmail.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-SW-Source: 2016-08/txt/msg00677.txt.bz2

On 08/08/16 18:48, Andrew Pinski wrote:
> On Fri, Aug 5, 2016 at 12:18 AM, Andrew Pinski <pinskia@gmail.com> wrote:
>> Hi,
>>   On ThunderX, load (and store) pair that does a pair of two word
>> (32bits) load/stores is slower in some cases than doing two
>> load/stores.  For some internal benchmarks, it provides a 2-5%
>> improvement.
>>
>> This patch disables the forming of the load/store pairs for SImode if
>> we are tuning for ThunderX.  I used the tuning flags route so it can
>> be overridden if needed later on or if someone else wants to use the
>> same method for their core.
>>
>> OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.
> 
> 
> Here is a new version based on feedback both on the list and off.
> I added a check for alignment to greater than 8 bytes as that is
> alignment < 8 causes the slow down.
> I also added two new testcases testing this to make sure it did the
> load pair optimization when it is profitable.
> 
> OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.
> 
> Thanks,
> Andrew Pinski
> 
> ChangeLog:
> * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
> * config/aarch64/aarch64.c (thunderx_tunings): Enable
> AARCH64_EXTRA_TUNE_SLOW_LDPW.
> (aarch64_operands_ok_for_ldpstp): Return false if
> AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode
> and the alignment is less than 8 byte.
> (aarch64_operands_adjust_ok_for_ldpstp): Likewise.
> 
> testsuite/ChangeLog:
> * gcc.target/aarch64/thunderxloadpair.c: New testcase.
> * gcc.target/aarch64/thunderxnoloadpair.c: New testcase.
> 
>>
>> Thanks,
>> Andrew Pinski
>>
>> ChangeLog:
>> * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
>> * config/aarch64/aarch64.c (thunderx_tunings): Enable
>> AARCH64_EXTRA_TUNE_SLOW_LDPW.
>> (aarch64_operands_ok_for_ldpstp): Return false if
>> AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode.
>> (aarch64_operands_adjust_ok_for_ldpstp): Likewise.
>>

OK with the changes noted below.

R.

>> stldpw.diff.txt
>>
>>
>> Index: config/aarch64/aarch64-tuning-flags.def
>> ===================================================================
>> --- config/aarch64/aarch64-tuning-flags.def	(revision 239228)
>> +++ config/aarch64/aarch64-tuning-flags.def	(working copy)
>> @@ -29,3 +29,4 @@
>>       AARCH64_TUNE_ to give an enum name. */
>>  
>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>> +AARCH64_EXTRA_TUNING_OPTION ("slow_ldpw", SLOW_LDPW)

I think this should now be renamed SLOW_UNALIGNED_LDPW.  I think it also
should be commented as to what it does at this point.

>> Index: config/aarch64/aarch64.c
>> ===================================================================
>> --- config/aarch64/aarch64.c	(revision 239228)
>> +++ config/aarch64/aarch64.c	(working copy)
>> @@ -712,7 +712,7 @@
>>    0,	/* max_case_values.  */
>>    0,	/* cache_line_size.  */
>>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
>> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
>> +  (AARCH64_EXTRA_TUNE_SLOW_LDPW)	/* tune_flags.  */
>>  };
>>  
>>  static const struct tune_params xgene1_tunings =
>> @@ -13593,6 +13593,15 @@
>>    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>>      return false;
>>  
>> +  /* If we have SImode and slow ldp, check the alignment to be greater
>> +     than 8 byte. */

Comment doesn't match code.  Should be "at least 8 byte alignment".

>> +  if (mode == SImode
>> +      && (aarch64_tune_params.extra_tuning_flags
>> +          & AARCH64_EXTRA_TUNE_SLOW_LDPW)
>> +      && !optimize_size
>> +      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
>> +    return false;
>> +
>>    /* Check if the addresses are in the form of [base+offset].  */
>>    extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
>>    if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
>> @@ -13752,6 +13761,15 @@
>>  	return false;
>>      }
>>  
>> +  /* If we have SImode and slow ldp, check the alignment to be greater
>> +     than 8 byte. */

Likewise.

>> +  if (mode == SImode
>> +      && (aarch64_tune_params.extra_tuning_flags
>> +          & AARCH64_EXTRA_TUNE_SLOW_LDPW)
>> +      && !optimize_size
>> +      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
>> +    return false;
>> +
>>    if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1)))
>>      rclass_1 = FP_REGS;
>>    else
>> Index: testsuite/gcc.target/aarch64/thunderxloadpair.c
>> ===================================================================
>> --- testsuite/gcc.target/aarch64/thunderxloadpair.c	(nonexistent)
>> +++ testsuite/gcc.target/aarch64/thunderxloadpair.c	(working copy)
>> @@ -0,0 +1,20 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mcpu=thunderx" } */
>> +
>> +struct ldp
>> +{
>> +  long long c;
>> +  int a, b;
>> +};
>> +
>> +
>> +int f(struct ldp *a)
>> +{
>> +  return a->a + a->b;
>> +}
>> +
>> +
>> +/* We know the alignement of a->a to be 8 byte aligned so it is profitable
>> +   to do ldp. */
>> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */
>> +
>> Index: testsuite/gcc.target/aarch64/thunderxnoloadpair.c
>> ===================================================================
>> --- testsuite/gcc.target/aarch64/thunderxnoloadpair.c	(nonexistent)
>> +++ testsuite/gcc.target/aarch64/thunderxnoloadpair.c	(working copy)
>> @@ -0,0 +1,17 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mcpu=thunderx" } */
>> +
>> +struct noldp
>> +{
>> +  int a, b;
>> +};
>> +
>> +
>> +int f(struct noldp *a)
>> +{
>> +  return a->a + a->b;
>> +}
>> +
>> +/* We know the alignement of a->a to be 4 byte aligned so it is not profitable
>> +   to do ldp. */
>> +/* { dg-final { scan-assembler-time "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */