public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair
@ 2016-08-05  7:18 Andrew Pinski
  2016-08-05 20:47 ` Jim Wilson
  2016-08-08 17:48 ` Andrew Pinski
  0 siblings, 2 replies; 5+ messages in thread
From: Andrew Pinski @ 2016-08-05  7:18 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 863 bytes --]

Hi,
  On ThunderX, load (and store) pair that does a pair of two word
(32bits) load/stores is slower in some cases than doing two
load/stores.  For some internal benchmarks, it provides a 2-5%
improvement.

This patch disables the forming of the load/store pairs for SImode if
we are tuning for ThunderX.  I used the tuning flags route so it can
be overridden if needed later on or if someone else wants to use the
same method for their core.

OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.

Thanks,
Andrew Pinski

ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
* config/aarch64/aarch64.c (thunderx_tunings): Enable
AARCH64_EXTRA_TUNE_SLOW_LDPW.
(aarch64_operands_ok_for_ldpstp): Return false if
AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode.
(aarch64_operands_adjust_ok_for_ldpstp): Likewise.

[-- Attachment #2: stldpw.diff.txt --]
[-- Type: text/plain, Size: 1555 bytes --]

Index: gcc/config/aarch64/aarch64-tuning-flags.def
===================================================================
--- gcc/config/aarch64/aarch64-tuning-flags.def	(revision 239150)
+++ gcc/config/aarch64/aarch64-tuning-flags.def	(working copy)
@@ -29,3 +29,4 @@
      AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
+AARCH64_EXTRA_TUNING_OPTION ("slow_ldpw", SLOW_LDPW)
Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c	(revision 239150)
+++ gcc/config/aarch64/aarch64.c	(working copy)
@@ -712,7 +712,7 @@
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_SLOW_LDPW)	/* tune_flags.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -13574,6 +13574,11 @@
   enum reg_class rclass_1, rclass_2;
   rtx mem_1, mem_2, reg_1, reg_2, base_1, base_2, offset_1, offset_2;
 
+  if (mode == SImode
+      && AARCH64_EXTRA_TUNE_SLOW_LDPW
+      && !optimize_size)
+    return false;
+
   if (load)
     {
       mem_1 = operands[1];
@@ -13673,6 +13678,11 @@
   rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4;
   rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4;
 
+  if (mode == SImode
+      && AARCH64_EXTRA_TUNE_SLOW_LDPW
+      && !optimize_size)
+    return false;
+
   if (load)
     {
       reg_1 = operands[0];

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair
  2016-08-05  7:18 [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair Andrew Pinski
@ 2016-08-05 20:47 ` Jim Wilson
  2016-08-08 17:48 ` Andrew Pinski
  1 sibling, 0 replies; 5+ messages in thread
From: Jim Wilson @ 2016-08-05 20:47 UTC (permalink / raw)
  To: Andrew Pinski, GCC Patches

On 08/05/2016 12:18 AM, Andrew Pinski wrote:
> This patch disables the forming of the load/store pairs for SImode if
> we are tuning for ThunderX.  I used the tuning flags route so it can
> be overridden if needed later on or if someone else wants to use the
> same method for their core.

+  if (mode == SImode
+      && AARCH64_EXTRA_TUNE_SLOW_LDPW
+      && !optimize_size)
+    return false;

AARCH64_EXTRA_TUNE_SLOW_LDPW is a non-zero bit-mask.  That will always 
be true.  This is present in two places in the patch.  You need 
something more like

    && (aarch64_tune_params.extra_tuning_flags
        & AARCH64_EXTRA_TUNE_SLOW_LDPW)

You should verify that the patch disables the optimization for ThunderX 
but does not disable it for other targets.

Jim

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair
  2016-08-05  7:18 [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair Andrew Pinski
  2016-08-05 20:47 ` Jim Wilson
@ 2016-08-08 17:48 ` Andrew Pinski
  2016-08-09  9:43   ` Richard Earnshaw (lists)
  1 sibling, 1 reply; 5+ messages in thread
From: Andrew Pinski @ 2016-08-08 17:48 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1855 bytes --]

On Fri, Aug 5, 2016 at 12:18 AM, Andrew Pinski <pinskia@gmail.com> wrote:
> Hi,
>   On ThunderX, load (and store) pair that does a pair of two word
> (32bits) load/stores is slower in some cases than doing two
> load/stores.  For some internal benchmarks, it provides a 2-5%
> improvement.
>
> This patch disables the forming of the load/store pairs for SImode if
> we are tuning for ThunderX.  I used the tuning flags route so it can
> be overridden if needed later on or if someone else wants to use the
> same method for their core.
>
> OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.


Here is a new version based on feedback both on the list and off.
I added a check for alignment to greater than 8 bytes as that is
alignment < 8 causes the slow down.
I also added two new testcases testing this to make sure it did the
load pair optimization when it is profitable.

OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.

Thanks,
Andrew Pinski

ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
* config/aarch64/aarch64.c (thunderx_tunings): Enable
AARCH64_EXTRA_TUNE_SLOW_LDPW.
(aarch64_operands_ok_for_ldpstp): Return false if
AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode
and the alignment is less than 8 byte.
(aarch64_operands_adjust_ok_for_ldpstp): Likewise.

testsuite/ChangeLog:
* gcc.target/aarch64/thunderxloadpair.c: New testcase.
* gcc.target/aarch64/thunderxnoloadpair.c: New testcase.

>
> Thanks,
> Andrew Pinski
>
> ChangeLog:
> * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
> * config/aarch64/aarch64.c (thunderx_tunings): Enable
> AARCH64_EXTRA_TUNE_SLOW_LDPW.
> (aarch64_operands_ok_for_ldpstp): Return false if
> AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode.
> (aarch64_operands_adjust_ok_for_ldpstp): Likewise.

[-- Attachment #2: stldpw.diff.txt --]
[-- Type: text/plain, Size: 3172 bytes --]

Index: config/aarch64/aarch64-tuning-flags.def
===================================================================
--- config/aarch64/aarch64-tuning-flags.def	(revision 239228)
+++ config/aarch64/aarch64-tuning-flags.def	(working copy)
@@ -29,3 +29,4 @@
      AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
+AARCH64_EXTRA_TUNING_OPTION ("slow_ldpw", SLOW_LDPW)
Index: config/aarch64/aarch64.c
===================================================================
--- config/aarch64/aarch64.c	(revision 239228)
+++ config/aarch64/aarch64.c	(working copy)
@@ -712,7 +712,7 @@
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_SLOW_LDPW)	/* tune_flags.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -13593,6 +13593,15 @@
   if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
     return false;
 
+  /* If we have SImode and slow ldp, check the alignment to be greater
+     than 8 byte. */
+  if (mode == SImode
+      && (aarch64_tune_params.extra_tuning_flags
+          & AARCH64_EXTRA_TUNE_SLOW_LDPW)
+      && !optimize_size
+      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
+    return false;
+
   /* Check if the addresses are in the form of [base+offset].  */
   extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
   if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
@@ -13752,6 +13761,15 @@
 	return false;
     }
 
+  /* If we have SImode and slow ldp, check the alignment to be greater
+     than 8 byte. */
+  if (mode == SImode
+      && (aarch64_tune_params.extra_tuning_flags
+          & AARCH64_EXTRA_TUNE_SLOW_LDPW)
+      && !optimize_size
+      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
+    return false;
+
   if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1)))
     rclass_1 = FP_REGS;
   else
Index: testsuite/gcc.target/aarch64/thunderxloadpair.c
===================================================================
--- testsuite/gcc.target/aarch64/thunderxloadpair.c	(nonexistent)
+++ testsuite/gcc.target/aarch64/thunderxloadpair.c	(working copy)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=thunderx" } */
+
+struct ldp
+{
+  long long c;
+  int a, b;
+};
+
+
+int f(struct ldp *a)
+{
+  return a->a + a->b;
+}
+
+
+/* We know the alignement of a->a to be 8 byte aligned so it is profitable
+   to do ldp. */
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */
+
Index: testsuite/gcc.target/aarch64/thunderxnoloadpair.c
===================================================================
--- testsuite/gcc.target/aarch64/thunderxnoloadpair.c	(nonexistent)
+++ testsuite/gcc.target/aarch64/thunderxnoloadpair.c	(working copy)
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=thunderx" } */
+
+struct noldp
+{
+  int a, b;
+};
+
+
+int f(struct noldp *a)
+{
+  return a->a + a->b;
+}
+
+/* We know the alignement of a->a to be 4 byte aligned so it is not profitable
+   to do ldp. */
+/* { dg-final { scan-assembler-time "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair
  2016-08-08 17:48 ` Andrew Pinski
@ 2016-08-09  9:43   ` Richard Earnshaw (lists)
  2016-09-12 21:29     ` Andrew Pinski
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Earnshaw (lists) @ 2016-08-09  9:43 UTC (permalink / raw)
  To: Andrew Pinski, GCC Patches

On 08/08/16 18:48, Andrew Pinski wrote:
> On Fri, Aug 5, 2016 at 12:18 AM, Andrew Pinski <pinskia@gmail.com> wrote:
>> Hi,
>>   On ThunderX, load (and store) pair that does a pair of two word
>> (32bits) load/stores is slower in some cases than doing two
>> load/stores.  For some internal benchmarks, it provides a 2-5%
>> improvement.
>>
>> This patch disables the forming of the load/store pairs for SImode if
>> we are tuning for ThunderX.  I used the tuning flags route so it can
>> be overridden if needed later on or if someone else wants to use the
>> same method for their core.
>>
>> OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.
> 
> 
> Here is a new version based on feedback both on the list and off.
> I added a check for alignment to greater than 8 bytes as that is
> alignment < 8 causes the slow down.
> I also added two new testcases testing this to make sure it did the
> load pair optimization when it is profitable.
> 
> OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.
> 
> Thanks,
> Andrew Pinski
> 
> ChangeLog:
> * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
> * config/aarch64/aarch64.c (thunderx_tunings): Enable
> AARCH64_EXTRA_TUNE_SLOW_LDPW.
> (aarch64_operands_ok_for_ldpstp): Return false if
> AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode
> and the alignment is less than 8 byte.
> (aarch64_operands_adjust_ok_for_ldpstp): Likewise.
> 
> testsuite/ChangeLog:
> * gcc.target/aarch64/thunderxloadpair.c: New testcase.
> * gcc.target/aarch64/thunderxnoloadpair.c: New testcase.
> 
>>
>> Thanks,
>> Andrew Pinski
>>
>> ChangeLog:
>> * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
>> * config/aarch64/aarch64.c (thunderx_tunings): Enable
>> AARCH64_EXTRA_TUNE_SLOW_LDPW.
>> (aarch64_operands_ok_for_ldpstp): Return false if
>> AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode.
>> (aarch64_operands_adjust_ok_for_ldpstp): Likewise.
>>

OK with the changes noted below.

R.

>> stldpw.diff.txt
>>
>>
>> Index: config/aarch64/aarch64-tuning-flags.def
>> ===================================================================
>> --- config/aarch64/aarch64-tuning-flags.def	(revision 239228)
>> +++ config/aarch64/aarch64-tuning-flags.def	(working copy)
>> @@ -29,3 +29,4 @@
>>       AARCH64_TUNE_ to give an enum name. */
>>  
>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>> +AARCH64_EXTRA_TUNING_OPTION ("slow_ldpw", SLOW_LDPW)

I think this should now be renamed SLOW_UNALIGNED_LDPW.  I think it also
should be commented as to what it does at this point.

>> Index: config/aarch64/aarch64.c
>> ===================================================================
>> --- config/aarch64/aarch64.c	(revision 239228)
>> +++ config/aarch64/aarch64.c	(working copy)
>> @@ -712,7 +712,7 @@
>>    0,	/* max_case_values.  */
>>    0,	/* cache_line_size.  */
>>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
>> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
>> +  (AARCH64_EXTRA_TUNE_SLOW_LDPW)	/* tune_flags.  */
>>  };
>>  
>>  static const struct tune_params xgene1_tunings =
>> @@ -13593,6 +13593,15 @@
>>    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>>      return false;
>>  
>> +  /* If we have SImode and slow ldp, check the alignment to be greater
>> +     than 8 byte. */

Comment doesn't match code.  Should be "at least 8 byte alignment".

>> +  if (mode == SImode
>> +      && (aarch64_tune_params.extra_tuning_flags
>> +          & AARCH64_EXTRA_TUNE_SLOW_LDPW)
>> +      && !optimize_size
>> +      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
>> +    return false;
>> +
>>    /* Check if the addresses are in the form of [base+offset].  */
>>    extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
>>    if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
>> @@ -13752,6 +13761,15 @@
>>  	return false;
>>      }
>>  
>> +  /* If we have SImode and slow ldp, check the alignment to be greater
>> +     than 8 byte. */

Likewise.

>> +  if (mode == SImode
>> +      && (aarch64_tune_params.extra_tuning_flags
>> +          & AARCH64_EXTRA_TUNE_SLOW_LDPW)
>> +      && !optimize_size
>> +      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
>> +    return false;
>> +
>>    if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1)))
>>      rclass_1 = FP_REGS;
>>    else
>> Index: testsuite/gcc.target/aarch64/thunderxloadpair.c
>> ===================================================================
>> --- testsuite/gcc.target/aarch64/thunderxloadpair.c	(nonexistent)
>> +++ testsuite/gcc.target/aarch64/thunderxloadpair.c	(working copy)
>> @@ -0,0 +1,20 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mcpu=thunderx" } */
>> +
>> +struct ldp
>> +{
>> +  long long c;
>> +  int a, b;
>> +};
>> +
>> +
>> +int f(struct ldp *a)
>> +{
>> +  return a->a + a->b;
>> +}
>> +
>> +
>> +/* We know the alignement of a->a to be 8 byte aligned so it is profitable
>> +   to do ldp. */
>> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */
>> +
>> Index: testsuite/gcc.target/aarch64/thunderxnoloadpair.c
>> ===================================================================
>> --- testsuite/gcc.target/aarch64/thunderxnoloadpair.c	(nonexistent)
>> +++ testsuite/gcc.target/aarch64/thunderxnoloadpair.c	(working copy)
>> @@ -0,0 +1,17 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mcpu=thunderx" } */
>> +
>> +struct noldp
>> +{
>> +  int a, b;
>> +};
>> +
>> +
>> +int f(struct noldp *a)
>> +{
>> +  return a->a + a->b;
>> +}
>> +
>> +/* We know the alignement of a->a to be 4 byte aligned so it is not profitable
>> +   to do ldp. */
>> +/* { dg-final { scan-assembler-time "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair
  2016-08-09  9:43   ` Richard Earnshaw (lists)
@ 2016-09-12 21:29     ` Andrew Pinski
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Pinski @ 2016-09-12 21:29 UTC (permalink / raw)
  To: Richard Earnshaw (lists); +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 6760 bytes --]

On Tue, Aug 9, 2016 at 10:43 AM, Richard Earnshaw (lists)
<Richard.Earnshaw@arm.com> wrote:
> On 08/08/16 18:48, Andrew Pinski wrote:
>> On Fri, Aug 5, 2016 at 12:18 AM, Andrew Pinski <pinskia@gmail.com> wrote:
>>> Hi,
>>>   On ThunderX, load (and store) pair that does a pair of two word
>>> (32bits) load/stores is slower in some cases than doing two
>>> load/stores.  For some internal benchmarks, it provides a 2-5%
>>> improvement.
>>>
>>> This patch disables the forming of the load/store pairs for SImode if
>>> we are tuning for ThunderX.  I used the tuning flags route so it can
>>> be overridden if needed later on or if someone else wants to use the
>>> same method for their core.
>>>
>>> OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.
>>
>>
>> Here is a new version based on feedback both on the list and off.
>> I added a check for alignment to greater than 8 bytes as that is
>> alignment < 8 causes the slow down.
>> I also added two new testcases testing this to make sure it did the
>> load pair optimization when it is profitable.
>>
>> OK?  Bootstrapped and tested on aarch64-linux-gnu with no regressions.
>>
>> Thanks,
>> Andrew Pinski
>>
>> ChangeLog:
>> * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
>> * config/aarch64/aarch64.c (thunderx_tunings): Enable
>> AARCH64_EXTRA_TUNE_SLOW_LDPW.
>> (aarch64_operands_ok_for_ldpstp): Return false if
>> AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode
>> and the alignment is less than 8 byte.
>> (aarch64_operands_adjust_ok_for_ldpstp): Likewise.
>>
>> testsuite/ChangeLog:
>> * gcc.target/aarch64/thunderxloadpair.c: New testcase.
>> * gcc.target/aarch64/thunderxnoloadpair.c: New testcase.
>>
>>>
>>> Thanks,
>>> Andrew Pinski
>>>
>>> ChangeLog:
>>> * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option.
>>> * config/aarch64/aarch64.c (thunderx_tunings): Enable
>>> AARCH64_EXTRA_TUNE_SLOW_LDPW.
>>> (aarch64_operands_ok_for_ldpstp): Return false if
>>> AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode.
>>> (aarch64_operands_adjust_ok_for_ldpstp): Likewise.
>>>
>
> OK with the changes noted below.

This is what I committed after a bootstrap/test.

2016-09-12  Andrew Pinski  <apinski@cavium.com>

        * config/aarch64/aarch64-tuning-flags.def (SLOW_UNALIGNED_LDPW):
        New tuning option.
        * config/aarch64/aarch64.c (thunderx_tunings): Enable
        AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW.
        (aarch64_operands_ok_for_ldpstp): Return false if
        AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW and the mode
        was SImode and the alignment is less than 8 byte.
        (aarch64_operands_adjust_ok_for_ldpstp): Likewise.
2016-09-12  Andrew Pinski  <apinski@cavium.com>

        * gcc.target/aarch64/thunderxloadpair.c: New testcase.
        * gcc.target/aarch64/thunderxnoloadpair.c: New testcase.

Thanks,
Andrew

>
> R.
>
>>> stldpw.diff.txt
>>>
>>>
>>> Index: config/aarch64/aarch64-tuning-flags.def
>>> ===================================================================
>>> --- config/aarch64/aarch64-tuning-flags.def  (revision 239228)
>>> +++ config/aarch64/aarch64-tuning-flags.def  (working copy)
>>> @@ -29,3 +29,4 @@
>>>       AARCH64_TUNE_ to give an enum name. */
>>>
>>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>>> +AARCH64_EXTRA_TUNING_OPTION ("slow_ldpw", SLOW_LDPW)
>
> I think this should now be renamed SLOW_UNALIGNED_LDPW.  I think it also
> should be commented as to what it does at this point.
>
>>> Index: config/aarch64/aarch64.c
>>> ===================================================================
>>> --- config/aarch64/aarch64.c (revision 239228)
>>> +++ config/aarch64/aarch64.c (working copy)
>>> @@ -712,7 +712,7 @@
>>>    0,        /* max_case_values.  */
>>>    0,        /* cache_line_size.  */
>>>    tune_params::AUTOPREFETCHER_OFF,  /* autoprefetcher_model.  */
>>> -  (AARCH64_EXTRA_TUNE_NONE) /* tune_flags.  */
>>> +  (AARCH64_EXTRA_TUNE_SLOW_LDPW)    /* tune_flags.  */
>>>  };
>>>
>>>  static const struct tune_params xgene1_tunings =
>>> @@ -13593,6 +13593,15 @@
>>>    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>>>      return false;
>>>
>>> +  /* If we have SImode and slow ldp, check the alignment to be greater
>>> +     than 8 byte. */
>
> Comment doesn't match code.  Should be "at least 8 byte alignment".
>
>>> +  if (mode == SImode
>>> +      && (aarch64_tune_params.extra_tuning_flags
>>> +          & AARCH64_EXTRA_TUNE_SLOW_LDPW)
>>> +      && !optimize_size
>>> +      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
>>> +    return false;
>>> +
>>>    /* Check if the addresses are in the form of [base+offset].  */
>>>    extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
>>>    if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
>>> @@ -13752,6 +13761,15 @@
>>>      return false;
>>>      }
>>>
>>> +  /* If we have SImode and slow ldp, check the alignment to be greater
>>> +     than 8 byte. */
>
> Likewise.
>
>>> +  if (mode == SImode
>>> +      && (aarch64_tune_params.extra_tuning_flags
>>> +          & AARCH64_EXTRA_TUNE_SLOW_LDPW)
>>> +      && !optimize_size
>>> +      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
>>> +    return false;
>>> +
>>>    if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1)))
>>>      rclass_1 = FP_REGS;
>>>    else
>>> Index: testsuite/gcc.target/aarch64/thunderxloadpair.c
>>> ===================================================================
>>> --- testsuite/gcc.target/aarch64/thunderxloadpair.c  (nonexistent)
>>> +++ testsuite/gcc.target/aarch64/thunderxloadpair.c  (working copy)
>>> @@ -0,0 +1,20 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -mcpu=thunderx" } */
>>> +
>>> +struct ldp
>>> +{
>>> +  long long c;
>>> +  int a, b;
>>> +};
>>> +
>>> +
>>> +int f(struct ldp *a)
>>> +{
>>> +  return a->a + a->b;
>>> +}
>>> +
>>> +
>>> +/* We know the alignement of a->a to be 8 byte aligned so it is profitable
>>> +   to do ldp. */
>>> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */
>>> +
>>> Index: testsuite/gcc.target/aarch64/thunderxnoloadpair.c
>>> ===================================================================
>>> --- testsuite/gcc.target/aarch64/thunderxnoloadpair.c        (nonexistent)
>>> +++ testsuite/gcc.target/aarch64/thunderxnoloadpair.c        (working copy)
>>> @@ -0,0 +1,17 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -mcpu=thunderx" } */
>>> +
>>> +struct noldp
>>> +{
>>> +  int a, b;
>>> +};
>>> +
>>> +
>>> +int f(struct noldp *a)
>>> +{
>>> +  return a->a + a->b;
>>> +}
>>> +
>>> +/* We know the alignement of a->a to be 4 byte aligned so it is not profitable
>>> +   to do ldp. */
>>> +/* { dg-final { scan-assembler-time "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */
>

[-- Attachment #2: unalignedldpw.diff.txt --]
[-- Type: text/plain, Size: 4641 bytes --]

Index: gcc/ChangeLog
===================================================================
--- gcc/ChangeLog	(revision 240100)
+++ gcc/ChangeLog	(working copy)
@@ -1,3 +1,14 @@
+2016-09-12  Andrew Pinski  <apinski@cavium.com>
+
+	* config/aarch64/aarch64-tuning-flags.def (SLOW_UNALIGNED_LDPW):
+	New tuning option.
+	* config/aarch64/aarch64.c (thunderx_tunings): Enable
+	AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW.
+	(aarch64_operands_ok_for_ldpstp): Return false if
+	AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW and the mode
+	was SImode and the alignment is less than 8 byte.
+	(aarch64_operands_adjust_ok_for_ldpstp): Likewise.
+
 2016-09-12  Marek Polacek  <polacek@redhat.com>
 
 	* doc/extend.texi: Use lowercase "boolean".
Index: gcc/config/aarch64/aarch64-tuning-flags.def
===================================================================
--- gcc/config/aarch64/aarch64-tuning-flags.def	(revision 240100)
+++ gcc/config/aarch64/aarch64-tuning-flags.def	(working copy)
@@ -29,3 +29,8 @@
      AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
+
+/* Don't create non-8 byte aligned load/store pair.  That is if the
+two load/stores are not at least 8 byte aligned don't create load/store
+pairs.   */
+AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c	(revision 240100)
+++ gcc/config/aarch64/aarch64.c	(working copy)
@@ -712,7 +712,7 @@
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)	/* tune_flags.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -13629,6 +13629,15 @@
   if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
     return false;
 
+  /* If we have SImode and slow unaligned ldp,
+     check the alignment to be at least 8 byte. */
+  if (mode == SImode
+      && (aarch64_tune_params.extra_tuning_flags
+          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
+      && !optimize_size
+      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
+    return false;
+
   /* Check if the addresses are in the form of [base+offset].  */
   extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
   if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
@@ -13788,6 +13797,15 @@
 	return false;
     }
 
+  /* If we have SImode and slow unaligned ldp,
+     check the alignment to be at least 8 byte. */
+  if (mode == SImode
+      && (aarch64_tune_params.extra_tuning_flags
+          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
+      && !optimize_size
+      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
+    return false;
+
   if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1)))
     rclass_1 = FP_REGS;
   else
Index: gcc/testsuite/ChangeLog
===================================================================
--- gcc/testsuite/ChangeLog	(revision 240100)
+++ gcc/testsuite/ChangeLog	(working copy)
@@ -1,3 +1,8 @@
+2016-09-12  Andrew Pinski  <apinski@cavium.com>
+
+	* gcc.target/aarch64/thunderxloadpair.c: New testcase.
+	* gcc.target/aarch64/thunderxnoloadpair.c: New testcase.
+
 2016-09-12  Uros Bizjak  <ubizjak@gmail.com>
 
 	* gcc.dg/compat/scalar-by-value-4_x.c: Also test passing of
Index: gcc/testsuite/gcc.target/aarch64/thunderxloadpair.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/thunderxloadpair.c	(nonexistent)
+++ gcc/testsuite/gcc.target/aarch64/thunderxloadpair.c	(working copy)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=thunderx" } */
+
+struct ldp
+{
+  long long c;
+  int a, b;
+};
+
+
+int f(struct ldp *a)
+{
+  return a->a + a->b;
+}
+
+
+/* We know the alignement of a->a to be 8 byte aligned so it is profitable
+   to do ldp. */
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */
+
Index: gcc/testsuite/gcc.target/aarch64/thunderxnoloadpair.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/thunderxnoloadpair.c	(nonexistent)
+++ gcc/testsuite/gcc.target/aarch64/thunderxnoloadpair.c	(working copy)
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=thunderx" } */
+
+struct noldp
+{
+  int a, b;
+};
+
+
+int f(struct noldp *a)
+{
+  return a->a + a->b;
+}
+
+/* We know the alignement of a->a to be 4 byte aligned so it is not profitable
+   to do ldp. */
+/* { dg-final { scan-assembler-not "ldp\tw\[0-9\]+, w\[0-9\]" } } */

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-09-12 21:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-05  7:18 [PATCH/AARCH64] Improve ThunderX code generation slightly with load/store pair Andrew Pinski
2016-08-05 20:47 ` Jim Wilson
2016-08-08 17:48 ` Andrew Pinski
2016-08-09  9:43   ` Richard Earnshaw (lists)
2016-09-12 21:29     ` Andrew Pinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).