public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
@ 2016-10-24 14:27 Kyrill Tkachov
  2016-10-24 16:16 ` Andrew Pinski
  2019-08-21 13:57 ` James Greenhalgh
  0 siblings, 2 replies; 13+ messages in thread
From: Kyrill Tkachov @ 2016-10-24 14:27 UTC (permalink / raw)
  To: GCC Patches; +Cc: Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 1753 bytes --]

Hi all,

When storing a 64-bit immediate that has equal bottom and top halves we currently
synthesize the repeating 32-bit pattern twice and perform a single X-store.
With this patch we synthesize the 32-bit pattern once into a W register and store
that twice using an STP. This reduces codesize bloat from synthesising the same
constant multiple times at the expense of converting a store to a store-pair.
It will only trigger if we can save two or more instructions, so it will only transform:
         mov     x1, 49370
         movk    x1, 0xc0da, lsl 32
         str     x1, [x0]

into:

         mov     w1, 49370
         stp     w1, w1, [x0]

when optimising for -Os, whereas it will always transform a 4-insn synthesis
sequence into a two-insn sequence + STP (see comments in the patch).

This patch triggers already but will trigger more with the store merging pass
that I'm working on since that will generate more of these repeating 64-bit constants.
This helps improve codegen on 456.hmmer where store merging can sometimes create very
complex repeating constants and target-specific expand needs to break them down.

Bootstrapped and tested on aarch64-none-linux-gnu.

Ok for trunk?

Thanks,
Kyrill

2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.md (mov<mode>): Call
     aarch64_split_dimode_const_store on DImode constant stores.
     * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
     New prototype.
     * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
     function.

2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/store_repeating_constant_1.c: New test.
     * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.

[-- Attachment #2: aarch64-dimode-const.patch --]
[-- Type: text/x-patch, Size: 4842 bytes --]

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 07a8cd0455d64a861cb919083a9d369bf23724b7..4c551ef143d3b32e94bd58989c85ebd3352cdd9b 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -337,6 +337,7 @@ bool aarch64_simd_scalar_immediate_valid_for_move (rtx, machine_mode);
 bool aarch64_simd_shift_imm_p (rtx, machine_mode, bool);
 bool aarch64_simd_valid_immediate (rtx, machine_mode, bool,
 				   struct simd_immediate_info *);
+bool aarch64_split_dimode_const_store (rtx, rtx);
 bool aarch64_symbolic_address_p (rtx);
 bool aarch64_uimm12_shift (HOST_WIDE_INT);
 bool aarch64_use_return_insn_p (void);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 3861e7d9ed82ebaffe93471fec08a72c70cfd133..0980befa42be65a760ad06e43decec77bff1b762 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13141,6 +13141,61 @@ aarch64_expand_movmem (rtx *operands)
   return true;
 }
 
+/* Split a DImode store of a CONST_INT SRC to MEM DST as two
+   SImode stores.  Handle the case when the constant has identical
+   bottom and top halves.  This is beneficial when the two stores can be
+   merged into an STP and we avoid synthesising potentially expensive
+   immediates twice.  Return true if such a split is possible.  */
+
+bool
+aarch64_split_dimode_const_store (rtx dst, rtx src)
+{
+  rtx lo = gen_lowpart (SImode, src);
+  rtx hi = gen_highpart_mode (SImode, DImode, src);
+
+  bool size_p = optimize_function_for_size_p (cfun);
+
+  if (!rtx_equal_p (lo, hi))
+    return false;
+
+  unsigned int orig_cost
+    = aarch64_internal_mov_immediate (NULL_RTX, src, false, DImode);
+  unsigned int lo_cost
+    = aarch64_internal_mov_immediate (NULL_RTX, lo, false, SImode);
+
+  /* We want to transform:
+     MOV	x1, 49370
+     MOVK	x1, 0x140, lsl 16
+     MOVK	x1, 0xc0da, lsl 32
+     MOVK	x1, 0x140, lsl 48
+     STR	x1, [x0]
+   into:
+     MOV	w1, 49370
+     MOVK	w1, 0x140, lsl 16
+     STP	w1, w1, [x0]
+   So we want to perform this only when we save two instructions
+   or more.  When optimizing for size, however, accept any code size
+   savings we can.  */
+  if (size_p && orig_cost <= lo_cost)
+    return false;
+
+  if (!size_p
+      && (orig_cost <= lo_cost + 1))
+    return false;
+
+  rtx mem_lo = adjust_address (dst, SImode, 0);
+  if (!aarch64_mem_pair_operand (mem_lo, SImode))
+    return false;
+
+  rtx tmp_reg = gen_reg_rtx (SImode);
+  aarch64_expand_mov_immediate (tmp_reg, lo);
+  rtx mem_hi = aarch64_move_pointer (mem_lo, GET_MODE_SIZE (SImode));
+
+  emit_insn (gen_store_pairsi (mem_lo, tmp_reg, mem_hi, tmp_reg));
+
+  return true;
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 8861ac18f4e33ada3bc6dde0e4667fd040b1c213..ec423eb4b048609daaf2f81b0ae874f2e4350f69 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1010,6 +1010,11 @@ (define_expand "mov<mode>"
 	(match_operand:GPI 1 "general_operand" ""))]
   ""
   "
+    if (MEM_P (operands[0]) && CONST_INT_P (operands[1])
+	&& <MODE>mode == DImode
+	&& aarch64_split_dimode_const_store (operands[0], operands[1]))
+      DONE;
+
     if (GET_CODE (operands[0]) == MEM && operands[1] != const0_rtx)
       operands[1] = force_reg (<MODE>mode, operands[1]);
 
diff --git a/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_1.c b/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..fc964272eb45dc067fac40a390cf67121b51c180
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+void
+foo (unsigned long long *a)
+{
+  a[0] = 0x0140c0da0140c0daULL;
+}
+
+/* { dg-final { scan-assembler-times "movk\\tw.*" 1 } } */
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]+.*" 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_2.c b/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c421277989adcf446ad8a7b3ab9060602c03a7ea
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_2.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+/* Check that for -Os we synthesize only the bottom half and then
+   store it twice with an STP rather than synthesizing it twice in each
+   half of an X-reg.  */
+
+void
+foo (unsigned long long *a)
+{
+  a[0] = 0xc0da0000c0daULL;
+}
+
+/* { dg-final { scan-assembler-times "mov\\tw.*" 1 } } */
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]+.*" 1 } } */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-10-24 14:27 [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable Kyrill Tkachov
@ 2016-10-24 16:16 ` Andrew Pinski
  2016-10-31 11:54   ` Kyrill Tkachov
  2019-08-21 13:57 ` James Greenhalgh
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Pinski @ 2016-10-24 16:16 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,
>
> When storing a 64-bit immediate that has equal bottom and top halves we
> currently
> synthesize the repeating 32-bit pattern twice and perform a single X-store.
> With this patch we synthesize the 32-bit pattern once into a W register and
> store
> that twice using an STP. This reduces codesize bloat from synthesising the
> same
> constant multiple times at the expense of converting a store to a
> store-pair.
> It will only trigger if we can save two or more instructions, so it will
> only transform:
>         mov     x1, 49370
>         movk    x1, 0xc0da, lsl 32
>         str     x1, [x0]
>
> into:
>
>         mov     w1, 49370
>         stp     w1, w1, [x0]
>
> when optimising for -Os, whereas it will always transform a 4-insn synthesis
> sequence into a two-insn sequence + STP (see comments in the patch).
>
> This patch triggers already but will trigger more with the store merging
> pass
> that I'm working on since that will generate more of these repeating 64-bit
> constants.
> This helps improve codegen on 456.hmmer where store merging can sometimes
> create very
> complex repeating constants and target-specific expand needs to break them
> down.


Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
might cause an ICE with -mcpu=thunderx; I can't remember if the check
for slow unaligned store pair word is with the pattern or not.
Basically the rule is
1) if 4 byte aligned, then it is better to do two str.
2) If 8 byte aligned, then doing stp is good
3) Otherwise it is better to do two str.

Thanks,
Andrew


>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Ok for trunk?
>
> Thanks,
> Kyrill
>
> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.md (mov<mode>): Call
>     aarch64_split_dimode_const_store on DImode constant stores.
>     * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>     New prototype.
>     * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>     function.
>
> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>     * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-10-24 16:16 ` Andrew Pinski
@ 2016-10-31 11:54   ` Kyrill Tkachov
  2016-10-31 13:42     ` Richard Earnshaw (lists)
  2016-11-01 15:21     ` Kyrill Tkachov
  0 siblings, 2 replies; 13+ messages in thread
From: Kyrill Tkachov @ 2016-10-31 11:54 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh


On 24/10/16 17:15, Andrew Pinski wrote:
> On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> Hi all,
>>
>> When storing a 64-bit immediate that has equal bottom and top halves we
>> currently
>> synthesize the repeating 32-bit pattern twice and perform a single X-store.
>> With this patch we synthesize the 32-bit pattern once into a W register and
>> store
>> that twice using an STP. This reduces codesize bloat from synthesising the
>> same
>> constant multiple times at the expense of converting a store to a
>> store-pair.
>> It will only trigger if we can save two or more instructions, so it will
>> only transform:
>>          mov     x1, 49370
>>          movk    x1, 0xc0da, lsl 32
>>          str     x1, [x0]
>>
>> into:
>>
>>          mov     w1, 49370
>>          stp     w1, w1, [x0]
>>
>> when optimising for -Os, whereas it will always transform a 4-insn synthesis
>> sequence into a two-insn sequence + STP (see comments in the patch).
>>
>> This patch triggers already but will trigger more with the store merging
>> pass
>> that I'm working on since that will generate more of these repeating 64-bit
>> constants.
>> This helps improve codegen on 456.hmmer where store merging can sometimes
>> create very
>> complex repeating constants and target-specific expand needs to break them
>> down.
>
> Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
> might cause an ICE with -mcpu=thunderx; I can't remember if the check
> for slow unaligned store pair word is with the pattern or not.

I can't get it to ICE with -mcpu=thunderx.
The restriction is just on the STP forming code in the sched-fusion hooks AFAIK.

> Basically the rule is
> 1) if 4 byte aligned, then it is better to do two str.
> 2) If 8 byte aligned, then doing stp is good
> 3) Otherwise it is better to do two str.

Ok, then I'll make the function just emit two stores and depend on the sched-fusion
machinery to fuse them into an STP when appropriate since that has the logic that
takes thunderx into account.

Thanks for the info.
Kyrill

>
> Thanks,
> Andrew
>
>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>
>> Ok for trunk?
>>
>> Thanks,
>> Kyrill
>>
>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * config/aarch64/aarch64.md (mov<mode>): Call
>>      aarch64_split_dimode_const_store on DImode constant stores.
>>      * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>>      New prototype.
>>      * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>      function.
>>
>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>      * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-10-31 11:54   ` Kyrill Tkachov
@ 2016-10-31 13:42     ` Richard Earnshaw (lists)
  2016-10-31 14:26       ` Kyrill Tkachov
  2016-11-01 15:21     ` Kyrill Tkachov
  1 sibling, 1 reply; 13+ messages in thread
From: Richard Earnshaw (lists) @ 2016-10-31 13:42 UTC (permalink / raw)
  To: Kyrill Tkachov, Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, James Greenhalgh

On 31/10/16 11:54, Kyrill Tkachov wrote:
> 
> On 24/10/16 17:15, Andrew Pinski wrote:
>> On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com> wrote:
>>> Hi all,
>>>
>>> When storing a 64-bit immediate that has equal bottom and top halves we
>>> currently
>>> synthesize the repeating 32-bit pattern twice and perform a single
>>> X-store.
>>> With this patch we synthesize the 32-bit pattern once into a W
>>> register and
>>> store
>>> that twice using an STP. This reduces codesize bloat from
>>> synthesising the
>>> same
>>> constant multiple times at the expense of converting a store to a
>>> store-pair.
>>> It will only trigger if we can save two or more instructions, so it will
>>> only transform:
>>>          mov     x1, 49370
>>>          movk    x1, 0xc0da, lsl 32
>>>          str     x1, [x0]
>>>
>>> into:
>>>
>>>          mov     w1, 49370
>>>          stp     w1, w1, [x0]
>>>
>>> when optimising for -Os, whereas it will always transform a 4-insn
>>> synthesis
>>> sequence into a two-insn sequence + STP (see comments in the patch).
>>>
>>> This patch triggers already but will trigger more with the store merging
>>> pass
>>> that I'm working on since that will generate more of these repeating
>>> 64-bit
>>> constants.
>>> This helps improve codegen on 456.hmmer where store merging can
>>> sometimes
>>> create very
>>> complex repeating constants and target-specific expand needs to break
>>> them
>>> down.
>>
>> Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
>> might cause an ICE with -mcpu=thunderx; I can't remember if the check
>> for slow unaligned store pair word is with the pattern or not.
> 
> I can't get it to ICE with -mcpu=thunderx.
> The restriction is just on the STP forming code in the sched-fusion
> hooks AFAIK.
> 
>> Basically the rule is
>> 1) if 4 byte aligned, then it is better to do two str.
>> 2) If 8 byte aligned, then doing stp is good
>> 3) Otherwise it is better to do two str.
> 
> Ok, then I'll make the function just emit two stores and depend on the
> sched-fusion
> machinery to fuse them into an STP when appropriate since that has the
> logic that
> takes thunderx into account.

If the mode is DImode (ie the pattern is 'movdi', then surely we must
have a 64-bit aligned store.

R.

> 
> Thanks for the info.
> Kyrill
> 
>>
>> Thanks,
>> Andrew
>>
>>
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>
>>> Ok for trunk?
>>>
>>> Thanks,
>>> Kyrill
>>>
>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * config/aarch64/aarch64.md (mov<mode>): Call
>>>      aarch64_split_dimode_const_store on DImode constant stores.
>>>      * config/aarch64/aarch64-protos.h
>>> (aarch64_split_dimode_const_store):
>>>      New prototype.
>>>      * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>>      function.
>>>
>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>>      * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-10-31 13:42     ` Richard Earnshaw (lists)
@ 2016-10-31 14:26       ` Kyrill Tkachov
  0 siblings, 0 replies; 13+ messages in thread
From: Kyrill Tkachov @ 2016-10-31 14:26 UTC (permalink / raw)
  To: Richard Earnshaw (lists), Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, James Greenhalgh


On 31/10/16 13:42, Richard Earnshaw (lists) wrote:
> On 31/10/16 11:54, Kyrill Tkachov wrote:
>> On 24/10/16 17:15, Andrew Pinski wrote:
>>> On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>> Hi all,
>>>>
>>>> When storing a 64-bit immediate that has equal bottom and top halves we
>>>> currently
>>>> synthesize the repeating 32-bit pattern twice and perform a single
>>>> X-store.
>>>> With this patch we synthesize the 32-bit pattern once into a W
>>>> register and
>>>> store
>>>> that twice using an STP. This reduces codesize bloat from
>>>> synthesising the
>>>> same
>>>> constant multiple times at the expense of converting a store to a
>>>> store-pair.
>>>> It will only trigger if we can save two or more instructions, so it will
>>>> only transform:
>>>>           mov     x1, 49370
>>>>           movk    x1, 0xc0da, lsl 32
>>>>           str     x1, [x0]
>>>>
>>>> into:
>>>>
>>>>           mov     w1, 49370
>>>>           stp     w1, w1, [x0]
>>>>
>>>> when optimising for -Os, whereas it will always transform a 4-insn
>>>> synthesis
>>>> sequence into a two-insn sequence + STP (see comments in the patch).
>>>>
>>>> This patch triggers already but will trigger more with the store merging
>>>> pass
>>>> that I'm working on since that will generate more of these repeating
>>>> 64-bit
>>>> constants.
>>>> This helps improve codegen on 456.hmmer where store merging can
>>>> sometimes
>>>> create very
>>>> complex repeating constants and target-specific expand needs to break
>>>> them
>>>> down.
>>> Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
>>> might cause an ICE with -mcpu=thunderx; I can't remember if the check
>>> for slow unaligned store pair word is with the pattern or not.
>> I can't get it to ICE with -mcpu=thunderx.
>> The restriction is just on the STP forming code in the sched-fusion
>> hooks AFAIK.
>>
>>> Basically the rule is
>>> 1) if 4 byte aligned, then it is better to do two str.
>>> 2) If 8 byte aligned, then doing stp is good
>>> 3) Otherwise it is better to do two str.
>> Ok, then I'll make the function just emit two stores and depend on the
>> sched-fusion
>> machinery to fuse them into an STP when appropriate since that has the
>> logic that
>> takes thunderx into account.
> If the mode is DImode (ie the pattern is 'movdi', then surely we must
> have a 64-bit aligned store.

I don't think that's guaranteed. At least the Standard Names documentation
doesn't mention it. I've gone through an example with gdb where store merging
produces an unaligned store and the gen_movdi expander is called with a source
of (const_int 8589934593 [0x200000001]) and a destination of:
(mem:DI (reg/v/f:DI 73 [ a ]) [1 MEM[(int *)a_2(D)]+0 S8 A32])

i.e. a 32-bit aligned DImode MEM.

Thanks,
Kyrill

> R.
>
>> Thanks for the info.
>> Kyrill
>>
>>> Thanks,
>>> Andrew
>>>
>>>
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>
>>>> Ok for trunk?
>>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>       * config/aarch64/aarch64.md (mov<mode>): Call
>>>>       aarch64_split_dimode_const_store on DImode constant stores.
>>>>       * config/aarch64/aarch64-protos.h
>>>> (aarch64_split_dimode_const_store):
>>>>       New prototype.
>>>>       * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>>>       function.
>>>>
>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>       * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>>>       * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-10-31 11:54   ` Kyrill Tkachov
  2016-10-31 13:42     ` Richard Earnshaw (lists)
@ 2016-11-01 15:21     ` Kyrill Tkachov
  2016-11-01 15:22       ` Kyrill Tkachov
                         ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Kyrill Tkachov @ 2016-11-01 15:21 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh


On 31/10/16 11:54, Kyrill Tkachov wrote:
>
> On 24/10/16 17:15, Andrew Pinski wrote:
>> On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com> wrote:
>>> Hi all,
>>>
>>> When storing a 64-bit immediate that has equal bottom and top halves we
>>> currently
>>> synthesize the repeating 32-bit pattern twice and perform a single X-store.
>>> With this patch we synthesize the 32-bit pattern once into a W register and
>>> store
>>> that twice using an STP. This reduces codesize bloat from synthesising the
>>> same
>>> constant multiple times at the expense of converting a store to a
>>> store-pair.
>>> It will only trigger if we can save two or more instructions, so it will
>>> only transform:
>>>          mov     x1, 49370
>>>          movk    x1, 0xc0da, lsl 32
>>>          str     x1, [x0]
>>>
>>> into:
>>>
>>>          mov     w1, 49370
>>>          stp     w1, w1, [x0]
>>>
>>> when optimising for -Os, whereas it will always transform a 4-insn synthesis
>>> sequence into a two-insn sequence + STP (see comments in the patch).
>>>
>>> This patch triggers already but will trigger more with the store merging
>>> pass
>>> that I'm working on since that will generate more of these repeating 64-bit
>>> constants.
>>> This helps improve codegen on 456.hmmer where store merging can sometimes
>>> create very
>>> complex repeating constants and target-specific expand needs to break them
>>> down.
>>
>> Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
>> might cause an ICE with -mcpu=thunderx; I can't remember if the check
>> for slow unaligned store pair word is with the pattern or not.
>
> I can't get it to ICE with -mcpu=thunderx.
> The restriction is just on the STP forming code in the sched-fusion hooks AFAIK.
>
>> Basically the rule is
>> 1) if 4 byte aligned, then it is better to do two str.
>> 2) If 8 byte aligned, then doing stp is good
>> 3) Otherwise it is better to do two str.
>
> Ok, then I'll make the function just emit two stores and depend on the sched-fusion
> machinery to fuse them into an STP when appropriate since that has the logic that
> takes thunderx into account.
>

Here it is.
I've confirmed that it emits to STRs for 4 byte aligned stores when -mtune=thunderx
and still generates STP for the other tunings, though now sched-fusion is responsible for
merging them, which is ok by me.

Bootstrapped and tested on aarch64.
Ok for trunk?

Thanks,
Kyril


2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.md (mov<mode>): Call
     aarch64_split_dimode_const_store on DImode constant stores.
     * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
     New prototype.
     * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
     function.

2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/store_repeating_constant_1.c: New test.
     * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.

>
>
>>
>> Thanks,
>> Andrew
>>
>>
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>
>>> Ok for trunk?
>>>
>>> Thanks,
>>> Kyrill
>>>
>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * config/aarch64/aarch64.md (mov<mode>): Call
>>>      aarch64_split_dimode_const_store on DImode constant stores.
>>>      * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>>>      New prototype.
>>>      * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>>      function.
>>>
>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>>      * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-11-01 15:21     ` Kyrill Tkachov
@ 2016-11-01 15:22       ` Kyrill Tkachov
  2016-11-10  9:04       ` Kyrill Tkachov
  2016-11-17 12:24       ` James Greenhalgh
  2 siblings, 0 replies; 13+ messages in thread
From: Kyrill Tkachov @ 2016-11-01 15:22 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 3916 bytes --]

And here is the patch itself.


On 01/11/16 15:21, Kyrill Tkachov wrote:
>
> On 31/10/16 11:54, Kyrill Tkachov wrote:
>>
>> On 24/10/16 17:15, Andrew Pinski wrote:
>>> On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>> Hi all,
>>>>
>>>> When storing a 64-bit immediate that has equal bottom and top halves we
>>>> currently
>>>> synthesize the repeating 32-bit pattern twice and perform a single X-store.
>>>> With this patch we synthesize the 32-bit pattern once into a W register and
>>>> store
>>>> that twice using an STP. This reduces codesize bloat from synthesising the
>>>> same
>>>> constant multiple times at the expense of converting a store to a
>>>> store-pair.
>>>> It will only trigger if we can save two or more instructions, so it will
>>>> only transform:
>>>>          mov     x1, 49370
>>>>          movk    x1, 0xc0da, lsl 32
>>>>          str     x1, [x0]
>>>>
>>>> into:
>>>>
>>>>          mov     w1, 49370
>>>>          stp     w1, w1, [x0]
>>>>
>>>> when optimising for -Os, whereas it will always transform a 4-insn synthesis
>>>> sequence into a two-insn sequence + STP (see comments in the patch).
>>>>
>>>> This patch triggers already but will trigger more with the store merging
>>>> pass
>>>> that I'm working on since that will generate more of these repeating 64-bit
>>>> constants.
>>>> This helps improve codegen on 456.hmmer where store merging can sometimes
>>>> create very
>>>> complex repeating constants and target-specific expand needs to break them
>>>> down.
>>>
>>> Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
>>> might cause an ICE with -mcpu=thunderx; I can't remember if the check
>>> for slow unaligned store pair word is with the pattern or not.
>>
>> I can't get it to ICE with -mcpu=thunderx.
>> The restriction is just on the STP forming code in the sched-fusion hooks AFAIK.
>>
>>> Basically the rule is
>>> 1) if 4 byte aligned, then it is better to do two str.
>>> 2) If 8 byte aligned, then doing stp is good
>>> 3) Otherwise it is better to do two str.
>>
>> Ok, then I'll make the function just emit two stores and depend on the sched-fusion
>> machinery to fuse them into an STP when appropriate since that has the logic that
>> takes thunderx into account.
>>
>
> Here it is.
> I've confirmed that it emits to STRs for 4 byte aligned stores when -mtune=thunderx
> and still generates STP for the other tunings, though now sched-fusion is responsible for
> merging them, which is ok by me.
>
> Bootstrapped and tested on aarch64.
> Ok for trunk?
>
> Thanks,
> Kyril
>
>
> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.md (mov<mode>): Call
>     aarch64_split_dimode_const_store on DImode constant stores.
>     * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>     New prototype.
>     * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>     function.
>
> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>     * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>
>>
>>
>>>
>>> Thanks,
>>> Andrew
>>>
>>>
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>
>>>> Ok for trunk?
>>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>      * config/aarch64/aarch64.md (mov<mode>): Call
>>>>      aarch64_split_dimode_const_store on DImode constant stores.
>>>>      * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>>>>      New prototype.
>>>>      * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>>>      function.
>>>>
>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>      * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>>>      * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>>
>


[-- Attachment #2: aarch64-str-const.patch --]
[-- Type: text/x-patch, Size: 5006 bytes --]

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 4f4989d8b0da91096000d0b51736eaa5b0aa9474..b6ca3dfacb0dc88e5d688905d9d013263d4e8d7f 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -337,6 +337,7 @@ bool aarch64_simd_scalar_immediate_valid_for_move (rtx, machine_mode);
 bool aarch64_simd_shift_imm_p (rtx, machine_mode, bool);
 bool aarch64_simd_valid_immediate (rtx, machine_mode, bool,
 				   struct simd_immediate_info *);
+bool aarch64_split_dimode_const_store (rtx, rtx);
 bool aarch64_symbolic_address_p (rtx);
 bool aarch64_uimm12_shift (HOST_WIDE_INT);
 bool aarch64_use_return_insn_p (void);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index a5c72a65451db639a5a623d15ecc61ceb60d1707..ec77d1cb2a6c63ac1703efc75fc67946e7d24c8e 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13178,6 +13178,63 @@ aarch64_expand_movmem (rtx *operands)
   return true;
 }
 
+/* Split a DImode store of a CONST_INT SRC to MEM DST as two
+   SImode stores.  Handle the case when the constant has identical
+   bottom and top halves.  This is beneficial when the two stores can be
+   merged into an STP and we avoid synthesising potentially expensive
+   immediates twice.  Return true if such a split is possible.  */
+
+bool
+aarch64_split_dimode_const_store (rtx dst, rtx src)
+{
+  rtx lo = gen_lowpart (SImode, src);
+  rtx hi = gen_highpart_mode (SImode, DImode, src);
+
+  bool size_p = optimize_function_for_size_p (cfun);
+
+  if (!rtx_equal_p (lo, hi))
+    return false;
+
+  unsigned int orig_cost
+    = aarch64_internal_mov_immediate (NULL_RTX, src, false, DImode);
+  unsigned int lo_cost
+    = aarch64_internal_mov_immediate (NULL_RTX, lo, false, SImode);
+
+  /* We want to transform:
+     MOV	x1, 49370
+     MOVK	x1, 0x140, lsl 16
+     MOVK	x1, 0xc0da, lsl 32
+     MOVK	x1, 0x140, lsl 48
+     STR	x1, [x0]
+   into:
+     MOV	w1, 49370
+     MOVK	w1, 0x140, lsl 16
+     STP	w1, w1, [x0]
+   So we want to perform this only when we save two instructions
+   or more.  When optimizing for size, however, accept any code size
+   savings we can.  */
+  if (size_p && orig_cost <= lo_cost)
+    return false;
+
+  if (!size_p
+      && (orig_cost <= lo_cost + 1))
+    return false;
+
+  rtx mem_lo = adjust_address (dst, SImode, 0);
+  if (!aarch64_mem_pair_operand (mem_lo, SImode))
+    return false;
+
+  rtx tmp_reg = gen_reg_rtx (SImode);
+  aarch64_expand_mov_immediate (tmp_reg, lo);
+  rtx mem_hi = aarch64_move_pointer (mem_lo, GET_MODE_SIZE (SImode));
+  /* Don't emit an explicit store pair as this may not be always profitable.
+     Let the sched-fusion logic decide whether to merge them.  */
+  emit_move_insn (mem_lo, tmp_reg);
+  emit_move_insn (mem_hi, tmp_reg);
+
+  return true;
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 8861ac18f4e33ada3bc6dde0e4667fd040b1c213..ec423eb4b048609daaf2f81b0ae874f2e4350f69 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1010,6 +1010,11 @@ (define_expand "mov<mode>"
 	(match_operand:GPI 1 "general_operand" ""))]
   ""
   "
+    if (MEM_P (operands[0]) && CONST_INT_P (operands[1])
+	&& <MODE>mode == DImode
+	&& aarch64_split_dimode_const_store (operands[0], operands[1]))
+      DONE;
+
     if (GET_CODE (operands[0]) == MEM && operands[1] != const0_rtx)
       operands[1] = force_reg (<MODE>mode, operands[1]);
 
diff --git a/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_1.c b/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..50d456834bcea10760f6cccecc89e3c21f53bb4c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic" } */
+
+void
+foo (unsigned long long *a)
+{
+  a[0] = 0x0140c0da0140c0daULL;
+}
+
+/* { dg-final { scan-assembler-times "movk\\tw.*" 1 } } */
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]+.*" 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_2.c b/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c421277989adcf446ad8a7b3ab9060602c03a7ea
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/store_repeating_constant_2.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+/* Check that for -Os we synthesize only the bottom half and then
+   store it twice with an STP rather than synthesizing it twice in each
+   half of an X-reg.  */
+
+void
+foo (unsigned long long *a)
+{
+  a[0] = 0xc0da0000c0daULL;
+}
+
+/* { dg-final { scan-assembler-times "mov\\tw.*" 1 } } */
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]+.*" 1 } } */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-11-01 15:21     ` Kyrill Tkachov
  2016-11-01 15:22       ` Kyrill Tkachov
@ 2016-11-10  9:04       ` Kyrill Tkachov
  2016-11-10  9:08         ` Andrew Pinski
  2016-11-17 12:24       ` James Greenhalgh
  2 siblings, 1 reply; 13+ messages in thread
From: Kyrill Tkachov @ 2016-11-10  9:04 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

Ping.
https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00040.html

Andrew, do you have any objections to this version?
Thanks,
Kyrill

On 01/11/16 15:21, Kyrill Tkachov wrote:
>
> On 31/10/16 11:54, Kyrill Tkachov wrote:
>>
>> On 24/10/16 17:15, Andrew Pinski wrote:
>>> On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>> Hi all,
>>>>
>>>> When storing a 64-bit immediate that has equal bottom and top halves we
>>>> currently
>>>> synthesize the repeating 32-bit pattern twice and perform a single X-store.
>>>> With this patch we synthesize the 32-bit pattern once into a W register and
>>>> store
>>>> that twice using an STP. This reduces codesize bloat from synthesising the
>>>> same
>>>> constant multiple times at the expense of converting a store to a
>>>> store-pair.
>>>> It will only trigger if we can save two or more instructions, so it will
>>>> only transform:
>>>>          mov     x1, 49370
>>>>          movk    x1, 0xc0da, lsl 32
>>>>          str     x1, [x0]
>>>>
>>>> into:
>>>>
>>>>          mov     w1, 49370
>>>>          stp     w1, w1, [x0]
>>>>
>>>> when optimising for -Os, whereas it will always transform a 4-insn synthesis
>>>> sequence into a two-insn sequence + STP (see comments in the patch).
>>>>
>>>> This patch triggers already but will trigger more with the store merging
>>>> pass
>>>> that I'm working on since that will generate more of these repeating 64-bit
>>>> constants.
>>>> This helps improve codegen on 456.hmmer where store merging can sometimes
>>>> create very
>>>> complex repeating constants and target-specific expand needs to break them
>>>> down.
>>>
>>> Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
>>> might cause an ICE with -mcpu=thunderx; I can't remember if the check
>>> for slow unaligned store pair word is with the pattern or not.
>>
>> I can't get it to ICE with -mcpu=thunderx.
>> The restriction is just on the STP forming code in the sched-fusion hooks AFAIK.
>>
>>> Basically the rule is
>>> 1) if 4 byte aligned, then it is better to do two str.
>>> 2) If 8 byte aligned, then doing stp is good
>>> 3) Otherwise it is better to do two str.
>>
>> Ok, then I'll make the function just emit two stores and depend on the sched-fusion
>> machinery to fuse them into an STP when appropriate since that has the logic that
>> takes thunderx into account.
>>
>
> Here it is.
> I've confirmed that it emits to STRs for 4 byte aligned stores when -mtune=thunderx
> and still generates STP for the other tunings, though now sched-fusion is responsible for
> merging them, which is ok by me.
>
> Bootstrapped and tested on aarch64.
> Ok for trunk?
>
> Thanks,
> Kyril
>
>
> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.md (mov<mode>): Call
>     aarch64_split_dimode_const_store on DImode constant stores.
>     * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>     New prototype.
>     * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>     function.
>
> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>     * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>
>>
>>
>>>
>>> Thanks,
>>> Andrew
>>>
>>>
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>
>>>> Ok for trunk?
>>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>      * config/aarch64/aarch64.md (mov<mode>): Call
>>>>      aarch64_split_dimode_const_store on DImode constant stores.
>>>>      * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>>>>      New prototype.
>>>>      * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>>>      function.
>>>>
>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>      * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>>>      * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-11-10  9:04       ` Kyrill Tkachov
@ 2016-11-10  9:08         ` Andrew Pinski
  2016-11-17 10:58           ` Kyrill Tkachov
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Pinski @ 2016-11-10  9:08 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

On Thu, Nov 10, 2016 at 1:04 AM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Ping.
> https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00040.html
>
> Andrew, do you have any objections to this version?

Not really.

Thanks,
Andrew

> Thanks,
> Kyrill
>
> On 01/11/16 15:21, Kyrill Tkachov wrote:
>>
>>
>> On 31/10/16 11:54, Kyrill Tkachov wrote:
>>>
>>>
>>> On 24/10/16 17:15, Andrew Pinski wrote:
>>>>
>>>> On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
>>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> When storing a 64-bit immediate that has equal bottom and top halves we
>>>>> currently
>>>>> synthesize the repeating 32-bit pattern twice and perform a single
>>>>> X-store.
>>>>> With this patch we synthesize the 32-bit pattern once into a W register
>>>>> and
>>>>> store
>>>>> that twice using an STP. This reduces codesize bloat from synthesising
>>>>> the
>>>>> same
>>>>> constant multiple times at the expense of converting a store to a
>>>>> store-pair.
>>>>> It will only trigger if we can save two or more instructions, so it
>>>>> will
>>>>> only transform:
>>>>>          mov     x1, 49370
>>>>>          movk    x1, 0xc0da, lsl 32
>>>>>          str     x1, [x0]
>>>>>
>>>>> into:
>>>>>
>>>>>          mov     w1, 49370
>>>>>          stp     w1, w1, [x0]
>>>>>
>>>>> when optimising for -Os, whereas it will always transform a 4-insn
>>>>> synthesis
>>>>> sequence into a two-insn sequence + STP (see comments in the patch).
>>>>>
>>>>> This patch triggers already but will trigger more with the store
>>>>> merging
>>>>> pass
>>>>> that I'm working on since that will generate more of these repeating
>>>>> 64-bit
>>>>> constants.
>>>>> This helps improve codegen on 456.hmmer where store merging can
>>>>> sometimes
>>>>> create very
>>>>> complex repeating constants and target-specific expand needs to break
>>>>> them
>>>>> down.
>>>>
>>>>
>>>> Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
>>>> might cause an ICE with -mcpu=thunderx; I can't remember if the check
>>>> for slow unaligned store pair word is with the pattern or not.
>>>
>>>
>>> I can't get it to ICE with -mcpu=thunderx.
>>> The restriction is just on the STP forming code in the sched-fusion hooks
>>> AFAIK.
>>>
>>>> Basically the rule is
>>>> 1) if 4 byte aligned, then it is better to do two str.
>>>> 2) If 8 byte aligned, then doing stp is good
>>>> 3) Otherwise it is better to do two str.
>>>
>>>
>>> Ok, then I'll make the function just emit two stores and depend on the
>>> sched-fusion
>>> machinery to fuse them into an STP when appropriate since that has the
>>> logic that
>>> takes thunderx into account.
>>>
>>
>> Here it is.
>> I've confirmed that it emits to STRs for 4 byte aligned stores when
>> -mtune=thunderx
>> and still generates STP for the other tunings, though now sched-fusion is
>> responsible for
>> merging them, which is ok by me.
>>
>> Bootstrapped and tested on aarch64.
>> Ok for trunk?
>>
>> Thanks,
>> Kyril
>>
>>
>> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>     * config/aarch64/aarch64.md (mov<mode>): Call
>>     aarch64_split_dimode_const_store on DImode constant stores.
>>     * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>>     New prototype.
>>     * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>     function.
>>
>> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>     * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>     * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>
>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>>
>>>>> Ok for trunk?
>>>>>
>>>>> Thanks,
>>>>> Kyrill
>>>>>
>>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>
>>>>>      * config/aarch64/aarch64.md (mov<mode>): Call
>>>>>      aarch64_split_dimode_const_store on DImode constant stores.
>>>>>      * config/aarch64/aarch64-protos.h
>>>>> (aarch64_split_dimode_const_store):
>>>>>      New prototype.
>>>>>      * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>>>>      function.
>>>>>
>>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>
>>>>>      * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>>>>      * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-11-10  9:08         ` Andrew Pinski
@ 2016-11-17 10:58           ` Kyrill Tkachov
  0 siblings, 0 replies; 13+ messages in thread
From: Kyrill Tkachov @ 2016-11-17 10:58 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh

Ping.

Thanks,
Kyrill

On 10/11/16 09:08, Andrew Pinski wrote:
> On Thu, Nov 10, 2016 at 1:04 AM, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> Ping.
>> https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00040.html
>>
>> Andrew, do you have any objections to this version?
> Not really.
>
> Thanks,
> Andrew
>
>> Thanks,
>> Kyrill
>>
>> On 01/11/16 15:21, Kyrill Tkachov wrote:
>>>
>>> On 31/10/16 11:54, Kyrill Tkachov wrote:
>>>>
>>>> On 24/10/16 17:15, Andrew Pinski wrote:
>>>>> On Mon, Oct 24, 2016 at 7:27 AM, Kyrill Tkachov
>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> When storing a 64-bit immediate that has equal bottom and top halves we
>>>>>> currently
>>>>>> synthesize the repeating 32-bit pattern twice and perform a single
>>>>>> X-store.
>>>>>> With this patch we synthesize the 32-bit pattern once into a W register
>>>>>> and
>>>>>> store
>>>>>> that twice using an STP. This reduces codesize bloat from synthesising
>>>>>> the
>>>>>> same
>>>>>> constant multiple times at the expense of converting a store to a
>>>>>> store-pair.
>>>>>> It will only trigger if we can save two or more instructions, so it
>>>>>> will
>>>>>> only transform:
>>>>>>           mov     x1, 49370
>>>>>>           movk    x1, 0xc0da, lsl 32
>>>>>>           str     x1, [x0]
>>>>>>
>>>>>> into:
>>>>>>
>>>>>>           mov     w1, 49370
>>>>>>           stp     w1, w1, [x0]
>>>>>>
>>>>>> when optimising for -Os, whereas it will always transform a 4-insn
>>>>>> synthesis
>>>>>> sequence into a two-insn sequence + STP (see comments in the patch).
>>>>>>
>>>>>> This patch triggers already but will trigger more with the store
>>>>>> merging
>>>>>> pass
>>>>>> that I'm working on since that will generate more of these repeating
>>>>>> 64-bit
>>>>>> constants.
>>>>>> This helps improve codegen on 456.hmmer where store merging can
>>>>>> sometimes
>>>>>> create very
>>>>>> complex repeating constants and target-specific expand needs to break
>>>>>> them
>>>>>> down.
>>>>>
>>>>> Doing STP might be worse on ThunderX 1 than the mov/movk.  Or this
>>>>> might cause an ICE with -mcpu=thunderx; I can't remember if the check
>>>>> for slow unaligned store pair word is with the pattern or not.
>>>>
>>>> I can't get it to ICE with -mcpu=thunderx.
>>>> The restriction is just on the STP forming code in the sched-fusion hooks
>>>> AFAIK.
>>>>
>>>>> Basically the rule is
>>>>> 1) if 4 byte aligned, then it is better to do two str.
>>>>> 2) If 8 byte aligned, then doing stp is good
>>>>> 3) Otherwise it is better to do two str.
>>>>
>>>> Ok, then I'll make the function just emit two stores and depend on the
>>>> sched-fusion
>>>> machinery to fuse them into an STP when appropriate since that has the
>>>> logic that
>>>> takes thunderx into account.
>>>>
>>> Here it is.
>>> I've confirmed that it emits to STRs for 4 byte aligned stores when
>>> -mtune=thunderx
>>> and still generates STP for the other tunings, though now sched-fusion is
>>> responsible for
>>> merging them, which is ok by me.
>>>
>>> Bootstrapped and tested on aarch64.
>>> Ok for trunk?
>>>
>>> Thanks,
>>> Kyril
>>>
>>>
>>> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * config/aarch64/aarch64.md (mov<mode>): Call
>>>      aarch64_split_dimode_const_store on DImode constant stores.
>>>      * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>>>      New prototype.
>>>      * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>>      function.
>>>
>>> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>>      * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>>>
>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>
>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>>>
>>>>>> Ok for trunk?
>>>>>>
>>>>>> Thanks,
>>>>>> Kyrill
>>>>>>
>>>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>>
>>>>>>       * config/aarch64/aarch64.md (mov<mode>): Call
>>>>>>       aarch64_split_dimode_const_store on DImode constant stores.
>>>>>>       * config/aarch64/aarch64-protos.h
>>>>>> (aarch64_split_dimode_const_store):
>>>>>>       New prototype.
>>>>>>       * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>>>>>       function.
>>>>>>
>>>>>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>>
>>>>>>       * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>>>>>       * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.
>>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-11-01 15:21     ` Kyrill Tkachov
  2016-11-01 15:22       ` Kyrill Tkachov
  2016-11-10  9:04       ` Kyrill Tkachov
@ 2016-11-17 12:24       ` James Greenhalgh
  2 siblings, 0 replies; 13+ messages in thread
From: James Greenhalgh @ 2016-11-17 12:24 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Andrew Pinski, GCC Patches, Marcus Shawcroft, Richard Earnshaw, nd

On Tue, Nov 01, 2016 at 03:21:29PM +0000, Kyrill Tkachov wrote:
> Here it is.
> I've confirmed that it emits to STRs for 4 byte aligned stores when -mtune=thunderx
> and still generates STP for the other tunings, though now sched-fusion is responsible for
> merging them, which is ok by me.
> 
> Bootstrapped and tested on aarch64.
> Ok for trunk?

OK.

Thanks,
James

> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     * config/aarch64/aarch64.md (mov<mode>): Call
>     aarch64_split_dimode_const_store on DImode constant stores.
>     * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>     New prototype.
>     * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>     function.
> 
> 2016-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>     * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2016-10-24 14:27 [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable Kyrill Tkachov
  2016-10-24 16:16 ` Andrew Pinski
@ 2019-08-21 13:57 ` James Greenhalgh
  2019-08-21 17:32   ` Kyrill Tkachov
  1 sibling, 1 reply; 13+ messages in thread
From: James Greenhalgh @ 2019-08-21 13:57 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, nd

On Mon, Oct 24, 2016 at 03:27:10PM +0100, Kyrill Tkachov wrote:
> Hi all,
> 
> When storing a 64-bit immediate that has equal bottom and top halves we currently
> synthesize the repeating 32-bit pattern twice and perform a single X-store.
> With this patch we synthesize the 32-bit pattern once into a W register and store
> that twice using an STP. This reduces codesize bloat from synthesising the same
> constant multiple times at the expense of converting a store to a store-pair.
> It will only trigger if we can save two or more instructions, so it will only transform:
>          mov     x1, 49370
>          movk    x1, 0xc0da, lsl 32
>          str     x1, [x0]
> 
> into:
> 
>          mov     w1, 49370
>          stp     w1, w1, [x0]
> 
> when optimising for -Os, whereas it will always transform a 4-insn synthesis
> sequence into a two-insn sequence + STP (see comments in the patch).
> 
> This patch triggers already but will trigger more with the store merging pass
> that I'm working on since that will generate more of these repeating 64-bit constants.
> This helps improve codegen on 456.hmmer where store merging can sometimes create very
> complex repeating constants and target-specific expand needs to break them down.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> 
> Ok for trunk?

Hi Kyrill,

Does this do the right thing for:

  void bar(u64 *x)
  {
	*(volatile u64 *)x = 0xabcdef10abcdef10;
  }

C.f. https://lore.kernel.org/lkml/20190821103200.kpufwtviqhpbuv2n@willie-the-truck/

i.e. is this optimization still valid for volatile?

Thanks,
James

> 
> Thanks,
> Kyrill
> 
> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>      * config/aarch64/aarch64.md (mov<mode>): Call
>      aarch64_split_dimode_const_store on DImode constant stores.
>      * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>      New prototype.
>      * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>      function.
> 
> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>      * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>      * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable
  2019-08-21 13:57 ` James Greenhalgh
@ 2019-08-21 17:32   ` Kyrill Tkachov
  0 siblings, 0 replies; 13+ messages in thread
From: Kyrill Tkachov @ 2019-08-21 17:32 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw, nd

Hi James,

On 8/21/19 1:48 PM, James Greenhalgh wrote:
> On Mon, Oct 24, 2016 at 03:27:10PM +0100, Kyrill Tkachov wrote:
>> Hi all,
>>
>> When storing a 64-bit immediate that has equal bottom and top halves we currently
>> synthesize the repeating 32-bit pattern twice and perform a single X-store.
>> With this patch we synthesize the 32-bit pattern once into a W register and store
>> that twice using an STP. This reduces codesize bloat from synthesising the same
>> constant multiple times at the expense of converting a store to a store-pair.
>> It will only trigger if we can save two or more instructions, so it will only transform:
>>           mov     x1, 49370
>>           movk    x1, 0xc0da, lsl 32
>>           str     x1, [x0]
>>
>> into:
>>
>>           mov     w1, 49370
>>           stp     w1, w1, [x0]
>>
>> when optimising for -Os, whereas it will always transform a 4-insn synthesis
>> sequence into a two-insn sequence + STP (see comments in the patch).
>>
>> This patch triggers already but will trigger more with the store merging pass
>> that I'm working on since that will generate more of these repeating 64-bit constants.
>> This helps improve codegen on 456.hmmer where store merging can sometimes create very
>> complex repeating constants and target-specific expand needs to break them down.
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>
>> Ok for trunk?
> Hi Kyrill,
>
> Does this do the right thing for:
>
>    void bar(u64 *x)
>    {
> 	*(volatile u64 *)x = 0xabcdef10abcdef10;
>    }
>
> C.f. https://lore.kernel.org/lkml/20190821103200.kpufwtviqhpbuv2n@willie-the-truck/
>
> i.e. is this optimization still valid for volatile?

Not sure if it's valid, but it's most likely undesirable for whatever 
gain it gives in this scenario.

I'm testing a patch to disable it for volatile destinations.

Thanks,

Kyrill

> Thanks,
> James
>
>> Thanks,
>> Kyrill
>>
>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>       * config/aarch64/aarch64.md (mov<mode>): Call
>>       aarch64_split_dimode_const_store on DImode constant stores.
>>       * config/aarch64/aarch64-protos.h (aarch64_split_dimode_const_store):
>>       New prototype.
>>       * config/aarch64/aarch64.c (aarch64_split_dimode_const_store): New
>>       function.
>>
>> 2016-10-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>       * gcc.target/aarch64/store_repeating_constant_1.c: New test.
>>       * gcc.target/aarch64/store_repeating_constant_2.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-08-21 16:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24 14:27 [PATCH][AArch64] Expand DImode constant stores to two SImode stores when profitable Kyrill Tkachov
2016-10-24 16:16 ` Andrew Pinski
2016-10-31 11:54   ` Kyrill Tkachov
2016-10-31 13:42     ` Richard Earnshaw (lists)
2016-10-31 14:26       ` Kyrill Tkachov
2016-11-01 15:21     ` Kyrill Tkachov
2016-11-01 15:22       ` Kyrill Tkachov
2016-11-10  9:04       ` Kyrill Tkachov
2016-11-10  9:08         ` Andrew Pinski
2016-11-17 10:58           ` Kyrill Tkachov
2016-11-17 12:24       ` James Greenhalgh
2019-08-21 13:57 ` James Greenhalgh
2019-08-21 17:32   ` Kyrill Tkachov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).