* arm memcpy of aligned data
@ 2015-05-28 21:36 Mike Stump
2015-05-29 8:22 ` Oleg Endo
2015-05-29 10:15 ` Kyrill Tkachov
0 siblings, 2 replies; 8+ messages in thread
From: Mike Stump @ 2015-05-28 21:36 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 2010 bytes --]
So, the arm memcpy code of aligned data isn’t as good as it can be.
void *memcpy(void *dest, const void *src, unsigned int n);
void foo(char *dst, int i) {
memcpy (dst, &i, sizeof (i));
}
generates horrible code, but, it we are willing to notice the src or the destination are aligned, we can do much better:
$ ./cc1 -fschedule-fusion -fdump-tree-all-all -da -march=armv7ve -mcpu=cortex-m4 -fomit-frame-pointer -quiet -O2 /tmp/t.c -o t.s
$ cat t.s
[ … ]
foo:
@ args = 0, pretend = 0, frame = 4
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #4
str r1, [r0] @ unaligned
add sp, sp, #4
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c (revision 223842)
+++ gcc/config/arm/arm.c (working copy)
@@ -14376,7 +14376,10 @@ arm_block_move_unaligned_straight (rtx d
srcoffset + j * UNITS_PER_WORD - src_autoinc);
mem = adjust_automodify_address (srcbase, SImode, addr,
srcoffset + j * UNITS_PER_WORD);
- emit_insn (gen_unaligned_loadsi (regs[j], mem));
+ if (src_aligned)
+ emit_move_insn (regs[j], mem);
+ else
+ emit_insn (gen_unaligned_loadsi (regs[j], mem));
}
srcoffset += words * UNITS_PER_WORD;
}
@@ -14395,7 +14398,10 @@ arm_block_move_unaligned_straight (rtx d
dstoffset + j * UNITS_PER_WORD - dst_autoinc);
mem = adjust_automodify_address (dstbase, SImode, addr,
dstoffset + j * UNITS_PER_WORD);
- emit_insn (gen_unaligned_storesi (mem, regs[j]));
+ if (dst_aligned)
+ emit_move_insn (mem, regs[j]);
+ else
+ emit_insn (gen_unaligned_storesi (mem, regs[j]));
}
dstoffset += words * UNITS_PER_WORD;
}
Ok?
Can someone spin this through an arm test suite run for me, I was doing this by inspection and cross compile on a system with no arm bits. Bonus points if you can check it in with the test case above marked up as appropriate.
[-- Attachment #2: arm.diffs.txt --]
[-- Type: text/plain, Size: 1081 bytes --]
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c (revision 223842)
+++ gcc/config/arm/arm.c (working copy)
@@ -14376,7 +14376,10 @@ arm_block_move_unaligned_straight (rtx d
srcoffset + j * UNITS_PER_WORD - src_autoinc);
mem = adjust_automodify_address (srcbase, SImode, addr,
srcoffset + j * UNITS_PER_WORD);
- emit_insn (gen_unaligned_loadsi (regs[j], mem));
+ if (src_aligned)
+ emit_move_insn (regs[j], mem);
+ else
+ emit_insn (gen_unaligned_loadsi (regs[j], mem));
}
srcoffset += words * UNITS_PER_WORD;
}
@@ -14395,7 +14398,10 @@ arm_block_move_unaligned_straight (rtx d
dstoffset + j * UNITS_PER_WORD - dst_autoinc);
mem = adjust_automodify_address (dstbase, SImode, addr,
dstoffset + j * UNITS_PER_WORD);
- emit_insn (gen_unaligned_storesi (mem, regs[j]));
+ if (dst_aligned)
+ emit_move_insn (mem, regs[j]);
+ else
+ emit_insn (gen_unaligned_storesi (mem, regs[j]));
}
dstoffset += words * UNITS_PER_WORD;
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: arm memcpy of aligned data
2015-05-28 21:36 arm memcpy of aligned data Mike Stump
@ 2015-05-29 8:22 ` Oleg Endo
2015-05-29 10:15 ` Kyrill Tkachov
1 sibling, 0 replies; 8+ messages in thread
From: Oleg Endo @ 2015-05-29 8:22 UTC (permalink / raw)
To: Mike Stump; +Cc: gcc-patches
On 28 May 2015, at 23:15, Mike Stump <mikestump@comcast.net> wrote:
> So, the arm memcpy code of aligned data isn’t as good as it can be.
>
> void *memcpy(void *dest, const void *src, unsigned int n);
>
> void foo(char *dst, int i) {
> memcpy (dst, &i, sizeof (i));
> }
>
> generates horrible code, but, it we are willing to notice the src or the destination are aligned, we can do much better:
>
This looks like PR 50417, doesn't it?
Cheers,
Oleg
> $ ./cc1 -fschedule-fusion -fdump-tree-all-all -da -march=armv7ve -mcpu=cortex-m4 -fomit-frame-pointer -quiet -O2 /tmp/t.c -o t.s
> $ cat t.s
> [ … ]
> foo:
> @ args = 0, pretend = 0, frame = 4
> @ frame_needed = 0, uses_anonymous_args = 0
> @ link register save eliminated.
> sub sp, sp, #4
> str r1, [r0] @ unaligned
> add sp, sp, #4
>
> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c (revision 223842)
> +++ gcc/config/arm/arm.c (working copy)
> @@ -14376,7 +14376,10 @@ arm_block_move_unaligned_straight (rtx d
> srcoffset + j * UNITS_PER_WORD - src_autoinc);
> mem = adjust_automodify_address (srcbase, SImode, addr,
> srcoffset + j * UNITS_PER_WORD);
> - emit_insn (gen_unaligned_loadsi (regs[j], mem));
> + if (src_aligned)
> + emit_move_insn (regs[j], mem);
> + else
> + emit_insn (gen_unaligned_loadsi (regs[j], mem));
> }
> srcoffset += words * UNITS_PER_WORD;
> }
> @@ -14395,7 +14398,10 @@ arm_block_move_unaligned_straight (rtx d
> dstoffset + j * UNITS_PER_WORD - dst_autoinc);
> mem = adjust_automodify_address (dstbase, SImode, addr,
> dstoffset + j * UNITS_PER_WORD);
> - emit_insn (gen_unaligned_storesi (mem, regs[j]));
> + if (dst_aligned)
> + emit_move_insn (mem, regs[j]);
> + else
> + emit_insn (gen_unaligned_storesi (mem, regs[j]));
> }
> dstoffset += words * UNITS_PER_WORD;
> }
>
>
> Ok?
>
> Can someone spin this through an arm test suite run for me, I was doing this by inspection and cross compile on a system with no arm bits. Bonus points if you can check it in with the test case above marked up as appropriate.
>
> <arm.diffs.txt>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: arm memcpy of aligned data
2015-05-28 21:36 arm memcpy of aligned data Mike Stump
2015-05-29 8:22 ` Oleg Endo
@ 2015-05-29 10:15 ` Kyrill Tkachov
2015-05-29 10:40 ` Kyrill Tkachov
1 sibling, 1 reply; 8+ messages in thread
From: Kyrill Tkachov @ 2015-05-29 10:15 UTC (permalink / raw)
To: Mike Stump, gcc-patches
Hi Mike,
On 28/05/15 22:15, Mike Stump wrote:
> So, the arm memcpy code of aligned data isn’t as good as it can be.
>
> void *memcpy(void *dest, const void *src, unsigned int n);
>
> void foo(char *dst, int i) {
> memcpy (dst, &i, sizeof (i));
> }
>
> generates horrible code, but, it we are willing to notice the src or the destination are aligned, we can do much better:
>
> $ ./cc1 -fschedule-fusion -fdump-tree-all-all -da -march=armv7ve -mcpu=cortex-m4 -fomit-frame-pointer -quiet -O2 /tmp/t.c -o t.s
> $ cat t.s
> [ … ]
> foo:
> @ args = 0, pretend = 0, frame = 4
> @ frame_needed = 0, uses_anonymous_args = 0
> @ link register save eliminated.
> sub sp, sp, #4
> str r1, [r0] @ unaligned
> add sp, sp, #4
I think there's something to do with cpu tuning here as well.
For the code you've given compiled with -O2 -mcpu=cortex-a53 I get:
sub sp, sp, #8
mov r2, r0
add r3, sp, #8
str r1, [r3, #-4]!
ldr r0, [r3] @ unaligned
str r0, [r2] @ unaligned
add sp, sp, #8
@ sp needed
bx lr
whereas for -O2 -mcpu=cortex-a57 I get the much better:
sub sp, sp, #8
str r1, [r0] @ unaligned
add sp, sp, #8
@ sp needed
bx lr
Kyrill
>
> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c (revision 223842)
> +++ gcc/config/arm/arm.c (working copy)
> @@ -14376,7 +14376,10 @@ arm_block_move_unaligned_straight (rtx d
> srcoffset + j * UNITS_PER_WORD - src_autoinc);
> mem = adjust_automodify_address (srcbase, SImode, addr,
> srcoffset + j * UNITS_PER_WORD);
> - emit_insn (gen_unaligned_loadsi (regs[j], mem));
> + if (src_aligned)
> + emit_move_insn (regs[j], mem);
> + else
> + emit_insn (gen_unaligned_loadsi (regs[j], mem));
> }
> srcoffset += words * UNITS_PER_WORD;
> }
> @@ -14395,7 +14398,10 @@ arm_block_move_unaligned_straight (rtx d
> dstoffset + j * UNITS_PER_WORD - dst_autoinc);
> mem = adjust_automodify_address (dstbase, SImode, addr,
> dstoffset + j * UNITS_PER_WORD);
> - emit_insn (gen_unaligned_storesi (mem, regs[j]));
> + if (dst_aligned)
> + emit_move_insn (mem, regs[j]);
> + else
> + emit_insn (gen_unaligned_storesi (mem, regs[j]));
> }
> dstoffset += words * UNITS_PER_WORD;
> }
>
>
> Ok?
>
> Can someone spin this through an arm test suite run for me, I was doing this by inspection and cross compile on a system with no arm bits. Bonus points if you can check it in with the test case above marked up as appropriate.
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: arm memcpy of aligned data
2015-05-29 10:15 ` Kyrill Tkachov
@ 2015-05-29 10:40 ` Kyrill Tkachov
2015-06-15 14:41 ` Kyrill Tkachov
0 siblings, 1 reply; 8+ messages in thread
From: Kyrill Tkachov @ 2015-05-29 10:40 UTC (permalink / raw)
To: Mike Stump, gcc-patches
On 29/05/15 10:08, Kyrill Tkachov wrote:
> Hi Mike,
>
> On 28/05/15 22:15, Mike Stump wrote:
>> So, the arm memcpy code of aligned data isn’t as good as it can be.
>>
>> void *memcpy(void *dest, const void *src, unsigned int n);
>>
>> void foo(char *dst, int i) {
>> memcpy (dst, &i, sizeof (i));
>> }
>>
>> generates horrible code, but, it we are willing to notice the src or the destination are aligned, we can do much better:
>>
>> $ ./cc1 -fschedule-fusion -fdump-tree-all-all -da -march=armv7ve -mcpu=cortex-m4 -fomit-frame-pointer -quiet -O2 /tmp/t.c -o t.s
>> $ cat t.s
>> [ … ]
>> foo:
>> @ args = 0, pretend = 0, frame = 4
>> @ frame_needed = 0, uses_anonymous_args = 0
>> @ link register save eliminated.
>> sub sp, sp, #4
>> str r1, [r0] @ unaligned
>> add sp, sp, #4
> I think there's something to do with cpu tuning here as well.
That being said, I do think this is a good idea.
I'll give it a test.
Kyrill
> For the code you've given compiled with -O2 -mcpu=cortex-a53 I get:
> sub sp, sp, #8
> mov r2, r0
> add r3, sp, #8
> str r1, [r3, #-4]!
> ldr r0, [r3] @ unaligned
> str r0, [r2] @ unaligned
> add sp, sp, #8
> @ sp needed
> bx lr
>
> whereas for -O2 -mcpu=cortex-a57 I get the much better:
> sub sp, sp, #8
> str r1, [r0] @ unaligned
> add sp, sp, #8
> @ sp needed
> bx lr
>
> Kyrill
>
>
>> Index: gcc/config/arm/arm.c
>> ===================================================================
>> --- gcc/config/arm/arm.c (revision 223842)
>> +++ gcc/config/arm/arm.c (working copy)
>> @@ -14376,7 +14376,10 @@ arm_block_move_unaligned_straight (rtx d
>> srcoffset + j * UNITS_PER_WORD - src_autoinc);
>> mem = adjust_automodify_address (srcbase, SImode, addr,
>> srcoffset + j * UNITS_PER_WORD);
>> - emit_insn (gen_unaligned_loadsi (regs[j], mem));
>> + if (src_aligned)
>> + emit_move_insn (regs[j], mem);
>> + else
>> + emit_insn (gen_unaligned_loadsi (regs[j], mem));
>> }
>> srcoffset += words * UNITS_PER_WORD;
>> }
>> @@ -14395,7 +14398,10 @@ arm_block_move_unaligned_straight (rtx d
>> dstoffset + j * UNITS_PER_WORD - dst_autoinc);
>> mem = adjust_automodify_address (dstbase, SImode, addr,
>> dstoffset + j * UNITS_PER_WORD);
>> - emit_insn (gen_unaligned_storesi (mem, regs[j]));
>> + if (dst_aligned)
>> + emit_move_insn (mem, regs[j]);
>> + else
>> + emit_insn (gen_unaligned_storesi (mem, regs[j]));
>> }
>> dstoffset += words * UNITS_PER_WORD;
>> }
>>
>>
>> Ok?
>>
>> Can someone spin this through an arm test suite run for me, I was doing this by inspection and cross compile on a system with no arm bits. Bonus points if you can check it in with the test case above marked up as appropriate.
>>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: arm memcpy of aligned data
2015-05-29 10:40 ` Kyrill Tkachov
@ 2015-06-15 14:41 ` Kyrill Tkachov
2015-06-15 15:25 ` Richard Earnshaw
2015-08-16 19:24 ` Mike Stump
0 siblings, 2 replies; 8+ messages in thread
From: Kyrill Tkachov @ 2015-06-15 14:41 UTC (permalink / raw)
To: Mike Stump, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 3840 bytes --]
On 29/05/15 11:15, Kyrill Tkachov wrote:
> On 29/05/15 10:08, Kyrill Tkachov wrote:
>> Hi Mike,
>>
>> On 28/05/15 22:15, Mike Stump wrote:
>>> So, the arm memcpy code of aligned data isn’t as good as it can be.
>>>
>>> void *memcpy(void *dest, const void *src, unsigned int n);
>>>
>>> void foo(char *dst, int i) {
>>> memcpy (dst, &i, sizeof (i));
>>> }
>>>
>>> generates horrible code, but, it we are willing to notice the src or the destination are aligned, we can do much better:
>>>
>>> $ ./cc1 -fschedule-fusion -fdump-tree-all-all -da -march=armv7ve -mcpu=cortex-m4 -fomit-frame-pointer -quiet -O2 /tmp/t.c -o t.s
>>> $ cat t.s
>>> [ … ]
>>> foo:
>>> @ args = 0, pretend = 0, frame = 4
>>> @ frame_needed = 0, uses_anonymous_args = 0
>>> @ link register save eliminated.
>>> sub sp, sp, #4
>>> str r1, [r0] @ unaligned
>>> add sp, sp, #4
>> I think there's something to do with cpu tuning here as well.
> That being said, I do think this is a good idea.
> I'll give it a test.
The patch passes bootstrap and testing ok and I've seen it
improve codegen in a few places in SPEC.
I've added a testcase all marked up.
Mike, I'll commit the attached patch in 24 hours unless somebody objects.
Thanks,
Kyrill
2015-06-15 Mike Stump <mikestump@comcast.net>
* config/arm/arm.c (arm_block_move_unaligned_straight):
Emit normal move instead of unaligned load when source or destination
are appropriately aligned.
2015-06-15 Mike Stump <mikestump@comcast.net>
Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* gcc.target/arm/memcpy-aligned-1.c: New test.
>
> Kyrill
>
>> For the code you've given compiled with -O2 -mcpu=cortex-a53 I get:
>> sub sp, sp, #8
>> mov r2, r0
>> add r3, sp, #8
>> str r1, [r3, #-4]!
>> ldr r0, [r3] @ unaligned
>> str r0, [r2] @ unaligned
>> add sp, sp, #8
>> @ sp needed
>> bx lr
>>
>> whereas for -O2 -mcpu=cortex-a57 I get the much better:
>> sub sp, sp, #8
>> str r1, [r0] @ unaligned
>> add sp, sp, #8
>> @ sp needed
>> bx lr
>>
>> Kyrill
>>
>>
>>> Index: gcc/config/arm/arm.c
>>> ===================================================================
>>> --- gcc/config/arm/arm.c (revision 223842)
>>> +++ gcc/config/arm/arm.c (working copy)
>>> @@ -14376,7 +14376,10 @@ arm_block_move_unaligned_straight (rtx d
>>> srcoffset + j * UNITS_PER_WORD - src_autoinc);
>>> mem = adjust_automodify_address (srcbase, SImode, addr,
>>> srcoffset + j * UNITS_PER_WORD);
>>> - emit_insn (gen_unaligned_loadsi (regs[j], mem));
>>> + if (src_aligned)
>>> + emit_move_insn (regs[j], mem);
>>> + else
>>> + emit_insn (gen_unaligned_loadsi (regs[j], mem));
>>> }
>>> srcoffset += words * UNITS_PER_WORD;
>>> }
>>> @@ -14395,7 +14398,10 @@ arm_block_move_unaligned_straight (rtx d
>>> dstoffset + j * UNITS_PER_WORD - dst_autoinc);
>>> mem = adjust_automodify_address (dstbase, SImode, addr,
>>> dstoffset + j * UNITS_PER_WORD);
>>> - emit_insn (gen_unaligned_storesi (mem, regs[j]));
>>> + if (dst_aligned)
>>> + emit_move_insn (mem, regs[j]);
>>> + else
>>> + emit_insn (gen_unaligned_storesi (mem, regs[j]));
>>> }
>>> dstoffset += words * UNITS_PER_WORD;
>>> }
>>>
>>>
>>> Ok?
>>>
>>> Can someone spin this through an arm test suite run for me, I was doing this by inspection and cross compile on a system with no arm bits. Bonus points if you can check it in with the test case above marked up as appropriate.
>>>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: arm-memcpy-aligned.patch --]
[-- Type: text/x-patch; name=arm-memcpy-aligned.patch, Size: 1855 bytes --]
commit 77191f4224c8729d014a9150bd9364f95ff704b0
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date: Fri May 29 10:44:21 2015 +0100
[ARM] arm memcpy of aligned data
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 638d659..3a33c26 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -14283,7 +14283,10 @@ arm_block_move_unaligned_straight (rtx dstbase, rtx srcbase,
srcoffset + j * UNITS_PER_WORD - src_autoinc);
mem = adjust_automodify_address (srcbase, SImode, addr,
srcoffset + j * UNITS_PER_WORD);
- emit_insn (gen_unaligned_loadsi (regs[j], mem));
+ if (src_aligned)
+ emit_move_insn (regs[j], mem);
+ else
+ emit_insn (gen_unaligned_loadsi (regs[j], mem));
}
srcoffset += words * UNITS_PER_WORD;
}
@@ -14302,7 +14305,10 @@ arm_block_move_unaligned_straight (rtx dstbase, rtx srcbase,
dstoffset + j * UNITS_PER_WORD - dst_autoinc);
mem = adjust_automodify_address (dstbase, SImode, addr,
dstoffset + j * UNITS_PER_WORD);
- emit_insn (gen_unaligned_storesi (mem, regs[j]));
+ if (dst_aligned)
+ emit_move_insn (mem, regs[j]);
+ else
+ emit_insn (gen_unaligned_storesi (mem, regs[j]));
}
dstoffset += words * UNITS_PER_WORD;
}
diff --git a/gcc/testsuite/gcc.target/arm/memcpy-aligned-1.c b/gcc/testsuite/gcc.target/arm/memcpy-aligned-1.c
new file mode 100644
index 0000000..852b391
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/memcpy-aligned-1.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -save-temps" } */
+
+void *memcpy (void *dest, const void *src, unsigned int n);
+
+void foo (char *dst, int i)
+{
+ memcpy (dst, &i, sizeof (i));
+}
+
+/* { dg-final { scan-assembler-times "str\t" 1 } } */
+/* { dg-final { scan-assembler-not "ldr\t" } } */
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: arm memcpy of aligned data
2015-06-15 14:41 ` Kyrill Tkachov
@ 2015-06-15 15:25 ` Richard Earnshaw
2015-08-16 19:24 ` Mike Stump
1 sibling, 0 replies; 8+ messages in thread
From: Richard Earnshaw @ 2015-06-15 15:25 UTC (permalink / raw)
To: Kyrill Tkachov, Mike Stump, gcc-patches
On 15/06/15 15:30, Kyrill Tkachov wrote:
>
> On 29/05/15 11:15, Kyrill Tkachov wrote:
>> On 29/05/15 10:08, Kyrill Tkachov wrote:
>>> Hi Mike,
>>>
>>> On 28/05/15 22:15, Mike Stump wrote:
>>>> So, the arm memcpy code of aligned data isnât as good as it can be.
>>>>
>>>> void *memcpy(void *dest, const void *src, unsigned int n);
>>>>
>>>> void foo(char *dst, int i) {
>>>> memcpy (dst, &i, sizeof (i));
>>>> }
>>>>
>>>> generates horrible code, but, it we are willing to notice the src or
>>>> the destination are aligned, we can do much better:
>>>>
>>>> $ ./cc1 -fschedule-fusion -fdump-tree-all-all -da -march=armv7ve
>>>> -mcpu=cortex-m4 -fomit-frame-pointer -quiet -O2 /tmp/t.c -o t.s
>>>> $ cat t.s
>>>> [ ⦠]
>>>> foo:
>>>> @ args = 0, pretend = 0, frame = 4
>>>> @ frame_needed = 0, uses_anonymous_args = 0
>>>> @ link register save eliminated.
>>>> sub sp, sp, #4
>>>> str r1, [r0] @ unaligned
>>>> add sp, sp, #4
>>> I think there's something to do with cpu tuning here as well.
>> That being said, I do think this is a good idea.
>> I'll give it a test.
>
> The patch passes bootstrap and testing ok and I've seen it
> improve codegen in a few places in SPEC.
> I've added a testcase all marked up.
>
> Mike, I'll commit the attached patch in 24 hours unless somebody objects.
>
> Thanks,
> Kyrill
>
> 2015-06-15 Mike Stump <mikestump@comcast.net>
>
> * config/arm/arm.c (arm_block_move_unaligned_straight):
> Emit normal move instead of unaligned load when source or destination
> are appropriately aligned.
>
> 2015-06-15 Mike Stump <mikestump@comcast.net>
> Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * gcc.target/arm/memcpy-aligned-1.c: New test.
>
My only question would be whether this should be pushed down into
gen_unaligned_{load|store}si, so that all callers would benefit?
R.
>>
>> Kyrill
>>
>>> For the code you've given compiled with -O2 -mcpu=cortex-a53 I get:
>>> sub sp, sp, #8
>>> mov r2, r0
>>> add r3, sp, #8
>>> str r1, [r3, #-4]!
>>> ldr r0, [r3] @ unaligned
>>> str r0, [r2] @ unaligned
>>> add sp, sp, #8
>>> @ sp needed
>>> bx lr
>>>
>>> whereas for -O2 -mcpu=cortex-a57 I get the much better:
>>> sub sp, sp, #8
>>> str r1, [r0] @ unaligned
>>> add sp, sp, #8
>>> @ sp needed
>>> bx lr
>>>
>>> Kyrill
>>>
>>>
>>>> Index: gcc/config/arm/arm.c
>>>> ===================================================================
>>>> --- gcc/config/arm/arm.c (revision 223842)
>>>> +++ gcc/config/arm/arm.c (working copy)
>>>> @@ -14376,7 +14376,10 @@ arm_block_move_unaligned_straight (rtx d
>>>> srcoffset + j * UNITS_PER_WORD - src_autoinc);
>>>> mem = adjust_automodify_address (srcbase, SImode, addr,
>>>> srcoffset + j * UNITS_PER_WORD);
>>>> - emit_insn (gen_unaligned_loadsi (regs[j], mem));
>>>> + if (src_aligned)
>>>> + emit_move_insn (regs[j], mem);
>>>> + else
>>>> + emit_insn (gen_unaligned_loadsi (regs[j], mem));
>>>> }
>>>> srcoffset += words * UNITS_PER_WORD;
>>>> }
>>>> @@ -14395,7 +14398,10 @@ arm_block_move_unaligned_straight (rtx d
>>>> dstoffset + j * UNITS_PER_WORD - dst_autoinc);
>>>> mem = adjust_automodify_address (dstbase, SImode, addr,
>>>> dstoffset + j * UNITS_PER_WORD);
>>>> - emit_insn (gen_unaligned_storesi (mem, regs[j]));
>>>> + if (dst_aligned)
>>>> + emit_move_insn (mem, regs[j]);
>>>> + else
>>>> + emit_insn (gen_unaligned_storesi (mem, regs[j]));
>>>> }
>>>> dstoffset += words * UNITS_PER_WORD;
>>>> }
>>>>
>>>>
>>>> Ok?
>>>>
>>>> Can someone spin this through an arm test suite run for me, I was
>>>> doing this by inspection and cross compile on a system with no arm
>>>> bits. Bonus points if you can check it in with the test case above
>>>> marked up as appropriate.
>>>>
>
>
> arm-memcpy-aligned.patch
>
>
> commit 77191f4224c8729d014a9150bd9364f95ff704b0
> Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> Date: Fri May 29 10:44:21 2015 +0100
>
> [ARM] arm memcpy of aligned data
>
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index 638d659..3a33c26 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -14283,7 +14283,10 @@ arm_block_move_unaligned_straight (rtx dstbase, rtx srcbase,
> srcoffset + j * UNITS_PER_WORD - src_autoinc);
> mem = adjust_automodify_address (srcbase, SImode, addr,
> srcoffset + j * UNITS_PER_WORD);
> - emit_insn (gen_unaligned_loadsi (regs[j], mem));
> + if (src_aligned)
> + emit_move_insn (regs[j], mem);
> + else
> + emit_insn (gen_unaligned_loadsi (regs[j], mem));
> }
> srcoffset += words * UNITS_PER_WORD;
> }
> @@ -14302,7 +14305,10 @@ arm_block_move_unaligned_straight (rtx dstbase, rtx srcbase,
> dstoffset + j * UNITS_PER_WORD - dst_autoinc);
> mem = adjust_automodify_address (dstbase, SImode, addr,
> dstoffset + j * UNITS_PER_WORD);
> - emit_insn (gen_unaligned_storesi (mem, regs[j]));
> + if (dst_aligned)
> + emit_move_insn (mem, regs[j]);
> + else
> + emit_insn (gen_unaligned_storesi (mem, regs[j]));
> }
> dstoffset += words * UNITS_PER_WORD;
> }
> diff --git a/gcc/testsuite/gcc.target/arm/memcpy-aligned-1.c b/gcc/testsuite/gcc.target/arm/memcpy-aligned-1.c
> new file mode 100644
> index 0000000..852b391
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/memcpy-aligned-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -save-temps" } */
> +
> +void *memcpy (void *dest, const void *src, unsigned int n);
> +
> +void foo (char *dst, int i)
> +{
> + memcpy (dst, &i, sizeof (i));
> +}
> +
> +/* { dg-final { scan-assembler-times "str\t" 1 } } */
> +/* { dg-final { scan-assembler-not "ldr\t" } } */
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: arm memcpy of aligned data
2015-06-15 14:41 ` Kyrill Tkachov
2015-06-15 15:25 ` Richard Earnshaw
@ 2015-08-16 19:24 ` Mike Stump
2015-08-17 10:01 ` Kyrill Tkachov
1 sibling, 1 reply; 8+ messages in thread
From: Mike Stump @ 2015-08-16 19:24 UTC (permalink / raw)
To: Kyrill Tkachov; +Cc: gcc-patches
On Jun 15, 2015, at 7:30 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
>
> On 29/05/15 11:15, Kyrill Tkachov wrote:
>> On 29/05/15 10:08, Kyrill Tkachov wrote:
>>> Hi Mike,
>>>
>>> On 28/05/15 22:15, Mike Stump wrote:
>>>> So, the arm memcpy code of aligned data isn’t as good as it can be.
>>>>
>>>> void *memcpy(void *dest, const void *src, unsigned int n);
>>>>
>>>> void foo(char *dst, int i) {
>>>> memcpy (dst, &i, sizeof (i));
>>>> }
>>>>
>>>> generates horrible code, but, it we are willing to notice the src or the destination are aligned, we can do much better:
>>>>
>>>> $ ./cc1 -fschedule-fusion -fdump-tree-all-all -da -march=armv7ve -mcpu=cortex-m4 -fomit-frame-pointer -quiet -O2 /tmp/t.c -o t.s
>>>> $ cat t.s
>>>> [ … ]
>>>> foo:
>>>> @ args = 0, pretend = 0, frame = 4
>>>> @ frame_needed = 0, uses_anonymous_args = 0
>>>> @ link register save eliminated.
>>>> sub sp, sp, #4
>>>> str r1, [r0] @ unaligned
>>>> add sp, sp, #4
>>> I think there's something to do with cpu tuning here as well.
>> That being said, I do think this is a good idea.
>> I'll give it a test.
>
> The patch passes bootstrap and testing ok and I've seen it
> improve codegen in a few places in SPEC.
> I've added a testcase all marked up.
>
> Mike, I'll commit the attached patch in 24 hours unless somebody objects.
Was this ever applied?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: arm memcpy of aligned data
2015-08-16 19:24 ` Mike Stump
@ 2015-08-17 10:01 ` Kyrill Tkachov
0 siblings, 0 replies; 8+ messages in thread
From: Kyrill Tkachov @ 2015-08-17 10:01 UTC (permalink / raw)
To: Mike Stump; +Cc: gcc-patches
On 16/08/15 20:01, Mike Stump wrote:
> On Jun 15, 2015, at 7:30 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
>> On 29/05/15 11:15, Kyrill Tkachov wrote:
>>> On 29/05/15 10:08, Kyrill Tkachov wrote:
>>>> Hi Mike,
>>>>
>>>> On 28/05/15 22:15, Mike Stump wrote:
>>>>> So, the arm memcpy code of aligned data isnÂ’t as good as it can be.
>>>>>
>>>>> void *memcpy(void *dest, const void *src, unsigned int n);
>>>>>
>>>>> void foo(char *dst, int i) {
>>>>> memcpy (dst, &i, sizeof (i));
>>>>> }
>>>>>
>>>>> generates horrible code, but, it we are willing to notice the src or the destination are aligned, we can do much better:
>>>>>
>>>>> $ ./cc1 -fschedule-fusion -fdump-tree-all-all -da -march=armv7ve -mcpu=cortex-m4 -fomit-frame-pointer -quiet -O2 /tmp/t.c -o t.s
>>>>> $ cat t.s
>>>>> [ Â… ]
>>>>> foo:
>>>>> @ args = 0, pretend = 0, frame = 4
>>>>> @ frame_needed = 0, uses_anonymous_args = 0
>>>>> @ link register save eliminated.
>>>>> sub sp, sp, #4
>>>>> str r1, [r0] @ unaligned
>>>>> add sp, sp, #4
>>>> I think there's something to do with cpu tuning here as well.
>>> That being said, I do think this is a good idea.
>>> I'll give it a test.
>> The patch passes bootstrap and testing ok and I've seen it
>> improve codegen in a few places in SPEC.
>> I've added a testcase all marked up.
>>
>> Mike, I'll commit the attached patch in 24 hours unless somebody objects.
> Was this ever applied?
Sorry, slipped through the cracks.
Committed with r226935.
Thanks,
Kyrill
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-08-17 9:41 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-28 21:36 arm memcpy of aligned data Mike Stump
2015-05-29 8:22 ` Oleg Endo
2015-05-29 10:15 ` Kyrill Tkachov
2015-05-29 10:40 ` Kyrill Tkachov
2015-06-15 14:41 ` Kyrill Tkachov
2015-06-15 15:25 ` Richard Earnshaw
2015-08-16 19:24 ` Mike Stump
2015-08-17 10:01 ` Kyrill Tkachov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).