* [PATCH v4] ARM: Improve armv7 memcpy performance.
@ 2013-09-16 8:37 Will Newton
2013-09-16 15:25 ` Joseph S. Myers
0 siblings, 1 reply; 2+ messages in thread
From: Will Newton @ 2013-09-16 8:37 UTC (permalink / raw)
To: libc-ports; +Cc: patches
Only enter the aligned copy loop with buffers that can be 8-byte
aligned. This improves performance slightly on Cortex-A9 and
Cortex-A15 cores for large copies with buffers that are 4-byte
aligned but not 8-byte aligned.
ports/ChangeLog.arm:
2013-08-30 Will Newton <will.newton@linaro.org>
* sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
on entry to aligned copy loop to improve performance.
---
ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
Changes in v4:
- More comment fixes
The output of the cortex-strings benchmark can be found here (where "this" is the new code
and "old" is the previous version):
http://people.linaro.org/~will.newton/glibc_memcpy/
diff --git a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
index 3decad6..ad43a3d 100644
--- a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
+++ b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
@@ -24,7 +24,6 @@
ARMv6 (ARMv7-a if using Neon)
ARM state
Unaligned accesses
- LDRD/STRD support unaligned word accesses
*/
@@ -369,8 +368,8 @@ ENTRY(memcpy)
cfi_adjust_cfa_offset (FRAME_SIZE)
cfi_rel_offset (tmp2, 0)
cfi_remember_state
- and tmp2, src, #3
- and tmp1, dst, #3
+ and tmp2, src, #7
+ and tmp1, dst, #7
cmp tmp1, tmp2
bne .Lcpy_notaligned
@@ -381,9 +380,9 @@ ENTRY(memcpy)
vmov.f32 s0, s0
#endif
- /* SRC and DST have the same mutual 32-bit alignment, but we may
+ /* SRC and DST have the same mutual 64-bit alignment, but we may
still need to pre-copy some bytes to get to natural alignment.
- We bring DST into full 64-bit alignment. */
+ We bring SRC and DST into full 64-bit alignment. */
lsls tmp2, dst, #29
beq 1f
rsbs tmp2, tmp2, #0
@@ -515,7 +514,7 @@ ENTRY(memcpy)
.Ltail63aligned: /* Count in tmp2. */
/* Copy up to 7 d-words of data. Similar to Ltail63unaligned, but
- we know that the src and dest are 32-bit aligned so we can use
+ we know that the src and dest are 64-bit aligned so we can use
LDRD/STRD to improve efficiency. */
/* TMP2 is now negative, but we don't care about that. The bottom
six bits still tell us how many bytes are left to copy. */
--
1.8.1.4
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH v4] ARM: Improve armv7 memcpy performance.
2013-09-16 8:37 [PATCH v4] ARM: Improve armv7 memcpy performance Will Newton
@ 2013-09-16 15:25 ` Joseph S. Myers
0 siblings, 0 replies; 2+ messages in thread
From: Joseph S. Myers @ 2013-09-16 15:25 UTC (permalink / raw)
To: Will Newton; +Cc: libc-ports, patches
On Mon, 16 Sep 2013, Will Newton wrote:
> Only enter the aligned copy loop with buffers that can be 8-byte
> aligned. This improves performance slightly on Cortex-A9 and
> Cortex-A15 cores for large copies with buffers that are 4-byte
> aligned but not 8-byte aligned.
>
> ports/ChangeLog.arm:
>
> 2013-08-30 Will Newton <will.newton@linaro.org>
>
> * sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
> on entry to aligned copy loop to improve performance.
OK.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2013-09-16 15:25 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-16 8:37 [PATCH v4] ARM: Improve armv7 memcpy performance Will Newton
2013-09-16 15:25 ` Joseph S. Myers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).