[PATCH v3] ARM: Improve armv7 memcpy performance.

public inbox for libc-ports@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v3] ARM: Improve armv7 memcpy performance.
@ 2013-09-09  9:40 Will Newton
  2013-09-09 13:39 ` Joseph S. Myers
  0 siblings, 1 reply; 6+ messages in thread
From: Will Newton @ 2013-09-09  9:40 UTC (permalink / raw)
  To: libc-ports; +Cc: patches


Only enter the aligned copy loop with buffers that can be 8-byte
aligned. This improves performance slightly on Cortex-A9 and
Cortex-A15 cores for large copies with buffers that are 4-byte
aligned but not 8-byte aligned.

ports/ChangeLog.arm:

2013-08-30  Will Newton  <will.newton@linaro.org>

	* sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
	on entry to aligned copy loop to improve performance.
---
 ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Changes in v3:
 - Fixed comments

diff --git a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
index 3decad6..330bb2d 100644
--- a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
+++ b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
@@ -369,8 +369,8 @@ ENTRY(memcpy)
 	cfi_adjust_cfa_offset (FRAME_SIZE)
 	cfi_rel_offset (tmp2, 0)
 	cfi_remember_state
-	and	tmp2, src, #3
-	and	tmp1, dst, #3
+	and	tmp2, src, #7
+	and	tmp1, dst, #7
 	cmp	tmp1, tmp2
 	bne	.Lcpy_notaligned

@@ -381,9 +381,9 @@ ENTRY(memcpy)
 	vmov.f32	s0, s0
 #endif

-	/* SRC and DST have the same mutual 32-bit alignment, but we may
+	/* SRC and DST have the same mutual 64-bit alignment, but we may
 	   still need to pre-copy some bytes to get to natural alignment.
-	   We bring DST into full 64-bit alignment.  */
+	   We bring SRC and DST into full 64-bit alignment.  */
 	lsls	tmp2, dst, #29
 	beq	1f
 	rsbs	tmp2, tmp2, #0
-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ARM: Improve armv7 memcpy performance.
  2013-09-09  9:40 [PATCH v3] ARM: Improve armv7 memcpy performance Will Newton
@ 2013-09-09 13:39 ` Joseph S. Myers
  2013-09-09 16:06   ` Will Newton
  0 siblings, 1 reply; 6+ messages in thread
From: Joseph S. Myers @ 2013-09-09 13:39 UTC (permalink / raw)
  To: Will Newton; +Cc: libc-ports, patches

On Mon, 9 Sep 2013, Will Newton wrote:

> Only enter the aligned copy loop with buffers that can be 8-byte
> aligned. This improves performance slightly on Cortex-A9 and
> Cortex-A15 cores for large copies with buffers that are 4-byte
> aligned but not 8-byte aligned.

Did you conclude that the comment about needing unaligned word access for 
ldrd/strd is still accurate after this patch (and if so, for which uses)?

There was a long discussion on benchmarking starting from this patch.  
Could you summarise the conclusions of that discussion as they relate to 
the appropriate benchmarks to apply to this patch, and give pointers to 
your before-and-after performance results?

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ARM: Improve armv7 memcpy performance.
  2013-09-09 13:39 ` Joseph S. Myers
@ 2013-09-09 16:06   ` Will Newton
  2013-09-09 17:11     ` Joseph S. Myers
  0 siblings, 1 reply; 6+ messages in thread
From: Will Newton @ 2013-09-09 16:06 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: libc-ports, Patch Tracking

On 9 September 2013 14:39, Joseph S. Myers <joseph@codesourcery.com> wrote:
> On Mon, 9 Sep 2013, Will Newton wrote:
>
>> Only enter the aligned copy loop with buffers that can be 8-byte
>> aligned. This improves performance slightly on Cortex-A9 and
>> Cortex-A15 cores for large copies with buffers that are 4-byte
>> aligned but not 8-byte aligned.
>
> Did you conclude that the comment about needing unaligned word access for
> ldrd/strd is still accurate after this patch (and if so, for which uses)?

No, I overlooked that, I'll submit a new patch.

> There was a long discussion on benchmarking starting from this patch.
> Could you summarise the conclusions of that discussion as they relate to
> the appropriate benchmarks to apply to this patch, and give pointers to
> your before-and-after performance results?

I believe the glibc memcpy benchmark is not capable in its present
form of showing the difference between this version of the code and
the current one:

1. The variety of alignments benchmarked is not adequate
2. The variability of the benchmark results is quite high (more runs
required and page allocation issue)
3. The output of the benchmark contains no measure of variance
4. There is no means of showing graphically the output of the
benchmark (for subtle differences this is necessary IMO)

These are all surmountable problems but I would rather not gate
acceptance of this code on a satisfactory resolution of the above
issues. I can provide output from the cortex-strings benchmark quite
instead though.

-- 
Will Newton
Toolchain Working Group, Linaro

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ARM: Improve armv7 memcpy performance.
  2013-09-09 16:06   ` Will Newton
@ 2013-09-09 17:11     ` Joseph S. Myers
  2013-09-09 17:46       ` Ondřej Bílka
  2013-09-09 21:02       ` Carlos O'Donell
  0 siblings, 2 replies; 6+ messages in thread
From: Joseph S. Myers @ 2013-09-09 17:11 UTC (permalink / raw)
  To: Will Newton; +Cc: libc-ports, Patch Tracking

On Mon, 9 Sep 2013, Will Newton wrote:

> I believe the glibc memcpy benchmark is not capable in its present
> form of showing the difference between this version of the code and
> the current one:
> 
> 1. The variety of alignments benchmarked is not adequate
> 2. The variability of the benchmark results is quite high (more runs
> required and page allocation issue)
> 3. The output of the benchmark contains no measure of variance
> 4. There is no means of showing graphically the output of the
> benchmark (for subtle differences this is necessary IMO)

Please make sure the wiki todo list 
<https://sourceware.org/glibc/wiki/Development_Todo/Master> includes all 
these areas for improvement of the benchmarks.

> These are all surmountable problems but I would rather not gate
> acceptance of this code on a satisfactory resolution of the above
> issues. I can provide output from the cortex-strings benchmark quite
> instead though.

If your summary of the benchmarking discussion indicates that the existing 
glibc benchmark is not relevant for the cases addressed by the patch, then 
it's indeed appropriate to give such results from another benchmark.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ARM: Improve armv7 memcpy performance.
  2013-09-09 17:11     ` Joseph S. Myers
@ 2013-09-09 17:46       ` Ondřej Bílka
  2013-09-09 21:02       ` Carlos O'Donell
  1 sibling, 0 replies; 6+ messages in thread
From: Ondřej Bílka @ 2013-09-09 17:46 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Will Newton, libc-ports, Patch Tracking

[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]

On Mon, Sep 09, 2013 at 05:11:36PM +0000, Joseph S. Myers wrote:
> On Mon, 9 Sep 2013, Will Newton wrote:
> 
> > I believe the glibc memcpy benchmark is not capable in its present
> > form of showing the difference between this version of the code and
> > the current one:
> > 
> > 1. The variety of alignments benchmarked is not adequate
> > 2. The variability of the benchmark results is quite high (more runs
> > required and page allocation issue)
> > 3. The output of the benchmark contains no measure of variance
> > 4. There is no means of showing graphically the output of the
> > benchmark (for subtle differences this is necessary IMO)
> 
> Please make sure the wiki todo list 
> <https://sourceware.org/glibc/wiki/Development_Todo/Master> includes all 
> these areas for improvement of the benchmarks.
> 
> > These are all surmountable problems but I would rather not gate
> > acceptance of this code on a satisfactory resolution of the above
> > issues. I can provide output from the cortex-strings benchmark quite
> > instead though.
> 
> If your summary of the benchmarking discussion indicates that the existing 
> glibc benchmark is not relevant for the cases addressed by the patch, then 
> it's indeed appropriate to give such results from another benchmark.
> 
I would prefer get profiling results in arm, I wrote simple tool that
measures time and variance of how long it takes gcc to compile with
different memcpy versions. 

For it you need to compile old and new memcpy as separate libraries that
will be preloaded. For that you need to compile memcpy as standalone
library like 

gcc -fPIC -shared   old_memcpy.S -o old.so
gcc -fPIC -shared   new_memcpy.S -o new.so

and place old.so and new.so to benchmark directory, then run
./benchmark 

It may take long until variance becomes statistically significant so its
better ran overnigth.

If you want check another command copy and modify benchmark script.

[-- Attachment #2: test_memcpy.tar.bz2 --]
[-- Type: application/octet-stream, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ARM: Improve armv7 memcpy performance.
  2013-09-09 17:11     ` Joseph S. Myers
  2013-09-09 17:46       ` Ondřej Bílka
@ 2013-09-09 21:02       ` Carlos O'Donell
  1 sibling, 0 replies; 6+ messages in thread
From: Carlos O'Donell @ 2013-09-09 21:02 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Will Newton, libc-ports, Patch Tracking

>> These are all surmountable problems but I would rather not gate
>> acceptance of this code on a satisfactory resolution of the above
>> issues. I can provide output from the cortex-strings benchmark quite
>> instead though.
> 
> If your summary of the benchmarking discussion indicates that the existing 
> glibc benchmark is not relevant for the cases addressed by the patch, then 
> it's indeed appropriate to give such results from another benchmark.

Agreed.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-09-09 21:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-09  9:40 [PATCH v3] ARM: Improve armv7 memcpy performance Will Newton
2013-09-09 13:39 ` Joseph S. Myers
2013-09-09 16:06   ` Will Newton
2013-09-09 17:11     ` Joseph S. Myers
2013-09-09 17:46       ` Ondřej Bílka
2013-09-09 21:02       ` Carlos O'Donell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).