[PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail
  2018-05-03 17:52 [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
  2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
@ 2018-05-03 17:52 ` Siddhesh Poyarekar
  2018-05-10 10:29   ` Szabolcs Nagy
  2018-05-10  2:59 ` [PING][PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
  2 siblings, 1 reply; 6+ messages in thread
From: Siddhesh Poyarekar @ 2018-05-03 17:52 UTC (permalink / raw)
  To: libc-alpha

The tail of the copy loops are unable to train the falkor hardware
prefetcher because they load from a different base compared to the hot
loop.  In this case avoid serializing the instructions by loading them
into different registers.  Also peel the last iteration of the loop
into the tail (and have them use different registers) since it gives
better performance for medium sizes.

This results in performance improvements of between 3% and 20% over
the current falkor implementation for sizes between 128 bytes and 1K
on the memmove-walk benchmark, thus mostly covering the regressions
seen against the generic memmove.

	* sysdeps/aarch64/multiarch/memmove_falkor.S
	(__memmove_falkor): Use multiple registers to move data in
	loop tail.
---
 sysdeps/aarch64/multiarch/memmove_falkor.S | 48 ++++++++++++++++++------------
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/sysdeps/aarch64/multiarch/memmove_falkor.S b/sysdeps/aarch64/multiarch/memmove_falkor.S
index 3375adf2de..c0d9560301 100644
--- a/sysdeps/aarch64/multiarch/memmove_falkor.S
+++ b/sysdeps/aarch64/multiarch/memmove_falkor.S
@@ -150,7 +150,6 @@ L(copy96):
 
 	.p2align 4
 L(copy_long):
-	sub	count, count, 64 + 16	/* Test and readjust count.  */
 	mov	B_l, Q_l
 	mov	B_h, Q_h
 	ldp	A_l, A_h, [src]
@@ -161,6 +160,8 @@ L(copy_long):
 	ldp	Q_l, Q_h, [src, 16]!
 	stp	A_l, A_h, [dstin]
 	ldp	A_l, A_h, [src, 16]!
+	subs	count, count, 32 + 64 + 16	/* Test and readjust count.  */
+	b.ls	L(last64)
 
 L(loop64):
 	subs	count, count, 32
@@ -170,18 +171,22 @@ L(loop64):
 	ldp	A_l, A_h, [src, 16]!
 	b.hi	L(loop64)
 
-	/* Write the last full set of 32 bytes.  The remainder is at most 32
-	   bytes, so it is safe to always copy 32 bytes from the end even if
-	   there is just 1 byte left.  */
+	/* Write the last full set of 64 bytes.  The remainder is at most 64
+	   bytes and at least 33 bytes, so it is safe to always copy 64 bytes
+	   from the end.  */
 L(last64):
-	ldp	C_l, C_h, [srcend, -32]
+	ldp	C_l, C_h, [srcend, -64]
 	stp	Q_l, Q_h, [dst, 16]
-	ldp	Q_l, Q_h, [srcend, -16]
+	mov	Q_l, B_l
+	mov	Q_h, B_h
+	ldp	B_l, B_h, [srcend, -48]
 	stp	A_l, A_h, [dst, 32]
-	stp	C_l, C_h, [dstend, -32]
-	stp	Q_l, Q_h, [dstend, -16]
-	mov	Q_l, B_l
-	mov	Q_h, B_h
+	ldp	A_l, A_h, [srcend, -32]
+	ldp	D_l, D_h, [srcend, -16]
+	stp	C_l, C_h, [dstend, -64]
+	stp	B_l, B_h, [dstend, -48]
+	stp	A_l, A_h, [dstend, -32]
+	stp	D_l, D_h, [dstend, -16]
 	ret
 
 	.p2align 4
@@ -204,7 +209,8 @@ L(move_long):
 	sub	count, count, tmp1
 	ldp	A_l, A_h, [srcend, -16]!
 	sub	dstend, dstend, tmp1
-	sub	count, count, 64
+	subs	count, count, 32 + 64
+	b.ls	2f
 
 1:
 	subs	count, count, 32
@@ -214,18 +220,22 @@ L(move_long):
 	ldp	A_l, A_h, [srcend, -16]!
 	b.hi	1b
 
-	/* Write the last full set of 32 bytes.  The remainder is at most 32
-	   bytes, so it is safe to always copy 32 bytes from the start even if
-	   there is just 1 byte left.  */
+	/* Write the last full set of 64 bytes.  The remainder is at most 64
+	   bytes and at least 33 bytes, so it is safe to always copy 64 bytes
+	   from the start.  */
 2:
-	ldp	C_l, C_h, [src, 16]
+	ldp	C_l, C_h, [src, 48]
 	stp	Q_l, Q_h, [dstend, -16]
-	ldp	Q_l, Q_h, [src]
-	stp	A_l, A_h, [dstend, -32]
-	stp	C_l, C_h, [dstin, 16]
-	stp	Q_l, Q_h, [dstin]
 	mov	Q_l, B_l
 	mov	Q_h, B_h
+	ldp	B_l, B_h, [src, 32]
+	stp	A_l, A_h, [dstend, -32]
+	ldp	A_l, A_h, [src, 16]
+	ldp	D_l, D_h, [src]
+	stp	C_l, C_h, [dstin, 48]
+	stp	B_l, B_h, [dstin, 32]
+	stp	A_l, A_h, [dstin, 16]
+	stp	D_l, D_h, [dstin]
 3:	ret
 
 END (__memmove_falkor)
-- 
2.14.3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements
@ 2018-05-03 17:52 Siddhesh Poyarekar
  2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Siddhesh Poyarekar @ 2018-05-03 17:52 UTC (permalink / raw)
  To: libc-alpha

Hi,

Here are a couple of patches to improve performance of the falkor memcpy
and memmove implementations based on testing on the latest hardware.
The theme of the optimization is to avoid trying to train the hardware
prefetcher for smaller sizes and in the loop tail since that just
mis-trains the prefetcher.  Instead, use multiple registers to aid
reordering wherever possible.  Testing showed that regressions in these
sizes compared to generic memcpy are resolved with this patch.

Siddhesh

Siddhesh Poyarekar (2):
  aarch64,falkor: Ignore prefetcher hints for memmove tail
  Ignore prefetcher tagging for smaller copies

 sysdeps/aarch64/multiarch/memcpy_falkor.S  | 68 ++++++++++++++++++------------
 sysdeps/aarch64/multiarch/memmove_falkor.S | 48 ++++++++++++---------
 2 files changed, 70 insertions(+), 46 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2] Ignore prefetcher tagging for smaller copies
  2018-05-03 17:52 [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
@ 2018-05-03 17:52 ` Siddhesh Poyarekar
  2018-05-10 10:29   ` Szabolcs Nagy
  2018-05-03 17:52 ` [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail Siddhesh Poyarekar
  2018-05-10  2:59 ` [PING][PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
  2 siblings, 1 reply; 6+ messages in thread
From: Siddhesh Poyarekar @ 2018-05-03 17:52 UTC (permalink / raw)
  To: libc-alpha

For smaller and medium sized copies, the effect of hardware
prefetching are not as dominant as instruction level parallelism.
Hence it makes more sense to load data into multiple registers than to
try and route them to the same prefetch unit.  This is also the case
for the loop exit where we are unable to latch on to the same prefetch
unit anyway so it makes more sense to have data loaded in parallel.

The performance results are a bit mixed with memcpy-random, with
numbers jumping between -1% and +3%, i.e. the numbers don't seem
repeatable.  memcpy-walk sees a 70% improvement (i.e. > 2x) for 128
bytes and that improvement reduces down as the impact of the tail copy
decreases in comparison to the loop.

	* sysdeps/aarch64/multiarch/memcpy_falkor.S (B_l, B_lw, C_l,
	D_l, E_l, F_l, G_l, A_h, B_h, C_h, D_h, E_h, F_h, G_h): New
	macros.
	(__memcpy_falkor): Use multiple registers to copy data in loop
	tail.

 sysdeps/aarch64/multiarch/memcpy_falkor.S | 68 +++++++++++++++++++------------
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/sysdeps/aarch64/multiarch/memcpy_falkor.S b/sysdeps/aarch64/multiarch/memcpy_falkor.S
index 8dd8c1e03a..2fe9937f11 100644
--- a/sysdeps/aarch64/multiarch/memcpy_falkor.S
+++ b/sysdeps/aarch64/multiarch/memcpy_falkor.S
@@ -35,6 +35,20 @@
 #define A_hw	w7
 #define tmp1	x14
 
+#define B_l	x8
+#define B_lw	w8
+#define B_h	x9
+#define C_l	x10
+#define C_h	x11
+#define D_l	x12
+#define D_h	x13
+#define E_l	dst
+#define E_h	tmp1
+#define F_l	src
+#define F_h	count
+#define G_l	srcend
+#define G_h	x15
+
 /* Copies are split into 3 main cases:
 
    1. Small copies of up to 32 bytes
@@ -74,21 +88,21 @@ ENTRY_ALIGN (__memcpy_falkor, 6)
 	/* Medium copies: 33..128 bytes.  */
 	sub	tmp1, count, 1
 	ldp	A_l, A_h, [src, 16]
-	stp	A_l, A_h, [dstin, 16]
+	ldp	B_l, B_h, [srcend, -32]
+	ldp	C_l, C_h, [srcend, -16]
 	tbz	tmp1, 6, 1f
-	ldp	A_l, A_h, [src, 32]
-	stp	A_l, A_h, [dstin, 32]
-	ldp	A_l, A_h, [src, 48]
-	stp	A_l, A_h, [dstin, 48]
-	ldp	A_l, A_h, [srcend, -64]
-	stp	A_l, A_h, [dstend, -64]
-	ldp	A_l, A_h, [srcend, -48]
-	stp	A_l, A_h, [dstend, -48]
+	ldp	D_l, D_h, [src, 32]
+	ldp	E_l, E_h, [src, 48]
+	stp	D_l, D_h, [dstin, 32]
+	stp	E_l, E_h, [dstin, 48]
+	ldp	F_l, F_h, [srcend, -64]
+	ldp	G_l, G_h, [srcend, -48]
+	stp	F_l, F_h, [dstend, -64]
+	stp	G_l, G_h, [dstend, -48]
 1:
-	ldp	A_l, A_h, [srcend, -32]
-	stp	A_l, A_h, [dstend, -32]
-	ldp	A_l, A_h, [srcend, -16]
-	stp	A_l, A_h, [dstend, -16]
+	stp	A_l, A_h, [dstin, 16]
+	stp	B_l, B_h, [dstend, -32]
+	stp	C_l, C_h, [dstend, -16]
 	ret
 
 	.p2align 4
@@ -98,36 +112,36 @@ L(copy32):
 	cmp	count, 16
 	b.lo	1f
 	ldp	A_l, A_h, [src]
+	ldp	B_l, B_h, [srcend, -16]
 	stp	A_l, A_h, [dstin]
-	ldp	A_l, A_h, [srcend, -16]
-	stp	A_l, A_h, [dstend, -16]
+	stp	B_l, B_h, [dstend, -16]
 	ret
 	.p2align 4
 1:
 	/* 8-15 */
 	tbz	count, 3, 1f
 	ldr	A_l, [src]
+	ldr	B_l, [srcend, -8]
 	str	A_l, [dstin]
-	ldr	A_l, [srcend, -8]
-	str	A_l, [dstend, -8]
+	str	B_l, [dstend, -8]
 	ret
 	.p2align 4
 1:
 	/* 4-7 */
 	tbz	count, 2, 1f
 	ldr	A_lw, [src]
+	ldr	B_lw, [srcend, -4]
 	str	A_lw, [dstin]
-	ldr	A_lw, [srcend, -4]
-	str	A_lw, [dstend, -4]
+	str	B_lw, [dstend, -4]
 	ret
 	.p2align 4
 1:
 	/* 2-3 */
 	tbz	count, 1, 1f
 	ldrh	A_lw, [src]
+	ldrh	B_lw, [srcend, -2]
 	strh	A_lw, [dstin]
-	ldrh	A_lw, [srcend, -2]
-	strh	A_lw, [dstend, -2]
+	strh	B_lw, [dstend, -2]
 	ret
 	.p2align 4
 1:
@@ -171,12 +185,12 @@ L(loop64):
 L(last64):
 	ldp	A_l, A_h, [srcend, -64]
 	stnp	A_l, A_h, [dstend, -64]
-	ldp	A_l, A_h, [srcend, -48]
-	stnp	A_l, A_h, [dstend, -48]
-	ldp	A_l, A_h, [srcend, -32]
-	stnp	A_l, A_h, [dstend, -32]
-	ldp	A_l, A_h, [srcend, -16]
-	stnp	A_l, A_h, [dstend, -16]
+	ldp	B_l, B_h, [srcend, -48]
+	stnp	B_l, B_h, [dstend, -48]
+	ldp	C_l, C_h, [srcend, -32]
+	stnp	C_l, C_h, [dstend, -32]
+	ldp	D_l, D_h, [srcend, -16]
+	stnp	D_l, D_h, [dstend, -16]
 	ret
 
 END (__memcpy_falkor)
-- 
2.14.3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING][PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements
  2018-05-03 17:52 [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
  2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
  2018-05-03 17:52 ` [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail Siddhesh Poyarekar
@ 2018-05-10  2:59 ` Siddhesh Poyarekar
  2 siblings, 0 replies; 6+ messages in thread
From: Siddhesh Poyarekar @ 2018-05-10  2:59 UTC (permalink / raw)
  To: libc-alpha, Szabolcs Nagy

Ping!

On 05/03/2018 11:22 PM, Siddhesh Poyarekar wrote:
> Hi,
> 
> Here are a couple of patches to improve performance of the falkor memcpy
> and memmove implementations based on testing on the latest hardware.
> The theme of the optimization is to avoid trying to train the hardware
> prefetcher for smaller sizes and in the loop tail since that just
> mis-trains the prefetcher.  Instead, use multiple registers to aid
> reordering wherever possible.  Testing showed that regressions in these
> sizes compared to generic memcpy are resolved with this patch.
> 
> Siddhesh
> 
> Siddhesh Poyarekar (2):
>    aarch64,falkor: Ignore prefetcher hints for memmove tail
>    Ignore prefetcher tagging for smaller copies
> 
>   sysdeps/aarch64/multiarch/memcpy_falkor.S  | 68 ++++++++++++++++++------------
>   sysdeps/aarch64/multiarch/memmove_falkor.S | 48 ++++++++++++---------
>   2 files changed, 70 insertions(+), 46 deletions(-)
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail
  2018-05-03 17:52 ` [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail Siddhesh Poyarekar
@ 2018-05-10 10:29   ` Szabolcs Nagy
  0 siblings, 0 replies; 6+ messages in thread
From: Szabolcs Nagy @ 2018-05-10 10:29 UTC (permalink / raw)
  To: Siddhesh Poyarekar, libc-alpha; +Cc: nd

On 03/05/18 18:52, Siddhesh Poyarekar wrote:
> The tail of the copy loops are unable to train the falkor hardware
> prefetcher because they load from a different base compared to the hot
> loop.  In this case avoid serializing the instructions by loading them
> into different registers.  Also peel the last iteration of the loop
> into the tail (and have them use different registers) since it gives
> better performance for medium sizes.
> 
> This results in performance improvements of between 3% and 20% over
> the current falkor implementation for sizes between 128 bytes and 1K
> on the memmove-walk benchmark, thus mostly covering the regressions
> seen against the generic memmove.
> 
> 	* sysdeps/aarch64/multiarch/memmove_falkor.S
> 	(__memmove_falkor): Use multiple registers to move data in
> 	loop tail.

OK to commit.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] Ignore prefetcher tagging for smaller copies
  2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
@ 2018-05-10 10:29   ` Szabolcs Nagy
  0 siblings, 0 replies; 6+ messages in thread
From: Szabolcs Nagy @ 2018-05-10 10:29 UTC (permalink / raw)
  To: Siddhesh Poyarekar, libc-alpha; +Cc: nd

On 03/05/18 18:52, Siddhesh Poyarekar wrote:
> For smaller and medium sized copies, the effect of hardware
> prefetching are not as dominant as instruction level parallelism.
> Hence it makes more sense to load data into multiple registers than to
> try and route them to the same prefetch unit.  This is also the case
> for the loop exit where we are unable to latch on to the same prefetch
> unit anyway so it makes more sense to have data loaded in parallel.
> 
> The performance results are a bit mixed with memcpy-random, with
> numbers jumping between -1% and +3%, i.e. the numbers don't seem
> repeatable.  memcpy-walk sees a 70% improvement (i.e. > 2x) for 128
> bytes and that improvement reduces down as the impact of the tail copy
> decreases in comparison to the loop.
> 
> 	* sysdeps/aarch64/multiarch/memcpy_falkor.S (B_l, B_lw, C_l,
> 	D_l, E_l, F_l, G_l, A_h, B_h, C_h, D_h, E_h, F_h, G_h): New
> 	macros.
> 	(__memcpy_falkor): Use multiple registers to copy data in loop
> 	tail.

OK to commit.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-05-10 10:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-03 17:52 [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
2018-05-10 10:29   ` Szabolcs Nagy
2018-05-03 17:52 ` [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail Siddhesh Poyarekar
2018-05-10 10:29   ` Szabolcs Nagy
2018-05-10  2:59 ` [PING][PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).