* [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail
2018-05-03 17:52 [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
@ 2018-05-03 17:52 ` Siddhesh Poyarekar
2018-05-10 10:29 ` Szabolcs Nagy
2018-05-10 2:59 ` [PING][PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
2 siblings, 1 reply; 6+ messages in thread
From: Siddhesh Poyarekar @ 2018-05-03 17:52 UTC (permalink / raw)
To: libc-alpha
The tail of the copy loops are unable to train the falkor hardware
prefetcher because they load from a different base compared to the hot
loop. In this case avoid serializing the instructions by loading them
into different registers. Also peel the last iteration of the loop
into the tail (and have them use different registers) since it gives
better performance for medium sizes.
This results in performance improvements of between 3% and 20% over
the current falkor implementation for sizes between 128 bytes and 1K
on the memmove-walk benchmark, thus mostly covering the regressions
seen against the generic memmove.
* sysdeps/aarch64/multiarch/memmove_falkor.S
(__memmove_falkor): Use multiple registers to move data in
loop tail.
---
sysdeps/aarch64/multiarch/memmove_falkor.S | 48 ++++++++++++++++++------------
1 file changed, 29 insertions(+), 19 deletions(-)
diff --git a/sysdeps/aarch64/multiarch/memmove_falkor.S b/sysdeps/aarch64/multiarch/memmove_falkor.S
index 3375adf2de..c0d9560301 100644
--- a/sysdeps/aarch64/multiarch/memmove_falkor.S
+++ b/sysdeps/aarch64/multiarch/memmove_falkor.S
@@ -150,7 +150,6 @@ L(copy96):
.p2align 4
L(copy_long):
- sub count, count, 64 + 16 /* Test and readjust count. */
mov B_l, Q_l
mov B_h, Q_h
ldp A_l, A_h, [src]
@@ -161,6 +160,8 @@ L(copy_long):
ldp Q_l, Q_h, [src, 16]!
stp A_l, A_h, [dstin]
ldp A_l, A_h, [src, 16]!
+ subs count, count, 32 + 64 + 16 /* Test and readjust count. */
+ b.ls L(last64)
L(loop64):
subs count, count, 32
@@ -170,18 +171,22 @@ L(loop64):
ldp A_l, A_h, [src, 16]!
b.hi L(loop64)
- /* Write the last full set of 32 bytes. The remainder is at most 32
- bytes, so it is safe to always copy 32 bytes from the end even if
- there is just 1 byte left. */
+ /* Write the last full set of 64 bytes. The remainder is at most 64
+ bytes and at least 33 bytes, so it is safe to always copy 64 bytes
+ from the end. */
L(last64):
- ldp C_l, C_h, [srcend, -32]
+ ldp C_l, C_h, [srcend, -64]
stp Q_l, Q_h, [dst, 16]
- ldp Q_l, Q_h, [srcend, -16]
+ mov Q_l, B_l
+ mov Q_h, B_h
+ ldp B_l, B_h, [srcend, -48]
stp A_l, A_h, [dst, 32]
- stp C_l, C_h, [dstend, -32]
- stp Q_l, Q_h, [dstend, -16]
- mov Q_l, B_l
- mov Q_h, B_h
+ ldp A_l, A_h, [srcend, -32]
+ ldp D_l, D_h, [srcend, -16]
+ stp C_l, C_h, [dstend, -64]
+ stp B_l, B_h, [dstend, -48]
+ stp A_l, A_h, [dstend, -32]
+ stp D_l, D_h, [dstend, -16]
ret
.p2align 4
@@ -204,7 +209,8 @@ L(move_long):
sub count, count, tmp1
ldp A_l, A_h, [srcend, -16]!
sub dstend, dstend, tmp1
- sub count, count, 64
+ subs count, count, 32 + 64
+ b.ls 2f
1:
subs count, count, 32
@@ -214,18 +220,22 @@ L(move_long):
ldp A_l, A_h, [srcend, -16]!
b.hi 1b
- /* Write the last full set of 32 bytes. The remainder is at most 32
- bytes, so it is safe to always copy 32 bytes from the start even if
- there is just 1 byte left. */
+ /* Write the last full set of 64 bytes. The remainder is at most 64
+ bytes and at least 33 bytes, so it is safe to always copy 64 bytes
+ from the start. */
2:
- ldp C_l, C_h, [src, 16]
+ ldp C_l, C_h, [src, 48]
stp Q_l, Q_h, [dstend, -16]
- ldp Q_l, Q_h, [src]
- stp A_l, A_h, [dstend, -32]
- stp C_l, C_h, [dstin, 16]
- stp Q_l, Q_h, [dstin]
mov Q_l, B_l
mov Q_h, B_h
+ ldp B_l, B_h, [src, 32]
+ stp A_l, A_h, [dstend, -32]
+ ldp A_l, A_h, [src, 16]
+ ldp D_l, D_h, [src]
+ stp C_l, C_h, [dstin, 48]
+ stp B_l, B_h, [dstin, 32]
+ stp A_l, A_h, [dstin, 16]
+ stp D_l, D_h, [dstin]
3: ret
END (__memmove_falkor)
--
2.14.3
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements
@ 2018-05-03 17:52 Siddhesh Poyarekar
2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Siddhesh Poyarekar @ 2018-05-03 17:52 UTC (permalink / raw)
To: libc-alpha
Hi,
Here are a couple of patches to improve performance of the falkor memcpy
and memmove implementations based on testing on the latest hardware.
The theme of the optimization is to avoid trying to train the hardware
prefetcher for smaller sizes and in the loop tail since that just
mis-trains the prefetcher. Instead, use multiple registers to aid
reordering wherever possible. Testing showed that regressions in these
sizes compared to generic memcpy are resolved with this patch.
Siddhesh
Siddhesh Poyarekar (2):
aarch64,falkor: Ignore prefetcher hints for memmove tail
Ignore prefetcher tagging for smaller copies
sysdeps/aarch64/multiarch/memcpy_falkor.S | 68 ++++++++++++++++++------------
sysdeps/aarch64/multiarch/memmove_falkor.S | 48 ++++++++++++---------
2 files changed, 70 insertions(+), 46 deletions(-)
--
2.14.3
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/2] Ignore prefetcher tagging for smaller copies
2018-05-03 17:52 [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
@ 2018-05-03 17:52 ` Siddhesh Poyarekar
2018-05-10 10:29 ` Szabolcs Nagy
2018-05-03 17:52 ` [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail Siddhesh Poyarekar
2018-05-10 2:59 ` [PING][PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
2 siblings, 1 reply; 6+ messages in thread
From: Siddhesh Poyarekar @ 2018-05-03 17:52 UTC (permalink / raw)
To: libc-alpha
For smaller and medium sized copies, the effect of hardware
prefetching are not as dominant as instruction level parallelism.
Hence it makes more sense to load data into multiple registers than to
try and route them to the same prefetch unit. This is also the case
for the loop exit where we are unable to latch on to the same prefetch
unit anyway so it makes more sense to have data loaded in parallel.
The performance results are a bit mixed with memcpy-random, with
numbers jumping between -1% and +3%, i.e. the numbers don't seem
repeatable. memcpy-walk sees a 70% improvement (i.e. > 2x) for 128
bytes and that improvement reduces down as the impact of the tail copy
decreases in comparison to the loop.
* sysdeps/aarch64/multiarch/memcpy_falkor.S (B_l, B_lw, C_l,
D_l, E_l, F_l, G_l, A_h, B_h, C_h, D_h, E_h, F_h, G_h): New
macros.
(__memcpy_falkor): Use multiple registers to copy data in loop
tail.
sysdeps/aarch64/multiarch/memcpy_falkor.S | 68 +++++++++++++++++++------------
1 file changed, 41 insertions(+), 27 deletions(-)
diff --git a/sysdeps/aarch64/multiarch/memcpy_falkor.S b/sysdeps/aarch64/multiarch/memcpy_falkor.S
index 8dd8c1e03a..2fe9937f11 100644
--- a/sysdeps/aarch64/multiarch/memcpy_falkor.S
+++ b/sysdeps/aarch64/multiarch/memcpy_falkor.S
@@ -35,6 +35,20 @@
#define A_hw w7
#define tmp1 x14
+#define B_l x8
+#define B_lw w8
+#define B_h x9
+#define C_l x10
+#define C_h x11
+#define D_l x12
+#define D_h x13
+#define E_l dst
+#define E_h tmp1
+#define F_l src
+#define F_h count
+#define G_l srcend
+#define G_h x15
+
/* Copies are split into 3 main cases:
1. Small copies of up to 32 bytes
@@ -74,21 +88,21 @@ ENTRY_ALIGN (__memcpy_falkor, 6)
/* Medium copies: 33..128 bytes. */
sub tmp1, count, 1
ldp A_l, A_h, [src, 16]
- stp A_l, A_h, [dstin, 16]
+ ldp B_l, B_h, [srcend, -32]
+ ldp C_l, C_h, [srcend, -16]
tbz tmp1, 6, 1f
- ldp A_l, A_h, [src, 32]
- stp A_l, A_h, [dstin, 32]
- ldp A_l, A_h, [src, 48]
- stp A_l, A_h, [dstin, 48]
- ldp A_l, A_h, [srcend, -64]
- stp A_l, A_h, [dstend, -64]
- ldp A_l, A_h, [srcend, -48]
- stp A_l, A_h, [dstend, -48]
+ ldp D_l, D_h, [src, 32]
+ ldp E_l, E_h, [src, 48]
+ stp D_l, D_h, [dstin, 32]
+ stp E_l, E_h, [dstin, 48]
+ ldp F_l, F_h, [srcend, -64]
+ ldp G_l, G_h, [srcend, -48]
+ stp F_l, F_h, [dstend, -64]
+ stp G_l, G_h, [dstend, -48]
1:
- ldp A_l, A_h, [srcend, -32]
- stp A_l, A_h, [dstend, -32]
- ldp A_l, A_h, [srcend, -16]
- stp A_l, A_h, [dstend, -16]
+ stp A_l, A_h, [dstin, 16]
+ stp B_l, B_h, [dstend, -32]
+ stp C_l, C_h, [dstend, -16]
ret
.p2align 4
@@ -98,36 +112,36 @@ L(copy32):
cmp count, 16
b.lo 1f
ldp A_l, A_h, [src]
+ ldp B_l, B_h, [srcend, -16]
stp A_l, A_h, [dstin]
- ldp A_l, A_h, [srcend, -16]
- stp A_l, A_h, [dstend, -16]
+ stp B_l, B_h, [dstend, -16]
ret
.p2align 4
1:
/* 8-15 */
tbz count, 3, 1f
ldr A_l, [src]
+ ldr B_l, [srcend, -8]
str A_l, [dstin]
- ldr A_l, [srcend, -8]
- str A_l, [dstend, -8]
+ str B_l, [dstend, -8]
ret
.p2align 4
1:
/* 4-7 */
tbz count, 2, 1f
ldr A_lw, [src]
+ ldr B_lw, [srcend, -4]
str A_lw, [dstin]
- ldr A_lw, [srcend, -4]
- str A_lw, [dstend, -4]
+ str B_lw, [dstend, -4]
ret
.p2align 4
1:
/* 2-3 */
tbz count, 1, 1f
ldrh A_lw, [src]
+ ldrh B_lw, [srcend, -2]
strh A_lw, [dstin]
- ldrh A_lw, [srcend, -2]
- strh A_lw, [dstend, -2]
+ strh B_lw, [dstend, -2]
ret
.p2align 4
1:
@@ -171,12 +185,12 @@ L(loop64):
L(last64):
ldp A_l, A_h, [srcend, -64]
stnp A_l, A_h, [dstend, -64]
- ldp A_l, A_h, [srcend, -48]
- stnp A_l, A_h, [dstend, -48]
- ldp A_l, A_h, [srcend, -32]
- stnp A_l, A_h, [dstend, -32]
- ldp A_l, A_h, [srcend, -16]
- stnp A_l, A_h, [dstend, -16]
+ ldp B_l, B_h, [srcend, -48]
+ stnp B_l, B_h, [dstend, -48]
+ ldp C_l, C_h, [srcend, -32]
+ stnp C_l, C_h, [dstend, -32]
+ ldp D_l, D_h, [srcend, -16]
+ stnp D_l, D_h, [dstend, -16]
ret
END (__memcpy_falkor)
--
2.14.3
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PING][PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements
2018-05-03 17:52 [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
2018-05-03 17:52 ` [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail Siddhesh Poyarekar
@ 2018-05-10 2:59 ` Siddhesh Poyarekar
2 siblings, 0 replies; 6+ messages in thread
From: Siddhesh Poyarekar @ 2018-05-10 2:59 UTC (permalink / raw)
To: libc-alpha, Szabolcs Nagy
Ping!
On 05/03/2018 11:22 PM, Siddhesh Poyarekar wrote:
> Hi,
>
> Here are a couple of patches to improve performance of the falkor memcpy
> and memmove implementations based on testing on the latest hardware.
> The theme of the optimization is to avoid trying to train the hardware
> prefetcher for smaller sizes and in the loop tail since that just
> mis-trains the prefetcher. Instead, use multiple registers to aid
> reordering wherever possible. Testing showed that regressions in these
> sizes compared to generic memcpy are resolved with this patch.
>
> Siddhesh
>
> Siddhesh Poyarekar (2):
> aarch64,falkor: Ignore prefetcher hints for memmove tail
> Ignore prefetcher tagging for smaller copies
>
> sysdeps/aarch64/multiarch/memcpy_falkor.S | 68 ++++++++++++++++++------------
> sysdeps/aarch64/multiarch/memmove_falkor.S | 48 ++++++++++++---------
> 2 files changed, 70 insertions(+), 46 deletions(-)
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail
2018-05-03 17:52 ` [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail Siddhesh Poyarekar
@ 2018-05-10 10:29 ` Szabolcs Nagy
0 siblings, 0 replies; 6+ messages in thread
From: Szabolcs Nagy @ 2018-05-10 10:29 UTC (permalink / raw)
To: Siddhesh Poyarekar, libc-alpha; +Cc: nd
On 03/05/18 18:52, Siddhesh Poyarekar wrote:
> The tail of the copy loops are unable to train the falkor hardware
> prefetcher because they load from a different base compared to the hot
> loop. In this case avoid serializing the instructions by loading them
> into different registers. Also peel the last iteration of the loop
> into the tail (and have them use different registers) since it gives
> better performance for medium sizes.
>
> This results in performance improvements of between 3% and 20% over
> the current falkor implementation for sizes between 128 bytes and 1K
> on the memmove-walk benchmark, thus mostly covering the regressions
> seen against the generic memmove.
>
> * sysdeps/aarch64/multiarch/memmove_falkor.S
> (__memmove_falkor): Use multiple registers to move data in
> loop tail.
OK to commit.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] Ignore prefetcher tagging for smaller copies
2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
@ 2018-05-10 10:29 ` Szabolcs Nagy
0 siblings, 0 replies; 6+ messages in thread
From: Szabolcs Nagy @ 2018-05-10 10:29 UTC (permalink / raw)
To: Siddhesh Poyarekar, libc-alpha; +Cc: nd
On 03/05/18 18:52, Siddhesh Poyarekar wrote:
> For smaller and medium sized copies, the effect of hardware
> prefetching are not as dominant as instruction level parallelism.
> Hence it makes more sense to load data into multiple registers than to
> try and route them to the same prefetch unit. This is also the case
> for the loop exit where we are unable to latch on to the same prefetch
> unit anyway so it makes more sense to have data loaded in parallel.
>
> The performance results are a bit mixed with memcpy-random, with
> numbers jumping between -1% and +3%, i.e. the numbers don't seem
> repeatable. memcpy-walk sees a 70% improvement (i.e. > 2x) for 128
> bytes and that improvement reduces down as the impact of the tail copy
> decreases in comparison to the loop.
>
> * sysdeps/aarch64/multiarch/memcpy_falkor.S (B_l, B_lw, C_l,
> D_l, E_l, F_l, G_l, A_h, B_h, C_h, D_h, E_h, F_h, G_h): New
> macros.
> (__memcpy_falkor): Use multiple registers to copy data in loop
> tail.
OK to commit.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-05-10 10:29 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-03 17:52 [PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
2018-05-03 17:52 ` [PATCH 2/2] Ignore prefetcher tagging for smaller copies Siddhesh Poyarekar
2018-05-10 10:29 ` Szabolcs Nagy
2018-05-03 17:52 ` [PATCH 1/2] aarch64,falkor: Ignore prefetcher hints for memmove tail Siddhesh Poyarekar
2018-05-10 10:29 ` Szabolcs Nagy
2018-05-10 2:59 ` [PING][PATCH 0/2] aarch64,falkor: memcpy/memmove performance improvements Siddhesh Poyarekar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).