* RE: [RFC PATCH] aarch64: improve memset
[not found] <002701cffaa0$77623570$6626a050$@com>
@ 2014-11-07 16:14 ` Wilco Dijkstra
2014-11-08 10:05 ` Richard Henderson
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Wilco Dijkstra @ 2014-11-07 16:14 UTC (permalink / raw)
To: 'Richard Henderson'; +Cc: will.newton, marcus.shawcroft, libc-alpha
> Richard Henderson wrote:
> On 11/05/2014 03:35 PM, Will Newton wrote:
> > On 30 September 2014 12:03, Marcus Shawcroft <marcus.shawcroft@gmail.com> wrote:
> >> On 14 June 2014 08:06, Richard Henderson <rth@twiddle.net> wrote:
> >>> The major idea here is to use IFUNC to check the zva line size once, and use
> >>> that to select different entry points. This saves 3 branches during startup,
> >>> and allows significantly more flexibility.
> >>>
> >>> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> >>> with the x86 versions.
> >>>
> >>> I've done some performance testing using cachebench, which suggests that the
> >>> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> >>> bytes and above. The non-zva path appears to be largely unchanged.
> >>
> >>
> >> OK Thanks /Marcus
> >
> > It looks like this patch has slipped through the cracks. Richard, are
> > you happy to apply this or do you think it warrants further
> > discussion?
>
> Sorry for the radio silence.
>
> Just before I went to apply it I thought I spotted a bug that would affect
> ld.so. I haven't had time to make sure one way or another.
I've got a few comments on this patch:
* Do we really need variants for cache line sizes that are never going to be used?
I'd say just support 64 and 128, and default higher sizes to no_zva.
* Why special case line size=64 only? Unrolling might not help for 128 but should not
harm either, and the alignment overhead only increases with larger line sizes, so you
want to bypass the zva code in all cases if N < 3-4x line size.
* Is the no-ifunc variant still required/used? We're now having at least 4 different
variants which all need to be tested and maintained...
* Finally, which version is used when linking statically? I presume there is some
makefile magic that causes the no-zva version to be used, however that might not be
optimal for all targets.
Wilco
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-11-07 16:14 ` [RFC PATCH] aarch64: improve memset Wilco Dijkstra
@ 2014-11-08 10:05 ` Richard Henderson
2014-11-09 8:19 ` Richard Henderson
2014-11-11 13:05 ` Marcus Shawcroft
2 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2014-11-08 10:05 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: will.newton, marcus.shawcroft, libc-alpha
On 11/07/2014 05:14 PM, Wilco Dijkstra wrote:
> * Do we really need variants for cache line sizes that are never going to be used?
> I'd say just support 64 and 128, and default higher sizes to no_zva.
We could.
> * Why special case line size=64 only?
Because that was all I could test.
> * Is the no-ifunc variant still required/used? We're now having at least 4 different
> variants which all need to be tested and maintained...
It's used within ld.so itself, though of course that too could go no_zva.
> * Finally, which version is used when linking statically? I presume there is some
> makefile magic that causes the no-zva version to be used, however that might not be
> optimal for all targets.
One can have ifuncs in statically linked programs. They get resolved at
startup; I forget the exact mechanism.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-11-07 16:14 ` [RFC PATCH] aarch64: improve memset Wilco Dijkstra
2014-11-08 10:05 ` Richard Henderson
@ 2014-11-09 8:19 ` Richard Henderson
2014-11-10 20:09 ` Wilco Dijkstra
2014-11-11 13:05 ` Marcus Shawcroft
2 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2014-11-09 8:19 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: will.newton, marcus.shawcroft, libc-alpha
[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]
On 11/07/2014 05:14 PM, Wilco Dijkstra wrote:
> I've got a few comments on this patch:
>
> * Do we really need variants for cache line sizes that are never going to be used?
> I'd say just support 64 and 128, and default higher sizes to no_zva.
>
> * Why special case line size=64 only? Unrolling might not help for 128 but should not
> harm either, and the alignment overhead only increases with larger line sizes, so you
> want to bypass the zva code in all cases if N < 3-4x line size.
>
> * Is the no-ifunc variant still required/used? We're now having at least 4 different
> variants which all need to be tested and maintained...
>
> * Finally, which version is used when linking statically? I presume there is some
> makefile magic that causes the no-zva version to be used, however that might not be
> optimal for all targets.
Here's a version which only implements zva for 64 and 128-byte line sizes.
It also removes the version that loaded the zva data each time, which had
been used by ld.so and no-ifunc. That was the path I had been concerned
about back in September.
That leaves ld.so using the no-zva path, which is perhaps a tad unfortunate
given that it needs to zero partial .bss pages during startup, and on a
system with 64k pages, we probably wind up with larger clears more often
than not...
Thoughts?
r~
[-- Attachment #2: z --]
[-- Type: text/plain, Size: 9576 bytes --]
diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 06f04be..9a3d932 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -20,23 +20,14 @@
*
* ARMv8-a, AArch64
* Unaligned accesses
- *
*/
#include <sysdep.h>
-/* By default we assume that the DC instruction can be used to zero
- data blocks more efficiently. In some circumstances this might be
- unsafe, for example in an asymmetric multiprocessor environment with
- different DC clear lengths (neither the upper nor lower lengths are
- safe to use). The feature can be disabled by defining DONT_USE_DC.
-
- If code may be run in a virtualized environment, then define
- MAYBE_VIRT. This will cause the code to cache the system register
- values rather than re-reading them each call. */
-
#define dstin x0
-#define val w1
+#define dstin_w w0
+#define val x1
+#define valw w1
#define count x2
#define tmp1 x3
#define tmp1w w3
@@ -44,186 +35,186 @@
#define tmp2w w4
#define zva_len_x x5
#define zva_len w5
-#define zva_bits_x x6
-
-#define A_l x7
-#define A_lw w7
+#define zva_mask_x x6
+#define zva_mask w6
#define dst x8
-#define tmp3w w9
-
-ENTRY_ALIGN (__memset, 6)
-
- mov dst, dstin /* Preserve return value. */
- ands A_lw, val, #255
-#ifndef DONT_USE_DC
- b.eq L(zero_mem)
-#endif
- orr A_lw, A_lw, A_lw, lsl #8
- orr A_lw, A_lw, A_lw, lsl #16
- orr A_l, A_l, A_l, lsl #32
-L(tail_maybe_long):
- cmp count, #64
- b.ge L(not_short)
-L(tail_maybe_tiny):
- cmp count, #15
- b.le L(tail15tiny)
-L(tail63):
- ands tmp1, count, #0x30
- b.eq L(tail15)
- add dst, dst, tmp1
- cmp tmp1w, #0x20
- b.eq 1f
- b.lt 2f
- stp A_l, A_l, [dst, #-48]
-1:
- stp A_l, A_l, [dst, #-32]
-2:
- stp A_l, A_l, [dst, #-16]
-
-L(tail15):
- and count, count, #15
- add dst, dst, count
- stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */
+#define dst_w w8
+#define dstend x9
+
+ .globl memset
+ cfi_startproc
+
+#if HAVE_IFUNC && !defined (IS_IN_rtld)
+/* Rather than decode dczid_el0 every time, checking for zva disabled and
+ unpacking the line size, do this once in the indirect function and choose
+ an appropriate entry point which encodes these values as constants. */
+
+ .type memset, %gnu_indirect_function
+memset:
+ mrs x1, dczid_el0
+ and x1, x1, #31 /* isolate line size + disable bit */
+
+ cmp x1, #4 /* 64 byte line size, enabled */
+ b.ne 1f
+ adr x0, memset_zva_64
RET
-L(tail15tiny):
- /* Set up to 15 bytes. Does not assume earlier memory
- being set. */
- tbz count, #3, 1f
- str A_l, [dst], #8
-1:
- tbz count, #2, 1f
- str A_lw, [dst], #4
-1:
- tbz count, #1, 1f
- strh A_lw, [dst], #2
-1:
- tbz count, #0, 1f
- strb A_lw, [dst]
-1:
+1: cmp x1, #5 /* 128 byte line size, enabled */
+ b.ne 1f
+ adr x0, memset_zva_128
RET
- /* Critical loop. Start at a new cache line boundary. Assuming
- * 64 bytes per line, this ensures the entire loop is in one line. */
- .p2align 6
-L(not_short):
- neg tmp2, dst
- ands tmp2, tmp2, #15
- b.eq 2f
- /* Bring DST to 128-bit (16-byte) alignment. We know that there's
- * more than that to set, so we simply store 16 bytes and advance by
- * the amount required to reach alignment. */
- sub count, count, tmp2
- stp A_l, A_l, [dst]
- add dst, dst, tmp2
- /* There may be less than 63 bytes to go now. */
- cmp count, #63
- b.le L(tail63)
-2:
- sub dst, dst, #16 /* Pre-bias. */
- sub count, count, #64
-1:
- stp A_l, A_l, [dst, #16]
- stp A_l, A_l, [dst, #32]
- stp A_l, A_l, [dst, #48]
- stp A_l, A_l, [dst, #64]!
- subs count, count, #64
- b.ge 1b
- tst count, #0x3f
- add dst, dst, #16
- b.ne L(tail63)
+1: adr x0, memset_nozva /* Don't use zva at all */
+ RET
+ .size memset, .-memset
+
+.macro do_zva size
+ .balign 64
+ .type memset_zva_\size, %function
+memset_zva_\size:
+ CALL_MCOUNT
+ and valw, valw, #255
+ cmp count, #4*\size
+ ccmp valw, #0, #0, hs /* hs ? cmp val,0 : !z */
+ b.ne L(nz_or_small)
+
+ stp xzr, xzr, [dstin] /* first 16 aligned 1. */
+ and tmp2, dstin, #-16
+ and dst, dstin, #-\size
+
+ stp xzr, xzr, [tmp2, #16] /* first 64 aligned 16. */
+ add dstend, dstin, count
+ add dst, dst, #\size
+
+ stp xzr, xzr, [tmp2, #32]
+ sub count, dstend, dst /* recompute for misalign */
+ add tmp1, dst, #\size
+
+ stp xzr, xzr, [tmp2, #48]
+ sub count, count, #2*\size /* pre-bias */
+
+ stp xzr, xzr, [tmp2, #64]
+
+ /* Store up to first SIZE, aligned 16. */
+.ifgt \size - 64
+ stp xzr, xzr, [tmp2, #80]
+ stp xzr, xzr, [tmp2, #96]
+ stp xzr, xzr, [tmp2, #112]
+ stp xzr, xzr, [tmp2, #128]
+.ifgt \size - 128
+.err
+.endif
+.endif
+
+ .balign 64,,24
+0: dc zva, dst
+ subs count, count, #2*\size
+ dc zva, tmp1
+ add dst, dst, #2*\size
+ add tmp1, tmp1, #2*\size
+ b.hs 0b
+
+ adds count, count, #2*\size /* undo pre-bias */
+ b.ne L(zva_tail)
RET
-#ifndef DONT_USE_DC
- /* For zeroing memory, check to see if we can use the ZVA feature to
- * zero entire 'cache' lines. */
-L(zero_mem):
- mov A_l, #0
- cmp count, #63
- b.le L(tail_maybe_tiny)
- neg tmp2, dst
- ands tmp2, tmp2, #15
- b.eq 1f
- sub count, count, tmp2
- stp A_l, A_l, [dst]
- add dst, dst, tmp2
- cmp count, #63
- b.le L(tail63)
-1:
- /* For zeroing small amounts of memory, it's not worth setting up
- * the line-clear code. */
- cmp count, #128
- b.lt L(not_short)
-#ifdef MAYBE_VIRT
- /* For efficiency when virtualized, we cache the ZVA capability. */
- adrp tmp2, L(cache_clear)
- ldr zva_len, [tmp2, #:lo12:L(cache_clear)]
- tbnz zva_len, #31, L(not_short)
- cbnz zva_len, L(zero_by_line)
- mrs tmp1, dczid_el0
- tbz tmp1, #4, 1f
- /* ZVA not available. Remember this for next time. */
- mov zva_len, #~0
- str zva_len, [tmp2, #:lo12:L(cache_clear)]
- b L(not_short)
-1:
- mov tmp3w, #4
- and zva_len, tmp1w, #15 /* Safety: other bits reserved. */
- lsl zva_len, tmp3w, zva_len
- str zva_len, [tmp2, #:lo12:L(cache_clear)]
+ .size memset_zva_\size, . - memset_zva_\size
+.endm
+
+do_zva 64
+do_zva 128
#else
- mrs tmp1, dczid_el0
- tbnz tmp1, #4, L(not_short)
- mov tmp3w, #4
- and zva_len, tmp1w, #15 /* Safety: other bits reserved. */
- lsl zva_len, tmp3w, zva_len
-#endif
-
-L(zero_by_line):
- /* Compute how far we need to go to become suitably aligned. We're
- * already at quad-word alignment. */
- cmp count, zva_len_x
- b.lt L(not_short) /* Not enough to reach alignment. */
- sub zva_bits_x, zva_len_x, #1
- neg tmp2, dst
- ands tmp2, tmp2, zva_bits_x
- b.eq 1f /* Already aligned. */
- /* Not aligned, check that there's enough to copy after alignment. */
- sub tmp1, count, tmp2
- cmp tmp1, #64
- ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */
- b.lt L(not_short)
- /* We know that there's at least 64 bytes to zero and that it's safe
- * to overrun by 64 bytes. */
- mov count, tmp1
-2:
- stp A_l, A_l, [dst]
- stp A_l, A_l, [dst, #16]
- stp A_l, A_l, [dst, #32]
- subs tmp2, tmp2, #64
- stp A_l, A_l, [dst, #48]
- add dst, dst, #64
- b.ge 2b
- /* We've overrun a bit, so adjust dst downwards. */
- add dst, dst, tmp2
-1:
- sub count, count, zva_len_x
-3:
- dc zva, dst
- add dst, dst, zva_len_x
- subs count, count, zva_len_x
- b.ge 3b
- ands count, count, zva_bits_x
- b.ne L(tail_maybe_long)
+/* If we don't have ifunc (e.g. ld.so) don't bother with the zva. */
+# define memset_nozva memset
+#endif /* IFUNC */
+
+/* The non-zva path. */
+
+ .balign 64
+ .type memset_nozva, %function
+memset_nozva:
+ CALL_MCOUNT
+ and valw, valw, #255
+L(nz_or_small):
+ orr valw, valw, valw, lsl #8 /* replicate the byte */
+ cmp count, #64
+ orr valw, valw, valw, lsl #16
+ add dstend, dstin, count /* remember end of buffer */
+ orr val, val, val, lsl #32
+ b.hs L(ge_64)
+
+ /* Small data -- original count is less than 64 bytes. */
+L(le_63):
+ cmp count, #16
+ b.lo L(le_15)
+ stp val, val, [dstin]
+ tbz count, #5, L(le_31)
+ stp val, val, [dstin, #16]
+ stp val, val, [dstend, #-32]
+L(le_31):
+ stp val, val, [dstend, #-16]
+ RET
+ .balign 64,,16
+L(le_15):
+ tbz count, #3, L(le_7)
+ str val, [dstin]
+ str val, [dstend, #-8]
+ RET
+ .balign 64,,16
+L(le_7):
+ tbz count, #2, L(le_3)
+ str valw, [dstin]
+ str valw, [dstend, #-4]
+ RET
+ .balign 64,,20
+L(le_3):
+ tbz count, #1, L(le_1)
+ strh valw, [dstend, #-2]
+L(le_1):
+ tbz count, #0, L(le_0)
+ strb valw, [dstin]
+L(le_0):
+ RET
+
+ .balign 64
+L(ge_64):
+ and dst, dstin, #-16 /* align the pointer / pre-bias. */
+ stp val, val, [dstin] /* first 16 align 1 */
+ sub count, dstend, dst /* begin misalign recompute */
+ subs count, count, #16+64 /* finish recompute + pre-bias */
+ b.ls L(loop_tail)
+
+ .balign 64,,24
+L(loop):
+ stp val, val, [dst, #16]
+ stp val, val, [dst, #32]
+ subs count, count, #64
+ stp val, val, [dst, #48]
+ stp val, val, [dst, #64]!
+ b.hs L(loop)
+
+ adds count, count, #64 /* undo pre-bias */
+ b.ne L(loop_tail)
+ RET
+
+ /* Tail of the zva loop. Less than ZVA bytes, but possibly lots
+ more than 64. Note that dst is aligned but unbiased. */
+L(zva_tail):
+ subs count, count, #64 /* pre-bias */
+ sub dst, dst, #16 /* pre-bias */
+ b.hi L(loop)
+
+ /* Tail of the stp loop; less than 64 bytes left (from loop)
+ or less-than-or-equal to 64 bytes left (from ge_64/zva_tail). */
+L(loop_tail):
+ stp val, val, [dstend, #-64]
+ stp val, val, [dstend, #-48]
+ stp val, val, [dstend, #-32]
+ stp val, val, [dstend, #-16]
RET
-#ifdef MAYBE_VIRT
- .bss
- .p2align 2
-L(cache_clear):
- .space 4
-#endif
-#endif /* DONT_USE_DC */
-
-END (__memset)
-weak_alias (__memset, memset)
+
+ .size memset_nozva, . - memset_nozva
+ cfi_endproc
+
+strong_alias (memset, __memset)
libc_hidden_builtin_def (memset)
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [RFC PATCH] aarch64: improve memset
2014-11-09 8:19 ` Richard Henderson
@ 2014-11-10 20:09 ` Wilco Dijkstra
2014-11-11 8:13 ` Richard Henderson
0 siblings, 1 reply; 19+ messages in thread
From: Wilco Dijkstra @ 2014-11-10 20:09 UTC (permalink / raw)
To: 'Richard Henderson'; +Cc: will.newton, marcus.shawcroft, libc-alpha
> Richard Henderson wrote:
> On 11/07/2014 05:14 PM, Wilco Dijkstra wrote:
> >
> > * Finally, which version is used when linking statically? I presume there is some
> > makefile magic that causes the no-zva version to be used, however that might not be
> > optimal for all targets.
So it turns out ifuncs are used even with static linking.
> That leaves ld.so using the no-zva path, which is perhaps a tad unfortunate
> given that it needs to zero partial .bss pages during startup, and on a
> system with 64k pages, we probably wind up with larger clears more often
> than not...
I'm not sure how often ld.so calls memset but I'm guessing it is minor compared
to the total time to load.
> Thoughts?
I spotted one issue in the alignment code:
+ stp xzr, xzr, [tmp2, #64]
+
+ /* Store up to first SIZE, aligned 16. */
+.ifgt \size - 64
+ stp xzr, xzr, [tmp2, #80]
+ stp xzr, xzr, [tmp2, #96]
+ stp xzr, xzr, [tmp2, #112]
+ stp xzr, xzr, [tmp2, #128]
+.ifgt \size - 128
+.err
+.endif
+.endif
This should be:
+ /* Store up to first SIZE, aligned 16. */
+.ifgt \size - 64
+ stp xzr, xzr, [tmp2, #64]
+ stp xzr, xzr, [tmp2, #80]
+ stp xzr, xzr, [tmp2, #96]
+ stp xzr, xzr, [tmp2, #112]
+.ifgt \size - 128
+.err
+.endif
+.endif
Other than that it looks good to me.
Wilco
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-11-10 20:09 ` Wilco Dijkstra
@ 2014-11-11 8:13 ` Richard Henderson
2014-11-11 12:52 ` Wilco Dijkstra
0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2014-11-11 8:13 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: will.newton, marcus.shawcroft, libc-alpha
On 11/10/2014 09:09 PM, Wilco Dijkstra wrote:
> I spotted one issue in the alignment code:
>
> + stp xzr, xzr, [tmp2, #64]
> +
> + /* Store up to first SIZE, aligned 16. */
> +.ifgt \size - 64
> + stp xzr, xzr, [tmp2, #80]
> + stp xzr, xzr, [tmp2, #96]
> + stp xzr, xzr, [tmp2, #112]
> + stp xzr, xzr, [tmp2, #128]
> +.ifgt \size - 128
> +.err
> +.endif
> +.endif
>
> This should be:
>
> + /* Store up to first SIZE, aligned 16. */
> +.ifgt \size - 64
> + stp xzr, xzr, [tmp2, #64]
> + stp xzr, xzr, [tmp2, #80]
> + stp xzr, xzr, [tmp2, #96]
> + stp xzr, xzr, [tmp2, #112]
> +.ifgt \size - 128
> +.err
> +.endif
Incorrect.
tmp2 is backward aligned from dst_in, which means that tmp2+0 may be before
dst_in. Thus we write the first 16 bytes, unaligned, then write to tmp2+16
through tmp2+N to clear the first N+1 to N+16 bytes.
However, if we stop at tmp2+48 (or tmp2+112) we could be leaving up to 15 bytes
uninitialized.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [RFC PATCH] aarch64: improve memset
2014-11-11 8:13 ` Richard Henderson
@ 2014-11-11 12:52 ` Wilco Dijkstra
2014-11-11 14:30 ` Richard Henderson
0 siblings, 1 reply; 19+ messages in thread
From: Wilco Dijkstra @ 2014-11-11 12:52 UTC (permalink / raw)
To: 'Richard Henderson'; +Cc: will.newton, marcus.shawcroft, libc-alpha
> Richard wrote:
> On 11/10/2014 09:09 PM, Wilco Dijkstra wrote:
> > I spotted one issue in the alignment code:
> >
> > + stp xzr, xzr, [tmp2, #64]
> > +
> > + /* Store up to first SIZE, aligned 16. */
> > +.ifgt \size - 64
> > + stp xzr, xzr, [tmp2, #80]
> > + stp xzr, xzr, [tmp2, #96]
> > + stp xzr, xzr, [tmp2, #112]
> > + stp xzr, xzr, [tmp2, #128]
> > +.ifgt \size - 128
> > +.err
> > +.endif
> > +.endif
> >
> > This should be:
> >
> > + /* Store up to first SIZE, aligned 16. */
> > +.ifgt \size - 64
> > + stp xzr, xzr, [tmp2, #64]
> > + stp xzr, xzr, [tmp2, #80]
> > + stp xzr, xzr, [tmp2, #96]
> > + stp xzr, xzr, [tmp2, #112]
> > +.ifgt \size - 128
> > +.err
> > +.endif
>
> Incorrect.
>
> tmp2 is backward aligned from dst_in, which means that tmp2+0 may be before
> dst_in. Thus we write the first 16 bytes, unaligned, then write to tmp2+16
> through tmp2+N to clear the first N+1 to N+16 bytes.
>
> However, if we stop at tmp2+48 (or tmp2+112) we could be leaving up to 15 bytes
> uninitialized.
No - in the worst case we need to write 64 bytes. The proof is trivial,
dst = x0 & -64, tmp2 = x0 & -16, so tmp2 = dst + (x0 & 0x30) or tmp2 >= dst.
Since we start doing the dc's at dst + 64, the stp to [tmp2 + 64] is redundant.
Wilco
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-11-07 16:14 ` [RFC PATCH] aarch64: improve memset Wilco Dijkstra
2014-11-08 10:05 ` Richard Henderson
2014-11-09 8:19 ` Richard Henderson
@ 2014-11-11 13:05 ` Marcus Shawcroft
2014-11-11 14:16 ` Wilco Dijkstra
2014-11-11 17:22 ` Andrew Pinski
2 siblings, 2 replies; 19+ messages in thread
From: Marcus Shawcroft @ 2014-11-11 13:05 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: Richard Henderson, will.newton, GNU C Library
On 7 November 2014 16:14, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>> Richard Henderson wrote:
> I've got a few comments on this patch:
>
> * Do we really need variants for cache line sizes that are never going to be used?
> I'd say just support 64 and 128, and default higher sizes to no_zva.
We shouldn't be removing support for the other sizes already supported
by the existing implementation. If the other sizes were deprecated
from the architecture then fair game, but that is not the case. From
offline conversation with Wilco I gather part of the motivation to
remove is that the none 64 cases cannot be readily tested on HW.
That particular issue was solved in the original implementation using
a hacked qemu.
Cheers
/Marcus
> * Why special case line size=64 only? Unrolling might not help for 128 but should not
> harm either, and the alignment overhead only increases with larger line sizes, so you
> want to bypass the zva code in all cases if N < 3-4x line size.
>
> * Is the no-ifunc variant still required/used? We're now having at least 4 different
> variants which all need to be tested and maintained...
>
> * Finally, which version is used when linking statically? I presume there is some
> makefile magic that causes the no-zva version to be used, however that might not be
> optimal for all targets.
>
> Wilco
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [RFC PATCH] aarch64: improve memset
2014-11-11 13:05 ` Marcus Shawcroft
@ 2014-11-11 14:16 ` Wilco Dijkstra
2014-11-11 17:22 ` Andrew Pinski
1 sibling, 0 replies; 19+ messages in thread
From: Wilco Dijkstra @ 2014-11-11 14:16 UTC (permalink / raw)
To: 'Marcus Shawcroft'; +Cc: Richard Henderson, will.newton, GNU C Library
> Marcus Shawcroft wrote:
> On 7 November 2014 16:14, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> >> Richard Henderson wrote:
>
> > I've got a few comments on this patch:
> >
> > * Do we really need variants for cache line sizes that are never going to be used?
> > I'd say just support 64 and 128, and default higher sizes to no_zva.
>
> We shouldn't be removing support for the other sizes already supported
> by the existing implementation. If the other sizes were deprecated
> from the architecture then fair game, but that is not the case. From
> offline conversation with Wilco I gather part of the motivation to
> remove is that the none 64 cases cannot be readily tested on HW.
> That particular issue was solved in the original implementation using
> a hacked qemu.
The architecture allows dc zva of 4..2048 bytes. Most of these are useless and would
not result in a performance gain. Sizes 4-16 cannot be useful as an stp can write
more data... Larger sizes incur an ever increasing alignment overhead and there are
fewer memsets where dc zva could be used.
It would certainly be a good idea to deprecate useless small and overly large sizes,
but I don't see the reasoning for supporting every legal size without evidence of a
performance gain on an actual implementation. It's not like memset will crash on
an implementation with an unsupported size, it just won't use dc.
Wilco
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-11-11 12:52 ` Wilco Dijkstra
@ 2014-11-11 14:30 ` Richard Henderson
0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2014-11-11 14:30 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: will.newton, marcus.shawcroft, libc-alpha
On 11/11/2014 01:52 PM, Wilco Dijkstra wrote:
> No - in the worst case we need to write 64 bytes. The proof is trivial,
> dst = x0 & -64, tmp2 = x0 & -16, so tmp2 = dst + (x0 & 0x30) or tmp2 >= dst.
> Since we start doing the dc's at dst + 64, the stp to [tmp2 + 64] is redundant.
Quite right, my mistake.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-11-11 13:05 ` Marcus Shawcroft
2014-11-11 14:16 ` Wilco Dijkstra
@ 2014-11-11 17:22 ` Andrew Pinski
1 sibling, 0 replies; 19+ messages in thread
From: Andrew Pinski @ 2014-11-11 17:22 UTC (permalink / raw)
To: Marcus Shawcroft
Cc: Wilco Dijkstra, Richard Henderson, Will Newton, GNU C Library
On Tue, Nov 11, 2014 at 5:05 AM, Marcus Shawcroft
<marcus.shawcroft@gmail.com> wrote:
> On 7 November 2014 16:14, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>>> Richard Henderson wrote:
>
>> I've got a few comments on this patch:
>>
>> * Do we really need variants for cache line sizes that are never going to be used?
>> I'd say just support 64 and 128, and default higher sizes to no_zva.
>
> We shouldn't be removing support for the other sizes already supported
> by the existing implementation. If the other sizes were deprecated
> from the architecture then fair game, but that is not the case. From
> offline conversation with Wilco I gather part of the motivation to
> remove is that the none 64 cases cannot be readily tested on HW.
> That particular issue was solved in the original implementation using
> a hacked qemu.
I will have the ability to test on hardware which uses 128 byte case
soon. I already testing using a simulator which sets it to 128 byte
(though I use it for performance analysis though).
Thanks,
Andrew
>
> Cheers
> /Marcus
>
>> * Why special case line size=64 only? Unrolling might not help for 128 but should not
>> harm either, and the alignment overhead only increases with larger line sizes, so you
>> want to bypass the zva code in all cases if N < 3-4x line size.
>>
>> * Is the no-ifunc variant still required/used? We're now having at least 4 different
>> variants which all need to be tested and maintained...
>>
>> * Finally, which version is used when linking statically? I presume there is some
>> makefile magic that causes the no-zva version to be used, however that might not be
>> optimal for all targets.
>>
>> Wilco
>>
>>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-06-14 7:06 Richard Henderson
2014-06-20 11:05 ` Ondřej Bílka
2014-09-30 11:03 ` Marcus Shawcroft
@ 2015-02-18 2:41 ` Andrew Pinski
2 siblings, 0 replies; 19+ messages in thread
From: Andrew Pinski @ 2015-02-18 2:41 UTC (permalink / raw)
To: Richard Henderson; +Cc: libc-alpha, Ondřej Bílka, Marcus Shawcroft
On Sat, Jun 14, 2014 at 12:06 AM, Richard Henderson <rth@twiddle.net> wrote:
> The major idea here is to use IFUNC to check the zva line size once, and use
> that to select different entry points. This saves 3 branches during startup,
> and allows significantly more flexibility.
>
> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> with the x86 versions.
>
> I've done some performance testing using cachebench, which suggests that the
> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> bytes and above. The non-zva path appears to be largely unchanged.
>
> I'd like to use some of Ondrej's benchmarks+data, but I couldn't locate them in
> a quick search of the mailing list. Pointers?
>
> Comments?
Yes I have a performance regression on ThunderX with this patch and
the newer versions still. Due to the placement of subs in the inner
most of loop of the non-zero case. Around a 20% regression.
Thanks,
Andrew Pinski
>
>
> r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-11-05 14:35 ` Will Newton
@ 2014-11-06 6:55 ` Richard Henderson
0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2014-11-06 6:55 UTC (permalink / raw)
To: Will Newton, Marcus Shawcroft; +Cc: libc-alpha
On 11/05/2014 03:35 PM, Will Newton wrote:
> On 30 September 2014 12:03, Marcus Shawcroft <marcus.shawcroft@gmail.com> wrote:
>> On 14 June 2014 08:06, Richard Henderson <rth@twiddle.net> wrote:
>>> The major idea here is to use IFUNC to check the zva line size once, and use
>>> that to select different entry points. This saves 3 branches during startup,
>>> and allows significantly more flexibility.
>>>
>>> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
>>> with the x86 versions.
>>>
>>> I've done some performance testing using cachebench, which suggests that the
>>> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
>>> bytes and above. The non-zva path appears to be largely unchanged.
>>
>>
>> OK Thanks /Marcus
>
> It looks like this patch has slipped through the cracks. Richard, are
> you happy to apply this or do you think it warrants further
> discussion?
Sorry for the radio silence.
Just before I went to apply it I thought I spotted a bug that would affect
ld.so. I haven't had time to make sure one way or another.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-09-30 11:03 ` Marcus Shawcroft
@ 2014-11-05 14:35 ` Will Newton
2014-11-06 6:55 ` Richard Henderson
0 siblings, 1 reply; 19+ messages in thread
From: Will Newton @ 2014-11-05 14:35 UTC (permalink / raw)
To: Marcus Shawcroft; +Cc: Richard Henderson, libc-alpha
On 30 September 2014 12:03, Marcus Shawcroft <marcus.shawcroft@gmail.com> wrote:
> On 14 June 2014 08:06, Richard Henderson <rth@twiddle.net> wrote:
>> The major idea here is to use IFUNC to check the zva line size once, and use
>> that to select different entry points. This saves 3 branches during startup,
>> and allows significantly more flexibility.
>>
>> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
>> with the x86 versions.
>>
>> I've done some performance testing using cachebench, which suggests that the
>> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
>> bytes and above. The non-zva path appears to be largely unchanged.
>
>
> OK Thanks /Marcus
It looks like this patch has slipped through the cracks. Richard, are
you happy to apply this or do you think it warrants further
discussion?
Thanks,
--
Will Newton
Toolchain Working Group, Linaro
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-06-14 7:06 Richard Henderson
2014-06-20 11:05 ` Ondřej Bílka
@ 2014-09-30 11:03 ` Marcus Shawcroft
2014-11-05 14:35 ` Will Newton
2015-02-18 2:41 ` Andrew Pinski
2 siblings, 1 reply; 19+ messages in thread
From: Marcus Shawcroft @ 2014-09-30 11:03 UTC (permalink / raw)
To: Richard Henderson; +Cc: libc-alpha
On 14 June 2014 08:06, Richard Henderson <rth@twiddle.net> wrote:
> The major idea here is to use IFUNC to check the zva line size once, and use
> that to select different entry points. This saves 3 branches during startup,
> and allows significantly more flexibility.
>
> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> with the x86 versions.
>
> I've done some performance testing using cachebench, which suggests that the
> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> bytes and above. The non-zva path appears to be largely unchanged.
OK Thanks /Marcus
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-09-12 0:15 ` Carlos O'Donell
@ 2014-09-18 0:25 ` Will Newton
0 siblings, 0 replies; 19+ messages in thread
From: Will Newton @ 2014-09-18 0:25 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: Richard Henderson, libc-alpha
On 11 September 2014 17:14, Carlos O'Donell <carlos@redhat.com> wrote:
> On 09/11/2014 04:40 PM, Richard Henderson wrote:
>>> https://sourceware.org/ml/libc-alpha/2014-06/msg00376.html
>>
>> Ping.
>
> From my perspective your patch looks fine. Given the lack of
> ARMv8 hardware I can test on I feel these optimization are
> a little bit like spinning our wheels. We move in the right
> direction though and that's good.
>
> Out of curiosity did bench/bench-memset show any improvements?
I agree, this looks good to me too but it would be useful to see what
performance changes were seen and if possible which hardware that was
on.
--
Will Newton
Toolchain Working Group, Linaro
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-09-11 20:41 Richard Henderson
@ 2014-09-12 0:15 ` Carlos O'Donell
2014-09-18 0:25 ` Will Newton
0 siblings, 1 reply; 19+ messages in thread
From: Carlos O'Donell @ 2014-09-12 0:15 UTC (permalink / raw)
To: Richard Henderson, libc-alpha
On 09/11/2014 04:40 PM, Richard Henderson wrote:
>> https://sourceware.org/ml/libc-alpha/2014-06/msg00376.html
>
> Ping.
From my perspective your patch looks fine. Given the lack of
ARMv8 hardware I can test on I feel these optimization are
a little bit like spinning our wheels. We move in the right
direction though and that's good.
Out of curiosity did bench/bench-memset show any improvements?
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
@ 2014-09-11 20:41 Richard Henderson
2014-09-12 0:15 ` Carlos O'Donell
0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2014-09-11 20:41 UTC (permalink / raw)
To: libc-alpha
> https://sourceware.org/ml/libc-alpha/2014-06/msg00376.html
Ping.
r~
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] aarch64: improve memset
2014-06-14 7:06 Richard Henderson
@ 2014-06-20 11:05 ` Ondřej Bílka
2014-09-30 11:03 ` Marcus Shawcroft
2015-02-18 2:41 ` Andrew Pinski
2 siblings, 0 replies; 19+ messages in thread
From: Ondřej Bílka @ 2014-06-20 11:05 UTC (permalink / raw)
To: Richard Henderson; +Cc: libc-alpha, Marcus Shawcroft
On Sat, Jun 14, 2014 at 12:06:39AM -0700, Richard Henderson wrote:
> The major idea here is to use IFUNC to check the zva line size once, and use
> that to select different entry points. This saves 3 branches during startup,
> and allows significantly more flexibility.
>
> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> with the x86 versions.
>
> I've done some performance testing using cachebench, which suggests that the
> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> bytes and above. The non-zva path appears to be largely unchanged.
>
> I'd like to use some of Ondrej's benchmarks+data, but I couldn't locate them in
> a quick search of the mailing list. Pointers?
>
> Comments?
>
A benchmark that I currently use is here, which simply measures running
time of given command with different implementation, you need to
generate .so with memset for each memset variant then run a ./benchmark
and wait. I am not sure about a performance impact of unrolling as these
sizes tend to be relatively rare on apps that I measured.
http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2
What I got from that is bit chaotic, for example on AMD a gcc runs fastest with simple rep stosq loop
but other benchmarks say otherwise. It is in my priority list to update
memset based on that.
Then I have a profiler however it is currently x86 specific, it would
take some work to make it cross platform. Also it has limitation that it
does not measure effects of memset on caches which could skew a results.
A important part here is characteristic of data which are here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_gcc/result.html
It shows things like that data are almost always 8 byte aligned and
similar. A latest source is here.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile130813.tar.bz2
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH] aarch64: improve memset
@ 2014-06-14 7:06 Richard Henderson
2014-06-20 11:05 ` Ondřej Bílka
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Richard Henderson @ 2014-06-14 7:06 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondřej Bílka, Marcus Shawcroft
[-- Attachment #1: Type: text/plain, Size: 664 bytes --]
The major idea here is to use IFUNC to check the zva line size once, and use
that to select different entry points. This saves 3 branches during startup,
and allows significantly more flexibility.
Also, I've cribbed several of the unaligned store ideas that Ondrej has done
with the x86 versions.
I've done some performance testing using cachebench, which suggests that the
unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
bytes and above. The non-zva path appears to be largely unchanged.
I'd like to use some of Ondrej's benchmarks+data, but I couldn't locate them in
a quick search of the mailing list. Pointers?
Comments?
r~
[-- Attachment #2: z --]
[-- Type: text/plain, Size: 11779 bytes --]
diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 06f04be..523406d 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -20,23 +20,14 @@
*
* ARMv8-a, AArch64
* Unaligned accesses
- *
*/
#include <sysdep.h>
-/* By default we assume that the DC instruction can be used to zero
- data blocks more efficiently. In some circumstances this might be
- unsafe, for example in an asymmetric multiprocessor environment with
- different DC clear lengths (neither the upper nor lower lengths are
- safe to use). The feature can be disabled by defining DONT_USE_DC.
-
- If code may be run in a virtualized environment, then define
- MAYBE_VIRT. This will cause the code to cache the system register
- values rather than re-reading them each call. */
-
#define dstin x0
-#define val w1
+#define dstin_w w0
+#define val x1
+#define valw w1
#define count x2
#define tmp1 x3
#define tmp1w w3
@@ -44,186 +35,280 @@
#define tmp2w w4
#define zva_len_x x5
#define zva_len w5
-#define zva_bits_x x6
-
-#define A_l x7
-#define A_lw w7
+#define zva_mask_x x6
+#define zva_mask w6
#define dst x8
-#define tmp3w w9
-
-ENTRY_ALIGN (__memset, 6)
-
- mov dst, dstin /* Preserve return value. */
- ands A_lw, val, #255
-#ifndef DONT_USE_DC
- b.eq L(zero_mem)
-#endif
- orr A_lw, A_lw, A_lw, lsl #8
- orr A_lw, A_lw, A_lw, lsl #16
- orr A_l, A_l, A_l, lsl #32
-L(tail_maybe_long):
- cmp count, #64
- b.ge L(not_short)
-L(tail_maybe_tiny):
- cmp count, #15
- b.le L(tail15tiny)
-L(tail63):
- ands tmp1, count, #0x30
- b.eq L(tail15)
- add dst, dst, tmp1
- cmp tmp1w, #0x20
- b.eq 1f
- b.lt 2f
- stp A_l, A_l, [dst, #-48]
-1:
- stp A_l, A_l, [dst, #-32]
-2:
- stp A_l, A_l, [dst, #-16]
-
-L(tail15):
- and count, count, #15
- add dst, dst, count
- stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */
- RET
+#define dst_w w8
+#define dstend x9
+
+ .globl memset
+ cfi_startproc
+
+#if HAVE_IFUNC && !defined (IS_IN_rtld)
+/* Rather than decode dczid_el0 every time, checking for zva disabled and
+ unpacking the line size, do this once in the indirect function and choose
+ an appropriate entry point which encodes these values as constants. */
-L(tail15tiny):
- /* Set up to 15 bytes. Does not assume earlier memory
- being set. */
- tbz count, #3, 1f
- str A_l, [dst], #8
-1:
- tbz count, #2, 1f
- str A_lw, [dst], #4
-1:
- tbz count, #1, 1f
- strh A_lw, [dst], #2
-1:
- tbz count, #0, 1f
- strb A_lw, [dst]
-1:
+ .type memset, %gnu_indirect_function
+memset:
+ mrs x1, dczid_el0
+ adrp x0, 1f
+ tst x1, #16 /* test for zva disabled */
+ and x1, x1, #15
+ add x0, x0, #:lo12:1f
+ csel x1, xzr, x1, ne /* squash index to 0 if so */
+ ldrsw x2, [x0, x1, lsl #2]
+ add x0, x0, x2
RET
+ .size memset, .-memset
+
+ .section .rodata
+1: .long memset_nozva - 1b // 0
+ .long memset_nozva - 1b // 1
+ .long memset_nozva - 1b // 2
+ .long memset_nozva - 1b // 3
+ .long memset_zva_64 - 1b // 4
+ .long memset_zva_128 - 1b // 5
+ .long memset_zva_256 - 1b // 6
+ .long memset_zva_512 - 1b // 7
+ .long memset_zva_1024 - 1b // 8
+ .long memset_zva_2048 - 1b // 9
+ .long memset_zva_4096 - 1b // 10
+ .long memset_zva_8192 - 1b // 11
+ .long memset_zva_16384 - 1b // 12
+ .long memset_zva_32768 - 1b // 13
+ .long memset_zva_65536 - 1b // 14
+ .long memset_zva_131072 - 1b // 15
+ .previous
+
+/* The 64 byte zva size is too small, and needs unrolling for efficiency. */
- /* Critical loop. Start at a new cache line boundary. Assuming
- * 64 bytes per line, this ensures the entire loop is in one line. */
.p2align 6
-L(not_short):
- neg tmp2, dst
- ands tmp2, tmp2, #15
- b.eq 2f
- /* Bring DST to 128-bit (16-byte) alignment. We know that there's
- * more than that to set, so we simply store 16 bytes and advance by
- * the amount required to reach alignment. */
- sub count, count, tmp2
- stp A_l, A_l, [dst]
- add dst, dst, tmp2
- /* There may be less than 63 bytes to go now. */
- cmp count, #63
- b.le L(tail63)
-2:
- sub dst, dst, #16 /* Pre-bias. */
- sub count, count, #64
-1:
- stp A_l, A_l, [dst, #16]
- stp A_l, A_l, [dst, #32]
- stp A_l, A_l, [dst, #48]
- stp A_l, A_l, [dst, #64]!
- subs count, count, #64
- b.ge 1b
- tst count, #0x3f
- add dst, dst, #16
- b.ne L(tail63)
+ .type memset_zva_64, %function
+memset_zva_64:
+ CALL_MCOUNT
+ and valw, valw, #255
+ cmp count, #256
+ ccmp valw, #0, #0, hs /* hs ? cmp val,0 : !z */
+ b.ne L(nz_or_small)
+
+ stp xzr, xzr, [dstin] /* first 16 aligned 1. */
+ and tmp2, dstin, #-16
+ and dst, dstin, #-64
+
+ stp xzr, xzr, [tmp2, #16] /* first 64 aligned 16. */
+ add dstend, dstin, count
+ add dst, dst, #64
+
+ stp xzr, xzr, [tmp2, #32]
+ sub count, dstend, dst /* recompute for misalign */
+ add tmp1, dst, #64
+
+ stp xzr, xzr, [tmp2, #48]
+ sub count, count, #128 /* pre-bias */
+
+ stp xzr, xzr, [tmp2, #64]
+
+ .p2align 6,,24
+0: dc zva, dst
+ subs count, count, #128
+ dc zva, tmp1
+ add dst, dst, #128
+ add tmp1, tmp1, #128
+ b.hs 0b
+
+ adds count, count, #128 /* undo pre-bias */
+ b.ne L(zva_tail)
RET
-#ifndef DONT_USE_DC
- /* For zeroing memory, check to see if we can use the ZVA feature to
- * zero entire 'cache' lines. */
-L(zero_mem):
- mov A_l, #0
- cmp count, #63
- b.le L(tail_maybe_tiny)
- neg tmp2, dst
- ands tmp2, tmp2, #15
- b.eq 1f
- sub count, count, tmp2
- stp A_l, A_l, [dst]
- add dst, dst, tmp2
- cmp count, #63
- b.le L(tail63)
-1:
- /* For zeroing small amounts of memory, it's not worth setting up
- * the line-clear code. */
- cmp count, #128
- b.lt L(not_short)
-#ifdef MAYBE_VIRT
- /* For efficiency when virtualized, we cache the ZVA capability. */
- adrp tmp2, L(cache_clear)
- ldr zva_len, [tmp2, #:lo12:L(cache_clear)]
- tbnz zva_len, #31, L(not_short)
- cbnz zva_len, L(zero_by_line)
- mrs tmp1, dczid_el0
- tbz tmp1, #4, 1f
- /* ZVA not available. Remember this for next time. */
- mov zva_len, #~0
- str zva_len, [tmp2, #:lo12:L(cache_clear)]
- b L(not_short)
-1:
- mov tmp3w, #4
- and zva_len, tmp1w, #15 /* Safety: other bits reserved. */
- lsl zva_len, tmp3w, zva_len
- str zva_len, [tmp2, #:lo12:L(cache_clear)]
+ .size memset_zva_64, . - memset_zva_64
+
+/* For larger zva sizes, a simple loop ought to suffice. */
+/* ??? Needs performance testing, when such hardware becomes available. */
+
+.macro do_zva len
+ .p2align 4
+ .type memset_zva_\len, %function
+memset_zva_\len:
+ CALL_MCOUNT
+ and valw, valw, #255
+ cmp count, #\len
+ ccmp valw, #0, #0, hs /* hs ? cmp val,0 : !z */
+ b.ne L(nz_or_small)
+
+ add dstend, dstin, count
+ mov zva_len, #\len
+ mov zva_mask, #\len-1
+ b memset_zva_n
+
+ .size memset_zva_\len, . - memset_zva_\len
+.endm
+
+ do_zva 128 // 5
+ do_zva 256 // 6
+ do_zva 512 // 7
+ do_zva 1024 // 8
+ do_zva 2048 // 9
+ do_zva 4096 // 10
+ do_zva 8192 // 11
+ do_zva 16384 // 12
+ do_zva 32768 // 13
+ do_zva 65536 // 14
+ do_zva 131072 // 15
+
+ .p2align 6
#else
+/* Without IFUNC, we must load the zva data from the dczid register. */
+
+ .p2align 6
+ .type memset, %function
+memset:
+ and valw, valw, #255
+ cmp count, #256
+ ccmp valw, #0, #0, hs /* hs ? cmp val,0 : !z */
+ b.ne L(nz_or_small)
+
mrs tmp1, dczid_el0
- tbnz tmp1, #4, L(not_short)
- mov tmp3w, #4
- and zva_len, tmp1w, #15 /* Safety: other bits reserved. */
- lsl zva_len, tmp3w, zva_len
-#endif
-
-L(zero_by_line):
- /* Compute how far we need to go to become suitably aligned. We're
- * already at quad-word alignment. */
+ tbnz tmp1, #4, L(nz_or_small)
+
+ and tmp1w, tmp1w, #15
+ mov zva_len, #4
+ add dstend, dstin, count
+ lsl zva_len, zva_len, tmp1w
cmp count, zva_len_x
- b.lt L(not_short) /* Not enough to reach alignment. */
- sub zva_bits_x, zva_len_x, #1
- neg tmp2, dst
- ands tmp2, tmp2, zva_bits_x
- b.eq 1f /* Already aligned. */
- /* Not aligned, check that there's enough to copy after alignment. */
- sub tmp1, count, tmp2
- cmp tmp1, #64
- ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */
- b.lt L(not_short)
- /* We know that there's at least 64 bytes to zero and that it's safe
- * to overrun by 64 bytes. */
- mov count, tmp1
-2:
- stp A_l, A_l, [dst]
- stp A_l, A_l, [dst, #16]
- stp A_l, A_l, [dst, #32]
- subs tmp2, tmp2, #64
- stp A_l, A_l, [dst, #48]
- add dst, dst, #64
- b.ge 2b
- /* We've overrun a bit, so adjust dst downwards. */
- add dst, dst, tmp2
-1:
- sub count, count, zva_len_x
-3:
- dc zva, dst
- add dst, dst, zva_len_x
+ sub zva_mask, zva_len, #1
+ b.lo L(ge_64)
+
+ /* Fall through into memset_zva_n. */
+ .size memset, . - memset
+#endif /* HAVE_IFUNC */
+
+/* Main part of the zva path. On arrival here, we've already checked for
+ minimum size and that VAL is zero. Also, we've set up zva_len and mask. */
+
+ .type memset_zva_n, %function
+memset_zva_n:
+ stp xzr, xzr, [dstin] /* first 16 aligned 1. */
+ neg tmp1w, dstin_w
+ sub count, count, zva_len_x /* pre-bias */
+ mov dst, dstin
+ ands tmp1w, tmp1w, zva_mask
+ b.ne 3f
+
+ .p2align 6,,16
+2: dc zva, dst
subs count, count, zva_len_x
- b.ge 3b
- ands count, count, zva_bits_x
- b.ne L(tail_maybe_long)
+ add dst, dst, zva_len_x
+ b.hs 2b
+
+ adds count, count, zva_len_x /* undo pre-bias */
+ b.ne L(zva_tail)
RET
-#ifdef MAYBE_VIRT
- .bss
- .p2align 2
-L(cache_clear):
- .space 4
-#endif
-#endif /* DONT_USE_DC */
-
-END (__memset)
-weak_alias (__memset, memset)
+
+ .p2align 4
+3: and tmp2, dstin, #-16
+ sub count, count, tmp1 /* account for misalign */
+ add dst, dstin, tmp1
+
+ .p2align 6,,24
+4: stp xzr, xzr, [tmp2, #16]
+ stp xzr, xzr, [tmp2, #32]
+ subs tmp1w, tmp1w, #64
+ stp xzr, xzr, [tmp2, #48]
+ stp xzr, xzr, [tmp2, #64]!
+ b.hi 4b
+
+ b 2b
+
+ .size memset_zva_n, . - memset_zva_n
+
+/* The non-zva path. */
+
+ .p2align 6
+ .type memset_nozva, %function
+memset_nozva:
+ CALL_MCOUNT
+ and valw, valw, #255
+L(nz_or_small):
+ orr valw, valw, valw, lsl #8 /* replicate the byte */
+ cmp count, #64
+ orr valw, valw, valw, lsl #16
+ add dstend, dstin, count /* remember end of buffer */
+ orr val, val, val, lsl #32
+ b.hs L(ge_64)
+
+ /* Small data -- original count is less than 64 bytes. */
+L(le_63):
+ cmp count, #16
+ b.lo L(le_15)
+ stp val, val, [dstin]
+ tbz count, #5, L(le_31)
+ stp val, val, [dstin, #16]
+ stp val, val, [dstend, #-32]
+L(le_31):
+ stp val, val, [dstend, #-16]
+ RET
+ .p2align 6,,16
+L(le_15):
+ tbz count, #3, L(le_7)
+ str val, [dstin]
+ str val, [dstend, #-8]
+ RET
+ .p2align 6,,16
+L(le_7):
+ tbz count, #2, L(le_3)
+ str valw, [dstin]
+ str valw, [dstend, #-4]
+ RET
+ .p2align 6,,20
+L(le_3):
+ tbz count, #1, L(le_1)
+ strh valw, [dstend, #-2]
+L(le_1):
+ tbz count, #0, L(le_0)
+ strb valw, [dstin]
+L(le_0):
+ RET
+
+ .p2align 6
+L(ge_64):
+ and dst, dstin, #-16 /* align the pointer / pre-bias. */
+ stp val, val, [dstin] /* first 16 align 1 */
+ sub count, dstend, dst /* begin misalign recompute */
+ subs count, count, #16+64 /* finish recompute + pre-bias */
+ b.ls L(loop_tail)
+
+ .p2align 6,,24
+L(loop):
+ stp val, val, [dst, #16]
+ stp val, val, [dst, #32]
+ subs count, count, #64
+ stp val, val, [dst, #48]
+ stp val, val, [dst, #64]!
+ b.hs L(loop)
+
+ adds count, count, #64 /* undo pre-bias */
+ b.ne L(loop_tail)
+ RET
+
+ /* Tail of the zva loop. Less than ZVA bytes, but possibly lots
+ more than 64. Note that dst is aligned but unbiased. */
+L(zva_tail):
+ subs count, count, #64 /* pre-bias */
+ sub dst, dst, #16 /* pre-bias */
+ b.hi L(loop)
+
+ /* Tail of the stp loop; less than 64 bytes left.
+ Note that dst is still aligned and biased by -16. */
+L(loop_tail):
+ stp val, val, [dstend, #-64]
+ stp val, val, [dstend, #-48]
+ stp val, val, [dstend, #-32]
+ stp val, val, [dstend, #-16]
+ RET
+
+ .size memset_nozva, . - memset_nozva
+ cfi_endproc
+
+strong_alias (memset, __memset)
libc_hidden_builtin_def (memset)
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2015-02-18 2:41 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <002701cffaa0$77623570$6626a050$@com>
2014-11-07 16:14 ` [RFC PATCH] aarch64: improve memset Wilco Dijkstra
2014-11-08 10:05 ` Richard Henderson
2014-11-09 8:19 ` Richard Henderson
2014-11-10 20:09 ` Wilco Dijkstra
2014-11-11 8:13 ` Richard Henderson
2014-11-11 12:52 ` Wilco Dijkstra
2014-11-11 14:30 ` Richard Henderson
2014-11-11 13:05 ` Marcus Shawcroft
2014-11-11 14:16 ` Wilco Dijkstra
2014-11-11 17:22 ` Andrew Pinski
2014-09-11 20:41 Richard Henderson
2014-09-12 0:15 ` Carlos O'Donell
2014-09-18 0:25 ` Will Newton
-- strict thread matches above, loose matches on Subject: below --
2014-06-14 7:06 Richard Henderson
2014-06-20 11:05 ` Ondřej Bílka
2014-09-30 11:03 ` Marcus Shawcroft
2014-11-05 14:35 ` Will Newton
2014-11-06 6:55 ` Richard Henderson
2015-02-18 2:41 ` Andrew Pinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).