RE: [RFC PATCH] aarch64: improve memset

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* RE: [RFC PATCH] aarch64: improve memset
       [not found] <002701cffaa0$77623570$6626a050$@com>
@ 2014-11-07 16:14 ` Wilco Dijkstra
  2014-11-08 10:05   ` Richard Henderson
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Wilco Dijkstra @ 2014-11-07 16:14 UTC (permalink / raw)
  To: 'Richard Henderson'; +Cc: will.newton, marcus.shawcroft, libc-alpha

> Richard Henderson wrote:
> On 11/05/2014 03:35 PM, Will Newton wrote:
> > On 30 September 2014 12:03, Marcus Shawcroft <marcus.shawcroft@gmail.com> wrote:
> >> On 14 June 2014 08:06, Richard Henderson <rth@twiddle.net> wrote:
> >>> The major idea here is to use IFUNC to check the zva line size once, and use
> >>> that to select different entry points.  This saves 3 branches during startup,
> >>> and allows significantly more flexibility.
> >>>
> >>> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> >>> with the x86 versions.
> >>>
> >>> I've done some performance testing using cachebench, which suggests that the
> >>> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> >>> bytes and above.  The non-zva path appears to be largely unchanged.
> >>
> >>
> >> OK Thanks /Marcus
> >
> > It looks like this patch has slipped through the cracks. Richard, are
> > you happy to apply this or do you think it warrants further
> > discussion?
> 
> Sorry for the radio silence.
> 
> Just before I went to apply it I thought I spotted a bug that would affect
> ld.so.  I haven't had time to make sure one way or another.

I've got a few comments on this patch:

* Do we really need variants for cache line sizes that are never going to be used?
  I'd say just support 64 and 128, and default higher sizes to no_zva.

* Why special case line size=64 only? Unrolling might not help for 128 but should not
  harm either, and the alignment overhead only increases with larger line sizes, so you
  want to bypass the zva code in all cases if N < 3-4x line size.

* Is the no-ifunc variant still required/used? We're now having at least 4 different
  variants which all need to be tested and maintained...

* Finally, which version is used when linking statically? I presume there is some 
  makefile magic that causes the no-zva version to be used, however that might not be 
  optimal for all targets.

Wilco

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-11-07 16:14 ` [RFC PATCH] aarch64: improve memset Wilco Dijkstra
@ 2014-11-08 10:05   ` Richard Henderson
  2014-11-09  8:19   ` Richard Henderson
  2014-11-11 13:05   ` Marcus Shawcroft
  2 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2014-11-08 10:05 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: will.newton, marcus.shawcroft, libc-alpha

On 11/07/2014 05:14 PM, Wilco Dijkstra wrote:
> * Do we really need variants for cache line sizes that are never going to be used?
>   I'd say just support 64 and 128, and default higher sizes to no_zva.

We could.

> * Why special case line size=64 only?

Because that was all I could test.

> * Is the no-ifunc variant still required/used? We're now having at least 4 different
>   variants which all need to be tested and maintained...

It's used within ld.so itself, though of course that too could go no_zva.

> * Finally, which version is used when linking statically? I presume there is some 
>   makefile magic that causes the no-zva version to be used, however that might not be 
>   optimal for all targets.

One can have ifuncs in statically linked programs.  They get resolved at
startup; I forget the exact mechanism.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-11-07 16:14 ` [RFC PATCH] aarch64: improve memset Wilco Dijkstra
  2014-11-08 10:05   ` Richard Henderson
@ 2014-11-09  8:19   ` Richard Henderson
  2014-11-10 20:09     ` Wilco Dijkstra
  2014-11-11 13:05   ` Marcus Shawcroft
  2 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2014-11-09  8:19 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: will.newton, marcus.shawcroft, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]

On 11/07/2014 05:14 PM, Wilco Dijkstra wrote:
> I've got a few comments on this patch:
> 
> * Do we really need variants for cache line sizes that are never going to be used?
>   I'd say just support 64 and 128, and default higher sizes to no_zva.
> 
> * Why special case line size=64 only? Unrolling might not help for 128 but should not
>   harm either, and the alignment overhead only increases with larger line sizes, so you
>   want to bypass the zva code in all cases if N < 3-4x line size.
> 
> * Is the no-ifunc variant still required/used? We're now having at least 4 different
>   variants which all need to be tested and maintained...
> 
> * Finally, which version is used when linking statically? I presume there is some 
>   makefile magic that causes the no-zva version to be used, however that might not be 
>   optimal for all targets.


Here's a version which only implements zva for 64 and 128-byte line sizes.

It also removes the version that loaded the zva data each time, which had
been used by ld.so and no-ifunc.  That was the path I had been concerned
about back in September.

That leaves ld.so using the no-zva path, which is perhaps a tad unfortunate
given that it needs to zero partial .bss pages during startup, and on a
system with 64k pages, we probably wind up with larger clears more often
than not...

Thoughts?


r~

[-- Attachment #2: z --]
[-- Type: text/plain, Size: 9576 bytes --]

diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 06f04be..9a3d932 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -20,23 +20,14 @@
  *
  * ARMv8-a, AArch64
  * Unaligned accesses
- *
  */
 
 #include <sysdep.h>
 
-/* By default we assume that the DC instruction can be used to zero
-   data blocks more efficiently.  In some circumstances this might be
-   unsafe, for example in an asymmetric multiprocessor environment with
-   different DC clear lengths (neither the upper nor lower lengths are
-   safe to use).  The feature can be disabled by defining DONT_USE_DC.
-
-   If code may be run in a virtualized environment, then define
-   MAYBE_VIRT.  This will cause the code to cache the system register
-   values rather than re-reading them each call.  */
-
 #define dstin		x0
-#define val		w1
+#define dstin_w		w0
+#define val		x1
+#define valw		w1
 #define count		x2
 #define tmp1		x3
 #define tmp1w		w3
@@ -44,186 +35,186 @@
 #define tmp2w		w4
 #define zva_len_x	x5
 #define zva_len		w5
-#define zva_bits_x	x6
-
-#define A_l		x7
-#define A_lw		w7
+#define zva_mask_x	x6
+#define zva_mask	w6
 #define dst		x8
-#define tmp3w		w9
-
-ENTRY_ALIGN (__memset, 6)
-
-	mov	dst, dstin		/* Preserve return value.  */
-	ands	A_lw, val, #255
-#ifndef DONT_USE_DC
-	b.eq	L(zero_mem)
-#endif
-	orr	A_lw, A_lw, A_lw, lsl #8
-	orr	A_lw, A_lw, A_lw, lsl #16
-	orr	A_l, A_l, A_l, lsl #32
-L(tail_maybe_long):
-	cmp	count, #64
-	b.ge	L(not_short)
-L(tail_maybe_tiny):
-	cmp	count, #15
-	b.le	L(tail15tiny)
-L(tail63):
-	ands	tmp1, count, #0x30
-	b.eq	L(tail15)
-	add	dst, dst, tmp1
-	cmp	tmp1w, #0x20
-	b.eq	1f
-	b.lt	2f
-	stp	A_l, A_l, [dst, #-48]
-1:
-	stp	A_l, A_l, [dst, #-32]
-2:
-	stp	A_l, A_l, [dst, #-16]
-
-L(tail15):
-	and	count, count, #15
-	add	dst, dst, count
-	stp	A_l, A_l, [dst, #-16]	/* Repeat some/all of last store. */
+#define dst_w		w8
+#define dstend		x9
+
+	.globl	memset
+	cfi_startproc
+
+#if HAVE_IFUNC && !defined (IS_IN_rtld)
+/* Rather than decode dczid_el0 every time, checking for zva disabled and
+   unpacking the line size, do this once in the indirect function and choose
+   an appropriate entry point which encodes these values as constants.  */
+
+	.type	memset, %gnu_indirect_function
+memset:
+	mrs	x1, dczid_el0
+	and	x1, x1, #31		/* isolate line size + disable bit */
+
+	cmp	x1, #4			/* 64 byte line size, enabled */
+	b.ne	1f
+	adr	x0, memset_zva_64
 	RET
 
-L(tail15tiny):
-	/* Set up to 15 bytes.  Does not assume earlier memory
-	   being set.  */
-	tbz	count, #3, 1f
-	str	A_l, [dst], #8
-1:
-	tbz	count, #2, 1f
-	str	A_lw, [dst], #4
-1:
-	tbz	count, #1, 1f
-	strh	A_lw, [dst], #2
-1:
-	tbz	count, #0, 1f
-	strb	A_lw, [dst]
-1:
+1:	cmp	x1, #5			/* 128 byte line size, enabled */
+	b.ne	1f
+	adr	x0, memset_zva_128
 	RET
 
-	/* Critical loop.  Start at a new cache line boundary.  Assuming
-	 * 64 bytes per line, this ensures the entire loop is in one line.  */
-	.p2align 6
-L(not_short):
-	neg	tmp2, dst
-	ands	tmp2, tmp2, #15
-	b.eq	2f
-	/* Bring DST to 128-bit (16-byte) alignment.  We know that there's
-	 * more than that to set, so we simply store 16 bytes and advance by
-	 * the amount required to reach alignment.  */
-	sub	count, count, tmp2
-	stp	A_l, A_l, [dst]
-	add	dst, dst, tmp2
-	/* There may be less than 63 bytes to go now.  */
-	cmp	count, #63
-	b.le	L(tail63)
-2:
-	sub	dst, dst, #16		/* Pre-bias.  */
-	sub	count, count, #64
-1:
-	stp	A_l, A_l, [dst, #16]
-	stp	A_l, A_l, [dst, #32]
-	stp	A_l, A_l, [dst, #48]
-	stp	A_l, A_l, [dst, #64]!
-	subs	count, count, #64
-	b.ge	1b
-	tst	count, #0x3f
-	add	dst, dst, #16
-	b.ne	L(tail63)
+1:	adr	x0, memset_nozva	/* Don't use zva at all */
+	RET
+	.size	memset, .-memset
+
+.macro do_zva size
+	.balign 64
+	.type	memset_zva_\size, %function
+memset_zva_\size:
+	CALL_MCOUNT
+	and	valw, valw, #255
+	cmp	count, #4*\size
+	ccmp	valw, #0, #0, hs	/* hs ? cmp val,0 : !z */
+	b.ne	L(nz_or_small)
+
+	stp	xzr, xzr, [dstin]	/* first 16 aligned 1.  */
+	and	tmp2, dstin, #-16
+	and	dst, dstin, #-\size
+
+	stp	xzr, xzr, [tmp2, #16]	/* first 64 aligned 16.  */
+	add	dstend, dstin, count
+	add	dst, dst, #\size
+
+	stp	xzr, xzr, [tmp2, #32]
+	sub	count, dstend, dst	/* recompute for misalign */
+	add	tmp1, dst, #\size
+
+	stp	xzr, xzr, [tmp2, #48]
+	sub	count, count, #2*\size	/* pre-bias */
+
+	stp	xzr, xzr, [tmp2, #64]
+
+	/* Store up to first SIZE, aligned 16.  */
+.ifgt	\size - 64
+	stp	xzr, xzr, [tmp2, #80]
+	stp	xzr, xzr, [tmp2, #96]
+	stp	xzr, xzr, [tmp2, #112]
+	stp	xzr, xzr, [tmp2, #128]
+.ifgt	\size - 128
+.err
+.endif
+.endif
+
+	.balign 64,,24
+0:	dc	zva, dst
+	subs	count, count, #2*\size
+	dc	zva, tmp1
+	add	dst, dst, #2*\size
+	add	tmp1, tmp1, #2*\size
+	b.hs	0b
+
+	adds	count, count, #2*\size	/* undo pre-bias */
+	b.ne	L(zva_tail)
 	RET
 
-#ifndef DONT_USE_DC
-	/* For zeroing memory, check to see if we can use the ZVA feature to
-	 * zero entire 'cache' lines.  */
-L(zero_mem):
-	mov	A_l, #0
-	cmp	count, #63
-	b.le	L(tail_maybe_tiny)
-	neg	tmp2, dst
-	ands	tmp2, tmp2, #15
-	b.eq	1f
-	sub	count, count, tmp2
-	stp	A_l, A_l, [dst]
-	add	dst, dst, tmp2
-	cmp	count, #63
-	b.le	L(tail63)
-1:
-	/* For zeroing small amounts of memory, it's not worth setting up
-	 * the line-clear code.  */
-	cmp	count, #128
-	b.lt	L(not_short)
-#ifdef MAYBE_VIRT
-	/* For efficiency when virtualized, we cache the ZVA capability.  */
-	adrp	tmp2, L(cache_clear)
-	ldr	zva_len, [tmp2, #:lo12:L(cache_clear)]
-	tbnz	zva_len, #31, L(not_short)
-	cbnz	zva_len, L(zero_by_line)
-	mrs	tmp1, dczid_el0
-	tbz	tmp1, #4, 1f
-	/* ZVA not available.  Remember this for next time.  */
-	mov	zva_len, #~0
-	str	zva_len, [tmp2, #:lo12:L(cache_clear)]
-	b	L(not_short)
-1:
-	mov	tmp3w, #4
-	and	zva_len, tmp1w, #15	/* Safety: other bits reserved.  */
-	lsl	zva_len, tmp3w, zva_len
-	str	zva_len, [tmp2, #:lo12:L(cache_clear)]
+	.size	memset_zva_\size, . - memset_zva_\size
+.endm
+
+do_zva	64
+do_zva	128
 #else
-	mrs	tmp1, dczid_el0
-	tbnz	tmp1, #4, L(not_short)
-	mov	tmp3w, #4
-	and	zva_len, tmp1w, #15	/* Safety: other bits reserved.  */
-	lsl	zva_len, tmp3w, zva_len
-#endif
-
-L(zero_by_line):
-	/* Compute how far we need to go to become suitably aligned.  We're
-	 * already at quad-word alignment.  */
-	cmp	count, zva_len_x
-	b.lt	L(not_short)		/* Not enough to reach alignment.  */
-	sub	zva_bits_x, zva_len_x, #1
-	neg	tmp2, dst
-	ands	tmp2, tmp2, zva_bits_x
-	b.eq	1f			/* Already aligned.  */
-	/* Not aligned, check that there's enough to copy after alignment.  */
-	sub	tmp1, count, tmp2
-	cmp	tmp1, #64
-	ccmp	tmp1, zva_len_x, #8, ge	/* NZCV=0b1000 */
-	b.lt	L(not_short)
-	/* We know that there's at least 64 bytes to zero and that it's safe
-	 * to overrun by 64 bytes.  */
-	mov	count, tmp1
-2:
-	stp	A_l, A_l, [dst]
-	stp	A_l, A_l, [dst, #16]
-	stp	A_l, A_l, [dst, #32]
-	subs	tmp2, tmp2, #64
-	stp	A_l, A_l, [dst, #48]
-	add	dst, dst, #64
-	b.ge	2b
-	/* We've overrun a bit, so adjust dst downwards.  */
-	add	dst, dst, tmp2
-1:
-	sub	count, count, zva_len_x
-3:
-	dc	zva, dst
-	add	dst, dst, zva_len_x
-	subs	count, count, zva_len_x
-	b.ge	3b
-	ands	count, count, zva_bits_x
-	b.ne	L(tail_maybe_long)
+/* If we don't have ifunc (e.g. ld.so) don't bother with the zva.  */
+# define memset_nozva  memset
+#endif /* IFUNC */
+
+/* The non-zva path.  */
+
+	.balign 64
+	.type	memset_nozva, %function
+memset_nozva:
+	CALL_MCOUNT
+	and	valw, valw, #255
+L(nz_or_small):
+	orr	valw, valw, valw, lsl #8  /* replicate the byte */
+	cmp	count, #64
+	orr	valw, valw, valw, lsl #16
+	add	dstend, dstin, count	  /* remember end of buffer */
+	orr	val, val, val, lsl #32
+	b.hs	L(ge_64)
+
+	/* Small data -- original count is less than 64 bytes.  */
+L(le_63):
+	cmp	count, #16
+	b.lo	L(le_15)
+	stp	val, val, [dstin]
+	tbz	count, #5, L(le_31)
+	stp	val, val, [dstin, #16]
+	stp	val, val, [dstend, #-32]
+L(le_31):
+	stp	val, val, [dstend, #-16]
+	RET
+	.balign 64,,16
+L(le_15):
+	tbz	count, #3, L(le_7)
+	str	val, [dstin]
+	str	val, [dstend, #-8]
+	RET
+	.balign 64,,16
+L(le_7):
+	tbz	count, #2, L(le_3)
+	str	valw, [dstin]
+	str	valw, [dstend, #-4]
+	RET
+	.balign 64,,20
+L(le_3):
+	tbz	count, #1, L(le_1)
+	strh	valw, [dstend, #-2]
+L(le_1):
+	tbz	count, #0, L(le_0)
+	strb	valw, [dstin]
+L(le_0):
+	RET
+
+	.balign 64
+L(ge_64):
+	and	dst, dstin, #-16	/* align the pointer / pre-bias.  */
+	stp	val, val, [dstin]	/* first 16 align 1 */
+	sub	count, dstend, dst	/* begin misalign recompute */
+	subs	count, count, #16+64	/* finish recompute + pre-bias */
+	b.ls	L(loop_tail)
+
+	.balign 64,,24
+L(loop):
+	stp	val, val, [dst, #16]
+	stp	val, val, [dst, #32]
+	subs	count, count, #64
+	stp	val, val, [dst, #48]
+	stp	val, val, [dst, #64]!
+	b.hs	L(loop)
+
+	adds	count, count, #64	/* undo pre-bias */
+	b.ne	L(loop_tail)
+	RET
+
+	/* Tail of the zva loop.  Less than ZVA bytes, but possibly lots
+	   more than 64.  Note that dst is aligned but unbiased.  */
+L(zva_tail):
+	subs	count, count, #64	/* pre-bias */
+	sub	dst, dst, #16		/* pre-bias */
+	b.hi	L(loop)
+
+	/* Tail of the stp loop; less than 64 bytes left (from loop)
+           or less-than-or-equal to 64 bytes left (from ge_64/zva_tail).  */
+L(loop_tail):
+	stp	val, val, [dstend, #-64]
+	stp	val, val, [dstend, #-48]
+	stp	val, val, [dstend, #-32]
+	stp	val, val, [dstend, #-16]
 	RET
-#ifdef MAYBE_VIRT
-	.bss
-	.p2align 2
-L(cache_clear):
-	.space 4
-#endif
-#endif /* DONT_USE_DC */
-
-END (__memset)
-weak_alias (__memset, memset)
+
+	.size	memset_nozva, . - memset_nozva
+	cfi_endproc
+
+strong_alias (memset, __memset)
 libc_hidden_builtin_def (memset)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC PATCH] aarch64: improve memset
  2014-11-09  8:19   ` Richard Henderson
@ 2014-11-10 20:09     ` Wilco Dijkstra
  2014-11-11  8:13       ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Wilco Dijkstra @ 2014-11-10 20:09 UTC (permalink / raw)
  To: 'Richard Henderson'; +Cc: will.newton, marcus.shawcroft, libc-alpha

> Richard Henderson wrote:
> On 11/07/2014 05:14 PM, Wilco Dijkstra wrote:
> > 
> > * Finally, which version is used when linking statically? I presume there is some
> >   makefile magic that causes the no-zva version to be used, however that might not be
> >   optimal for all targets.

So it turns out ifuncs are used even with static linking.

> That leaves ld.so using the no-zva path, which is perhaps a tad unfortunate
> given that it needs to zero partial .bss pages during startup, and on a
> system with 64k pages, we probably wind up with larger clears more often
> than not...

I'm not sure how often ld.so calls memset but I'm guessing it is minor compared
to the total time to load.

> Thoughts?

I spotted one issue in the alignment code:

+	stp	xzr, xzr, [tmp2, #64]
+
+	/* Store up to first SIZE, aligned 16.  */
+.ifgt	\size - 64
+	stp	xzr, xzr, [tmp2, #80]
+	stp	xzr, xzr, [tmp2, #96]
+	stp	xzr, xzr, [tmp2, #112]
+	stp	xzr, xzr, [tmp2, #128]
+.ifgt	\size - 128
+.err
+.endif
+.endif

This should be:

+	/* Store up to first SIZE, aligned 16.  */
+.ifgt	\size - 64
+	stp	xzr, xzr, [tmp2, #64]
+	stp	xzr, xzr, [tmp2, #80]
+	stp	xzr, xzr, [tmp2, #96]
+	stp	xzr, xzr, [tmp2, #112]
+.ifgt	\size - 128
+.err
+.endif
+.endif

Other than that it looks good to me.

Wilco


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-11-10 20:09     ` Wilco Dijkstra
@ 2014-11-11  8:13       ` Richard Henderson
  2014-11-11 12:52         ` Wilco Dijkstra
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2014-11-11  8:13 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: will.newton, marcus.shawcroft, libc-alpha

On 11/10/2014 09:09 PM, Wilco Dijkstra wrote:
> I spotted one issue in the alignment code:
> 
> +	stp	xzr, xzr, [tmp2, #64]
> +
> +	/* Store up to first SIZE, aligned 16.  */
> +.ifgt	\size - 64
> +	stp	xzr, xzr, [tmp2, #80]
> +	stp	xzr, xzr, [tmp2, #96]
> +	stp	xzr, xzr, [tmp2, #112]
> +	stp	xzr, xzr, [tmp2, #128]
> +.ifgt	\size - 128
> +.err
> +.endif
> +.endif
> 
> This should be:
> 
> +	/* Store up to first SIZE, aligned 16.  */
> +.ifgt	\size - 64
> +	stp	xzr, xzr, [tmp2, #64]
> +	stp	xzr, xzr, [tmp2, #80]
> +	stp	xzr, xzr, [tmp2, #96]
> +	stp	xzr, xzr, [tmp2, #112]
> +.ifgt	\size - 128
> +.err
> +.endif

Incorrect.

tmp2 is backward aligned from dst_in, which means that tmp2+0 may be before
dst_in.  Thus we write the first 16 bytes, unaligned, then write to tmp2+16
through tmp2+N to clear the first N+1 to N+16 bytes.

However, if we stop at tmp2+48 (or tmp2+112) we could be leaving up to 15 bytes
uninitialized.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC PATCH] aarch64: improve memset
  2014-11-11  8:13       ` Richard Henderson
@ 2014-11-11 12:52         ` Wilco Dijkstra
  2014-11-11 14:30           ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Wilco Dijkstra @ 2014-11-11 12:52 UTC (permalink / raw)
  To: 'Richard Henderson'; +Cc: will.newton, marcus.shawcroft, libc-alpha

> Richard wrote:
> On 11/10/2014 09:09 PM, Wilco Dijkstra wrote:
> > I spotted one issue in the alignment code:
> >
> > +	stp	xzr, xzr, [tmp2, #64]
> > +
> > +	/* Store up to first SIZE, aligned 16.  */
> > +.ifgt	\size - 64
> > +	stp	xzr, xzr, [tmp2, #80]
> > +	stp	xzr, xzr, [tmp2, #96]
> > +	stp	xzr, xzr, [tmp2, #112]
> > +	stp	xzr, xzr, [tmp2, #128]
> > +.ifgt	\size - 128
> > +.err
> > +.endif
> > +.endif
> >
> > This should be:
> >
> > +	/* Store up to first SIZE, aligned 16.  */
> > +.ifgt	\size - 64
> > +	stp	xzr, xzr, [tmp2, #64]
> > +	stp	xzr, xzr, [tmp2, #80]
> > +	stp	xzr, xzr, [tmp2, #96]
> > +	stp	xzr, xzr, [tmp2, #112]
> > +.ifgt	\size - 128
> > +.err
> > +.endif
> 
> Incorrect.
> 
> tmp2 is backward aligned from dst_in, which means that tmp2+0 may be before
> dst_in.  Thus we write the first 16 bytes, unaligned, then write to tmp2+16
> through tmp2+N to clear the first N+1 to N+16 bytes.
> 
> However, if we stop at tmp2+48 (or tmp2+112) we could be leaving up to 15 bytes
> uninitialized.

No - in the worst case we need to write 64 bytes. The proof is trivial, 
dst = x0 & -64, tmp2 = x0 & -16, so tmp2 = dst + (x0 & 0x30) or tmp2 >= dst. 
Since we start doing the dc's at dst + 64, the stp to [tmp2 + 64] is redundant.

Wilco


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-11-07 16:14 ` [RFC PATCH] aarch64: improve memset Wilco Dijkstra
  2014-11-08 10:05   ` Richard Henderson
  2014-11-09  8:19   ` Richard Henderson
@ 2014-11-11 13:05   ` Marcus Shawcroft
  2014-11-11 14:16     ` Wilco Dijkstra
  2014-11-11 17:22     ` Andrew Pinski
  2 siblings, 2 replies; 19+ messages in thread
From: Marcus Shawcroft @ 2014-11-11 13:05 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Richard Henderson, will.newton, GNU C Library

On 7 November 2014 16:14, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>> Richard Henderson wrote:

> I've got a few comments on this patch:
>
> * Do we really need variants for cache line sizes that are never going to be used?
>   I'd say just support 64 and 128, and default higher sizes to no_zva.

We shouldn't be removing support for the other sizes already supported
by the existing implementation.  If the other sizes were deprecated
from the architecture then fair game, but that is not the case.  From
offline conversation with Wilco I gather part of the motivation to
remove is that the none 64  cases cannot be readily tested on HW.
That particular issue was solved in the original implementation using
a hacked qemu.

Cheers
/Marcus

> * Why special case line size=64 only? Unrolling might not help for 128 but should not
>   harm either, and the alignment overhead only increases with larger line sizes, so you
>   want to bypass the zva code in all cases if N < 3-4x line size.
>
> * Is the no-ifunc variant still required/used? We're now having at least 4 different
>   variants which all need to be tested and maintained...
>
> * Finally, which version is used when linking statically? I presume there is some
>   makefile magic that causes the no-zva version to be used, however that might not be
>   optimal for all targets.
>
> Wilco
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC PATCH] aarch64: improve memset
  2014-11-11 13:05   ` Marcus Shawcroft
@ 2014-11-11 14:16     ` Wilco Dijkstra
  2014-11-11 17:22     ` Andrew Pinski
  1 sibling, 0 replies; 19+ messages in thread
From: Wilco Dijkstra @ 2014-11-11 14:16 UTC (permalink / raw)
  To: 'Marcus Shawcroft'; +Cc: Richard Henderson, will.newton, GNU C Library

> Marcus Shawcroft wrote:
> On 7 November 2014 16:14, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> >> Richard Henderson wrote:
> 
> > I've got a few comments on this patch:
> >
> > * Do we really need variants for cache line sizes that are never going to be used?
> >   I'd say just support 64 and 128, and default higher sizes to no_zva.
> 
> We shouldn't be removing support for the other sizes already supported
> by the existing implementation.  If the other sizes were deprecated
> from the architecture then fair game, but that is not the case.  From
> offline conversation with Wilco I gather part of the motivation to
> remove is that the none 64  cases cannot be readily tested on HW.
> That particular issue was solved in the original implementation using
> a hacked qemu.

The architecture allows dc zva of 4..2048 bytes. Most of these are useless and would
not result in a performance gain. Sizes 4-16 cannot be useful as an stp can write
more data... Larger sizes incur an ever increasing alignment overhead and there are 
fewer memsets where dc zva could be used.

It would certainly be a good idea to deprecate useless small and overly large sizes,
but I don't see the reasoning for supporting every legal size without evidence of a 
performance gain on an actual implementation. It's not like memset will crash on
an implementation with an unsupported size, it just won't use dc.

Wilco

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-11-11 12:52         ` Wilco Dijkstra
@ 2014-11-11 14:30           ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2014-11-11 14:30 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: will.newton, marcus.shawcroft, libc-alpha

On 11/11/2014 01:52 PM, Wilco Dijkstra wrote:
> No - in the worst case we need to write 64 bytes. The proof is trivial, 
> dst = x0 & -64, tmp2 = x0 & -16, so tmp2 = dst + (x0 & 0x30) or tmp2 >= dst. 
> Since we start doing the dc's at dst + 64, the stp to [tmp2 + 64] is redundant.

Quite right, my mistake.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-11-11 13:05   ` Marcus Shawcroft
  2014-11-11 14:16     ` Wilco Dijkstra
@ 2014-11-11 17:22     ` Andrew Pinski
  1 sibling, 0 replies; 19+ messages in thread
From: Andrew Pinski @ 2014-11-11 17:22 UTC (permalink / raw)
  To: Marcus Shawcroft
  Cc: Wilco Dijkstra, Richard Henderson, Will Newton, GNU C Library

On Tue, Nov 11, 2014 at 5:05 AM, Marcus Shawcroft
<marcus.shawcroft@gmail.com> wrote:
> On 7 November 2014 16:14, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>>> Richard Henderson wrote:
>
>> I've got a few comments on this patch:
>>
>> * Do we really need variants for cache line sizes that are never going to be used?
>>   I'd say just support 64 and 128, and default higher sizes to no_zva.
>
> We shouldn't be removing support for the other sizes already supported
> by the existing implementation.  If the other sizes were deprecated
> from the architecture then fair game, but that is not the case.  From
> offline conversation with Wilco I gather part of the motivation to
> remove is that the none 64  cases cannot be readily tested on HW.
> That particular issue was solved in the original implementation using
> a hacked qemu.

I will have the ability to test on hardware which uses 128 byte case
soon.  I already testing using a simulator which sets it to 128 byte
(though I use it for performance analysis though).

Thanks,
Andrew


>
> Cheers
> /Marcus
>
>> * Why special case line size=64 only? Unrolling might not help for 128 but should not
>>   harm either, and the alignment overhead only increases with larger line sizes, so you
>>   want to bypass the zva code in all cases if N < 3-4x line size.
>>
>> * Is the no-ifunc variant still required/used? We're now having at least 4 different
>>   variants which all need to be tested and maintained...
>>
>> * Finally, which version is used when linking statically? I presume there is some
>>   makefile magic that causes the no-zva version to be used, however that might not be
>>   optimal for all targets.
>>
>> Wilco
>>
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-06-14  7:06 Richard Henderson
  2014-06-20 11:05 ` Ondřej Bílka
  2014-09-30 11:03 ` Marcus Shawcroft
@ 2015-02-18  2:41 ` Andrew Pinski
  2 siblings, 0 replies; 19+ messages in thread
From: Andrew Pinski @ 2015-02-18  2:41 UTC (permalink / raw)
  To: Richard Henderson; +Cc: libc-alpha, Ondřej Bílka, Marcus Shawcroft

On Sat, Jun 14, 2014 at 12:06 AM, Richard Henderson <rth@twiddle.net> wrote:
> The major idea here is to use IFUNC to check the zva line size once, and use
> that to select different entry points.  This saves 3 branches during startup,
> and allows significantly more flexibility.
>
> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> with the x86 versions.
>
> I've done some performance testing using cachebench, which suggests that the
> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> bytes and above.  The non-zva path appears to be largely unchanged.
>
> I'd like to use some of Ondrej's benchmarks+data, but I couldn't locate them in
> a quick search of the mailing list.  Pointers?
>
> Comments?

Yes I have a performance regression on ThunderX with this patch and
the newer versions still.  Due to the placement of subs in the inner
most of loop of the non-zero case.  Around a 20% regression.

Thanks,
Andrew Pinski



>
>
> r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-11-05 14:35   ` Will Newton
@ 2014-11-06  6:55     ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2014-11-06  6:55 UTC (permalink / raw)
  To: Will Newton, Marcus Shawcroft; +Cc: libc-alpha

On 11/05/2014 03:35 PM, Will Newton wrote:
> On 30 September 2014 12:03, Marcus Shawcroft <marcus.shawcroft@gmail.com> wrote:
>> On 14 June 2014 08:06, Richard Henderson <rth@twiddle.net> wrote:
>>> The major idea here is to use IFUNC to check the zva line size once, and use
>>> that to select different entry points.  This saves 3 branches during startup,
>>> and allows significantly more flexibility.
>>>
>>> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
>>> with the x86 versions.
>>>
>>> I've done some performance testing using cachebench, which suggests that the
>>> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
>>> bytes and above.  The non-zva path appears to be largely unchanged.
>>
>>
>> OK Thanks /Marcus
> 
> It looks like this patch has slipped through the cracks. Richard, are
> you happy to apply this or do you think it warrants further
> discussion?

Sorry for the radio silence.

Just before I went to apply it I thought I spotted a bug that would affect
ld.so.  I haven't had time to make sure one way or another.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-09-30 11:03 ` Marcus Shawcroft
@ 2014-11-05 14:35   ` Will Newton
  2014-11-06  6:55     ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Will Newton @ 2014-11-05 14:35 UTC (permalink / raw)
  To: Marcus Shawcroft; +Cc: Richard Henderson, libc-alpha

On 30 September 2014 12:03, Marcus Shawcroft <marcus.shawcroft@gmail.com> wrote:
> On 14 June 2014 08:06, Richard Henderson <rth@twiddle.net> wrote:
>> The major idea here is to use IFUNC to check the zva line size once, and use
>> that to select different entry points.  This saves 3 branches during startup,
>> and allows significantly more flexibility.
>>
>> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
>> with the x86 versions.
>>
>> I've done some performance testing using cachebench, which suggests that the
>> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
>> bytes and above.  The non-zva path appears to be largely unchanged.
>
>
> OK Thanks /Marcus

It looks like this patch has slipped through the cracks. Richard, are
you happy to apply this or do you think it warrants further
discussion?

Thanks,

-- 
Will Newton
Toolchain Working Group, Linaro

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-06-14  7:06 Richard Henderson
  2014-06-20 11:05 ` Ondřej Bílka
@ 2014-09-30 11:03 ` Marcus Shawcroft
  2014-11-05 14:35   ` Will Newton
  2015-02-18  2:41 ` Andrew Pinski
  2 siblings, 1 reply; 19+ messages in thread
From: Marcus Shawcroft @ 2014-09-30 11:03 UTC (permalink / raw)
  To: Richard Henderson; +Cc: libc-alpha

On 14 June 2014 08:06, Richard Henderson <rth@twiddle.net> wrote:
> The major idea here is to use IFUNC to check the zva line size once, and use
> that to select different entry points.  This saves 3 branches during startup,
> and allows significantly more flexibility.
>
> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> with the x86 versions.
>
> I've done some performance testing using cachebench, which suggests that the
> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> bytes and above.  The non-zva path appears to be largely unchanged.


OK Thanks /Marcus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-09-12  0:15 ` Carlos O'Donell
@ 2014-09-18  0:25   ` Will Newton
  0 siblings, 0 replies; 19+ messages in thread
From: Will Newton @ 2014-09-18  0:25 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Richard Henderson, libc-alpha

On 11 September 2014 17:14, Carlos O'Donell <carlos@redhat.com> wrote:
> On 09/11/2014 04:40 PM, Richard Henderson wrote:
>>> https://sourceware.org/ml/libc-alpha/2014-06/msg00376.html
>>
>> Ping.
>
> From my perspective your patch looks fine. Given the lack of
> ARMv8 hardware I can test on I feel these optimization are
> a little bit like spinning our wheels. We move in the right
> direction though and that's good.
>
> Out of curiosity did bench/bench-memset show any improvements?

I agree, this looks good to me too but it would be useful to see what
performance changes were seen and if possible which hardware that was
on.

-- 
Will Newton
Toolchain Working Group, Linaro

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-09-11 20:41 Richard Henderson
@ 2014-09-12  0:15 ` Carlos O'Donell
  2014-09-18  0:25   ` Will Newton
  0 siblings, 1 reply; 19+ messages in thread
From: Carlos O'Donell @ 2014-09-12  0:15 UTC (permalink / raw)
  To: Richard Henderson, libc-alpha

On 09/11/2014 04:40 PM, Richard Henderson wrote:
>> https://sourceware.org/ml/libc-alpha/2014-06/msg00376.html
> 
> Ping.

From my perspective your patch looks fine. Given the lack of
ARMv8 hardware I can test on I feel these optimization are
a little bit like spinning our wheels. We move in the right
direction though and that's good.

Out of curiosity did bench/bench-memset show any improvements?

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
@ 2014-09-11 20:41 Richard Henderson
  2014-09-12  0:15 ` Carlos O'Donell
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2014-09-11 20:41 UTC (permalink / raw)
  To: libc-alpha

> https://sourceware.org/ml/libc-alpha/2014-06/msg00376.html

Ping.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] aarch64: improve memset
  2014-06-14  7:06 Richard Henderson
@ 2014-06-20 11:05 ` Ondřej Bílka
  2014-09-30 11:03 ` Marcus Shawcroft
  2015-02-18  2:41 ` Andrew Pinski
  2 siblings, 0 replies; 19+ messages in thread
From: Ondřej Bílka @ 2014-06-20 11:05 UTC (permalink / raw)
  To: Richard Henderson; +Cc: libc-alpha, Marcus Shawcroft

On Sat, Jun 14, 2014 at 12:06:39AM -0700, Richard Henderson wrote:
> The major idea here is to use IFUNC to check the zva line size once, and use
> that to select different entry points.  This saves 3 branches during startup,
> and allows significantly more flexibility.
> 
> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> with the x86 versions.
> 
> I've done some performance testing using cachebench, which suggests that the
> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> bytes and above.  The non-zva path appears to be largely unchanged.
> 
> I'd like to use some of Ondrej's benchmarks+data, but I couldn't locate them in
> a quick search of the mailing list.  Pointers?
> 
> Comments?
>
A benchmark that I currently use is here, which simply measures running
time of given command with different implementation, you need to
generate .so with memset for each memset variant then run a ./benchmark
and wait. I am not sure about a performance impact of unrolling as these
sizes tend to be relatively rare on apps that I measured.

http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2

What I got from that is bit chaotic, for example on AMD a gcc runs fastest with simple rep stosq loop
but other benchmarks say otherwise. It is in my priority list to update
memset based on that.


Then I have a profiler however it is currently x86 specific, it would
take some work to make it cross platform. Also it has limitation that it
does not measure effects of memset on caches which could skew a results.

A important part here is characteristic of data which are here:

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_gcc/result.html

It shows things like that data are almost always 8 byte aligned and
similar. A latest source is here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile130813.tar.bz2 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH] aarch64: improve memset
@ 2014-06-14  7:06 Richard Henderson
  2014-06-20 11:05 ` Ondřej Bílka
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Richard Henderson @ 2014-06-14  7:06 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondřej Bílka, Marcus Shawcroft

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

The major idea here is to use IFUNC to check the zva line size once, and use
that to select different entry points.  This saves 3 branches during startup,
and allows significantly more flexibility.

Also, I've cribbed several of the unaligned store ideas that Ondrej has done
with the x86 versions.

I've done some performance testing using cachebench, which suggests that the
unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
bytes and above.  The non-zva path appears to be largely unchanged.

I'd like to use some of Ondrej's benchmarks+data, but I couldn't locate them in
a quick search of the mailing list.  Pointers?

Comments?


r~

[-- Attachment #2: z --]
[-- Type: text/plain, Size: 11779 bytes --]

diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 06f04be..523406d 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -20,23 +20,14 @@
  *
  * ARMv8-a, AArch64
  * Unaligned accesses
- *
  */
 
 #include <sysdep.h>
 
-/* By default we assume that the DC instruction can be used to zero
-   data blocks more efficiently.  In some circumstances this might be
-   unsafe, for example in an asymmetric multiprocessor environment with
-   different DC clear lengths (neither the upper nor lower lengths are
-   safe to use).  The feature can be disabled by defining DONT_USE_DC.
-
-   If code may be run in a virtualized environment, then define
-   MAYBE_VIRT.  This will cause the code to cache the system register
-   values rather than re-reading them each call.  */
-
 #define dstin		x0
-#define val		w1
+#define dstin_w		w0
+#define val		x1
+#define valw		w1
 #define count		x2
 #define tmp1		x3
 #define tmp1w		w3
@@ -44,186 +35,280 @@
 #define tmp2w		w4
 #define zva_len_x	x5
 #define zva_len		w5
-#define zva_bits_x	x6
-
-#define A_l		x7
-#define A_lw		w7
+#define zva_mask_x	x6
+#define zva_mask	w6
 #define dst		x8
-#define tmp3w		w9
-
-ENTRY_ALIGN (__memset, 6)
-
-	mov	dst, dstin		/* Preserve return value.  */
-	ands	A_lw, val, #255
-#ifndef DONT_USE_DC
-	b.eq	L(zero_mem)
-#endif
-	orr	A_lw, A_lw, A_lw, lsl #8
-	orr	A_lw, A_lw, A_lw, lsl #16
-	orr	A_l, A_l, A_l, lsl #32
-L(tail_maybe_long):
-	cmp	count, #64
-	b.ge	L(not_short)
-L(tail_maybe_tiny):
-	cmp	count, #15
-	b.le	L(tail15tiny)
-L(tail63):
-	ands	tmp1, count, #0x30
-	b.eq	L(tail15)
-	add	dst, dst, tmp1
-	cmp	tmp1w, #0x20
-	b.eq	1f
-	b.lt	2f
-	stp	A_l, A_l, [dst, #-48]
-1:
-	stp	A_l, A_l, [dst, #-32]
-2:
-	stp	A_l, A_l, [dst, #-16]
-
-L(tail15):
-	and	count, count, #15
-	add	dst, dst, count
-	stp	A_l, A_l, [dst, #-16]	/* Repeat some/all of last store. */
-	RET
+#define dst_w		w8
+#define dstend		x9
+
+	.globl	memset
+	cfi_startproc
+
+#if HAVE_IFUNC && !defined (IS_IN_rtld)
+/* Rather than decode dczid_el0 every time, checking for zva disabled and
+   unpacking the line size, do this once in the indirect function and choose
+   an appropriate entry point which encodes these values as constants.  */
 
-L(tail15tiny):
-	/* Set up to 15 bytes.  Does not assume earlier memory
-	   being set.  */
-	tbz	count, #3, 1f
-	str	A_l, [dst], #8
-1:
-	tbz	count, #2, 1f
-	str	A_lw, [dst], #4
-1:
-	tbz	count, #1, 1f
-	strh	A_lw, [dst], #2
-1:
-	tbz	count, #0, 1f
-	strb	A_lw, [dst]
-1:
+	.type	memset, %gnu_indirect_function
+memset:
+	mrs	x1, dczid_el0
+	adrp	x0, 1f
+	tst	x1, #16				/* test for zva disabled */
+	and	x1, x1, #15
+	add	x0, x0, #:lo12:1f
+	csel	x1, xzr, x1, ne			/* squash index to 0 if so */
+	ldrsw	x2, [x0, x1, lsl #2]
+	add	x0, x0, x2
 	RET
+	.size	memset, .-memset
+
+	.section .rodata
+1:	.long	memset_nozva - 1b		// 0
+	.long	memset_nozva - 1b		// 1
+	.long	memset_nozva - 1b		// 2
+	.long	memset_nozva - 1b		// 3
+	.long	memset_zva_64 - 1b		// 4
+	.long	memset_zva_128 - 1b		// 5
+	.long	memset_zva_256 - 1b		// 6
+	.long	memset_zva_512 - 1b		// 7
+	.long	memset_zva_1024 - 1b	// 8
+	.long	memset_zva_2048 - 1b	// 9
+	.long	memset_zva_4096 - 1b	// 10
+	.long	memset_zva_8192 - 1b	// 11
+	.long	memset_zva_16384 - 1b	// 12
+	.long	memset_zva_32768 - 1b	// 13
+	.long	memset_zva_65536 - 1b	// 14
+	.long	memset_zva_131072 - 1b	// 15
+	.previous
+
+/* The 64 byte zva size is too small, and needs unrolling for efficiency.  */
 
-	/* Critical loop.  Start at a new cache line boundary.  Assuming
-	 * 64 bytes per line, this ensures the entire loop is in one line.  */
 	.p2align 6
-L(not_short):
-	neg	tmp2, dst
-	ands	tmp2, tmp2, #15
-	b.eq	2f
-	/* Bring DST to 128-bit (16-byte) alignment.  We know that there's
-	 * more than that to set, so we simply store 16 bytes and advance by
-	 * the amount required to reach alignment.  */
-	sub	count, count, tmp2
-	stp	A_l, A_l, [dst]
-	add	dst, dst, tmp2
-	/* There may be less than 63 bytes to go now.  */
-	cmp	count, #63
-	b.le	L(tail63)
-2:
-	sub	dst, dst, #16		/* Pre-bias.  */
-	sub	count, count, #64
-1:
-	stp	A_l, A_l, [dst, #16]
-	stp	A_l, A_l, [dst, #32]
-	stp	A_l, A_l, [dst, #48]
-	stp	A_l, A_l, [dst, #64]!
-	subs	count, count, #64
-	b.ge	1b
-	tst	count, #0x3f
-	add	dst, dst, #16
-	b.ne	L(tail63)
+	.type	memset_zva_64, %function
+memset_zva_64:
+	CALL_MCOUNT
+	and	valw, valw, #255
+	cmp	count, #256
+	ccmp	valw, #0, #0, hs	/* hs ? cmp val,0 : !z */
+	b.ne	L(nz_or_small)
+
+	stp	xzr, xzr, [dstin]	/* first 16 aligned 1.  */
+	and	tmp2, dstin, #-16
+	and	dst, dstin, #-64
+
+	stp	xzr, xzr, [tmp2, #16]	/* first 64 aligned 16.  */
+	add	dstend, dstin, count
+	add	dst, dst, #64
+
+	stp	xzr, xzr, [tmp2, #32]
+	sub	count, dstend, dst	/* recompute for misalign */
+	add	tmp1, dst, #64
+
+	stp	xzr, xzr, [tmp2, #48]
+	sub	count, count, #128	/* pre-bias */
+
+	stp	xzr, xzr, [tmp2, #64]
+
+	.p2align 6,,24
+0:	dc	zva, dst
+	subs	count, count, #128
+	dc	zva, tmp1
+	add	dst, dst, #128
+	add	tmp1, tmp1, #128
+	b.hs	0b
+
+	adds	count, count, #128	/* undo pre-bias */
+	b.ne	L(zva_tail)
 	RET
 
-#ifndef DONT_USE_DC
-	/* For zeroing memory, check to see if we can use the ZVA feature to
-	 * zero entire 'cache' lines.  */
-L(zero_mem):
-	mov	A_l, #0
-	cmp	count, #63
-	b.le	L(tail_maybe_tiny)
-	neg	tmp2, dst
-	ands	tmp2, tmp2, #15
-	b.eq	1f
-	sub	count, count, tmp2
-	stp	A_l, A_l, [dst]
-	add	dst, dst, tmp2
-	cmp	count, #63
-	b.le	L(tail63)
-1:
-	/* For zeroing small amounts of memory, it's not worth setting up
-	 * the line-clear code.  */
-	cmp	count, #128
-	b.lt	L(not_short)
-#ifdef MAYBE_VIRT
-	/* For efficiency when virtualized, we cache the ZVA capability.  */
-	adrp	tmp2, L(cache_clear)
-	ldr	zva_len, [tmp2, #:lo12:L(cache_clear)]
-	tbnz	zva_len, #31, L(not_short)
-	cbnz	zva_len, L(zero_by_line)
-	mrs	tmp1, dczid_el0
-	tbz	tmp1, #4, 1f
-	/* ZVA not available.  Remember this for next time.  */
-	mov	zva_len, #~0
-	str	zva_len, [tmp2, #:lo12:L(cache_clear)]
-	b	L(not_short)
-1:
-	mov	tmp3w, #4
-	and	zva_len, tmp1w, #15	/* Safety: other bits reserved.  */
-	lsl	zva_len, tmp3w, zva_len
-	str	zva_len, [tmp2, #:lo12:L(cache_clear)]
+	.size	memset_zva_64, . - memset_zva_64
+
+/* For larger zva sizes, a simple loop ought to suffice.  */
+/* ??? Needs performance testing, when such hardware becomes available.  */
+
+.macro do_zva len
+	.p2align 4
+	.type	memset_zva_\len, %function
+memset_zva_\len:
+	CALL_MCOUNT
+	and	valw, valw, #255
+	cmp	count, #\len
+	ccmp	valw, #0, #0, hs	/* hs ? cmp val,0 : !z */
+	b.ne	L(nz_or_small)
+
+	add	dstend, dstin, count
+	mov	zva_len, #\len
+	mov	zva_mask, #\len-1
+	b	memset_zva_n
+
+	.size	memset_zva_\len, . - memset_zva_\len
+.endm
+
+	do_zva 128	// 5
+	do_zva 256	// 6
+	do_zva 512	// 7
+	do_zva 1024	// 8
+	do_zva 2048	// 9
+	do_zva 4096	// 10
+	do_zva 8192	// 11
+	do_zva 16384	// 12
+	do_zva 32768	// 13
+	do_zva 65536	// 14
+	do_zva 131072	// 15
+
+	.p2align 6
 #else
+/* Without IFUNC, we must load the zva data from the dczid register.  */
+
+	.p2align 6
+	.type	memset, %function
+memset:
+	and	valw, valw, #255
+	cmp	count, #256
+	ccmp	valw, #0, #0, hs	/* hs ? cmp val,0 : !z */
+	b.ne	L(nz_or_small)
+
 	mrs	tmp1, dczid_el0
-	tbnz	tmp1, #4, L(not_short)
-	mov	tmp3w, #4
-	and	zva_len, tmp1w, #15	/* Safety: other bits reserved.  */
-	lsl	zva_len, tmp3w, zva_len
-#endif
-
-L(zero_by_line):
-	/* Compute how far we need to go to become suitably aligned.  We're
-	 * already at quad-word alignment.  */
+	tbnz	tmp1, #4, L(nz_or_small)
+
+	and	tmp1w, tmp1w, #15
+	mov	zva_len, #4
+	add	dstend, dstin, count
+	lsl	zva_len, zva_len, tmp1w
 	cmp	count, zva_len_x
-	b.lt	L(not_short)		/* Not enough to reach alignment.  */
-	sub	zva_bits_x, zva_len_x, #1
-	neg	tmp2, dst
-	ands	tmp2, tmp2, zva_bits_x
-	b.eq	1f			/* Already aligned.  */
-	/* Not aligned, check that there's enough to copy after alignment.  */
-	sub	tmp1, count, tmp2
-	cmp	tmp1, #64
-	ccmp	tmp1, zva_len_x, #8, ge	/* NZCV=0b1000 */
-	b.lt	L(not_short)
-	/* We know that there's at least 64 bytes to zero and that it's safe
-	 * to overrun by 64 bytes.  */
-	mov	count, tmp1
-2:
-	stp	A_l, A_l, [dst]
-	stp	A_l, A_l, [dst, #16]
-	stp	A_l, A_l, [dst, #32]
-	subs	tmp2, tmp2, #64
-	stp	A_l, A_l, [dst, #48]
-	add	dst, dst, #64
-	b.ge	2b
-	/* We've overrun a bit, so adjust dst downwards.  */
-	add	dst, dst, tmp2
-1:
-	sub	count, count, zva_len_x
-3:
-	dc	zva, dst
-	add	dst, dst, zva_len_x
+	sub	zva_mask, zva_len, #1
+	b.lo	L(ge_64)
+
+	/* Fall through into memset_zva_n.  */
+	.size	memset, . - memset
+#endif /* HAVE_IFUNC */
+
+/* Main part of the zva path.  On arrival here, we've already checked for
+   minimum size and that VAL is zero.  Also, we've set up zva_len and mask. */
+
+	.type	memset_zva_n, %function
+memset_zva_n:
+	stp	xzr, xzr, [dstin]	/* first 16 aligned 1.  */
+	neg	tmp1w, dstin_w
+	sub	count, count, zva_len_x	/* pre-bias */
+	mov	dst, dstin
+	ands	tmp1w, tmp1w, zva_mask
+	b.ne	3f
+
+	.p2align 6,,16
+2:	dc	zva, dst
 	subs	count, count, zva_len_x
-	b.ge	3b
-	ands	count, count, zva_bits_x
-	b.ne	L(tail_maybe_long)
+	add	dst, dst, zva_len_x
+	b.hs	2b
+
+	adds	count, count, zva_len_x	/* undo pre-bias */
+	b.ne	L(zva_tail)
 	RET
-#ifdef MAYBE_VIRT
-	.bss
-	.p2align 2
-L(cache_clear):
-	.space 4
-#endif
-#endif /* DONT_USE_DC */
-
-END (__memset)
-weak_alias (__memset, memset)
+
+	.p2align 4
+3:	and	tmp2, dstin, #-16
+	sub	count, count, tmp1	/* account for misalign */
+	add	dst, dstin, tmp1
+
+	.p2align 6,,24
+4:	stp	xzr, xzr, [tmp2, #16]
+	stp	xzr, xzr, [tmp2, #32]
+	subs	tmp1w, tmp1w, #64
+	stp	xzr, xzr, [tmp2, #48]
+	stp	xzr, xzr, [tmp2, #64]!
+	b.hi	4b
+
+	b	2b
+
+	.size	memset_zva_n, . - memset_zva_n
+
+/* The non-zva path.  */
+
+	.p2align 6
+	.type	memset_nozva, %function
+memset_nozva:
+	CALL_MCOUNT
+	and	valw, valw, #255
+L(nz_or_small):
+	orr	valw, valw, valw, lsl #8  /* replicate the byte */
+	cmp	count, #64
+	orr	valw, valw, valw, lsl #16
+	add	dstend, dstin, count	  /* remember end of buffer */
+	orr	val, val, val, lsl #32
+	b.hs	L(ge_64)
+
+	/* Small data -- original count is less than 64 bytes.  */
+L(le_63):
+	cmp	count, #16
+	b.lo	L(le_15)
+	stp	val, val, [dstin]
+	tbz	count, #5, L(le_31)
+	stp	val, val, [dstin, #16]
+	stp	val, val, [dstend, #-32]
+L(le_31):
+	stp	val, val, [dstend, #-16]
+	RET
+	.p2align 6,,16
+L(le_15):
+	tbz	count, #3, L(le_7)
+	str	val, [dstin]
+	str	val, [dstend, #-8]
+	RET
+	.p2align 6,,16
+L(le_7):
+	tbz	count, #2, L(le_3)
+	str	valw, [dstin]
+	str	valw, [dstend, #-4]
+	RET
+	.p2align 6,,20
+L(le_3):
+	tbz	count, #1, L(le_1)
+	strh	valw, [dstend, #-2]
+L(le_1):
+	tbz	count, #0, L(le_0)
+	strb	valw, [dstin]
+L(le_0):
+	RET
+
+	.p2align 6
+L(ge_64):
+	and	dst, dstin, #-16	/* align the pointer / pre-bias.  */
+	stp	val, val, [dstin]	/* first 16 align 1 */
+	sub	count, dstend, dst	/* begin misalign recompute */
+	subs	count, count, #16+64	/* finish recompute + pre-bias */
+	b.ls	L(loop_tail)
+
+	.p2align 6,,24
+L(loop):
+	stp	val, val, [dst, #16]
+	stp	val, val, [dst, #32]
+	subs	count, count, #64
+	stp	val, val, [dst, #48]
+	stp	val, val, [dst, #64]!
+	b.hs	L(loop)
+
+	adds	count, count, #64	/* undo pre-bias */
+	b.ne	L(loop_tail)
+	RET
+
+	/* Tail of the zva loop.  Less than ZVA bytes, but possibly lots
+	   more than 64.  Note that dst is aligned but unbiased.  */
+L(zva_tail):
+	subs	count, count, #64	/* pre-bias */
+	sub	dst, dst, #16		/* pre-bias */
+	b.hi	L(loop)
+
+	/* Tail of the stp loop; less than 64 bytes left.
+	   Note that dst is still aligned and biased by -16.  */
+L(loop_tail):
+	stp	val, val, [dstend, #-64]
+	stp	val, val, [dstend, #-48]
+	stp	val, val, [dstend, #-32]
+	stp	val, val, [dstend, #-16]
+	RET
+
+	.size	memset_nozva, . - memset_nozva
+	cfi_endproc
+
+strong_alias (memset, __memset)
 libc_hidden_builtin_def (memset)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-02-18  2:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <002701cffaa0$77623570$6626a050$@com>
2014-11-07 16:14 ` [RFC PATCH] aarch64: improve memset Wilco Dijkstra
2014-11-08 10:05   ` Richard Henderson
2014-11-09  8:19   ` Richard Henderson
2014-11-10 20:09     ` Wilco Dijkstra
2014-11-11  8:13       ` Richard Henderson
2014-11-11 12:52         ` Wilco Dijkstra
2014-11-11 14:30           ` Richard Henderson
2014-11-11 13:05   ` Marcus Shawcroft
2014-11-11 14:16     ` Wilco Dijkstra
2014-11-11 17:22     ` Andrew Pinski
2014-09-11 20:41 Richard Henderson
2014-09-12  0:15 ` Carlos O'Donell
2014-09-18  0:25   ` Will Newton
  -- strict thread matches above, loose matches on Subject: below --
2014-06-14  7:06 Richard Henderson
2014-06-20 11:05 ` Ondřej Bílka
2014-09-30 11:03 ` Marcus Shawcroft
2014-11-05 14:35   ` Will Newton
2014-11-06  6:55     ` Richard Henderson
2015-02-18  2:41 ` Andrew Pinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).