[PATCH] aarch64: Improve strcmp unaligned performance

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] aarch64: Improve strcmp unaligned performance
@ 2017-12-01  5:29 Siddhesh Poyarekar
  2017-12-01  6:55 ` Andrew Pinski
  2017-12-04 16:07 ` [PING][PATCH] " Siddhesh Poyarekar
  0 siblings, 2 replies; 6+ messages in thread
From: Siddhesh Poyarekar @ 2017-12-01  5:29 UTC (permalink / raw)
  To: libc-alpha

Replace the simple byte-wise compare in the misaligned case with a
dword compare with page boundary checks in place.  For simplicity I've
chosen a 4K page boundary so that we don't have to query the actual
page size on the system.

This results in up to 3x improvement in performance in the unaligned
case on falkor and about 2.5x improvement on mustang as measured using
bench-strcmp.

	* sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a
	time whenever possible.
---
 sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S
index e99d662..c260e1d 100644
--- a/sysdeps/aarch64/strcmp.S
+++ b/sysdeps/aarch64/strcmp.S
@@ -72,6 +72,7 @@ L(start_realigned):
 	cbz	syndrome, L(loop_aligned)
 	/* End of performance-critical section  -- one 64B cache line.  */
 
+L(end):
 #ifndef	__AARCH64EB__
 	rev	syndrome, syndrome
 	rev	data1, data1
@@ -145,12 +146,38 @@ L(mutual_align):
 	b	L(start_realigned)
 
 L(misaligned8):
-	/* We can do better than this.  */
+	/* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always
+	   checking to make sure that we don't access beyond page boundary in
+	   SRC2.  */
+	tst	src1, #7
+	b.eq	L(loop_misaligned)
+L(do_misaligned):
 	ldrb	data1w, [src1], #1
 	ldrb	data2w, [src2], #1
 	cmp	data1w, #1
 	ccmp	data1w, data2w, #0, cs	/* NZCV = 0b0000.  */
-	b.eq	L(misaligned8)
+	b.ne	L(done)
+	tst	src1, #7
+	b.ne	L(misaligned8)
+
+L(loop_misaligned):
+	/* Test if we are within the last dword of the end of a 4K page.  If
+	   yes then jump back to the misaligned loop to copy a byte at a time.  */
+	and	tmp1, src2, #0xff8
+	eor	tmp1, tmp1, #0xff8
+	cbz	tmp1, L(do_misaligned)
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2	/* Non-zero if differences found.  */
+	bic	has_nul, tmp1, tmp2	/* Non-zero if NUL terminator.  */
+	orr	syndrome, diff, has_nul
+	cbz	syndrome, L(loop_misaligned)
+	b	L(end)
+
+L(done):
 	sub	result, data1, data2
 	RET
 END(strcmp)
-- 
2.7.5

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] aarch64: Improve strcmp unaligned performance
  2017-12-01  5:29 [PATCH] aarch64: Improve strcmp unaligned performance Siddhesh Poyarekar
@ 2017-12-01  6:55 ` Andrew Pinski
  2017-12-07 12:38   ` Siddhesh Poyarekar
  2017-12-04 16:07 ` [PING][PATCH] " Siddhesh Poyarekar
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Pinski @ 2017-12-01  6:55 UTC (permalink / raw)
  To: Siddhesh Poyarekar; +Cc: GNU C Library

On Thu, Nov 30, 2017 at 9:26 PM, Siddhesh Poyarekar
<siddhesh@sourceware.org> wrote:
> Replace the simple byte-wise compare in the misaligned case with a
> dword compare with page boundary checks in place.  For simplicity I've
> chosen a 4K page boundary so that we don't have to query the actual
> page size on the system.
>
> This results in up to 3x improvement in performance in the unaligned
> case on falkor and about 2.5x improvement on mustang as measured using
> bench-strcmp.

On at least Cavium OcteonTX T81 and T83 (and ThunderX1 T88), this
might be slower as any
unaligned load/stores have a 9 cycle penalty.  I will double check
this next week.
What this means is I will propose a strcmp version for OcteonTX and
ThunderX1 above this patch if I find it is slower.

Thanks,
Andrew Pinski

>
>         * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a
>         time whenever possible.
> ---
>  sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S
> index e99d662..c260e1d 100644
> --- a/sysdeps/aarch64/strcmp.S
> +++ b/sysdeps/aarch64/strcmp.S
> @@ -72,6 +72,7 @@ L(start_realigned):
>         cbz     syndrome, L(loop_aligned)
>         /* End of performance-critical section  -- one 64B cache line.  */
>
> +L(end):
>  #ifndef        __AARCH64EB__
>         rev     syndrome, syndrome
>         rev     data1, data1
> @@ -145,12 +146,38 @@ L(mutual_align):
>         b       L(start_realigned)
>
>  L(misaligned8):
> -       /* We can do better than this.  */
> +       /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always
> +          checking to make sure that we don't access beyond page boundary in
> +          SRC2.  */
> +       tst     src1, #7
> +       b.eq    L(loop_misaligned)
> +L(do_misaligned):
>         ldrb    data1w, [src1], #1
>         ldrb    data2w, [src2], #1
>         cmp     data1w, #1
>         ccmp    data1w, data2w, #0, cs  /* NZCV = 0b0000.  */
> -       b.eq    L(misaligned8)
> +       b.ne    L(done)
> +       tst     src1, #7
> +       b.ne    L(misaligned8)
> +
> +L(loop_misaligned):
> +       /* Test if we are within the last dword of the end of a 4K page.  If
> +          yes then jump back to the misaligned loop to copy a byte at a time.  */
> +       and     tmp1, src2, #0xff8
> +       eor     tmp1, tmp1, #0xff8
> +       cbz     tmp1, L(do_misaligned)
> +       ldr     data1, [src1], #8
> +       ldr     data2, [src2], #8
> +
> +       sub     tmp1, data1, zeroones
> +       orr     tmp2, data1, #REP8_7f
> +       eor     diff, data1, data2      /* Non-zero if differences found.  */
> +       bic     has_nul, tmp1, tmp2     /* Non-zero if NUL terminator.  */
> +       orr     syndrome, diff, has_nul
> +       cbz     syndrome, L(loop_misaligned)
> +       b       L(end)
> +
> +L(done):
>         sub     result, data1, data2
>         RET
>  END(strcmp)
> --
> 2.7.5
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING][PATCH] aarch64: Improve strcmp unaligned performance
  2017-12-01  5:29 [PATCH] aarch64: Improve strcmp unaligned performance Siddhesh Poyarekar
  2017-12-01  6:55 ` Andrew Pinski
@ 2017-12-04 16:07 ` Siddhesh Poyarekar
  2017-12-12 18:20   ` [PING*2][PATCH] " Siddhesh Poyarekar
  1 sibling, 1 reply; 6+ messages in thread
From: Siddhesh Poyarekar @ 2017-12-04 16:07 UTC (permalink / raw)
  To: libc-alpha

Ping!

On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote:
> Replace the simple byte-wise compare in the misaligned case with a
> dword compare with page boundary checks in place.  For simplicity I've
> chosen a 4K page boundary so that we don't have to query the actual
> page size on the system.
> 
> This results in up to 3x improvement in performance in the unaligned
> case on falkor and about 2.5x improvement on mustang as measured using
> bench-strcmp.
> 
> 	* sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a
> 	time whenever possible.
> ---
>  sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S
> index e99d662..c260e1d 100644
> --- a/sysdeps/aarch64/strcmp.S
> +++ b/sysdeps/aarch64/strcmp.S
> @@ -72,6 +72,7 @@ L(start_realigned):
>  	cbz	syndrome, L(loop_aligned)
>  	/* End of performance-critical section  -- one 64B cache line.  */
>  
> +L(end):
>  #ifndef	__AARCH64EB__
>  	rev	syndrome, syndrome
>  	rev	data1, data1
> @@ -145,12 +146,38 @@ L(mutual_align):
>  	b	L(start_realigned)
>  
>  L(misaligned8):
> -	/* We can do better than this.  */
> +	/* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always
> +	   checking to make sure that we don't access beyond page boundary in
> +	   SRC2.  */
> +	tst	src1, #7
> +	b.eq	L(loop_misaligned)
> +L(do_misaligned):
>  	ldrb	data1w, [src1], #1
>  	ldrb	data2w, [src2], #1
>  	cmp	data1w, #1
>  	ccmp	data1w, data2w, #0, cs	/* NZCV = 0b0000.  */
> -	b.eq	L(misaligned8)
> +	b.ne	L(done)
> +	tst	src1, #7
> +	b.ne	L(misaligned8)
> +
> +L(loop_misaligned):
> +	/* Test if we are within the last dword of the end of a 4K page.  If
> +	   yes then jump back to the misaligned loop to copy a byte at a time.  */
> +	and	tmp1, src2, #0xff8
> +	eor	tmp1, tmp1, #0xff8
> +	cbz	tmp1, L(do_misaligned)
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +
> +	sub	tmp1, data1, zeroones
> +	orr	tmp2, data1, #REP8_7f
> +	eor	diff, data1, data2	/* Non-zero if differences found.  */
> +	bic	has_nul, tmp1, tmp2	/* Non-zero if NUL terminator.  */
> +	orr	syndrome, diff, has_nul
> +	cbz	syndrome, L(loop_misaligned)
> +	b	L(end)
> +
> +L(done):
>  	sub	result, data1, data2
>  	RET
>  END(strcmp)
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] aarch64: Improve strcmp unaligned performance
  2017-12-01  6:55 ` Andrew Pinski
@ 2017-12-07 12:38   ` Siddhesh Poyarekar
  0 siblings, 0 replies; 6+ messages in thread
From: Siddhesh Poyarekar @ 2017-12-07 12:38 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: GNU C Library

On Friday 01 December 2017 12:25 PM, Andrew Pinski wrote:
> On at least Cavium OcteonTX T81 and T83 (and ThunderX1 T88), this
> might be slower as any
> unaligned load/stores have a 9 cycle penalty.  I will double check
> this next week.
> What this means is I will propose a strcmp version for OcteonTX and
> ThunderX1 above this patch if I find it is slower.

Hi Andrew, were you able to test this?

Siddhesh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING*2][PATCH] aarch64: Improve strcmp unaligned performance
  2017-12-04 16:07 ` [PING][PATCH] " Siddhesh Poyarekar
@ 2017-12-12 18:20   ` Siddhesh Poyarekar
  2017-12-12 18:27     ` Szabolcs Nagy
  0 siblings, 1 reply; 6+ messages in thread
From: Siddhesh Poyarekar @ 2017-12-12 18:20 UTC (permalink / raw)
  To: libc-alpha; +Cc: Szabolcs Nagy

Ping!

On Monday 04 December 2017 09:37 PM, Siddhesh Poyarekar wrote:
> Ping!
> 
> On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote:
>> Replace the simple byte-wise compare in the misaligned case with a
>> dword compare with page boundary checks in place.  For simplicity I've
>> chosen a 4K page boundary so that we don't have to query the actual
>> page size on the system.
>>
>> This results in up to 3x improvement in performance in the unaligned
>> case on falkor and about 2.5x improvement on mustang as measured using
>> bench-strcmp.
>>
>> 	* sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a
>> 	time whenever possible.
>> ---
>>  sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++--
>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S
>> index e99d662..c260e1d 100644
>> --- a/sysdeps/aarch64/strcmp.S
>> +++ b/sysdeps/aarch64/strcmp.S
>> @@ -72,6 +72,7 @@ L(start_realigned):
>>  	cbz	syndrome, L(loop_aligned)
>>  	/* End of performance-critical section  -- one 64B cache line.  */
>>  
>> +L(end):
>>  #ifndef	__AARCH64EB__
>>  	rev	syndrome, syndrome
>>  	rev	data1, data1
>> @@ -145,12 +146,38 @@ L(mutual_align):
>>  	b	L(start_realigned)
>>  
>>  L(misaligned8):
>> -	/* We can do better than this.  */
>> +	/* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always
>> +	   checking to make sure that we don't access beyond page boundary in
>> +	   SRC2.  */
>> +	tst	src1, #7
>> +	b.eq	L(loop_misaligned)
>> +L(do_misaligned):
>>  	ldrb	data1w, [src1], #1
>>  	ldrb	data2w, [src2], #1
>>  	cmp	data1w, #1
>>  	ccmp	data1w, data2w, #0, cs	/* NZCV = 0b0000.  */
>> -	b.eq	L(misaligned8)
>> +	b.ne	L(done)
>> +	tst	src1, #7
>> +	b.ne	L(misaligned8)
>> +
>> +L(loop_misaligned):
>> +	/* Test if we are within the last dword of the end of a 4K page.  If
>> +	   yes then jump back to the misaligned loop to copy a byte at a time.  */
>> +	and	tmp1, src2, #0xff8
>> +	eor	tmp1, tmp1, #0xff8
>> +	cbz	tmp1, L(do_misaligned)
>> +	ldr	data1, [src1], #8
>> +	ldr	data2, [src2], #8
>> +
>> +	sub	tmp1, data1, zeroones
>> +	orr	tmp2, data1, #REP8_7f
>> +	eor	diff, data1, data2	/* Non-zero if differences found.  */
>> +	bic	has_nul, tmp1, tmp2	/* Non-zero if NUL terminator.  */
>> +	orr	syndrome, diff, has_nul
>> +	cbz	syndrome, L(loop_misaligned)
>> +	b	L(end)
>> +
>> +L(done):
>>  	sub	result, data1, data2
>>  	RET
>>  END(strcmp)
>>
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PING*2][PATCH] aarch64: Improve strcmp unaligned performance
  2017-12-12 18:20   ` [PING*2][PATCH] " Siddhesh Poyarekar
@ 2017-12-12 18:27     ` Szabolcs Nagy
  0 siblings, 0 replies; 6+ messages in thread
From: Szabolcs Nagy @ 2017-12-12 18:27 UTC (permalink / raw)
  To: siddhesh, libc-alpha; +Cc: nd

On 12/12/17 18:20, Siddhesh Poyarekar wrote:
> Ping!
> 
> On Monday 04 December 2017 09:37 PM, Siddhesh Poyarekar wrote:
>> Ping!
>>
>> On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote:
>>> Replace the simple byte-wise compare in the misaligned case with a
>>> dword compare with page boundary checks in place.  For simplicity I've
>>> chosen a 4K page boundary so that we don't have to query the actual
>>> page size on the system.
>>>
>>> This results in up to 3x improvement in performance in the unaligned
>>> case on falkor and about 2.5x improvement on mustang as measured using
>>> bench-strcmp.
>>>
>>> 	* sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a
>>> 	time whenever possible.

sorry for the delay,

this is ok to commit.

i believe it should give a performance improvement in general.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-12-12 18:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-01  5:29 [PATCH] aarch64: Improve strcmp unaligned performance Siddhesh Poyarekar
2017-12-01  6:55 ` Andrew Pinski
2017-12-07 12:38   ` Siddhesh Poyarekar
2017-12-04 16:07 ` [PING][PATCH] " Siddhesh Poyarekar
2017-12-12 18:20   ` [PING*2][PATCH] " Siddhesh Poyarekar
2017-12-12 18:27     ` Szabolcs Nagy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).