public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755]
@ 2022-01-09 12:29 Noah Goldstein
  2022-01-09 12:29 ` [PATCH v1 2/5] x86: Optimize strcmp-evex.S " Noah Goldstein
                   ` (6 more replies)
  0 siblings, 7 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-09 12:29 UTC (permalink / raw)
  To: libc-alpha

Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.

Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
Numbers attached in reply.

Numbers are geometric mean of N=20 runs.
Numbers where collected on: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i71165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

The 'score' column is (time current) / (time new). The "greener" the
number the the larger the improvement. The "redder" the larger the
regression.

Some notes on the numbers:

There are three cases of regressions:

1. Small values at the page cross case. The regression is because the
new logic spends extra logic checking if the page cross was a false
positive and setting up the logic to transition to the loop case more
smoothly. I don't see any way around this and on the flip side of the
regressions is 500% speedups in either the false positive case or
contuation.

2. Cases where the string barely crosses the page. The regression is
because the current logic does a single byte loop on exit which is
ultimately a faster check for very small strings. The flip side of
this is 20000% speedups. I think the logic that has us implement
strncmp with vectors also supports replacing the one at a time byte
loop for something that scales better.

3. The avx2 case for [128, 512] is within [-5%, +5%]. There are some
regressions here. I am unsure what exacting why this is the case. In
general I am less happy with the quality of the avx2 implementation
and believe it still needs some work. I still think its an improvement
because of the gains in the [0, 128] case, many of the page cross
cases and the [513, inf] cases but if people think otherwise it may be
best to skip the patch. Note the patch is also for [BZ# 28755]
although a seperate fix for that will be simple enough.

Aside from the 3 regressions there are mostly modest improvements then
some dramatic improvements where the one at a time byte loops where
eliminated.


 sysdeps/x86_64/multiarch/strcmp-avx2.S | 1586 ++++++++++++++----------
 1 file changed, 942 insertions(+), 644 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index a45f9d2749..28d6a0025a 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -26,35 +26,57 @@
 
 # define PAGE_SIZE	4096
 
-/* VEC_SIZE = Number of bytes in a ymm register */
+	/* VEC_SIZE = Number of bytes in a ymm register.  */
 # define VEC_SIZE	32
 
-/* Shift for dividing by (VEC_SIZE * 4).  */
-# define DIVIDE_BY_VEC_4_SHIFT	7
-# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-# endif
+# define VMOVU	vmovdqu
+# define VMOVA	vmovdqa
 
 # ifdef USE_AS_WCSCMP
-/* Compare packed dwords.  */
+	/* Compare packed dwords.  */
 #  define VPCMPEQ	vpcmpeqd
-/* Compare packed dwords and store minimum.  */
+	/* Compare packed dwords and store minimum.  */
 #  define VPMINU	vpminud
-/* 1 dword char == 4 bytes.  */
+	/* 1 dword char == 4 bytes.  */
 #  define SIZE_OF_CHAR	4
 # else
-/* Compare packed bytes.  */
+	/* Compare packed bytes.  */
 #  define VPCMPEQ	vpcmpeqb
-/* Compare packed bytes and store minimum.  */
+	/* Compare packed bytes and store minimum.  */
 #  define VPMINU	vpminub
-/* 1 byte char == 1 byte.  */
+	/* 1 byte char == 1 byte.  */
 #  define SIZE_OF_CHAR	1
 # endif
 
+# ifdef USE_AS_STRNCMP
+#  define LOOP_REG	r9d
+#  define LOOP_REG64	r9
+
+#  define OFFSET_REG8	r9b
+#  define OFFSET_REG	r9d
+#  define OFFSET_REG64	r9
+# else
+#  define LOOP_REG	edx
+#  define LOOP_REG64	rdx
+
+#  define OFFSET_REG8	dl
+#  define OFFSET_REG	edx
+#  define OFFSET_REG64	rdx
+# endif
+
 # ifndef VZEROUPPER
 #  define VZEROUPPER	vzeroupper
 # endif
 
+# if defined USE_AS_STRNCMP
+#  define VEC_OFFSET	0
+# else
+#  define VEC_OFFSET	(-VEC_SIZE)
+# endif
+
+# define xmmZERO	xmm15
+# define ymmZERO	ymm15
+
 # ifndef SECTION
 #  define SECTION(p)	p##.avx
 # endif
@@ -79,773 +101,1049 @@
    the maximum offset is reached before a difference is found, zero is
    returned.  */
 
-	.section SECTION(.text),"ax",@progbits
-ENTRY (STRCMP)
+	.section SECTION(.text), "ax", @progbits
+ENTRY(STRCMP)
 # ifdef USE_AS_STRNCMP
-	/* Check for simple cases (0 or 1) in offset.  */
+#  ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %rdx
+#  endif
 	cmp	$1, %RDX_LP
-	je	L(char0)
-	jb	L(zero)
+	/* Signed comparison intentional. We use this branch to also
+	   test cases where length >= 2^63. These very large sizes can be
+	   handled with strcmp as there is no way for that length to
+	   actually bound the buffer.  */
+	jle	L(one_or_less)
 #  ifdef USE_AS_WCSCMP
-	/* Convert units: from wide to byte char.  */
-	shl	$2, %RDX_LP
+	movq	%rdx, %rcx
+
+	/* Multiplying length by sizeof(wchar_t) can result in overflow.
+	   Check if that is possible. All cases where overflow are possible
+	   are cases where length is large enough that it can never be a
+	   bound on valid memory so just use wcscmp.  */
+	shrq	$56, %rcx
+	jnz	__wcscmp_avx2
+
+	leaq	(, %rdx, 4), %rdx
 #  endif
-	/* Register %r11 tracks the maximum offset.  */
-	mov	%RDX_LP, %R11_LP
 # endif
+	vpxor	%xmmZERO, %xmmZERO, %xmmZERO
 	movl	%edi, %eax
-	xorl	%edx, %edx
-	/* Make %xmm7 (%ymm7) all zeros in this function.  */
-	vpxor	%xmm7, %xmm7, %xmm7
 	orl	%esi, %eax
-	andl	$(PAGE_SIZE - 1), %eax
-	cmpl	$(PAGE_SIZE - (VEC_SIZE * 4)), %eax
-	jg	L(cross_page)
-	/* Start comparing 4 vectors.  */
-	vmovdqu	(%rdi), %ymm1
-	VPCMPEQ	(%rsi), %ymm1, %ymm0
-	VPMINU	%ymm1, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	je	L(next_3_vectors)
-	tzcntl	%ecx, %edx
+	sall	$20, %eax
+	/* Check if s1 or s2 may cross a page  in next 4x VEC loads.  */
+	cmpl	$((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
+	ja	L(page_cross)
+
+L(no_page_cross):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	(%rdi), %ymm0
+	/* 1s where s1 and s2 equal.  */
+	VPCMPEQ	(%rsi), %ymm0, %ymm1
+	/* 1s at null CHAR.  */
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	/* 1s where s1 and s2 equal AND not null CHAR.  */
+	vpandn	%ymm1, %ymm2, %ymm1
+
+	/* All 1s -> keep going, any 0s -> return.  */
+	vpmovmskb %ymm1, %ecx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx) is after the maximum
-	   offset (%r11).   */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$VEC_SIZE, %rdx
+	jbe	L(vec_0_test_len)
 # endif
+
+	/* All 1s represents all equals. incl will overflow to zero in
+	   all equals case. Otherwise 1s will carry until position of first
+	   mismatch.  */
+	incl	%ecx
+	jz	L(more_3x_vec)
+
+	.p2align 4,, 4
+L(return_vec_0):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	je	L(return)
-L(wcscmp_return):
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret0)
 	setl	%al
 	negl	%eax
 	orl	$1, %eax
-L(return):
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret0):
 L(return_vzeroupper):
 	ZERO_UPPER_VEC_REGISTERS_RETURN
 
-	.p2align 4
-L(return_vec_size):
-	tzcntl	%ecx, %edx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
-	   the maximum offset (%r11).  */
-	addq	$VEC_SIZE, %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	.p2align 4,, 8
+L(vec_0_test_len):
+	notl	%ecx
+	bzhil	%edx, %ecx, %eax
+	jnz	L(return_vec_0)
+	/* Align if will cross fetch block.  */
+	.p2align 4,, 2
+L(ret_zero):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
-# else
+	VZEROUPPER_RETURN
+
+	.p2align 4,, 5
+L(one_or_less):
+	jb	L(ret_zero)
 #  ifdef USE_AS_WCSCMP
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__wcscmp_avx2
+	movl	(%rdi), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rdi, %rdx), %ecx
-	cmpl	VEC_SIZE(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(%rsi), %edx
+	je	L(ret1)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	VEC_SIZE(%rdi, %rdx), %eax
-	movzbl	VEC_SIZE(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+
+	jnbe	__strcmp_avx2
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi), %ecx
+	subl	%ecx, %eax
 #  endif
+L(ret1):
+	ret
 # endif
-	VZEROUPPER_RETURN
 
-	.p2align 4
-L(return_2_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
+L(return_vec_1):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 2), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	/* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of
+	   overflow.  */
+	addq	$-VEC_SIZE, %rdx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_SIZE(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_SIZE(%rsi, %rcx), %edx
+	je	L(ret2)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 2)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 2)(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret2):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(return_3_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 3), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+L(return_vec_3):
+	salq	$32, %rcx
+# endif
+
+L(return_vec_2):
+# ifndef USE_AS_STRNCMP
+	tzcntl	%ecx, %ecx
+# else
+	tzcntq	%rcx, %rcx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx), %edx
+	je	L(ret3)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+# endif
+L(ret3):
+	VZEROUPPER_RETURN
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_3):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 3)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 3)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(VEC_SIZE * 3)(%rsi, %rcx), %edx
+	je	L(ret4)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 3)(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(VEC_SIZE * 3)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 3)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret4):
 	VZEROUPPER_RETURN
+# endif
+
+	.p2align 4,, 10
+L(more_3x_vec):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	VEC_SIZE(%rdi), %ymm0
+	VPCMPEQ	VEC_SIZE(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_1)
+
+# ifdef USE_AS_STRNCMP
+	subq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+	VMOVU	(VEC_SIZE * 2)(%rdi), %ymm0
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_2)
+
+	VMOVU	(VEC_SIZE * 3)(%rdi), %ymm0
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_3)
 
-	.p2align 4
-L(next_3_vectors):
-	vmovdqu	VEC_SIZE(%rdi), %ymm6
-	VPCMPEQ	VEC_SIZE(%rsi), %ymm6, %ymm3
-	VPMINU	%ymm6, %ymm3, %ymm3
-	VPCMPEQ	%ymm7, %ymm3, %ymm3
-	vpmovmskb %ymm3, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_vec_size)
-	vmovdqu	(VEC_SIZE * 2)(%rdi), %ymm5
-	vmovdqu	(VEC_SIZE * 3)(%rdi), %ymm4
-	vmovdqu	(VEC_SIZE * 3)(%rsi), %ymm0
-	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm5, %ymm2
-	VPMINU	%ymm5, %ymm2, %ymm2
-	VPCMPEQ	%ymm4, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm2, %ymm2
-	vpmovmskb %ymm2, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_2_vec_size)
-	VPMINU	%ymm4, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_3_vec_size)
-L(main_loop_header):
-	leaq	(VEC_SIZE * 4)(%rdi), %rdx
-	movl	$PAGE_SIZE, %ecx
-	/* Align load via RAX.  */
-	andq	$-(VEC_SIZE * 4), %rdx
-	subq	%rdi, %rdx
-	leaq	(%rdi, %rdx), %rax
 # ifdef USE_AS_STRNCMP
-	/* Starting from this point, the maximum offset, or simply the
-	   'offset', DECREASES by the same amount when base pointers are
-	   moved forward.  Return 0 when:
-	     1) On match: offset <= the matched vector index.
-	     2) On mistmach, offset is before the mistmatched index.
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	/* any non-zero positive value that doesn't inference with 0x1.
 	 */
-	subq	%rdx, %r11
-	jbe	L(zero)
-# endif
-	addq	%rsi, %rdx
-	movq	%rdx, %rsi
-	andl	$(PAGE_SIZE - 1), %esi
-	/* Number of bytes before page crossing.  */
-	subq	%rsi, %rcx
-	/* Number of VEC_SIZE * 4 blocks before page crossing.  */
-	shrq	$DIVIDE_BY_VEC_4_SHIFT, %rcx
-	/* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
-	movl	%ecx, %esi
-	jmp	L(loop_start)
+	movl	$2, %r8d
 
+# else
+	xorl	%r8d, %r8d
+# endif
+
+	/* The prepare labels are various entry points from the page
+	   cross logic.  */
+L(prepare_loop):
+
+# ifdef USE_AS_STRNCMP
+	/* Store N + (VEC_SIZE * 4) and place check at the begining of
+	   the loop.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rdx), %rdx
+# endif
+L(prepare_loop_no_len):
+
+	/* Align s1 and adjust s2 accordingly.  */
+	subq	%rdi, %rsi
+	andq	$-(VEC_SIZE * 4), %rdi
+	addq	%rdi, %rsi
+
+# ifdef USE_AS_STRNCMP
+	subq	%rdi, %rdx
+# endif
+
+L(prepare_loop_aligned):
+	/* eax stores distance from rsi to next page cross. These cases
+	   need to be handled specially as the 4x loop could potentially
+	   read memory past the length of s1 or s2 and across a page
+	   boundary.  */
+	movl	$-(VEC_SIZE * 4), %eax
+	subl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+
+	/* Loop 4x comparisons at a time.  */
 	.p2align 4
 L(loop):
+
+	/* End condition for strncmp.  */
 # ifdef USE_AS_STRNCMP
-	/* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
-	   the maximum offset (%r11) by the same amount.  */
-	subq	$(VEC_SIZE * 4), %r11
-	jbe	L(zero)
-# endif
-	addq	$(VEC_SIZE * 4), %rax
-	addq	$(VEC_SIZE * 4), %rdx
-L(loop_start):
-	testl	%esi, %esi
-	leal	-1(%esi), %esi
-	je	L(loop_cross_page)
-L(back_to_loop):
-	/* Main loop, comparing 4 vectors are a time.  */
-	vmovdqa	(%rax), %ymm0
-	vmovdqa	VEC_SIZE(%rax), %ymm3
-	VPCMPEQ	(%rdx), %ymm0, %ymm4
-	VPCMPEQ	VEC_SIZE(%rdx), %ymm3, %ymm1
-	VPMINU	%ymm0, %ymm4, %ymm4
-	VPMINU	%ymm3, %ymm1, %ymm1
-	vmovdqa	(VEC_SIZE * 2)(%rax), %ymm2
-	VPMINU	%ymm1, %ymm4, %ymm0
-	vmovdqa	(VEC_SIZE * 3)(%rax), %ymm3
-	VPCMPEQ	(VEC_SIZE * 2)(%rdx), %ymm2, %ymm5
-	VPCMPEQ	(VEC_SIZE * 3)(%rdx), %ymm3, %ymm6
-	VPMINU	%ymm2, %ymm5, %ymm5
-	VPMINU	%ymm3, %ymm6, %ymm6
-	VPMINU	%ymm5, %ymm0, %ymm0
-	VPMINU	%ymm6, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-
-	/* Test each mask (32 bits) individually because for VEC_SIZE
-	   == 32 is not possible to OR the four masks and keep all bits
-	   in a 64-bit integer register, differing from SSE2 strcmp
-	   where ORing is possible.  */
-	vpmovmskb %ymm0, %ecx
+	subq	$(VEC_SIZE * 4), %rdx
+	jbe	L(ret_zero)
+# endif
+
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+
+	/* Check if rsi loads will cross a page boundary.  */
+	addl	$-(VEC_SIZE * 4), %eax
+	jnb	L(page_cross_during_loop)
+
+	/* Loop entry after handling page cross during loop.  */
+L(loop_skip_page_cross_check):
+	VMOVA	(VEC_SIZE * 0)(%rdi), %ymm0
+	VMOVA	(VEC_SIZE * 1)(%rdi), %ymm2
+	VMOVA	(VEC_SIZE * 2)(%rdi), %ymm4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %ymm6
+
+	/* ymm1 all 1s where s1 and s2 equal. All 0s otherwise.  */
+	VPCMPEQ	(VEC_SIZE * 0)(%rsi), %ymm0, %ymm1
+
+	VPCMPEQ	(VEC_SIZE * 1)(%rsi), %ymm2, %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
+
+
+	/* If any mismatches or null CHAR then 0 CHAR, otherwise non-
+	   zero.  */
+	vpand	%ymm0, %ymm1, %ymm1
+
+
+	vpand	%ymm2, %ymm3, %ymm3
+	vpand	%ymm4, %ymm5, %ymm5
+	vpand	%ymm6, %ymm7, %ymm7
+
+	VPMINU	%ymm1, %ymm3, %ymm3
+	VPMINU	%ymm5, %ymm7, %ymm7
+
+	/* Reduce all 0 CHARs for the 4x VEC into ymm7.  */
+	VPMINU	%ymm3, %ymm7, %ymm7
+
+	/* If any 0 CHAR then done.  */
+	VPCMPEQ	%ymm7, %ymmZERO, %ymm7
+	vpmovmskb %ymm7, %LOOP_REG
+	testl	%LOOP_REG, %LOOP_REG
+	jz	L(loop)
+
+	/* Find which VEC has the mismatch of end of string.  */
+	VPCMPEQ	%ymm1, %ymmZERO, %ymm1
+	vpmovmskb %ymm1, %ecx
 	testl	%ecx, %ecx
-	je	L(loop)
-	VPCMPEQ	%ymm7, %ymm4, %ymm0
-	vpmovmskb %ymm0, %edi
-	testl	%edi, %edi
-	je	L(test_vec)
-	tzcntl	%edi, %ecx
+	jnz	L(return_vec_0_end)
+
+
+	VPCMPEQ	%ymm3, %ymmZERO, %ymm3
+	vpmovmskb %ymm3, %ecx
+	testl	%ecx, %ecx
+	jnz	L(return_vec_1_end)
+
+L(return_vec_2_3_end):
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	subq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+	VPCMPEQ	%ymm5, %ymmZERO, %ymm5
+	vpmovmskb %ymm5, %ecx
+	testl	%ecx, %ecx
+	jnz	L(return_vec_2_end)
+
+	/* LOOP_REG contains matches for null/mismatch from the loop. If
+	   VEC 0,1,and 2 all have no null and no mismatches then mismatch
+	   must entirely be from VEC 3 which is fully represented by
+	   LOOP_REG.  */
+	tzcntl	%LOOP_REG, %LOOP_REG
+
+# ifdef USE_AS_STRNCMP
+	subl	$-(VEC_SIZE), %LOOP_REG
+	cmpq	%LOOP_REG64, %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
+	je	L(ret5)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	(VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax
+	movzbl	(VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret5):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(test_vec):
 # ifdef USE_AS_STRNCMP
-	/* The first vector matched.  Return 0 if the maximum offset
-	   (%r11) <= VEC_SIZE.  */
-	cmpq	$VEC_SIZE, %r11
-	jbe	L(zero)
+	.p2align 4,, 2
+L(ret_zero_end):
+	xorl	%eax, %eax
+	VZEROUPPER_RETURN
 # endif
-	VPCMPEQ	%ymm7, %ymm1, %ymm1
-	vpmovmskb %ymm1, %ecx
-	testl	%ecx, %ecx
-	je	L(test_2_vec)
-	tzcntl	%ecx, %edi
+
+
+	/* The L(return_vec_N_end) differ from L(return_vec_N) in that
+	   they use the value of `r8` to negate the return value. This is
+	   because the page cross logic can swap `rdi` and `rsi`.  */
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	addq	$VEC_SIZE, %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+L(return_vec_1_end):
+	salq	$32, %rcx
+# endif
+L(return_vec_0_end):
+# ifndef USE_AS_STRNCMP
+	tzcntl	%ecx, %ecx
+# else
+	tzcntq	%rcx, %rcx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret6)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
+# endif
+L(ret6):
+	VZEROUPPER_RETURN
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_1_end):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_SIZE(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rsi, %rdi), %ecx
-	cmpl	VEC_SIZE(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
+	cmpl	VEC_SIZE(%rsi, %rcx), %edx
+	je	L(ret7)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 #  else
-	movzbl	VEC_SIZE(%rax, %rdi), %eax
-	movzbl	VEC_SIZE(%rdx, %rdi), %edx
-	subl	%edx, %eax
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 #  endif
-# endif
+L(ret7):
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(test_2_vec):
+	.p2align 4,, 10
+L(return_vec_2_end):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-	/* The first 2 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 2 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 2), %r11
-	jbe	L(zero)
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
-	VPCMPEQ	%ymm7, %ymm5, %ymm5
-	vpmovmskb %ymm5, %ecx
-	testl	%ecx, %ecx
-	je	L(test_3_vec)
-	tzcntl	%ecx, %edi
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx), %edx
+	je	L(ret11)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rdi), %ecx
-	cmpl	(VEC_SIZE * 2)(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rdi), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret11):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(test_3_vec):
+
+	/* Page cross in rsi in next 4x VEC.  */
+
+	/* TODO: Improve logic here.  */
+	.p2align 4,, 10
+L(page_cross_during_loop):
+	/* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
+
+	/* Optimistically rsi and rdi and both aligned inwhich case we
+	   don't need any logic here.  */
+	cmpl	$-(VEC_SIZE * 4), %eax
+	/* Don't adjust eax before jumping back to loop and we will
+	   never hit page cross case again.  */
+	je	L(loop_skip_page_cross_check)
+
+	/* Check if we can safely load a VEC.  */
+	cmpl	$-(VEC_SIZE * 3), %eax
+	jle	L(less_1x_vec_till_page_cross)
+
+	VMOVA	(%rdi), %ymm0
+	VPCMPEQ	(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_0_end)
+
+	/* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
+	cmpl	$-(VEC_SIZE * 2), %eax
+	jg	L(more_2x_vec_till_page_cross)
+
+	.p2align 4,, 4
+L(less_1x_vec_till_page_cross):
+	subl	$-(VEC_SIZE * 4), %eax
+	/* Guranteed safe to read from rdi - VEC_SIZE here. The only
+	   concerning case is first iteration if incoming s1 was near start
+	   of a page and s2 near end. If s1 was near the start of the page
+	   we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
+	   to read back -VEC_SIZE. If rdi is truly at the start of a page
+	   here, it means the previous page (rdi - VEC_SIZE) has already
+	   been loaded earlier so must be valid.  */
+	VMOVU	-VEC_SIZE(%rdi, %rax), %ymm0
+	VPCMPEQ	-VEC_SIZE(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+
+	/* Mask of potentially valid bits. The lower bits can be out of
+	   range comparisons (but safe regarding page crosses).  */
+	movl	$-1, %r10d
+	shlxl	%esi, %r10d, %r10d
+	notl	%ecx
+
 # ifdef USE_AS_STRNCMP
-	/* The first 3 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 3 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 3), %r11
-	jbe	L(zero)
-# endif
-	VPCMPEQ	%ymm7, %ymm6, %ymm6
-	vpmovmskb %ymm6, %esi
-	tzcntl	%esi, %ecx
+	cmpq	%rax, %rdx
+	jbe	L(return_page_cross_end_check)
+# endif
+	movl	%eax, %OFFSET_REG
+	addl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+
+	andl	%r10d, %ecx
+	jz	L(loop_skip_page_cross_check)
+
+	.p2align 4,, 3
+L(return_page_cross_end):
+	tzcntl	%ecx, %ecx
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 3), %rcx
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %esi
-	cmpl	(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	leal	-VEC_SIZE(%OFFSET_REG64, %rcx), %ecx
+L(return_page_cross_cmp_mem):
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	addl	%OFFSET_REG, %ecx
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rsi, %rcx), %esi
-	cmpl	(VEC_SIZE * 3)(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 3)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 3)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret8)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret8):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(loop_cross_page):
-	xorl	%r10d, %r10d
-	movq	%rdx, %rcx
-	/* Align load via RDX.  We load the extra ECX bytes which should
-	   be ignored.  */
-	andl	$((VEC_SIZE * 4) - 1), %ecx
-	/* R10 is -RCX.  */
-	subq	%rcx, %r10
-
-	/* This works only if VEC_SIZE * 2 == 64. */
-# if (VEC_SIZE * 2) != 64
-#  error (VEC_SIZE * 2) != 64
-# endif
-
-	/* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
-	cmpl	$(VEC_SIZE * 2), %ecx
-	jge	L(loop_cross_page_2_vec)
-
-	vmovdqu	(%rax, %r10), %ymm2
-	vmovdqu	VEC_SIZE(%rax, %r10), %ymm3
-	VPCMPEQ	(%rdx, %r10), %ymm2, %ymm0
-	VPCMPEQ	VEC_SIZE(%rdx, %r10), %ymm3, %ymm1
-	VPMINU	%ymm2, %ymm0, %ymm0
-	VPMINU	%ymm3, %ymm1, %ymm1
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm1, %ymm1
-
-	vpmovmskb %ymm0, %edi
-	vpmovmskb %ymm1, %esi
-
-	salq	$32, %rsi
-	xorq	%rsi, %rdi
-
-	/* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
-	shrq	%cl, %rdi
-
-	testq	%rdi, %rdi
-	je	L(loop_cross_page_2_vec)
-	tzcntq	%rdi, %rcx
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	.p2align 4,, 10
+L(return_page_cross_end_check):
+	tzcntl	%ecx, %ecx
+	leal	-VEC_SIZE(%rax, %rcx), %ecx
+	cmpl	%ecx, %edx
+	ja	L(return_page_cross_cmp_mem)
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# endif
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(loop_cross_page_2_vec):
-	/* The first VEC_SIZE * 2 bytes match or are ignored.  */
-	vmovdqu	(VEC_SIZE * 2)(%rax, %r10), %ymm2
-	vmovdqu	(VEC_SIZE * 3)(%rax, %r10), %ymm3
-	VPCMPEQ	(VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5
-	VPMINU	%ymm2, %ymm5, %ymm5
-	VPCMPEQ	(VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6
-	VPCMPEQ	%ymm7, %ymm5, %ymm5
-	VPMINU	%ymm3, %ymm6, %ymm6
-	VPCMPEQ	%ymm7, %ymm6, %ymm6
-
-	vpmovmskb %ymm5, %edi
-	vpmovmskb %ymm6, %esi
-
-	salq	$32, %rsi
-	xorq	%rsi, %rdi
 
-	xorl	%r8d, %r8d
-	/* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
-	subl	$(VEC_SIZE * 2), %ecx
-	jle	1f
-	/* Skip ECX bytes.  */
-	shrq	%cl, %rdi
-	/* R8 has number of bytes skipped.  */
-	movl	%ecx, %r8d
-1:
-	/* Before jumping back to the loop, set ESI to the number of
-	   VEC_SIZE * 4 blocks before page crossing.  */
-	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
-
-	testq	%rdi, %rdi
+	.p2align 4,, 10
+L(more_2x_vec_till_page_cross):
+	/* If more 2x vec till cross we will complete a full loop
+	   iteration here.  */
+
+	VMOVU	VEC_SIZE(%rdi), %ymm0
+	VPCMPEQ	VEC_SIZE(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_1_end)
+
 # ifdef USE_AS_STRNCMP
-	/* At this point, if %rdi value is 0, it already tested
-	   VEC_SIZE*4+%r10 byte starting from %rax. This label
-	   checks whether strncmp maximum offset reached or not.  */
-	je	L(string_nbyte_offset_check)
-# else
-	je	L(back_to_loop)
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
-	tzcntq	%rdi, %rcx
-	addq	%r10, %rcx
-	/* Adjust for number of bytes skipped.  */
-	addq	%r8, %rcx
+
+	subl	$-(VEC_SIZE * 4), %eax
+
+	/* Safe to include comparisons from lower bytes.  */
+	VMOVU	-(VEC_SIZE * 2)(%rdi, %rax), %ymm0
+	VPCMPEQ	-(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_page_cross_0)
+
+	VMOVU	-(VEC_SIZE * 1)(%rdi, %rax), %ymm0
+	VPCMPEQ	-(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_page_cross_1)
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rcx
-	subq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	/* Must check length here as length might proclude reading next
+	   page.  */
+	cmpq	%rax, %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
+# endif
+
+	/* Finish the loop.  */
+	VMOVA	(VEC_SIZE * 2)(%rdi), %ymm4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %ymm6
+
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
+	vpand	%ymm4, %ymm5, %ymm5
+	vpand	%ymm6, %ymm7, %ymm7
+	VPMINU	%ymm5, %ymm7, %ymm7
+	VPCMPEQ	%ymm7, %ymmZERO, %ymm7
+	vpmovmskb %ymm7, %LOOP_REG
+	testl	%LOOP_REG, %LOOP_REG
+	jnz	L(return_vec_2_3_end)
+
+	/* Best for code size to include ucond-jmp here. Would be faster
+	   if this case is hot to duplicate the L(return_vec_2_3_end) code
+	   as fall-through and have jump back to loop on mismatch
+	   comparison.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+	addl	$(PAGE_SIZE - VEC_SIZE * 8), %eax
+# ifdef USE_AS_STRNCMP
+	subq	$(VEC_SIZE * 4), %rdx
+	ja	L(loop_skip_page_cross_check)
+L(ret_zero_in_loop_page_cross):
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	VZEROUPPER_RETURN
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rcx), %edi
-	cmpl	(VEC_SIZE * 2)(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	jmp	L(loop_skip_page_cross_check)
 # endif
-	VZEROUPPER_RETURN
 
+
+	.p2align 4,, 10
+L(return_vec_page_cross_0):
+	addl	$-VEC_SIZE, %eax
+L(return_vec_page_cross_1):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-L(string_nbyte_offset_check):
-	leaq	(VEC_SIZE * 4)(%r10), %r10
-	cmpq	%r10, %r11
-	jbe	L(zero)
-	jmp	L(back_to_loop)
+	leal	-VEC_SIZE(%rax, %rcx), %ecx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
+# else
+	addl	%eax, %ecx
 # endif
 
-	.p2align 4
-L(cross_page_loop):
-	/* Check one byte/dword at a time.  */
 # ifdef USE_AS_WCSCMP
-	cmpl	%ecx, %eax
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
+	xorl	%eax, %eax
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret9)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
 	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
-	jne	L(different)
-	addl	$SIZE_OF_CHAR, %edx
-	cmpl	$(VEC_SIZE * 4), %edx
-	je	L(main_loop_header)
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+L(ret9):
+	VZEROUPPER_RETURN
+
+
+	.p2align 4,, 10
+L(page_cross):
+# ifndef USE_AS_STRNCMP
+	/* If both are VEC aligned we don't need any special logic here.
+	   Only valid for strcmp where stop condition is guranteed to be
+	   reachable by just reading memory.  */
+	testl	$((VEC_SIZE - 1) << 20), %eax
+	jz	L(no_page_cross)
 # endif
+
+	movl	%edi, %eax
+	movl	%esi, %ecx
+	andl	$(PAGE_SIZE - 1), %eax
+	andl	$(PAGE_SIZE - 1), %ecx
+
+	xorl	%OFFSET_REG, %OFFSET_REG
+
+	/* Check which is closer to page cross, s1 or s2.  */
+	cmpl	%eax, %ecx
+	jg	L(page_cross_s2)
+
+	/* The previous page cross check has false positives. Check for
+	   true positive as page cross logic is very expensive.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+	jbe	L(no_page_cross)
+
+	/* Set r8 to not interfere with normal return value (rdi and rsi
+	   did not swap).  */
 # ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
+	xorl	%r8d, %r8d
 # endif
-	/* Check null char.  */
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
-	/* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
-	   comparisons.  */
-	subl	%ecx, %eax
-# ifndef USE_AS_WCSCMP
-L(different):
+
+	/* Check if less than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jg	L(less_1x_vec_till_page)
+
+	/* If more than 1x VEC till page cross, loop throuh safely
+	   loadable memory until within 1x VEC of page cross.  */
+
+	.p2align 4,, 10
+L(page_cross_loop):
+
+	VMOVU	(%rdi, %OFFSET_REG64), %ymm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+
+	jnz	L(check_ret_vec_page_cross)
+	addl	$VEC_SIZE, %OFFSET_REG
+# ifdef USE_AS_STRNCMP
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
-	VZEROUPPER_RETURN
+	addl	$VEC_SIZE, %eax
+	jl	L(page_cross_loop)
+
+	subl	%eax, %OFFSET_REG
+	/* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
+	   to not cross page so is safe to load. Since we have already
+	   loaded at least 1 VEC from rsi it is also guranteed to be safe.
+	 */
+
+	VMOVU	(%rdi, %OFFSET_REG64), %ymm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+
+# ifdef USE_AS_STRNCMP
+	leal	VEC_SIZE(%OFFSET_REG64), %eax
+	cmpq	%rax, %rdx
+	jbe	L(check_ret_vec_page_cross2)
+	addq	%rdi, %rdx
+# endif
+	incl	%ecx
+	jz	L(prepare_loop_no_len)
 
+	.p2align 4,, 4
+L(ret_vec_page_cross):
+# ifndef USE_AS_STRNCMP
+L(check_ret_vec_page_cross):
+# endif
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+L(ret_vec_page_cross_cont):
 # ifdef USE_AS_WCSCMP
-	.p2align 4
-L(different):
-	/* Use movl to avoid modifying EFLAGS.  */
-	movl	$0, %eax
+	movl	(%rdi, %rcx), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret12)
 	setl	%al
 	negl	%eax
-	orl	$1, %eax
-	VZEROUPPER_RETURN
+	xorl	%r8d, %eax
+# else
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret12):
+	VZEROUPPER_RETURN
 
 # ifdef USE_AS_STRNCMP
-	.p2align 4
-L(zero):
+	.p2align 4,, 10
+L(check_ret_vec_page_cross2):
+	incl	%ecx
+L(check_ret_vec_page_cross):
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+	cmpq	%rcx, %rdx
+	ja	L(ret_vec_page_cross_cont)
+	.p2align 4,, 2
+L(ret_zero_page_cross):
 	xorl	%eax, %eax
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(char0):
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi), %ecx
-	cmpl	(%rsi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rsi), %ecx
-	movzbl	(%rdi), %eax
-	subl	%ecx, %eax
-#  endif
-	VZEROUPPER_RETURN
+	.p2align 4,, 4
+L(page_cross_s2):
+	/* Ensure this is a true page cross.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %ecx
+	jbe	L(no_page_cross)
+
+
+	movl	%ecx, %eax
+	movq	%rdi, %rcx
+	movq	%rsi, %rdi
+	movq	%rcx, %rsi
+
+	/* set r8 to negate return value as rdi and rsi swapped.  */
+# ifdef USE_AS_WCSCMP
+	movl	$-4, %r8d
+# else
+	movl	$-1, %r8d
 # endif
+	xorl	%OFFSET_REG, %OFFSET_REG
 
-	.p2align 4
-L(last_vector):
-	addq	%rdx, %rdi
-	addq	%rdx, %rsi
+	/* Check if more than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jle	L(page_cross_loop)
+
+	.p2align 4,, 6
+L(less_1x_vec_till_page):
+	/* Find largest load size we can use.  */
+	cmpl	$16, %eax
+	ja	L(less_16_till_page)
+
+	VMOVU	(%rdi), %xmm0
+	VPCMPEQ	(%rsi), %xmm0, %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incw	%cx
+	jnz	L(check_ret_vec_page_cross)
+	movl	$16, %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	subq	%rdx, %r11
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subl	%eax, %OFFSET_REG
+# else
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+	jz	L(prepare_loop)
 # endif
-	tzcntl	%ecx, %edx
+
+	VMOVU	(%rdi, %OFFSET_REG64), %xmm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %xmm0, %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incw	%cx
+	jnz	L(check_ret_vec_page_cross)
+
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addl	$16, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(VEC_SIZE * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# else
+	leaq	(16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
 # endif
-# ifdef USE_AS_WCSCMP
+	jmp	L(prepare_loop_aligned)
+
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case0):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	ret
 # endif
-	VZEROUPPER_RETURN
 
-	/* Comparing on page boundary region requires special treatment:
-	   It must done one vector at the time, starting with the wider
-	   ymm vector if possible, if not, with xmm. If fetching 16 bytes
-	   (xmm) still passes the boundary, byte comparison must be done.
-	 */
-	.p2align 4
-L(cross_page):
-	/* Try one ymm vector at a time.  */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jg	L(cross_page_1_vector)
-L(loop_1_vector):
-	vmovdqu	(%rdi, %rdx), %ymm1
-	VPCMPEQ	(%rsi, %rdx), %ymm1, %ymm0
-	VPMINU	%ymm1, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
 
-	addl	$VEC_SIZE, %edx
+	.p2align 4,, 10
+L(less_16_till_page):
+	/* Find largest load size we can use.  */
+	cmpl	$24, %eax
+	ja	L(less_8_till_page)
 
-	addl	$VEC_SIZE, %eax
-# ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jle	L(loop_1_vector)
-L(cross_page_1_vector):
-	/* Less than 32 bytes to check, try one xmm vector.  */
-	cmpl	$(PAGE_SIZE - 16), %eax
-	jg	L(cross_page_1_xmm)
-	vmovdqu	(%rdi, %rdx), %xmm1
-	VPCMPEQ	(%rsi, %rdx), %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	vmovq	(%rdi), %xmm0
+	vmovq	(%rsi), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incb	%cl
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$16, %edx
-# ifndef USE_AS_WCSCMP
-	addl	$16, %eax
+
+# ifdef USE_AS_STRNCMP
+	cmpq	$8, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
 # endif
+	movl	$24, %OFFSET_REG
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+
+
+
+	vmovq	(%rdi, %OFFSET_REG64), %xmm0
+	vmovq	(%rsi, %OFFSET_REG64), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incb	%cl
+	jnz	L(check_ret_vec_page_cross)
+
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-
-L(cross_page_1_xmm):
-# ifndef USE_AS_WCSCMP
-	/* Less than 16 bytes to check, try 8 byte vector.  NB: No need
-	   for wcscmp nor wcsncmp since wide char is 4 bytes.   */
-	cmpl	$(PAGE_SIZE - 8), %eax
-	jg	L(cross_page_8bytes)
-	vmovq	(%rdi, %rdx), %xmm1
-	vmovq	(%rsi, %rdx), %xmm0
-	VPCMPEQ	%xmm0, %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	addl	$8, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(VEC_SIZE * 4), %rdx
 
-	addl	$8, %edx
-	addl	$8, %eax
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# else
+	leaq	(8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# endif
+	jmp	L(prepare_loop_aligned)
+
+
+	.p2align 4,, 10
+L(less_8_till_page):
+# ifdef USE_AS_WCSCMP
+	/* If using wchar then this is the only check before we reach
+	   the page boundary.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	cmpl	%ecx, %eax
+	jnz	L(ret_less_8_wcs)
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addq	%rdi, %rdx
+	/* We already checked for len <= 1 so cannot hit that case here.
+	 */
 #  endif
+	testl	%eax, %eax
+	jnz	L(prepare_loop_no_len)
+	ret
 
-L(cross_page_8bytes):
-	/* Less than 8 bytes to check, try 4 byte vector.  */
-	cmpl	$(PAGE_SIZE - 4), %eax
-	jg	L(cross_page_4bytes)
-	vmovd	(%rdi, %rdx), %xmm1
-	vmovd	(%rsi, %rdx), %xmm0
-	VPCMPEQ	%xmm0, %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	/* Only last 4 bits are valid.  */
-	andl	$0xf, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	.p2align 4,, 8
+L(ret_less_8_wcs):
+	setl	%OFFSET_REG8
+	negl	%OFFSET_REG
+	movl	%OFFSET_REG, %eax
+	xorl	%r8d, %eax
+	ret
+
+# else
+
+	/* Find largest load size we can use.  */
+	cmpl	$28, %eax
+	ja	L(less_4_till_page)
+
+	vmovd	(%rdi), %xmm0
+	vmovd	(%rsi), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$4, %edx
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$4, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
 #  endif
+	movl	$28, %OFFSET_REG
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
 
-L(cross_page_4bytes):
-# endif
-	/* Less than 4 bytes to check, try one byte/dword at a time.  */
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-# ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
-# endif
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
+
+
+	vmovd	(%rdi, %OFFSET_REG64), %xmm0
+	vmovd	(%rsi, %OFFSET_REG64), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
+
+#  ifdef USE_AS_STRNCMP
+	addl	$4, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
+	subq	$-(VEC_SIZE * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+#  else
+	leaq	(4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+#  ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case1):
+	xorl	%eax, %eax
+	ret
+#  endif
+
+	.p2align 4,, 10
+L(less_4_till_page):
+	subq	%rdi, %rsi
+	/* Extremely slow byte comparison loop.  */
+L(less_4_loop):
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi, %rdi), %ecx
 	subl	%ecx, %eax
-	VZEROUPPER_RETURN
-END (STRCMP)
+	jnz	L(ret_less_4_loop)
+	testl	%ecx, %ecx
+	jz	L(ret_zero_4_loop)
+#  ifdef USE_AS_STRNCMP
+	decq	%rdx
+	jz	L(ret_zero_4_loop)
+#  endif
+	incq	%rdi
+	/* end condition is reach page boundary (rdi is aligned).  */
+	testl	$31, %edi
+	jnz	L(less_4_loop)
+	leaq	-(VEC_SIZE * 4)(%rdi, %rsi), %rsi
+	addq	$-(VEC_SIZE * 4), %rdi
+#  ifdef USE_AS_STRNCMP
+	subq	$-(VEC_SIZE * 4), %rdx
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+L(ret_zero_4_loop):
+	xorl	%eax, %eax
+	ret
+L(ret_less_4_loop):
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
+	ret
+# endif
+END(STRCMP)
 #endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 2/5] x86: Optimize strcmp-evex.S and fix for [BZ# 28755]
  2022-01-09 12:29 [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
@ 2022-01-09 12:29 ` Noah Goldstein
  2022-01-09 12:29 ` [PATCH v1 3/5] string: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-09 12:29 UTC (permalink / raw)
  To: libc-alpha

Fixes [BZ# 28755] for wcsncmp by not multiply length by
sizeof(wchar_t).

Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases as well are nearly universally improved.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-evex.S | 1702 +++++++++++++-----------
 1 file changed, 919 insertions(+), 783 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
index 1d971f3889..e5070f3d53 100644
--- a/sysdeps/x86_64/multiarch/strcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
@@ -26,54 +26,69 @@
 
 # define PAGE_SIZE	4096
 
-/* VEC_SIZE = Number of bytes in a ymm register */
+	/* VEC_SIZE = Number of bytes in a ymm register.  */
 # define VEC_SIZE	32
+# define CHAR_PER_VEC	(VEC_SIZE	/	SIZE_OF_CHAR)
 
-/* Shift for dividing by (VEC_SIZE * 4).  */
-# define DIVIDE_BY_VEC_4_SHIFT	7
-# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-# endif
-
-# define VMOVU		vmovdqu64
-# define VMOVA		vmovdqa64
+# define VMOVU	vmovdqu64
+# define VMOVA	vmovdqa64
 
 # ifdef USE_AS_WCSCMP
-/* Compare packed dwords.  */
-#  define VPCMP		vpcmpd
+#  define TESTEQ	subl	$0xff,
+	/* Compare packed dwords.  */
+#  define VPCMP	vpcmpd
 #  define VPMINU	vpminud
 #  define VPTESTM	vptestmd
-#  define SHIFT_REG32	r8d
-#  define SHIFT_REG64	r8
-/* 1 dword char == 4 bytes.  */
+	/* 1 dword char == 4 bytes.  */
 #  define SIZE_OF_CHAR	4
 # else
-/* Compare packed bytes.  */
-#  define VPCMP		vpcmpb
+#  define TESTEQ	incl
+	/* Compare packed bytes.  */
+#  define VPCMP	vpcmpb
 #  define VPMINU	vpminub
 #  define VPTESTM	vptestmb
-#  define SHIFT_REG32	ecx
-#  define SHIFT_REG64	rcx
-/* 1 byte char == 1 byte.  */
+	/* 1 byte char == 1 byte.  */
 #  define SIZE_OF_CHAR	1
 # endif
 
+# ifdef USE_AS_STRNCMP
+#  define LOOP_REG	r9d
+#  define LOOP_REG64	r9
+
+#  define OFFSET_REG8	r9b
+#  define OFFSET_REG	r9d
+#  define OFFSET_REG64	r9
+# else
+#  define LOOP_REG	edx
+#  define LOOP_REG64	rdx
+
+#  define OFFSET_REG8	dl
+#  define OFFSET_REG	edx
+#  define OFFSET_REG64	rdx
+# endif
+
+# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
+#  define VEC_OFFSET	0
+# else
+#  define VEC_OFFSET	(-VEC_SIZE)
+# endif
+
 # define XMMZERO	xmm16
-# define XMM0		xmm17
-# define XMM1		xmm18
+# define XMM0	xmm17
+# define XMM1	xmm18
 
 # define YMMZERO	ymm16
-# define YMM0		ymm17
-# define YMM1		ymm18
-# define YMM2		ymm19
-# define YMM3		ymm20
-# define YMM4		ymm21
-# define YMM5		ymm22
-# define YMM6		ymm23
-# define YMM7		ymm24
-# define YMM8		ymm25
-# define YMM9		ymm26
-# define YMM10		ymm27
+# define YMM0	ymm17
+# define YMM1	ymm18
+# define YMM2	ymm19
+# define YMM3	ymm20
+# define YMM4	ymm21
+# define YMM5	ymm22
+# define YMM6	ymm23
+# define YMM7	ymm24
+# define YMM8	ymm25
+# define YMM9	ymm26
+# define YMM10	ymm27
 
 /* Warning!
            wcscmp/wcsncmp have to use SIGNED comparison for elements.
@@ -96,975 +111,1096 @@
    the maximum offset is reached before a difference is found, zero is
    returned.  */
 
-	.section .text.evex,"ax",@progbits
-ENTRY (STRCMP)
+	.section .text.evex, "ax", @progbits
+ENTRY(STRCMP)
 # ifdef USE_AS_STRNCMP
-	/* Check for simple cases (0 or 1) in offset.  */
-	cmp	$1, %RDX_LP
-	je	L(char0)
-	jb	L(zero)
-#  ifdef USE_AS_WCSCMP
-	/* Convert units: from wide to byte char.  */
-	shl	$2, %RDX_LP
+#  ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %rdx
 #  endif
-	/* Register %r11 tracks the maximum offset.  */
-	mov	%RDX_LP, %R11_LP
+	cmp	$1, %RDX_LP
+	/* Signed comparison intentional. We use this branch to also
+	   test cases where length >= 2^63. These very large sizes can be
+	   handled with strcmp as there is no way for that length to
+	   actually bound the buffer.  */
+	jle	L(one_or_less)
 # endif
 	movl	%edi, %eax
-	xorl	%edx, %edx
-	/* Make %XMMZERO (%YMMZERO) all zeros in this function.  */
-	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
 	orl	%esi, %eax
-	andl	$(PAGE_SIZE - 1), %eax
-	cmpl	$(PAGE_SIZE - (VEC_SIZE * 4)), %eax
-	jg	L(cross_page)
-	/* Start comparing 4 vectors.  */
+	/* Shift out the bits irrelivant to page boundary ([63:12]).  */
+	sall	$20, %eax
+	/* Check if s1 or s2 may cross a page in next 4x VEC loads.  */
+	cmpl	$((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
+	ja	L(page_cross)
+
+L(no_page_cross):
+	/* Safe to compare 4x vectors.  */
 	VMOVU	(%rdi), %YMM0
-
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-
 	/* Each bit cleared in K1 represents a mismatch or a null CHAR
 	   in YMM0 and 32 bytes at (%rsi).  */
 	VPCMP	$0, (%rsi), %YMM0, %k1{%k2}
-
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(next_3_vectors)
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx) is after the maximum
-	   offset (%r11).   */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$CHAR_PER_VEC, %rdx
+	jbe	L(vec_0_test_len)
 # endif
+
+	/* TESTEQ is `incl` for strcmp/strncmp and `subl $0xff` for
+	   wcscmp/wcsncmp.  */
+
+	/* All 1s represents all equals. TESTEQ will overflow to zero in
+	   all equals case. Otherwise 1s will carry until position of first
+	   mismatch.  */
+	TESTEQ	%ecx
+	jz	L(more_3x_vec)
+
+	.p2align 4,, 4
+L(return_vec_0):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	je	L(return)
-L(wcscmp_return):
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret0)
 	setl	%al
 	negl	%eax
 	orl	$1, %eax
-L(return):
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret0):
 	ret
 
-L(return_vec_size):
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
-	   the maximum offset (%r11).  */
-	addq	$VEC_SIZE, %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	.p2align 4,, 4
+L(vec_0_test_len):
+	notl	%ecx
+	bzhil	%edx, %ecx, %eax
+	jnz	L(return_vec_0)
+	/* Align if will cross fetch block.  */
+	.p2align 4,, 2
+L(ret_zero):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
-# else
+	ret
+
+	.p2align 4,, 5
+L(one_or_less):
+	jb	L(ret_zero)
 #  ifdef USE_AS_WCSCMP
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__wcscmp_evex
+	movl	(%rdi), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rdi, %rdx), %ecx
-	cmpl	VEC_SIZE(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(%rsi), %edx
+	je	L(ret1)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	VEC_SIZE(%rdi, %rdx), %eax
-	movzbl	VEC_SIZE(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__strcmp_evex
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret1):
 	ret
+# endif
 
-L(return_2_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
+L(return_vec_1):
+	tzcntl	%ecx, %ecx
+# ifdef USE_AS_STRNCMP
+	/* rdx must be > CHAR_PER_VEC so its safe to subtract without
+	   worrying about underflow.  */
+	addq	$-CHAR_PER_VEC, %rdx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
+	movl	VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret2)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
+# else
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret2):
+	ret
+
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 2), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+L(return_vec_3):
+#  if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %ecx
 #  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	salq	$CHAR_PER_VEC, %rcx
 #  endif
+# endif
+L(return_vec_2):
+# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
+	tzcntl	%ecx, %ecx
 # else
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 2)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 2)(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	tzcntq	%rcx, %rcx
 # endif
-	ret
 
-L(return_3_vec_size):
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 3), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret3)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+# endif
+L(ret3):
+	ret
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_3):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 3)(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 3)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(VEC_SIZE * 3)(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret4)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 3)(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(VEC_SIZE * 3)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 3)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret4):
 	ret
+# endif
 
-	.p2align 4
-L(next_3_vectors):
-	VMOVU	VEC_SIZE(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
+	/* 32 byte align here ensures the main loop is ideally aligned
+	   for DSB.  */
+	.p2align 5
+L(more_3x_vec):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	(VEC_SIZE)(%rdi), %YMM0
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at VEC_SIZE(%rsi).  */
-	VPCMP	$0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
+	VPCMP	$0, (VEC_SIZE)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_1)
+
+# ifdef USE_AS_STRNCMP
+	subq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero)
 # endif
-	jne	L(return_vec_size)
 
 	VMOVU	(VEC_SIZE * 2)(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
 	VPCMP	$0, (VEC_SIZE * 2)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	jne	L(return_2_vec_size)
+	TESTEQ	%ecx
+	jnz	L(return_vec_2)
 
 	VMOVU	(VEC_SIZE * 3)(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
 	VPCMP	$0, (VEC_SIZE * 3)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_3)
+
+# ifdef USE_AS_STRNCMP
+	cmpq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
+
 # else
-	incl	%ecx
+	xorl	%r8d, %r8d
 # endif
-	jne	L(return_3_vec_size)
-L(main_loop_header):
-	leaq	(VEC_SIZE * 4)(%rdi), %rdx
-	movl	$PAGE_SIZE, %ecx
-	/* Align load via RAX.  */
-	andq	$-(VEC_SIZE * 4), %rdx
-	subq	%rdi, %rdx
-	leaq	(%rdi, %rdx), %rax
+
+	/* The prepare labels are various entry points from the page
+	   cross logic.  */
+L(prepare_loop):
+
 # ifdef USE_AS_STRNCMP
-	/* Starting from this point, the maximum offset, or simply the
-	   'offset', DECREASES by the same amount when base pointers are
-	   moved forward.  Return 0 when:
-	     1) On match: offset <= the matched vector index.
-	     2) On mistmach, offset is before the mistmatched index.
-	 */
-	subq	%rdx, %r11
-	jbe	L(zero)
+#  ifdef USE_AS_WCSCMP
+L(prepare_loop_no_len):
+	movl	%edi, %ecx
+	andl	$(VEC_SIZE * 4 - 1), %ecx
+	shrl	$2, %ecx
+	leaq	(CHAR_PER_VEC * 2)(%rdx, %rcx), %rdx
+#  else
+	/* Store N + (VEC_SIZE * 4) and place check at the begining of
+	   the loop.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rdx), %rdx
+L(prepare_loop_no_len):
+#  endif
+# else
+L(prepare_loop_no_len):
+# endif
+
+	/* Align s1 and adjust s2 accordingly.  */
+	subq	%rdi, %rsi
+	andq	$-(VEC_SIZE * 4), %rdi
+L(prepare_loop_readj):
+	addq	%rdi, %rsi
+# if (defined USE_AS_STRNCMP) && !(defined USE_AS_WCSCMP)
+	subq	%rdi, %rdx
 # endif
-	addq	%rsi, %rdx
-	movq	%rdx, %rsi
-	andl	$(PAGE_SIZE - 1), %esi
-	/* Number of bytes before page crossing.  */
-	subq	%rsi, %rcx
-	/* Number of VEC_SIZE * 4 blocks before page crossing.  */
-	shrq	$DIVIDE_BY_VEC_4_SHIFT, %rcx
-	/* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
-	movl	%ecx, %esi
-	jmp	L(loop_start)
 
+L(prepare_loop_aligned):
+	/* eax stores distance from rsi to next page cross. These cases
+	   need to be handled specially as the 4x loop could potentially
+	   read memory past the length of s1 or s2 and across a page
+	   boundary.  */
+	movl	$-(VEC_SIZE * 4), %eax
+	subl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+
+	vpxorq	%YMMZERO, %YMMZERO, %YMMZERO
+
+	/* Loop 4x comparisons at a time.  */
 	.p2align 4
 L(loop):
+
+	/* End condition for strncmp.  */
 # ifdef USE_AS_STRNCMP
-	/* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
-	   the maximum offset (%r11) by the same amount.  */
-	subq	$(VEC_SIZE * 4), %r11
-	jbe	L(zero)
+	subq	$(CHAR_PER_VEC * 4), %rdx
+	jbe	L(ret_zero)
 # endif
-	addq	$(VEC_SIZE * 4), %rax
-	addq	$(VEC_SIZE * 4), %rdx
-L(loop_start):
-	testl	%esi, %esi
-	leal	-1(%esi), %esi
-	je	L(loop_cross_page)
-L(back_to_loop):
-	/* Main loop, comparing 4 vectors are a time.  */
-	VMOVA	(%rax), %YMM0
-	VMOVA	VEC_SIZE(%rax), %YMM2
-	VMOVA	(VEC_SIZE * 2)(%rax), %YMM4
-	VMOVA	(VEC_SIZE * 3)(%rax), %YMM6
+
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+
+	/* Check if rsi loads will cross a page boundary.  */
+	addl	$-(VEC_SIZE * 4), %eax
+	jnb	L(page_cross_during_loop)
+
+	/* Loop entry after handling page cross during loop.  */
+L(loop_skip_page_cross_check):
+	VMOVA	(VEC_SIZE * 0)(%rdi), %YMM0
+	VMOVA	(VEC_SIZE * 1)(%rdi), %YMM2
+	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM6
 
 	VPMINU	%YMM0, %YMM2, %YMM8
 	VPMINU	%YMM4, %YMM6, %YMM9
 
-	/* A zero CHAR in YMM8 means that there is a null CHAR.  */
-	VPMINU	%YMM8, %YMM9, %YMM8
+	/* A zero CHAR in YMM9 means that there is a null CHAR.  */
+	VPMINU	%YMM8, %YMM9, %YMM9
 
 	/* Each bit set in K1 represents a non-null CHAR in YMM8.  */
-	VPTESTM	%YMM8, %YMM8, %k1
+	VPTESTM	%YMM9, %YMM9, %k1
 
-	/* (YMM ^ YMM): A non-zero CHAR represents a mismatch.  */
-	vpxorq	(%rdx), %YMM0, %YMM1
-	vpxorq	VEC_SIZE(%rdx), %YMM2, %YMM3
-	vpxorq	(VEC_SIZE * 2)(%rdx), %YMM4, %YMM5
-	vpxorq	(VEC_SIZE * 3)(%rdx), %YMM6, %YMM7
+	vpxorq	(VEC_SIZE * 0)(%rsi), %YMM0, %YMM1
+	vpxorq	(VEC_SIZE * 1)(%rsi), %YMM2, %YMM3
+	vpxorq	(VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
+	/* Ternary logic to xor (VEC_SIZE * 3)(%rsi) with YMM6 while
+	   oring with YMM1. Result is stored in YMM6.  */
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM1, %YMM6
 
-	vporq	%YMM1, %YMM3, %YMM9
-	vporq	%YMM5, %YMM7, %YMM10
+	/* Or together YMM3, YMM5, and YMM6.  */
+	vpternlogd $0xfe, %YMM3, %YMM5, %YMM6
 
-	/* A non-zero CHAR in YMM9 represents a mismatch.  */
-	vporq	%YMM9, %YMM10, %YMM9
 
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR.  */
-	VPCMP	$0, %YMMZERO, %YMM9, %k0{%k1}
-	kmovd   %k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	 L(loop)
+	/* A non-zero CHAR in YMM6 represents a mismatch.  */
+	VPCMP	$0, %YMMZERO, %YMM6, %k0{%k1}
+	kmovd	%k0, %LOOP_REG
+
+	TESTEQ	%LOOP_REG
+	jz	L(loop)
 
-	/* Each bit set in K1 represents a non-null CHAR in YMM0.  */
+
+	/* Find which VEC has the mismatch of end of string.  */
 	VPTESTM	%YMM0, %YMM0, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM0 and (%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM1, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(test_vec)
-	tzcntl	%ecx, %ecx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-# ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# endif
-	ret
+	TESTEQ	%ecx
+	jnz	L(return_vec_0_end)
 
-	.p2align 4
-L(test_vec):
-# ifdef USE_AS_STRNCMP
-	/* The first vector matched.  Return 0 if the maximum offset
-	   (%r11) <= VEC_SIZE.  */
-	cmpq	$VEC_SIZE, %r11
-	jbe	L(zero)
-# endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM2.  */
 	VPTESTM	%YMM2, %YMM2, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM2 and VEC_SIZE(%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM3, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(test_2_vec)
-	tzcntl	%ecx, %edi
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edi
-# endif
-# ifdef USE_AS_STRNCMP
-	addq	$VEC_SIZE, %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	VEC_SIZE(%rsi, %rdi), %ecx
-	cmpl	VEC_SIZE(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	VEC_SIZE(%rax, %rdi), %eax
-	movzbl	VEC_SIZE(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
-# endif
-	ret
+	TESTEQ	%ecx
+	jnz	L(return_vec_1_end)
 
-	.p2align 4
-L(test_2_vec):
+
+	/* Handle VEC 2 and 3 without branches.  */
+L(return_vec_2_3_end):
 # ifdef USE_AS_STRNCMP
-	/* The first 2 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 2 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 2), %r11
-	jbe	L(zero)
+	subq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero_end)
 # endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM4.  */
+
 	VPTESTM	%YMM4, %YMM4, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM4 and (VEC_SIZE * 2)(%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM5, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	TESTEQ	%ecx
+# if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %LOOP_REG
+	orl	%ecx, %LOOP_REG
+# else
+	salq	$CHAR_PER_VEC, %LOOP_REG64
+	orq	%rcx, %LOOP_REG64
+# endif
+L(return_vec_3_end):
+	/* LOOP_REG contains matches for null/mismatch from the loop. If
+	   VEC 0,1,and 2 all have no null and no mismatches then mismatch
+	   must entirely be from VEC 3 which is fully represented by
+	   LOOP_REG.  */
+# if CHAR_PER_VEC <= 16
+	tzcntl	%LOOP_REG, %LOOP_REG
 # else
-	incl	%ecx
+	tzcntq	%LOOP_REG64, %LOOP_REG64
 # endif
-	je	L(test_3_vec)
-	tzcntl	%ecx, %edi
+# ifdef USE_AS_STRNCMP
+	cmpq	%LOOP_REG64, %rdx
+	jbe	L(ret_zero_end)
+# endif
+
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edi
+	movl	(VEC_SIZE * 2)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
+	xorl	%eax, %eax
+	cmpl	(VEC_SIZE * 2)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
+	je	L(ret5)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	(VEC_SIZE * 2)(%rdi, %LOOP_REG64), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %LOOP_REG64), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret5):
+	ret
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	.p2align 4,, 2
+L(ret_zero_end):
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
+	ret
+# endif
+
+
+	/* The L(return_vec_N_end) differ from L(return_vec_N) in that
+	   they use the value of `r8` to negate the return value. This is
+	   because the page cross logic can swap `rdi` and `rsi`.  */
+	.p2align 4,, 10
+# ifdef USE_AS_STRNCMP
+L(return_vec_1_end):
+#  if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %ecx
 #  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
+	salq	$CHAR_PER_VEC, %rcx
 #  endif
+# endif
+L(return_vec_0_end):
+# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
+	tzcntl	%ecx, %ecx
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rdi), %ecx
-	cmpl	(VEC_SIZE * 2)(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rdi), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	tzcntq	%rcx, %rcx
 # endif
-	ret
 
-	.p2align 4
-L(test_3_vec):
 # ifdef USE_AS_STRNCMP
-	/* The first 3 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 3 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 3), %r11
-	jbe	L(zero)
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_end)
 # endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM6.  */
-	VPTESTM	%YMM6, %YMM6, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM6 and (VEC_SIZE * 3)(%rdx).  */
-	VPCMP	$0, %YMMZERO, %YMM7, %k0{%k1}
-	kmovd	%k0, %ecx
+
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret6)
+	setl	%al
+	negl	%eax
+	/* This is the non-zero case for `eax` so just xorl with `r8d`
+	   flip is `rdi` and `rsi` where swapped.  */
+	xorl	%r8d, %eax
 # else
-	incl	%ecx
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	/* Flip `eax` if `rdi` and `rsi` where swapped in page cross
+	   logic. Subtract `r8d` after xor for zero case.  */
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret6):
+	ret
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_1_end):
 	tzcntl	%ecx, %ecx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 3), %rcx
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %esi
-	cmpl	(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rsi, %rcx), %esi
-	cmpl	(VEC_SIZE * 3)(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
+	cmpl	VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret7)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 3)(%rdx, %rcx), %edx
-	subl	%edx, %eax
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 #  endif
-# endif
+L(ret7):
 	ret
-
-	.p2align 4
-L(loop_cross_page):
-	xorl	%r10d, %r10d
-	movq	%rdx, %rcx
-	/* Align load via RDX.  We load the extra ECX bytes which should
-	   be ignored.  */
-	andl	$((VEC_SIZE * 4) - 1), %ecx
-	/* R10 is -RCX.  */
-	subq	%rcx, %r10
-
-	/* This works only if VEC_SIZE * 2 == 64. */
-# if (VEC_SIZE * 2) != 64
-#  error (VEC_SIZE * 2) != 64
 # endif
 
-	/* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
-	cmpl	$(VEC_SIZE * 2), %ecx
-	jge	L(loop_cross_page_2_vec)
 
-	VMOVU	(%rax, %r10), %YMM2
-	VMOVU	VEC_SIZE(%rax, %r10), %YMM3
+	/* Page cross in rsi in next 4x VEC.  */
 
-	/* Each bit set in K2 represents a non-null CHAR in YMM2.  */
-	VPTESTM	%YMM2, %YMM2, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM2 and 32 bytes at (%rdx, %r10).  */
-	VPCMP	$0, (%rdx, %r10), %YMM2, %k1{%k2}
-	kmovd	%k1, %r9d
-	/* Don't use subl since it is the lower 16/32 bits of RDI
-	   below.  */
-	notl	%r9d
-# ifdef USE_AS_WCSCMP
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %r9d
-# endif
+	/* TODO: Improve logic here.  */
+	.p2align 4,, 10
+L(page_cross_during_loop):
+	/* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
 
-	/* Each bit set in K4 represents a non-null CHAR in YMM3.  */
-	VPTESTM	%YMM3, %YMM3, %k4
-	/* Each bit cleared in K3 represents a mismatch or a null CHAR
-	   in YMM3 and 32 bytes at VEC_SIZE(%rdx, %r10).  */
-	VPCMP	$0, VEC_SIZE(%rdx, %r10), %YMM3, %k3{%k4}
-	kmovd	%k3, %edi
-    /* Must use notl %edi here as lower bits are for CHAR
-	   comparisons potentially out of range thus can be 0 without
-	   indicating mismatch.  */
-	notl	%edi
-# ifdef USE_AS_WCSCMP
-	/* Don't use subl since it is the upper 8 bits of EDI below.  */
-	andl	$0xff, %edi
-# endif
+	/* Optimistically rsi and rdi and both aligned in which case we
+	   don't need any logic here.  */
+	cmpl	$-(VEC_SIZE * 4), %eax
+	/* Don't adjust eax before jumping back to loop and we will
+	   never hit page cross case again.  */
+	je	L(loop_skip_page_cross_check)
 
-# ifdef USE_AS_WCSCMP
-	/* NB: Each bit in EDI/R9D represents 4-byte element.  */
-	sall	$8, %edi
-	/* NB: Divide shift count by 4 since each bit in K1 represent 4
-	   bytes.  */
-	movl	%ecx, %SHIFT_REG32
-	sarl	$2, %SHIFT_REG32
-
-	/* Each bit in EDI represents a null CHAR or a mismatch.  */
-	orl	%r9d, %edi
-# else
-	salq	$32, %rdi
+	/* Check if we can safely load a VEC.  */
+	cmpl	$-(VEC_SIZE * 3), %eax
+	jle	L(less_1x_vec_till_page_cross)
 
-	/* Each bit in RDI represents a null CHAR or a mismatch.  */
-	orq	%r9, %rdi
-# endif
+	VMOVA	(%rdi), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_0_end)
+
+	/* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
+	cmpl	$-(VEC_SIZE * 2), %eax
+	jg	L(more_2x_vec_till_page_cross)
+
+	.p2align 4,, 4
+L(less_1x_vec_till_page_cross):
+	subl	$-(VEC_SIZE * 4), %eax
+	/* Guranteed safe to read from rdi - VEC_SIZE here. The only
+	   concerning case is first iteration if incoming s1 was near start
+	   of a page and s2 near end. If s1 was near the start of the page
+	   we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
+	   to read back -VEC_SIZE. If rdi is truly at the start of a page
+	   here, it means the previous page (rdi - VEC_SIZE) has already
+	   been loaded earlier so must be valid.  */
+	VMOVU	-VEC_SIZE(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -VEC_SIZE(%rsi, %rax), %YMM0, %k1{%k2}
+
+	/* Mask of potentially valid bits. The lower bits can be out of
+	   range comparisons (but safe regarding page crosses).  */
 
-	/* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
-	shrxq	%SHIFT_REG64, %rdi, %rdi
-	testq	%rdi, %rdi
-	je	L(loop_cross_page_2_vec)
-	tzcntq	%rdi, %rcx
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
+	movl	$-1, %r10d
+	movl	%esi, %ecx
+	andl	$(VEC_SIZE - 1), %ecx
+	shrl	$2, %ecx
+	shlxl	%ecx, %r10d, %ecx
+	movzbl	%cl, %r10d
+# else
+	movl	$-1, %ecx
+	shlxl	%esi, %ecx, %r10d
 # endif
+
+	kmovd	%k1, %ecx
+	notl	%ecx
+
+
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
+	movl	%eax, %r11d
+	shrl	$2, %r11d
+	cmpq	%r11, %rdx
 #  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
+	cmpq	%rax, %rdx
 #  endif
+	jbe	L(return_page_cross_end_check)
+# endif
+	movl	%eax, %OFFSET_REG
+
+	/* Readjust eax before potentially returning to the loop.  */
+	addl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+
+	andl	%r10d, %ecx
+	jz	L(loop_skip_page_cross_check)
+
+	.p2align 4,, 3
+L(return_page_cross_end):
+	tzcntl	%ecx, %ecx
+
+# if (defined USE_AS_STRNCMP) || (defined USE_AS_WCSCMP)
+	leal	-VEC_SIZE(%OFFSET_REG64, %rcx, SIZE_OF_CHAR), %ecx
+L(return_page_cross_cmp_mem):
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	addl	%OFFSET_REG, %ecx
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret8)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret8):
 	ret
 
-	.p2align 4
-L(loop_cross_page_2_vec):
-	/* The first VEC_SIZE * 2 bytes match or are ignored.  */
-	VMOVU	(VEC_SIZE * 2)(%rax, %r10), %YMM0
-	VMOVU	(VEC_SIZE * 3)(%rax, %r10), %YMM1
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_page_cross_end_check):
+	tzcntl	%ecx, %ecx
+	leal	-VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
+#  ifdef USE_AS_WCSCMP
+	sall	$2, %edx
+#  endif
+	cmpl	%ecx, %edx
+	ja	L(return_page_cross_cmp_mem)
+	xorl	%eax, %eax
+	ret
+# endif
+
+
+	.p2align 4,, 10
+L(more_2x_vec_till_page_cross):
+	/* If more 2x vec till cross we will complete a full loop
+	   iteration here.  */
 
+	VMOVA	VEC_SIZE(%rdi), %YMM0
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rdx, %r10).  */
-	VPCMP	$0, (VEC_SIZE * 2)(%rdx, %r10), %YMM0, %k1{%k2}
-	kmovd	%k1, %r9d
-	/* Don't use subl since it is the lower 16/32 bits of RDI
-	   below.  */
-	notl	%r9d
-# ifdef USE_AS_WCSCMP
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %r9d
-# endif
+	VPCMP	$0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_1_end)
 
-	VPTESTM	%YMM1, %YMM1, %k4
-	/* Each bit cleared in K3 represents a mismatch or a null CHAR
-	   in YMM1 and 32 bytes at (VEC_SIZE * 3)(%rdx, %r10).  */
-	VPCMP	$0, (VEC_SIZE * 3)(%rdx, %r10), %YMM1, %k3{%k4}
-	kmovd	%k3, %edi
-	/* Must use notl %edi here as lower bits are for CHAR
-	   comparisons potentially out of range thus can be 0 without
-	   indicating mismatch.  */
-	notl	%edi
-# ifdef USE_AS_WCSCMP
-	/* Don't use subl since it is the upper 8 bits of EDI below.  */
-	andl	$0xff, %edi
+# ifdef USE_AS_STRNCMP
+	cmpq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
 
-# ifdef USE_AS_WCSCMP
-	/* NB: Each bit in EDI/R9D represents 4-byte element.  */
-	sall	$8, %edi
+	subl	$-(VEC_SIZE * 4), %eax
 
-	/* Each bit in EDI represents a null CHAR or a mismatch.  */
-	orl	%r9d, %edi
-# else
-	salq	$32, %rdi
+	/* Safe to include comparisons from lower bytes.  */
+	VMOVU	-(VEC_SIZE * 2)(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -(VEC_SIZE * 2)(%rsi, %rax), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_page_cross_0)
+
+	VMOVU	-(VEC_SIZE * 1)(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -(VEC_SIZE * 1)(%rsi, %rax), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_page_cross_1)
 
-	/* Each bit in RDI represents a null CHAR or a mismatch.  */
-	orq	%r9, %rdi
+# ifdef USE_AS_STRNCMP
+	/* Must check length here as length might proclude reading next
+	   page.  */
+#  ifdef USE_AS_WCSCMP
+	movl	%eax, %r11d
+	shrl	$2, %r11d
+	cmpq	%r11, %rdx
+#  else
+	cmpq	%rax, %rdx
+#  endif
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
 
-	xorl	%r8d, %r8d
-	/* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
-	subl	$(VEC_SIZE * 2), %ecx
-	jle	1f
-	/* R8 has number of bytes skipped.  */
-	movl	%ecx, %r8d
-# ifdef USE_AS_WCSCMP
-	/* NB: Divide shift count by 4 since each bit in RDI represent 4
-	   bytes.  */
-	sarl	$2, %ecx
-	/* Skip ECX bytes.  */
-	shrl	%cl, %edi
+	/* Finish the loop.  */
+	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM6
+	VPMINU	%YMM4, %YMM6, %YMM9
+	VPTESTM	%YMM9, %YMM9, %k1
+
+	vpxorq	(VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
+	/* YMM6 = YMM5 | ((VEC_SIZE * 3)(%rsi) ^ YMM6).  */
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM5, %YMM6
+
+	VPCMP	$0, %YMMZERO, %YMM6, %k0{%k1}
+	kmovd	%k0, %LOOP_REG
+	TESTEQ	%LOOP_REG
+	jnz	L(return_vec_2_3_end)
+
+	/* Best for code size to include ucond-jmp here. Would be faster
+	   if this case is hot to duplicate the L(return_vec_2_3_end) code
+	   as fall-through and have jump back to loop on mismatch
+	   comparison.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+	addl	$(PAGE_SIZE - VEC_SIZE * 8), %eax
+# ifdef USE_AS_STRNCMP
+	subq	$(CHAR_PER_VEC * 4), %rdx
+	ja	L(loop_skip_page_cross_check)
+L(ret_zero_in_loop_page_cross):
+	xorl	%eax, %eax
+	ret
 # else
-	/* Skip ECX bytes.  */
-	shrq	%cl, %rdi
+	jmp	L(loop_skip_page_cross_check)
 # endif
-1:
-	/* Before jumping back to the loop, set ESI to the number of
-	   VEC_SIZE * 4 blocks before page crossing.  */
-	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
 
-	testq	%rdi, %rdi
-# ifdef USE_AS_STRNCMP
-	/* At this point, if %rdi value is 0, it already tested
-	   VEC_SIZE*4+%r10 byte starting from %rax. This label
-	   checks whether strncmp maximum offset reached or not.  */
-	je	L(string_nbyte_offset_check)
+
+	.p2align 4,, 10
+L(return_vec_page_cross_0):
+	addl	$-VEC_SIZE, %eax
+L(return_vec_page_cross_1):
+	tzcntl	%ecx, %ecx
+# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
+	leal	-VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
+#  ifdef USE_AS_STRNCMP
+#   ifdef USE_AS_WCSCMP
+	/* Must divide ecx instead of multiply rdx due to overflow.  */
+	movl	%ecx, %eax
+	shrl	$2, %eax
+	cmpq	%rax, %rdx
+#   else
+	cmpq	%rcx, %rdx
+#   endif
+	jbe	L(ret_zero_in_loop_page_cross)
+#  endif
 # else
-	je	L(back_to_loop)
+	addl	%eax, %ecx
 # endif
-	tzcntq	%rdi, %rcx
+
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-	addq	%r10, %rcx
-	/* Adjust for number of bytes skipped.  */
-	addq	%r8, %rcx
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rcx
-	subq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret9)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rcx), %edi
-	cmpl	(VEC_SIZE * 2)(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret9):
 	ret
 
-# ifdef USE_AS_STRNCMP
-L(string_nbyte_offset_check):
-	leaq	(VEC_SIZE * 4)(%r10), %r10
-	cmpq	%r10, %r11
-	jbe	L(zero)
-	jmp	L(back_to_loop)
+
+	.p2align 4,, 10
+L(page_cross):
+# ifndef USE_AS_STRNCMP
+	/* If both are VEC aligned we don't need any special logic here.
+	   Only valid for strcmp where stop condition is guranteed to be
+	   reachable by just reading memory.  */
+	testl	$((VEC_SIZE - 1) << 20), %eax
+	jz	L(no_page_cross)
 # endif
 
-	.p2align 4
-L(cross_page_loop):
-	/* Check one byte/dword at a time.  */
+	movl	%edi, %eax
+	movl	%esi, %ecx
+	andl	$(PAGE_SIZE - 1), %eax
+	andl	$(PAGE_SIZE - 1), %ecx
+
+	xorl	%OFFSET_REG, %OFFSET_REG
+
+	/* Check which is closer to page cross, s1 or s2.  */
+	cmpl	%eax, %ecx
+	jg	L(page_cross_s2)
+
+	/* The previous page cross check has false positives. Check for
+	   true positive as page cross logic is very expensive.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+	jbe	L(no_page_cross)
+
+
+	/* Set r8 to not interfere with normal return value (rdi and rsi
+	   did not swap).  */
 # ifdef USE_AS_WCSCMP
-	cmpl	%ecx, %eax
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
 # else
-	subl	%ecx, %eax
+	xorl	%r8d, %r8d
 # endif
-	jne	L(different)
-	addl	$SIZE_OF_CHAR, %edx
-	cmpl	$(VEC_SIZE * 4), %edx
-	je	L(main_loop_header)
+
+	/* Check if less than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jg	L(less_1x_vec_till_page)
+
+
+	/* If more than 1x VEC till page cross, loop throuh safely
+	   loadable memory until within 1x VEC of page cross.  */
+	.p2align 4,, 8
+L(page_cross_loop):
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(check_ret_vec_page_cross)
+	addl	$CHAR_PER_VEC, %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
+	addl	$VEC_SIZE, %eax
+	jl	L(page_cross_loop)
+
 # ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
+	shrl	$2, %eax
 # endif
-	/* Check null CHAR.  */
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
-	/* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
-	   comparisons.  */
-	subl	%ecx, %eax
-# ifndef USE_AS_WCSCMP
-L(different):
+
+
+	subl	%eax, %OFFSET_REG
+	/* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
+	   to not cross page so is safe to load. Since we have already
+	   loaded at least 1 VEC from rsi it is also guranteed to be safe.
+	 */
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2}
+
+	kmovd	%k1, %ecx
+# ifdef USE_AS_STRNCMP
+	leal	CHAR_PER_VEC(%OFFSET_REG64), %eax
+	cmpq	%rax, %rdx
+	jbe	L(check_ret_vec_page_cross2)
+#  ifdef USE_AS_WCSCMP
+	addq	$-(CHAR_PER_VEC * 2), %rdx
+#  else
+	addq	%rdi, %rdx
+#  endif
 # endif
-	ret
+	TESTEQ	%ecx
+	jz	L(prepare_loop_no_len)
 
+	.p2align 4,, 4
+L(ret_vec_page_cross):
+# ifndef USE_AS_STRNCMP
+L(check_ret_vec_page_cross):
+# endif
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+L(ret_vec_page_cross_cont):
 # ifdef USE_AS_WCSCMP
-	.p2align 4
-L(different):
-	/* Use movl to avoid modifying EFLAGS.  */
-	movl	$0, %eax
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret12)
 	setl	%al
 	negl	%eax
-	orl	$1, %eax
-	ret
+	xorl	%r8d, %eax
+# else
+	movzbl	(%rdi, %rcx, SIZE_OF_CHAR), %eax
+	movzbl	(%rsi, %rcx, SIZE_OF_CHAR), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret12):
+	ret
+
 
 # ifdef USE_AS_STRNCMP
-	.p2align 4
-L(zero):
+	.p2align 4,, 10
+L(check_ret_vec_page_cross2):
+	TESTEQ	%ecx
+L(check_ret_vec_page_cross):
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+	cmpq	%rcx, %rdx
+	ja	L(ret_vec_page_cross_cont)
+	.p2align 4,, 2
+L(ret_zero_page_cross):
 	xorl	%eax, %eax
 	ret
+# endif
 
-	.p2align 4
-L(char0):
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi), %ecx
-	cmpl	(%rsi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rsi), %ecx
-	movzbl	(%rdi), %eax
-	subl	%ecx, %eax
-#  endif
-	ret
+	.p2align 4,, 4
+L(page_cross_s2):
+	/* Ensure this is a true page cross.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %ecx
+	jbe	L(no_page_cross)
+
+
+	movl	%ecx, %eax
+	movq	%rdi, %rcx
+	movq	%rsi, %rdi
+	movq	%rcx, %rsi
+
+	/* set r8 to negate return value as rdi and rsi swapped.  */
+# ifdef USE_AS_WCSCMP
+	movl	$-4, %r8d
+# else
+	movl	$-1, %r8d
 # endif
+	xorl	%OFFSET_REG, %OFFSET_REG
 
-	.p2align 4
-L(last_vector):
-	addq	%rdx, %rdi
-	addq	%rdx, %rsi
-# ifdef USE_AS_STRNCMP
-	subq	%rdx, %r11
+	/* Check if more than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jle	L(page_cross_loop)
+
+	.p2align 4,, 6
+L(less_1x_vec_till_page):
+# ifdef USE_AS_WCSCMP
+	shrl	$2, %eax
 # endif
-	tzcntl	%ecx, %edx
+	/* Find largest load size we can use.  */
+	cmpl	$(16 / SIZE_OF_CHAR), %eax
+	ja	L(less_16_till_page)
+
+	/* Use 16 byte comparison.  */
+	vmovdqu	(%rdi), %xmm0
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, (%rsi), %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
+	subl	$0xf, %ecx
+# else
+	incw	%cx
 # endif
+	jnz	L(check_ret_vec_page_cross)
+	movl	$(16 / SIZE_OF_CHAR), %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subl	%eax, %OFFSET_REG
+# else
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+	jz	L(prepare_loop)
 # endif
+	vmovdqu	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	subl	$0xf, %ecx
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	incw	%cx
 # endif
+	jnz	L(check_ret_vec_page_cross)
+# ifdef USE_AS_STRNCMP
+	addl	$(16 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# else
+	leaq	(16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# endif
+	jmp	L(prepare_loop_aligned)
+
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case0):
+	xorl	%eax, %eax
 	ret
+# endif
 
-	/* Comparing on page boundary region requires special treatment:
-	   It must done one vector at the time, starting with the wider
-	   ymm vector if possible, if not, with xmm. If fetching 16 bytes
-	   (xmm) still passes the boundary, byte comparison must be done.
-	 */
-	.p2align 4
-L(cross_page):
-	/* Try one ymm vector at a time.  */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jg	L(cross_page_1_vector)
-L(loop_1_vector):
-	VMOVU	(%rdi, %rdx), %YMM0
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (%rsi, %rdx).  */
-	VPCMP	$0, (%rsi, %rdx), %YMM0, %k1{%k2}
+	.p2align 4,, 10
+L(less_16_till_page):
+	cmpl	$(24 / SIZE_OF_CHAR), %eax
+	ja	L(less_8_till_page)
+
+	/* Use 8 byte comparison.  */
+	vmovq	(%rdi), %xmm0
+	vmovq	(%rsi), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	subl	$0x3, %ecx
 # else
-	incl	%ecx
+	incb	%cl
 # endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$VEC_SIZE, %edx
 
-	addl	$VEC_SIZE, %eax
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$(8 / SIZE_OF_CHAR), %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
 # endif
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jle	L(loop_1_vector)
-L(cross_page_1_vector):
-	/* Less than 32 bytes to check, try one xmm vector.  */
-	cmpl	$(PAGE_SIZE - 16), %eax
-	jg	L(cross_page_1_xmm)
-	VMOVU	(%rdi, %rdx), %XMM0
+	movl	$(24 / SIZE_OF_CHAR), %OFFSET_REG
+	subl	%eax, %OFFSET_REG
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and 16 bytes at (%rsi, %rdx).  */
-	VPCMP	$0, (%rsi, %rdx), %XMM0, %k1{%k2}
+	vmovq	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	vmovq	(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	subl	$0xf, %ecx
+	subl	$0x3, %ecx
 # else
-	subl	$0xffff, %ecx
+	incb	%cl
 # endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
+
 
-	addl	$16, %edx
-# ifndef USE_AS_WCSCMP
-	addl	$16, %eax
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addl	$(8 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# else
+	leaq	(8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
 # endif
+	jmp	L(prepare_loop_aligned)
 
-L(cross_page_1_xmm):
-# ifndef USE_AS_WCSCMP
-	/* Less than 16 bytes to check, try 8 byte vector.  NB: No need
-	   for wcscmp nor wcsncmp since wide char is 4 bytes.   */
-	cmpl	$(PAGE_SIZE - 8), %eax
-	jg	L(cross_page_8bytes)
-	vmovq	(%rdi, %rdx), %XMM0
-	vmovq	(%rsi, %rdx), %XMM1
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and XMM1.  */
-	VPCMP	$0, %XMM1, %XMM0, %k1{%k2}
-	kmovb	%k1, %ecx
+
+
+	.p2align 4,, 10
+L(less_8_till_page):
 # ifdef USE_AS_WCSCMP
-	subl	$0x3, %ecx
+	/* If using wchar then this is the only check before we reach
+	   the page boundary.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	cmpl	%ecx, %eax
+	jnz	L(ret_less_8_wcs)
+#  ifdef USE_AS_STRNCMP
+	addq	$-(CHAR_PER_VEC * 2), %rdx
+	/* We already checked for len <= 1 so cannot hit that case here.
+	 */
+#  endif
+	testl	%eax, %eax
+	jnz	L(prepare_loop)
+	ret
+
+	.p2align 4,, 8
+L(ret_less_8_wcs):
+	setl	%OFFSET_REG8
+	negl	%OFFSET_REG
+	movl	%OFFSET_REG, %eax
+	xorl	%r8d, %eax
+	ret
+
 # else
-	subl	$0xff, %ecx
-# endif
-	jne	L(last_vector)
+	cmpl	$28, %eax
+	ja	L(less_4_till_page)
+
+	vmovd	(%rdi), %xmm0
+	vmovd	(%rsi), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$8, %edx
-	addl	$8, %eax
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$4, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
 #  endif
+	movl	$(28 / SIZE_OF_CHAR), %OFFSET_REG
+	subl	%eax, %OFFSET_REG
 
-L(cross_page_8bytes):
-	/* Less than 8 bytes to check, try 4 byte vector.  */
-	cmpl	$(PAGE_SIZE - 4), %eax
-	jg	L(cross_page_4bytes)
-	vmovd	(%rdi, %rdx), %XMM0
-	vmovd	(%rsi, %rdx), %XMM1
-
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and XMM1.  */
-	VPCMP	$0, %XMM1, %XMM0, %k1{%k2}
+	vmovd	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	vmovd	(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0x1, %ecx
-# else
 	subl	$0xf, %ecx
-# endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
+#  ifdef USE_AS_STRNCMP
+	addl	$(4 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+#  else
+	leaq	(4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+#  endif
+	jmp	L(prepare_loop_aligned)
+
 
-	addl	$4, %edx
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case1):
+	xorl	%eax, %eax
+	ret
 #  endif
 
-L(cross_page_4bytes):
-# endif
-	/* Less than 4 bytes to check, try one byte/dword at a time.  */
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-# ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
-# endif
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
+	.p2align 4,, 10
+L(less_4_till_page):
+	subq	%rdi, %rsi
+	/* Extremely slow byte comparison loop.  */
+L(less_4_loop):
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi, %rdi), %ecx
 	subl	%ecx, %eax
+	jnz	L(ret_less_4_loop)
+	testl	%ecx, %ecx
+	jz	L(ret_zero_4_loop)
+#  ifdef USE_AS_STRNCMP
+	decq	%rdx
+	jz	L(ret_zero_4_loop)
+#  endif
+	incq	%rdi
+	/* end condition is reach page boundary (rdi is aligned).  */
+	testl	$31, %edi
+	jnz	L(less_4_loop)
+	leaq	-(VEC_SIZE * 4)(%rdi, %rsi), %rsi
+	addq	$-(VEC_SIZE * 4), %rdi
+#  ifdef USE_AS_STRNCMP
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+L(ret_zero_4_loop):
+	xorl	%eax, %eax
+	ret
+L(ret_less_4_loop):
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 	ret
-END (STRCMP)
+# endif
+END(STRCMP)
 #endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 3/5] string: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp].
  2022-01-09 12:29 [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
  2022-01-09 12:29 ` [PATCH v1 2/5] x86: Optimize strcmp-evex.S " Noah Goldstein
@ 2022-01-09 12:29 ` Noah Goldstein
  2022-01-09 12:29 ` [PATCH v1 4/5] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-09 12:29 UTC (permalink / raw)
  To: libc-alpha

These implementations are incorrect. There may be a mismatch in s1/s2
before invalid memory but no null CHAR / length boundary.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 string/test-strcmp.c  | 35 -----------------------------------
 string/test-strncmp.c | 34 ----------------------------------
 2 files changed, 69 deletions(-)

diff --git a/string/test-strcmp.c b/string/test-strcmp.c
index 3c75076fb8..97d7bf5043 100644
--- a/string/test-strcmp.c
+++ b/string/test-strcmp.c
@@ -34,7 +34,6 @@
 # define STRLEN wcslen
 # define MEMCPY wmemcpy
 # define SIMPLE_STRCMP simple_wcscmp
-# define STUPID_STRCMP stupid_wcscmp
 # define CHAR wchar_t
 # define UCHAR wchar_t
 # define CHARBYTES 4
@@ -64,25 +63,6 @@ simple_wcscmp (const wchar_t *s1, const wchar_t *s2)
   return c1 < c2 ? -1 : 1;
 }
 
-int
-stupid_wcscmp (const wchar_t *s1, const wchar_t *s2)
-{
-  size_t ns1 = wcslen (s1) + 1;
-  size_t ns2 = wcslen (s2) + 1;
-  size_t n = ns1 < ns2 ? ns1 : ns2;
-  int ret = 0;
-
-  wchar_t c1, c2;
-
-  while (n--) {
-    c1 = *s1++;
-    c2 = *s2++;
-    if ((ret = c1 < c2 ? -1 : c1 == c2 ? 0 : 1) != 0)
-      break;
-  }
-  return ret;
-}
-
 #else
 # include <limits.h>
 
@@ -92,7 +72,6 @@ stupid_wcscmp (const wchar_t *s1, const wchar_t *s2)
 # define STRLEN strlen
 # define MEMCPY memcpy
 # define SIMPLE_STRCMP simple_strcmp
-# define STUPID_STRCMP stupid_strcmp
 # define CHAR char
 # define UCHAR unsigned char
 # define CHARBYTES 1
@@ -113,24 +92,10 @@ simple_strcmp (const char *s1, const char *s2)
   return ret;
 }
 
-int
-stupid_strcmp (const char *s1, const char *s2)
-{
-  size_t ns1 = strlen (s1) + 1;
-  size_t ns2 = strlen (s2) + 1;
-  size_t n = ns1 < ns2 ? ns1 : ns2;
-  int ret = 0;
-
-  while (n--)
-    if ((ret = *(unsigned char *) s1++ - *(unsigned char *) s2++) != 0)
-      break;
-  return ret;
-}
 #endif
 
 typedef int (*proto_t) (const CHAR *, const CHAR *);
 
-IMPL (STUPID_STRCMP, 1)
 IMPL (SIMPLE_STRCMP, 1)
 IMPL (STRCMP, 1)
 
diff --git a/string/test-strncmp.c b/string/test-strncmp.c
index e7d5edea39..61a283a0af 100644
--- a/string/test-strncmp.c
+++ b/string/test-strncmp.c
@@ -33,7 +33,6 @@
 # define STRDUP wcsdup
 # define MEMCPY wmemcpy
 # define SIMPLE_STRNCMP simple_wcsncmp
-# define STUPID_STRNCMP stupid_wcsncmp
 # define CHAR wchar_t
 # define UCHAR wchar_t
 # define CHARBYTES 4
@@ -57,25 +56,6 @@ simple_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
   return 0;
 }
 
-int
-stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
-{
-  wchar_t c1, c2;
-  size_t ns1 = wcsnlen (s1, n) + 1, ns2 = wcsnlen (s2, n) + 1;
-
-  n = ns1 < n ? ns1 : n;
-  n = ns2 < n ? ns2 : n;
-
-  while (n--)
-    {
-      c1 = *s1++;
-      c2 = *s2++;
-      if (c1 != c2)
-	return c1 > c2 ? 1 : -1;
-    }
-  return 0;
-}
-
 #else
 # define L(str) str
 # define STRNCMP strncmp
@@ -83,7 +63,6 @@ stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
 # define STRDUP strdup
 # define MEMCPY memcpy
 # define SIMPLE_STRNCMP simple_strncmp
-# define STUPID_STRNCMP stupid_strncmp
 # define CHAR char
 # define UCHAR unsigned char
 # define CHARBYTES 1
@@ -101,23 +80,10 @@ simple_strncmp (const char *s1, const char *s2, size_t n)
   return ret;
 }
 
-int
-stupid_strncmp (const char *s1, const char *s2, size_t n)
-{
-  size_t ns1 = strnlen (s1, n) + 1, ns2 = strnlen (s2, n) + 1;
-  int ret = 0;
-
-  n = ns1 < n ? ns1 : n;
-  n = ns2 < n ? ns2 : n;
-  while (n-- && (ret = *(unsigned char *) s1++ - * (unsigned char *) s2++) == 0);
-  return ret;
-}
-
 #endif
 
 typedef int (*proto_t) (const CHAR *, const CHAR *, size_t);
 
-IMPL (STUPID_STRNCMP, 0)
 IMPL (SIMPLE_STRNCMP, 0)
 IMPL (STRNCMP, 1)
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 4/5] string: Improve coverage in test-strcmp.c and test-strncmp.c
  2022-01-09 12:29 [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
  2022-01-09 12:29 ` [PATCH v1 2/5] x86: Optimize strcmp-evex.S " Noah Goldstein
  2022-01-09 12:29 ` [PATCH v1 3/5] string: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
@ 2022-01-09 12:29 ` Noah Goldstein
  2022-01-09 12:29 ` [PATCH v1 5/5] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-09 12:29 UTC (permalink / raw)
  To: libc-alpha

Add additional test cases for small / medium sizes. Add tests in
strncmp where `n` is near ULONG_MAX to test for overflow bugs in
length handling.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 string/test-strcmp.c  |  70 +++++++++++++++++++---
 string/test-strncmp.c | 134 ++++++++++++++++++++++++++++++++++++++----
 2 files changed, 184 insertions(+), 20 deletions(-)

diff --git a/string/test-strcmp.c b/string/test-strcmp.c
index 97d7bf5043..eacbdc8857 100644
--- a/string/test-strcmp.c
+++ b/string/test-strcmp.c
@@ -16,6 +16,9 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#define TEST_LEN (4096 * 3)
+#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
+
 #define TEST_MAIN
 #ifdef WIDE
 # define TEST_NAME "wcscmp"
@@ -129,7 +132,7 @@ do_one_test (impl_t *impl,
 
 static void
 do_test (size_t align1, size_t align2, size_t len, int max_char,
-	 int exp_result)
+         int exp_result)
 {
   size_t i;
 
@@ -138,19 +141,22 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
   if (len == 0)
     return;
 
-  align1 &= 63;
+  align1 &= ~(CHARBYTES - 1);
+  align2 &= ~(CHARBYTES - 1);
+
+  align1 &= getpagesize () - 1;
   if (align1 + (len + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
+  align2 &= getpagesize () - 1;
   if (align2 + (len + 1) * CHARBYTES >= page_size)
     return;
 
   /* Put them close to the end of page.  */
   i = align1 + CHARBYTES * (len + 2);
-  s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1);
+  s1 = (CHAR *)(buf1 + ((page_size - i) / 16 * 16) + align1);
   i = align2 + CHARBYTES * (len + 2);
-  s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16)  + align2);
+  s2 = (CHAR *)(buf2 + ((page_size - i) / 16 * 16) + align2);
 
   for (i = 0; i < len; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
@@ -161,9 +167,10 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
   s2[len - 1] -= exp_result;
 
   FOR_EACH_IMPL (impl, 0)
-    do_one_test (impl, s1, s2, exp_result);
+  do_one_test (impl, s1, s2, exp_result);
 }
 
+
 static void
 do_random_tests (void)
 {
@@ -385,7 +392,7 @@ check3 (void)
 int
 test_main (void)
 {
-  size_t i;
+  size_t i, j;
 
   test_init ();
   check();
@@ -426,6 +433,55 @@ test_main (void)
       do_test (2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1);
     }
 
+  for (j = 0; j < 160; ++j)
+    {
+      for (i = 0; i < TEST_LEN;)
+        {
+          do_test (getpagesize () - j - 1, 0, i, 127, 0);
+          do_test (getpagesize () - j - 1, 0, i, 127, 1);
+          do_test (getpagesize () - j - 1, 0, i, 127, -1);
+
+          do_test (getpagesize () - j - 1, j, i, 127, 0);
+          do_test (getpagesize () - j - 1, j, i, 127, 1);
+          do_test (getpagesize () - j - 1, j, i, 127, -1);
+
+          do_test (0, getpagesize () - j - 1, i, 127, 0);
+          do_test (0, getpagesize () - j - 1, i, 127, 1);
+          do_test (0, getpagesize () - j - 1, i, 127, -1);
+
+          do_test (j, getpagesize () - j - 1, i, 127, 0);
+          do_test (j, getpagesize () - j - 1, i, 127, 1);
+          do_test (j, getpagesize () - j - 1, i, 127, -1);
+
+          if (i < 32)
+            {
+              i += 1;
+            }
+          else if (i < 161)
+            {
+              i += 7;
+            }
+          else if (i + 161 < TEST_LEN)
+            {
+              i += 31;
+              i *= 17;
+              i /= 16;
+              if (i + 161 > TEST_LEN)
+                {
+                  i = TEST_LEN - 160;
+                }
+            }
+          else if (i + 32 < TEST_LEN)
+            {
+              i += 7;
+            }
+          else
+            {
+              i += 1;
+            }
+        }
+    }
+
   do_random_tests ();
   return ret;
 }
diff --git a/string/test-strncmp.c b/string/test-strncmp.c
index 61a283a0af..35492f1f68 100644
--- a/string/test-strncmp.c
+++ b/string/test-strncmp.c
@@ -16,6 +16,9 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#define TEST_LEN (4096 * 3)
+#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
+
 #define TEST_MAIN
 #ifdef WIDE
 # define TEST_NAME "wcsncmp"
@@ -166,10 +169,10 @@ do_test_limit (size_t align1, size_t align2, size_t len, size_t n, int max_char,
 }
 
 static void
-do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
-	 int exp_result)
+do_test_n (size_t align1, size_t align2, size_t len, size_t n, int n_in_bounds,
+           int max_char, int exp_result)
 {
-  size_t i;
+  size_t i, buf_bound;
   CHAR *s1, *s2;
 
   align1 &= ~(CHARBYTES - 1);
@@ -178,22 +181,28 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
   if (n == 0)
     return;
 
-  align1 &= 63;
-  if (align1 + (n + 1) * CHARBYTES >= page_size)
+  buf_bound = n_in_bounds ? n : len;
+
+  align1 &= getpagesize () - 1;
+  if (align1 + (buf_bound + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
-  if (align2 + (n + 1) * CHARBYTES >= page_size)
+  align2 &= getpagesize () - 1;
+  if (align2 + (buf_bound + 1) * CHARBYTES >= page_size)
     return;
 
-  s1 = (CHAR *) (buf1 + align1);
-  s2 = (CHAR *) (buf2 + align2);
+  s1 = (CHAR *)(buf1 + align1);
+  s2 = (CHAR *)(buf2 + align2);
 
-  for (i = 0; i < n; i++)
+  if (n_in_bounds)
+    {
+      s1[n] = 24 + exp_result;
+      s2[n] = 23;
+    }
+
+  for (i = 0; i < buf_bound; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
 
-  s1[n] = 24 + exp_result;
-  s2[n] = 23;
   s1[len] = 0;
   s2[len] = 0;
   if (exp_result < 0)
@@ -207,6 +216,13 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
     do_one_test (impl, s1, s2, n, exp_result);
 }
 
+static void
+do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
+         int exp_result)
+{
+  do_test_n (align1, align2, len, n, 1, max_char, exp_result);
+}
+
 static void
 do_page_test (size_t offset1, size_t offset2, CHAR *s2)
 {
@@ -403,7 +419,7 @@ check3 (void)
 int
 test_main (void)
 {
-  size_t i;
+  size_t i, j;
 
   test_init ();
 
@@ -470,6 +486,98 @@ test_main (void)
       do_test_limit (0, 0, 15 - i, 16 - i, 255, -1);
     }
 
+  for (j = 0; j < 160; ++j)
+    {
+      for (i = 0; i < TEST_LEN;)
+        {
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, 0, i, i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, i - 1, 0, 127, 0);
+
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX / 2, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX / 2, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX / 2, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, i - 1, 0, 127, 0);
+
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX / 2, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX / 2, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX / 2, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, i, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
+
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX / 2, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX / 2, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX / 2, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, i, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
+
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX / 2, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX / 2, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX / 2, 0, 127, -1);
+          if (i < 32)
+            {
+              i += 1;
+            }
+          else if (i < 161)
+            {
+              i += 7;
+            }
+          else if (i + 161 < TEST_LEN)
+            {
+              i += 31;
+              i *= 17;
+              i /= 16;
+              if (i + 161 > TEST_LEN)
+                {
+                  i = TEST_LEN - 160;
+                }
+            }
+          else if (i + 32 < TEST_LEN)
+            {
+              i += 7;
+            }
+          else
+            {
+              i += 1;
+            }
+        }
+    }
+
   do_random_tests ();
   return ret;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 5/5] benchtests: Add more coverage for strcmp and strncmp benchmarks
  2022-01-09 12:29 [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
                   ` (2 preceding siblings ...)
  2022-01-09 12:29 ` [PATCH v1 4/5] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
@ 2022-01-09 12:29 ` Noah Goldstein
  2022-01-09 12:35 ` [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-09 12:29 UTC (permalink / raw)
  To: libc-alpha

Add more small and medium sized tests for strcmp and strncmp.

As well for strcmp add option for more direct control of
alignment. Previously alignment was being pushed to the end of the
page. While this is the most difficult case to implement, it is far
from the common case and so shouldn't be the only benchmark.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 benchtests/bench-strcmp.c  | 142 ++++++++++++++++++++++++++-----------
 benchtests/bench-strncmp.c | 110 ++++++++++++++++++++--------
 2 files changed, 183 insertions(+), 69 deletions(-)

diff --git a/benchtests/bench-strcmp.c b/benchtests/bench-strcmp.c
index 387e76fcfb..3a60edfb15 100644
--- a/benchtests/bench-strcmp.c
+++ b/benchtests/bench-strcmp.c
@@ -99,8 +99,8 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl,
 }
 
 static void
-do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
-	 max_char, int exp_result)
+do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len,
+         int max_char, int exp_result, int at_end)
 {
   size_t i;
 
@@ -109,19 +109,28 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
   if (len == 0)
     return;
 
-  align1 &= 63;
+  align1 &= ~(CHARBYTES - 1);
+  align2 &= ~(CHARBYTES - 1);
+
+  align1 &= (getpagesize () - 1);
   if (align1 + (len + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
+  align2 &= (getpagesize () - 1);
   if (align2 + (len + 1) * CHARBYTES >= page_size)
     return;
 
   /* Put them close to the end of page.  */
-  i = align1 + CHARBYTES * (len + 2);
-  s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1);
-  i = align2 + CHARBYTES * (len + 2);
-  s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16)  + align2);
+  if (at_end)
+    {
+      i = align1 + CHARBYTES * (len + 2);
+      align1 = ((page_size - i) / 16 * 16) + align1;
+      i = align2 + CHARBYTES * (len + 2);
+      align2 = ((page_size - i) / 16 * 16) + align2;
+    }
+
+  s1 = (CHAR *)(buf1 + align1);
+  s2 = (CHAR *)(buf2 + align2);
 
   for (i = 0; i < len; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
@@ -132,9 +141,9 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
   s2[len - 1] -= exp_result;
 
   json_element_object_begin (json_ctx);
-  json_attr_uint (json_ctx, "length", (double) len);
-  json_attr_uint (json_ctx, "align1", (double) align1);
-  json_attr_uint (json_ctx, "align2", (double) align2);
+  json_attr_uint (json_ctx, "length", (double)len);
+  json_attr_uint (json_ctx, "align1", (double)align1);
+  json_attr_uint (json_ctx, "align2", (double)align2);
   json_array_begin (json_ctx, "timings");
 
   FOR_EACH_IMPL (impl, 0)
@@ -202,7 +211,8 @@ int
 test_main (void)
 {
   json_ctx_t json_ctx;
-  size_t i;
+  size_t i, j, k;
+  size_t pg_sz = getpagesize ();
 
   test_init ();
 
@@ -221,36 +231,88 @@ test_main (void)
   json_array_end (&json_ctx);
 
   json_array_begin (&json_ctx, "results");
-
-  for (i = 1; i < 32; ++i)
-    {
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 0);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, -1);
-    }
-
-  for (i = 1; i < 10 + CHARBYTESLOG; ++i)
+  for (k = 0; k < 2; ++k)
     {
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 0);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 0);
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 1);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 1);
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, -1);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, -1);
-      do_test (&json_ctx, 0, CHARBYTES * i, 2 << i, MIDCHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * (i + 1), 2 << i, LARGECHAR, 1);
+      for (i = 1; i < 32; ++i)
+        {
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, -1, k);
+        }
+
+      for (i = 1; i <= 8192;)
+        {
+          /* No page crosses.  */
+          do_test (&json_ctx, 0, 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, i * CHARBYTES, 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, i * CHARBYTES, i, MIDCHAR, 0, k);
+
+          /* False page crosses.  */
+          do_test (&json_ctx, pg_sz / 2, pg_sz / 2 - CHARBYTES, i, MIDCHAR, 0,
+                   k);
+          do_test (&json_ctx, pg_sz / 2 - CHARBYTES, pg_sz / 2, i, MIDCHAR, 0,
+                   k);
+
+          do_test (&json_ctx, pg_sz - (i * CHARBYTES), 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, pg_sz - (i * CHARBYTES), i, MIDCHAR, 0, k);
+
+          /* Real page cross.  */
+          for (j = 16; j < 128; j += 16)
+            {
+              do_test (&json_ctx, pg_sz - j, 0, i, MIDCHAR, 0, k);
+              do_test (&json_ctx, 0, pg_sz - j, i, MIDCHAR, 0, k);
+
+              do_test (&json_ctx, pg_sz - j, pg_sz - j / 2, i, MIDCHAR, 0, k);
+              do_test (&json_ctx, pg_sz - j / 2, pg_sz - j, i, MIDCHAR, 0, k);
+            }
+
+          if (i < 32)
+            {
+              ++i;
+            }
+          else if (i < 160)
+            {
+              i += 8;
+            }
+          else if (i < 512)
+            {
+              i += 32;
+            }
+          else
+            {
+              i *= 2;
+            }
+        }
+
+      for (i = 1; i < 10 + CHARBYTESLOG; ++i)
+        {
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 0, k);
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, -1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, -1, k);
+          do_test (&json_ctx, 0, CHARBYTES * i, 2 << i, MIDCHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * (i + 1), 2 << i,
+                   LARGECHAR, 1, k);
+        }
+
+      for (i = 1; i < 8; ++i)
+        {
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, 0, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, 0, k);
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, 1, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, -1, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, -1, k);
+        }
     }
-
-  for (i = 1; i < 8; ++i)
-    {
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, 0);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, 0);
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, 1);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, -1);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1);
-    }
-
   do_test_page_boundary (&json_ctx);
 
   json_array_end (&json_ctx);
diff --git a/benchtests/bench-strncmp.c b/benchtests/bench-strncmp.c
index b7a01fde64..6673a53521 100644
--- a/benchtests/bench-strncmp.c
+++ b/benchtests/bench-strncmp.c
@@ -150,43 +150,43 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, size_t
   if (n == 0)
     return;
 
-  align1 &= 63;
+  align1 &= getpagesize () - 1;
   if (align1 + (n + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 7;
+  align2 &= getpagesize () - 1;
   if (align2 + (n + 1) * CHARBYTES >= page_size)
     return;
 
   json_element_object_begin (json_ctx);
-  json_attr_uint (json_ctx, "strlen", (double) len);
-  json_attr_uint (json_ctx, "len", (double) n);
-  json_attr_uint (json_ctx, "align1", (double) align1);
-  json_attr_uint (json_ctx, "align2", (double) align2);
+  json_attr_uint (json_ctx, "strlen", (double)len);
+  json_attr_uint (json_ctx, "len", (double)n);
+  json_attr_uint (json_ctx, "align1", (double)align1);
+  json_attr_uint (json_ctx, "align2", (double)align2);
   json_array_begin (json_ctx, "timings");
 
   FOR_EACH_IMPL (impl, 0)
-    {
-      alloc_bufs ();
-      s1 = (CHAR *) (buf1 + align1);
-      s2 = (CHAR *) (buf2 + align2);
-
-      for (i = 0; i < n; i++)
-	s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
-
-      s1[n] = 24 + exp_result;
-      s2[n] = 23;
-      s1[len] = 0;
-      s2[len] = 0;
-      if (exp_result < 0)
-	s2[len] = 32;
-      else if (exp_result > 0)
-	s1[len] = 64;
-      if (len >= n)
-	s2[n - 1] -= exp_result;
+  {
+    alloc_bufs ();
+    s1 = (CHAR *)(buf1 + align1);
+    s2 = (CHAR *)(buf2 + align2);
+
+    for (i = 0; i < n; i++)
+      s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
+
+    s1[n] = 24 + exp_result;
+    s2[n] = 23;
+    s1[len] = 0;
+    s2[len] = 0;
+    if (exp_result < 0)
+      s2[len] = 32;
+    else if (exp_result > 0)
+      s1[len] = 64;
+    if (len >= n)
+      s2[n - 1] -= exp_result;
 
-      do_one_test (json_ctx, impl, s1, s2, n, exp_result);
-    }
+    do_one_test (json_ctx, impl, s1, s2, n, exp_result);
+  }
 
   json_array_end (json_ctx);
   json_element_object_end (json_ctx);
@@ -319,7 +319,8 @@ int
 test_main (void)
 {
   json_ctx_t json_ctx;
-  size_t i;
+  size_t i, j, len;
+  size_t pg_sz = getpagesize ();
 
   test_init ();
 
@@ -334,12 +335,12 @@ test_main (void)
 
   json_array_begin (&json_ctx, "ifuncs");
   FOR_EACH_IMPL (impl, 0)
-    json_element_string (&json_ctx, impl->name);
+  json_element_string (&json_ctx, impl->name);
   json_array_end (&json_ctx);
 
   json_array_begin (&json_ctx, "results");
 
-  for (i =0; i < 16; ++i)
+  for (i = 0; i < 16; ++i)
     {
       do_test (&json_ctx, 0, 0, 8, i, 127, 0);
       do_test (&json_ctx, 0, 0, 8, i, 127, -1);
@@ -361,6 +362,57 @@ test_main (void)
       do_test (&json_ctx, i, 3 * i, 8, i, 255, -1);
     }
 
+  for (len = 0; len <= 128; len += 64)
+    {
+      for (i = 1; i <= 8192;)
+        {
+          /* No page crosses.  */
+          do_test (&json_ctx, 0, 0, i, i + len, 127, 0);
+          do_test (&json_ctx, i * CHARBYTES, 0, i, i + len, 127, 0);
+          do_test (&json_ctx, 0, i * CHARBYTES, i, i + len, 127, 0);
+
+          /* False page crosses.  */
+          do_test (&json_ctx, pg_sz / 2, pg_sz / 2 - CHARBYTES, i, i + len,
+                   127, 0);
+          do_test (&json_ctx, pg_sz / 2 - CHARBYTES, pg_sz / 2, i, i + len,
+                   127, 0);
+
+          do_test (&json_ctx, pg_sz - (i * CHARBYTES), 0, i, i + len, 127,
+                   0);
+          do_test (&json_ctx, 0, pg_sz - (i * CHARBYTES), i, i + len, 127,
+                   0);
+
+          /* Real page cross.  */
+          for (j = 16; j < 128; j += 16)
+            {
+              do_test (&json_ctx, pg_sz - j, 0, i, i + len, 127, 0);
+              do_test (&json_ctx, 0, pg_sz - j, i, i + len, 127, 0);
+
+              do_test (&json_ctx, pg_sz - j, pg_sz - j / 2, i, i + len,
+                       127, 0);
+              do_test (&json_ctx, pg_sz - j / 2, pg_sz - j, i, i + len,
+                       127, 0);
+            }
+
+          if (i < 32)
+            {
+              ++i;
+            }
+          else if (i < 160)
+            {
+              i += 8;
+            }
+          else if (i < 256)
+            {
+              i += 32;
+            }
+          else
+            {
+              i *= 2;
+            }
+        }
+    }
+
   for (i = 1; i < 8; ++i)
     {
       do_test (&json_ctx, 0, 0, 8 << i, 16 << i, 127, 0);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755]
  2022-01-09 12:29 [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
                   ` (3 preceding siblings ...)
  2022-01-09 12:29 ` [PATCH v1 5/5] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
@ 2022-01-09 12:35 ` Noah Goldstein
  2022-01-09 14:07   ` H.J. Lu
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
  6 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-09 12:35 UTC (permalink / raw)
  To: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 59095 bytes --]

On Sun, Jan 9, 2022 at 6:30 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> __wcscmp_avx2. For x86_64 this covers the entire address range so any
> length larger could not possibly be used to bound `s1` or `s2`.
>
> Optimization are primarily to the loop logic and how the page cross
> logic interacts with the loop.
>
> The page cross logic is at times more expensive for short strings near
> the end of a page but not crossing the page. This is done to retest
> the page cross conditions with a non-faulty check and to improve the
> logic for entering the loop afterwards. This is only particular cases,
> however, and is general made up for by more than 10x improvements on
> the transition from the page cross -> loop case.
>
> The non-page cross cases are improved most for smaller sizes [0, 128]
> and go about even for (128, 4096]. The loop page cross logic is
> improved so some more significant speedup is seen there as well.
>
> test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
> Numbers attached in reply.
>
> Numbers are geometric mean of N=20 runs.
> Numbers where collected on: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i71165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
>
> The 'score' column is (time current) / (time new). The "greener" the
> number the the larger the improvement. The "redder" the larger the
> regression.
>
> Some notes on the numbers:
>
> There are three cases of regressions:
>
> 1. Small values at the page cross case. The regression is because the
> new logic spends extra logic checking if the page cross was a false
> positive and setting up the logic to transition to the loop case more
> smoothly. I don't see any way around this and on the flip side of the
> regressions is 500% speedups in either the false positive case or
> contuation.
>
> 2. Cases where the string barely crosses the page. The regression is
> because the current logic does a single byte loop on exit which is
> ultimately a faster check for very small strings. The flip side of
> this is 20000% speedups. I think the logic that has us implement
> strncmp with vectors also supports replacing the one at a time byte
> loop for something that scales better.
>
> 3. The avx2 case for [128, 512] is within [-5%, +5%]. There are some
> regressions here. I am unsure what exacting why this is the case. In
> general I am less happy with the quality of the avx2 implementation
> and believe it still needs some work. I still think its an improvement
> because of the gains in the [0, 128] case, many of the page cross
> cases and the [513, inf] cases but if people think otherwise it may be
> best to skip the patch. Note the patch is also for [BZ# 28755]
> although a seperate fix for that will be simple enough.
>
> Aside from the 3 regressions there are mostly modest improvements then
> some dramatic improvements where the one at a time byte loops where
> eliminated.
>
>
>  sysdeps/x86_64/multiarch/strcmp-avx2.S | 1586 ++++++++++++++----------
>  1 file changed, 942 insertions(+), 644 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> index a45f9d2749..28d6a0025a 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> @@ -26,35 +26,57 @@
>
>  # define PAGE_SIZE     4096
>
> -/* VEC_SIZE = Number of bytes in a ymm register */
> +       /* VEC_SIZE = Number of bytes in a ymm register.  */
>  # define VEC_SIZE      32
>
> -/* Shift for dividing by (VEC_SIZE * 4).  */
> -# define DIVIDE_BY_VEC_4_SHIFT 7
> -# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> -#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> -# endif
> +# define VMOVU vmovdqu
> +# define VMOVA vmovdqa
>
>  # ifdef USE_AS_WCSCMP
> -/* Compare packed dwords.  */
> +       /* Compare packed dwords.  */
>  #  define VPCMPEQ      vpcmpeqd
> -/* Compare packed dwords and store minimum.  */
> +       /* Compare packed dwords and store minimum.  */
>  #  define VPMINU       vpminud
> -/* 1 dword char == 4 bytes.  */
> +       /* 1 dword char == 4 bytes.  */
>  #  define SIZE_OF_CHAR 4
>  # else
> -/* Compare packed bytes.  */
> +       /* Compare packed bytes.  */
>  #  define VPCMPEQ      vpcmpeqb
> -/* Compare packed bytes and store minimum.  */
> +       /* Compare packed bytes and store minimum.  */
>  #  define VPMINU       vpminub
> -/* 1 byte char == 1 byte.  */
> +       /* 1 byte char == 1 byte.  */
>  #  define SIZE_OF_CHAR 1
>  # endif
>
> +# ifdef USE_AS_STRNCMP
> +#  define LOOP_REG     r9d
> +#  define LOOP_REG64   r9
> +
> +#  define OFFSET_REG8  r9b
> +#  define OFFSET_REG   r9d
> +#  define OFFSET_REG64 r9
> +# else
> +#  define LOOP_REG     edx
> +#  define LOOP_REG64   rdx
> +
> +#  define OFFSET_REG8  dl
> +#  define OFFSET_REG   edx
> +#  define OFFSET_REG64 rdx
> +# endif
> +
>  # ifndef VZEROUPPER
>  #  define VZEROUPPER   vzeroupper
>  # endif
>
> +# if defined USE_AS_STRNCMP
> +#  define VEC_OFFSET   0
> +# else
> +#  define VEC_OFFSET   (-VEC_SIZE)
> +# endif
> +
> +# define xmmZERO       xmm15
> +# define ymmZERO       ymm15
> +
>  # ifndef SECTION
>  #  define SECTION(p)   p##.avx
>  # endif
> @@ -79,773 +101,1049 @@
>     the maximum offset is reached before a difference is found, zero is
>     returned.  */
>
> -       .section SECTION(.text),"ax",@progbits
> -ENTRY (STRCMP)
> +       .section SECTION(.text), "ax", @progbits
> +ENTRY(STRCMP)
>  # ifdef USE_AS_STRNCMP
> -       /* Check for simple cases (0 or 1) in offset.  */
> +#  ifdef __ILP32__
> +       /* Clear the upper 32 bits.  */
> +       movl    %edx, %rdx
> +#  endif
>         cmp     $1, %RDX_LP
> -       je      L(char0)
> -       jb      L(zero)
> +       /* Signed comparison intentional. We use this branch to also
> +          test cases where length >= 2^63. These very large sizes can be
> +          handled with strcmp as there is no way for that length to
> +          actually bound the buffer.  */
> +       jle     L(one_or_less)
>  #  ifdef USE_AS_WCSCMP
> -       /* Convert units: from wide to byte char.  */
> -       shl     $2, %RDX_LP
> +       movq    %rdx, %rcx
> +
> +       /* Multiplying length by sizeof(wchar_t) can result in overflow.
> +          Check if that is possible. All cases where overflow are possible
> +          are cases where length is large enough that it can never be a
> +          bound on valid memory so just use wcscmp.  */
> +       shrq    $56, %rcx
> +       jnz     __wcscmp_avx2
> +
> +       leaq    (, %rdx, 4), %rdx
>  #  endif
> -       /* Register %r11 tracks the maximum offset.  */
> -       mov     %RDX_LP, %R11_LP
>  # endif
> +       vpxor   %xmmZERO, %xmmZERO, %xmmZERO
>         movl    %edi, %eax
> -       xorl    %edx, %edx
> -       /* Make %xmm7 (%ymm7) all zeros in this function.  */
> -       vpxor   %xmm7, %xmm7, %xmm7
>         orl     %esi, %eax
> -       andl    $(PAGE_SIZE - 1), %eax
> -       cmpl    $(PAGE_SIZE - (VEC_SIZE * 4)), %eax
> -       jg      L(cross_page)
> -       /* Start comparing 4 vectors.  */
> -       vmovdqu (%rdi), %ymm1
> -       VPCMPEQ (%rsi), %ymm1, %ymm0
> -       VPMINU  %ymm1, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -       vpmovmskb %ymm0, %ecx
> -       testl   %ecx, %ecx
> -       je      L(next_3_vectors)
> -       tzcntl  %ecx, %edx
> +       sall    $20, %eax
> +       /* Check if s1 or s2 may cross a page  in next 4x VEC loads.  */
> +       cmpl    $((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
> +       ja      L(page_cross)
> +
> +L(no_page_cross):
> +       /* Safe to compare 4x vectors.  */
> +       VMOVU   (%rdi), %ymm0
> +       /* 1s where s1 and s2 equal.  */
> +       VPCMPEQ (%rsi), %ymm0, %ymm1
> +       /* 1s at null CHAR.  */
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       /* 1s where s1 and s2 equal AND not null CHAR.  */
> +       vpandn  %ymm1, %ymm2, %ymm1
> +
> +       /* All 1s -> keep going, any 0s -> return.  */
> +       vpmovmskb %ymm1, %ecx
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx) is after the maximum
> -          offset (%r11).   */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    $VEC_SIZE, %rdx
> +       jbe     L(vec_0_test_len)
>  # endif
> +
> +       /* All 1s represents all equals. incl will overflow to zero in
> +          all equals case. Otherwise 1s will carry until position of first
> +          mismatch.  */
> +       incl    %ecx
> +       jz      L(more_3x_vec)
> +
> +       .p2align 4,, 4
> +L(return_vec_0):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_WCSCMP
> +       movl    (%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       je      L(return)
> -L(wcscmp_return):
> +       cmpl    (%rsi, %rcx), %edx
> +       je      L(ret0)
>         setl    %al
>         negl    %eax
>         orl     $1, %eax
> -L(return):
>  # else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       movzbl  (%rdi, %rcx), %eax
> +       movzbl  (%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  # endif
> +L(ret0):
>  L(return_vzeroupper):
>         ZERO_UPPER_VEC_REGISTERS_RETURN
>
> -       .p2align 4
> -L(return_vec_size):
> -       tzcntl  %ecx, %edx
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
> -          the maximum offset (%r11).  */
> -       addq    $VEC_SIZE, %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> +       .p2align 4,, 8
> +L(vec_0_test_len):
> +       notl    %ecx
> +       bzhil   %edx, %ecx, %eax
> +       jnz     L(return_vec_0)
> +       /* Align if will cross fetch block.  */
> +       .p2align 4,, 2
> +L(ret_zero):
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# else
> +       VZEROUPPER_RETURN
> +
> +       .p2align 4,, 5
> +L(one_or_less):
> +       jb      L(ret_zero)
>  #  ifdef USE_AS_WCSCMP
> +       /* 'nbe' covers the case where length is negative (large
> +          unsigned).  */
> +       jnbe    __wcscmp_avx2
> +       movl    (%rdi), %edx
>         xorl    %eax, %eax
> -       movl    VEC_SIZE(%rdi, %rdx), %ecx
> -       cmpl    VEC_SIZE(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> +       cmpl    (%rsi), %edx
> +       je      L(ret1)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  #  else
> -       movzbl  VEC_SIZE(%rdi, %rdx), %eax
> -       movzbl  VEC_SIZE(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       /* 'nbe' covers the case where length is negative (large
> +          unsigned).  */
> +
> +       jnbe    __strcmp_avx2
> +       movzbl  (%rdi), %eax
> +       movzbl  (%rsi), %ecx
> +       subl    %ecx, %eax
>  #  endif
> +L(ret1):
> +       ret
>  # endif
> -       VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(return_2_vec_size):
> -       tzcntl  %ecx, %edx
> +       .p2align 4,, 10
> +L(return_vec_1):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
> -          after the maximum offset (%r11).  */
> -       addq    $(VEC_SIZE * 2), %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> +       /* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of
> +          overflow.  */
> +       addq    $-VEC_SIZE, %rdx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero)
> +# endif
> +# ifdef USE_AS_WCSCMP
> +       movl    VEC_SIZE(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> +       je      L(ret2)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rdi, %rdx), %ecx
> -       cmpl    (VEC_SIZE * 2)(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rdi, %rdx), %eax
> -       movzbl  (VEC_SIZE * 2)(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  # endif
> +L(ret2):
>         VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(return_3_vec_size):
> -       tzcntl  %ecx, %edx
> +       .p2align 4,, 10
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
> -          after the maximum offset (%r11).  */
> -       addq    $(VEC_SIZE * 3), %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> +L(return_vec_3):
> +       salq    $32, %rcx
> +# endif
> +
> +L(return_vec_2):
> +# ifndef USE_AS_STRNCMP
> +       tzcntl  %ecx, %ecx
> +# else
> +       tzcntq  %rcx, %rcx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> +       je      L(ret3)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  # else
> +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +# endif
> +L(ret3):
> +       VZEROUPPER_RETURN
> +
> +# ifndef USE_AS_STRNCMP
> +       .p2align 4,, 10
> +L(return_vec_3):
> +       tzcntl  %ecx, %ecx
>  #  ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 3)(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (VEC_SIZE * 3)(%rdi, %rdx), %ecx
> -       cmpl    (VEC_SIZE * 3)(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> +       cmpl    (VEC_SIZE * 3)(%rsi, %rcx), %edx
> +       je      L(ret4)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  #  else
> -       movzbl  (VEC_SIZE * 3)(%rdi, %rdx), %eax
> -       movzbl  (VEC_SIZE * 3)(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       movzbl  (VEC_SIZE * 3)(%rdi, %rcx), %eax
> +       movzbl  (VEC_SIZE * 3)(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  #  endif
> -# endif
> +L(ret4):
>         VZEROUPPER_RETURN
> +# endif
> +
> +       .p2align 4,, 10
> +L(more_3x_vec):
> +       /* Safe to compare 4x vectors.  */
> +       VMOVU   VEC_SIZE(%rdi), %ymm0
> +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_1)
> +
> +# ifdef USE_AS_STRNCMP
> +       subq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +       VMOVU   (VEC_SIZE * 2)(%rdi), %ymm0
> +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_2)
> +
> +       VMOVU   (VEC_SIZE * 3)(%rdi), %ymm0
> +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_3)
>
> -       .p2align 4
> -L(next_3_vectors):
> -       vmovdqu VEC_SIZE(%rdi), %ymm6
> -       VPCMPEQ VEC_SIZE(%rsi), %ymm6, %ymm3
> -       VPMINU  %ymm6, %ymm3, %ymm3
> -       VPCMPEQ %ymm7, %ymm3, %ymm3
> -       vpmovmskb %ymm3, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(return_vec_size)
> -       vmovdqu (VEC_SIZE * 2)(%rdi), %ymm5
> -       vmovdqu (VEC_SIZE * 3)(%rdi), %ymm4
> -       vmovdqu (VEC_SIZE * 3)(%rsi), %ymm0
> -       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm5, %ymm2
> -       VPMINU  %ymm5, %ymm2, %ymm2
> -       VPCMPEQ %ymm4, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm2, %ymm2
> -       vpmovmskb %ymm2, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(return_2_vec_size)
> -       VPMINU  %ymm4, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -       vpmovmskb %ymm0, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(return_3_vec_size)
> -L(main_loop_header):
> -       leaq    (VEC_SIZE * 4)(%rdi), %rdx
> -       movl    $PAGE_SIZE, %ecx
> -       /* Align load via RAX.  */
> -       andq    $-(VEC_SIZE * 4), %rdx
> -       subq    %rdi, %rdx
> -       leaq    (%rdi, %rdx), %rax
>  # ifdef USE_AS_STRNCMP
> -       /* Starting from this point, the maximum offset, or simply the
> -          'offset', DECREASES by the same amount when base pointers are
> -          moved forward.  Return 0 when:
> -            1) On match: offset <= the matched vector index.
> -            2) On mistmach, offset is before the mistmatched index.
> +       cmpq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       /* any non-zero positive value that doesn't inference with 0x1.
>          */
> -       subq    %rdx, %r11
> -       jbe     L(zero)
> -# endif
> -       addq    %rsi, %rdx
> -       movq    %rdx, %rsi
> -       andl    $(PAGE_SIZE - 1), %esi
> -       /* Number of bytes before page crossing.  */
> -       subq    %rsi, %rcx
> -       /* Number of VEC_SIZE * 4 blocks before page crossing.  */
> -       shrq    $DIVIDE_BY_VEC_4_SHIFT, %rcx
> -       /* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
> -       movl    %ecx, %esi
> -       jmp     L(loop_start)
> +       movl    $2, %r8d
>
> +# else
> +       xorl    %r8d, %r8d
> +# endif
> +
> +       /* The prepare labels are various entry points from the page
> +          cross logic.  */
> +L(prepare_loop):
> +
> +# ifdef USE_AS_STRNCMP
> +       /* Store N + (VEC_SIZE * 4) and place check at the begining of
> +          the loop.  */
> +       leaq    (VEC_SIZE * 2)(%rdi, %rdx), %rdx
> +# endif
> +L(prepare_loop_no_len):
> +
> +       /* Align s1 and adjust s2 accordingly.  */
> +       subq    %rdi, %rsi
> +       andq    $-(VEC_SIZE * 4), %rdi
> +       addq    %rdi, %rsi
> +
> +# ifdef USE_AS_STRNCMP
> +       subq    %rdi, %rdx
> +# endif
> +
> +L(prepare_loop_aligned):
> +       /* eax stores distance from rsi to next page cross. These cases
> +          need to be handled specially as the 4x loop could potentially
> +          read memory past the length of s1 or s2 and across a page
> +          boundary.  */
> +       movl    $-(VEC_SIZE * 4), %eax
> +       subl    %esi, %eax
> +       andl    $(PAGE_SIZE - 1), %eax
> +
> +       /* Loop 4x comparisons at a time.  */
>         .p2align 4
>  L(loop):
> +
> +       /* End condition for strncmp.  */
>  # ifdef USE_AS_STRNCMP
> -       /* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
> -          the maximum offset (%r11) by the same amount.  */
> -       subq    $(VEC_SIZE * 4), %r11
> -       jbe     L(zero)
> -# endif
> -       addq    $(VEC_SIZE * 4), %rax
> -       addq    $(VEC_SIZE * 4), %rdx
> -L(loop_start):
> -       testl   %esi, %esi
> -       leal    -1(%esi), %esi
> -       je      L(loop_cross_page)
> -L(back_to_loop):
> -       /* Main loop, comparing 4 vectors are a time.  */
> -       vmovdqa (%rax), %ymm0
> -       vmovdqa VEC_SIZE(%rax), %ymm3
> -       VPCMPEQ (%rdx), %ymm0, %ymm4
> -       VPCMPEQ VEC_SIZE(%rdx), %ymm3, %ymm1
> -       VPMINU  %ymm0, %ymm4, %ymm4
> -       VPMINU  %ymm3, %ymm1, %ymm1
> -       vmovdqa (VEC_SIZE * 2)(%rax), %ymm2
> -       VPMINU  %ymm1, %ymm4, %ymm0
> -       vmovdqa (VEC_SIZE * 3)(%rax), %ymm3
> -       VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm2, %ymm5
> -       VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm3, %ymm6
> -       VPMINU  %ymm2, %ymm5, %ymm5
> -       VPMINU  %ymm3, %ymm6, %ymm6
> -       VPMINU  %ymm5, %ymm0, %ymm0
> -       VPMINU  %ymm6, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -
> -       /* Test each mask (32 bits) individually because for VEC_SIZE
> -          == 32 is not possible to OR the four masks and keep all bits
> -          in a 64-bit integer register, differing from SSE2 strcmp
> -          where ORing is possible.  */
> -       vpmovmskb %ymm0, %ecx
> +       subq    $(VEC_SIZE * 4), %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       subq    $-(VEC_SIZE * 4), %rsi
> +
> +       /* Check if rsi loads will cross a page boundary.  */
> +       addl    $-(VEC_SIZE * 4), %eax
> +       jnb     L(page_cross_during_loop)
> +
> +       /* Loop entry after handling page cross during loop.  */
> +L(loop_skip_page_cross_check):
> +       VMOVA   (VEC_SIZE * 0)(%rdi), %ymm0
> +       VMOVA   (VEC_SIZE * 1)(%rdi), %ymm2
> +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> +
> +       /* ymm1 all 1s where s1 and s2 equal. All 0s otherwise.  */
> +       VPCMPEQ (VEC_SIZE * 0)(%rsi), %ymm0, %ymm1
> +
> +       VPCMPEQ (VEC_SIZE * 1)(%rsi), %ymm2, %ymm3
> +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> +
> +
> +       /* If any mismatches or null CHAR then 0 CHAR, otherwise non-
> +          zero.  */
> +       vpand   %ymm0, %ymm1, %ymm1
> +
> +
> +       vpand   %ymm2, %ymm3, %ymm3
> +       vpand   %ymm4, %ymm5, %ymm5
> +       vpand   %ymm6, %ymm7, %ymm7
> +
> +       VPMINU  %ymm1, %ymm3, %ymm3
> +       VPMINU  %ymm5, %ymm7, %ymm7
> +
> +       /* Reduce all 0 CHARs for the 4x VEC into ymm7.  */
> +       VPMINU  %ymm3, %ymm7, %ymm7
> +
> +       /* If any 0 CHAR then done.  */
> +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> +       vpmovmskb %ymm7, %LOOP_REG
> +       testl   %LOOP_REG, %LOOP_REG
> +       jz      L(loop)
> +
> +       /* Find which VEC has the mismatch of end of string.  */
> +       VPCMPEQ %ymm1, %ymmZERO, %ymm1
> +       vpmovmskb %ymm1, %ecx
>         testl   %ecx, %ecx
> -       je      L(loop)
> -       VPCMPEQ %ymm7, %ymm4, %ymm0
> -       vpmovmskb %ymm0, %edi
> -       testl   %edi, %edi
> -       je      L(test_vec)
> -       tzcntl  %edi, %ecx
> +       jnz     L(return_vec_0_end)
> +
> +
> +       VPCMPEQ %ymm3, %ymmZERO, %ymm3
> +       vpmovmskb %ymm3, %ecx
> +       testl   %ecx, %ecx
> +       jnz     L(return_vec_1_end)
> +
> +L(return_vec_2_3_end):
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       subq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(ret_zero_end)
> +# endif
> +
> +       VPCMPEQ %ymm5, %ymmZERO, %ymm5
> +       vpmovmskb %ymm5, %ecx
> +       testl   %ecx, %ecx
> +       jnz     L(return_vec_2_end)
> +
> +       /* LOOP_REG contains matches for null/mismatch from the loop. If
> +          VEC 0,1,and 2 all have no null and no mismatches then mismatch
> +          must entirely be from VEC 3 which is fully represented by
> +          LOOP_REG.  */
> +       tzcntl  %LOOP_REG, %LOOP_REG
> +
> +# ifdef USE_AS_STRNCMP
> +       subl    $-(VEC_SIZE), %LOOP_REG
> +       cmpq    %LOOP_REG64, %rdx
> +       jbe     L(ret_zero_end)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> +       je      L(ret5)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax
> +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret5):
>         VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(test_vec):
>  # ifdef USE_AS_STRNCMP
> -       /* The first vector matched.  Return 0 if the maximum offset
> -          (%r11) <= VEC_SIZE.  */
> -       cmpq    $VEC_SIZE, %r11
> -       jbe     L(zero)
> +       .p2align 4,, 2
> +L(ret_zero_end):
> +       xorl    %eax, %eax
> +       VZEROUPPER_RETURN
>  # endif
> -       VPCMPEQ %ymm7, %ymm1, %ymm1
> -       vpmovmskb %ymm1, %ecx
> -       testl   %ecx, %ecx
> -       je      L(test_2_vec)
> -       tzcntl  %ecx, %edi
> +
> +
> +       /* The L(return_vec_N_end) differ from L(return_vec_N) in that
> +          they use the value of `r8` to negate the return value. This is
> +          because the page cross logic can swap `rdi` and `rsi`.  */
> +       .p2align 4,, 10
>  # ifdef USE_AS_STRNCMP
> -       addq    $VEC_SIZE, %rdi
> -       cmpq    %rdi, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +L(return_vec_1_end):
> +       salq    $32, %rcx
> +# endif
> +L(return_vec_0_end):
> +# ifndef USE_AS_STRNCMP
> +       tzcntl  %ecx, %ecx
> +# else
> +       tzcntq  %rcx, %rcx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero_end)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       movl    (%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rdi), %ecx
> -       cmpl    (%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rdi), %eax
> -       movzbl  (%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (%rsi, %rcx), %edx
> +       je      L(ret6)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> +       movzbl  (%rdi, %rcx), %eax
> +       movzbl  (%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
> +# endif
> +L(ret6):
> +       VZEROUPPER_RETURN
> +
> +# ifndef USE_AS_STRNCMP
> +       .p2align 4,, 10
> +L(return_vec_1_end):
> +       tzcntl  %ecx, %ecx
>  #  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       movl    VEC_SIZE(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    VEC_SIZE(%rsi, %rdi), %ecx
> -       cmpl    VEC_SIZE(%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> +       je      L(ret7)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  #  else
> -       movzbl  VEC_SIZE(%rax, %rdi), %eax
> -       movzbl  VEC_SIZE(%rdx, %rdi), %edx
> -       subl    %edx, %eax
> +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  #  endif
> -# endif
> +L(ret7):
>         VZEROUPPER_RETURN
> +# endif
>
> -       .p2align 4
> -L(test_2_vec):
> +       .p2align 4,, 10
> +L(return_vec_2_end):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_STRNCMP
> -       /* The first 2 vectors matched.  Return 0 if the maximum offset
> -          (%r11) <= 2 * VEC_SIZE.  */
> -       cmpq    $(VEC_SIZE * 2), %r11
> -       jbe     L(zero)
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero_page_cross)
>  # endif
> -       VPCMPEQ %ymm7, %ymm5, %ymm5
> -       vpmovmskb %ymm5, %ecx
> -       testl   %ecx, %ecx
> -       je      L(test_3_vec)
> -       tzcntl  %ecx, %edi
> -# ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 2), %rdi
> -       cmpq    %rdi, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +# ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rdi), %ecx
> -       cmpl    (%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rdi), %eax
> -       movzbl  (%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> +       je      L(ret11)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rsi, %rdi), %ecx
> -       cmpl    (VEC_SIZE * 2)(%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rax, %rdi), %eax
> -       movzbl  (VEC_SIZE * 2)(%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret11):
>         VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(test_3_vec):
> +
> +       /* Page cross in rsi in next 4x VEC.  */
> +
> +       /* TODO: Improve logic here.  */
> +       .p2align 4,, 10
> +L(page_cross_during_loop):
> +       /* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
> +
> +       /* Optimistically rsi and rdi and both aligned inwhich case we
> +          don't need any logic here.  */
> +       cmpl    $-(VEC_SIZE * 4), %eax
> +       /* Don't adjust eax before jumping back to loop and we will
> +          never hit page cross case again.  */
> +       je      L(loop_skip_page_cross_check)
> +
> +       /* Check if we can safely load a VEC.  */
> +       cmpl    $-(VEC_SIZE * 3), %eax
> +       jle     L(less_1x_vec_till_page_cross)
> +
> +       VMOVA   (%rdi), %ymm0
> +       VPCMPEQ (%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_0_end)
> +
> +       /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
> +       cmpl    $-(VEC_SIZE * 2), %eax
> +       jg      L(more_2x_vec_till_page_cross)
> +
> +       .p2align 4,, 4
> +L(less_1x_vec_till_page_cross):
> +       subl    $-(VEC_SIZE * 4), %eax
> +       /* Guranteed safe to read from rdi - VEC_SIZE here. The only
> +          concerning case is first iteration if incoming s1 was near start
> +          of a page and s2 near end. If s1 was near the start of the page
> +          we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
> +          to read back -VEC_SIZE. If rdi is truly at the start of a page
> +          here, it means the previous page (rdi - VEC_SIZE) has already
> +          been loaded earlier so must be valid.  */
> +       VMOVU   -VEC_SIZE(%rdi, %rax), %ymm0
> +       VPCMPEQ -VEC_SIZE(%rsi, %rax), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +
> +       /* Mask of potentially valid bits. The lower bits can be out of
> +          range comparisons (but safe regarding page crosses).  */
> +       movl    $-1, %r10d
> +       shlxl   %esi, %r10d, %r10d
> +       notl    %ecx
> +
>  # ifdef USE_AS_STRNCMP
> -       /* The first 3 vectors matched.  Return 0 if the maximum offset
> -          (%r11) <= 3 * VEC_SIZE.  */
> -       cmpq    $(VEC_SIZE * 3), %r11
> -       jbe     L(zero)
> -# endif
> -       VPCMPEQ %ymm7, %ymm6, %ymm6
> -       vpmovmskb %ymm6, %esi
> -       tzcntl  %esi, %ecx
> +       cmpq    %rax, %rdx
> +       jbe     L(return_page_cross_end_check)
> +# endif
> +       movl    %eax, %OFFSET_REG
> +       addl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> +
> +       andl    %r10d, %ecx
> +       jz      L(loop_skip_page_cross_check)
> +
> +       .p2align 4,, 3
> +L(return_page_cross_end):
> +       tzcntl  %ecx, %ecx
> +
>  # ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 3), %rcx
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %esi
> -       cmpl    (%rdx, %rcx), %esi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       leal    -VEC_SIZE(%OFFSET_REG64, %rcx), %ecx
> +L(return_page_cross_cmp_mem):
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       addl    %OFFSET_REG, %ecx
> +# endif
> +# ifdef USE_AS_WCSCMP
> +       movl    VEC_OFFSET(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (VEC_SIZE * 3)(%rsi, %rcx), %esi
> -       cmpl    (VEC_SIZE * 3)(%rdx, %rcx), %esi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 3)(%rax, %rcx), %eax
> -       movzbl  (VEC_SIZE * 3)(%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> +       je      L(ret8)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
> +# else
> +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret8):
>         VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(loop_cross_page):
> -       xorl    %r10d, %r10d
> -       movq    %rdx, %rcx
> -       /* Align load via RDX.  We load the extra ECX bytes which should
> -          be ignored.  */
> -       andl    $((VEC_SIZE * 4) - 1), %ecx
> -       /* R10 is -RCX.  */
> -       subq    %rcx, %r10
> -
> -       /* This works only if VEC_SIZE * 2 == 64. */
> -# if (VEC_SIZE * 2) != 64
> -#  error (VEC_SIZE * 2) != 64
> -# endif
> -
> -       /* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
> -       cmpl    $(VEC_SIZE * 2), %ecx
> -       jge     L(loop_cross_page_2_vec)
> -
> -       vmovdqu (%rax, %r10), %ymm2
> -       vmovdqu VEC_SIZE(%rax, %r10), %ymm3
> -       VPCMPEQ (%rdx, %r10), %ymm2, %ymm0
> -       VPCMPEQ VEC_SIZE(%rdx, %r10), %ymm3, %ymm1
> -       VPMINU  %ymm2, %ymm0, %ymm0
> -       VPMINU  %ymm3, %ymm1, %ymm1
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm1, %ymm1
> -
> -       vpmovmskb %ymm0, %edi
> -       vpmovmskb %ymm1, %esi
> -
> -       salq    $32, %rsi
> -       xorq    %rsi, %rdi
> -
> -       /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
> -       shrq    %cl, %rdi
> -
> -       testq   %rdi, %rdi
> -       je      L(loop_cross_page_2_vec)
> -       tzcntq  %rdi, %rcx
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       .p2align 4,, 10
> +L(return_page_cross_end_check):
> +       tzcntl  %ecx, %ecx
> +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> +       cmpl    %ecx, %edx
> +       ja      L(return_page_cross_cmp_mem)
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# endif
>         VZEROUPPER_RETURN
> +# endif
>
> -       .p2align 4
> -L(loop_cross_page_2_vec):
> -       /* The first VEC_SIZE * 2 bytes match or are ignored.  */
> -       vmovdqu (VEC_SIZE * 2)(%rax, %r10), %ymm2
> -       vmovdqu (VEC_SIZE * 3)(%rax, %r10), %ymm3
> -       VPCMPEQ (VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5
> -       VPMINU  %ymm2, %ymm5, %ymm5
> -       VPCMPEQ (VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6
> -       VPCMPEQ %ymm7, %ymm5, %ymm5
> -       VPMINU  %ymm3, %ymm6, %ymm6
> -       VPCMPEQ %ymm7, %ymm6, %ymm6
> -
> -       vpmovmskb %ymm5, %edi
> -       vpmovmskb %ymm6, %esi
> -
> -       salq    $32, %rsi
> -       xorq    %rsi, %rdi
>
> -       xorl    %r8d, %r8d
> -       /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
> -       subl    $(VEC_SIZE * 2), %ecx
> -       jle     1f
> -       /* Skip ECX bytes.  */
> -       shrq    %cl, %rdi
> -       /* R8 has number of bytes skipped.  */
> -       movl    %ecx, %r8d
> -1:
> -       /* Before jumping back to the loop, set ESI to the number of
> -          VEC_SIZE * 4 blocks before page crossing.  */
> -       movl    $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
> -
> -       testq   %rdi, %rdi
> +       .p2align 4,, 10
> +L(more_2x_vec_till_page_cross):
> +       /* If more 2x vec till cross we will complete a full loop
> +          iteration here.  */
> +
> +       VMOVU   VEC_SIZE(%rdi), %ymm0
> +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_1_end)
> +
>  # ifdef USE_AS_STRNCMP
> -       /* At this point, if %rdi value is 0, it already tested
> -          VEC_SIZE*4+%r10 byte starting from %rax. This label
> -          checks whether strncmp maximum offset reached or not.  */
> -       je      L(string_nbyte_offset_check)
> -# else
> -       je      L(back_to_loop)
> +       cmpq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(ret_zero_in_loop_page_cross)
>  # endif
> -       tzcntq  %rdi, %rcx
> -       addq    %r10, %rcx
> -       /* Adjust for number of bytes skipped.  */
> -       addq    %r8, %rcx
> +
> +       subl    $-(VEC_SIZE * 4), %eax
> +
> +       /* Safe to include comparisons from lower bytes.  */
> +       VMOVU   -(VEC_SIZE * 2)(%rdi, %rax), %ymm0
> +       VPCMPEQ -(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_page_cross_0)
> +
> +       VMOVU   -(VEC_SIZE * 1)(%rdi, %rax), %ymm0
> +       VPCMPEQ -(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_page_cross_1)
> +
>  # ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 2), %rcx
> -       subq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       /* Must check length here as length might proclude reading next
> +          page.  */
> +       cmpq    %rax, %rdx
> +       jbe     L(ret_zero_in_loop_page_cross)
> +# endif
> +
> +       /* Finish the loop.  */
> +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> +
> +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> +       vpand   %ymm4, %ymm5, %ymm5
> +       vpand   %ymm6, %ymm7, %ymm7
> +       VPMINU  %ymm5, %ymm7, %ymm7
> +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> +       vpmovmskb %ymm7, %LOOP_REG
> +       testl   %LOOP_REG, %LOOP_REG
> +       jnz     L(return_vec_2_3_end)
> +
> +       /* Best for code size to include ucond-jmp here. Would be faster
> +          if this case is hot to duplicate the L(return_vec_2_3_end) code
> +          as fall-through and have jump back to loop on mismatch
> +          comparison.  */
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       subq    $-(VEC_SIZE * 4), %rsi
> +       addl    $(PAGE_SIZE - VEC_SIZE * 8), %eax
> +# ifdef USE_AS_STRNCMP
> +       subq    $(VEC_SIZE * 4), %rdx
> +       ja      L(loop_skip_page_cross_check)
> +L(ret_zero_in_loop_page_cross):
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       VZEROUPPER_RETURN
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rsi, %rcx), %edi
> -       cmpl    (VEC_SIZE * 2)(%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rax, %rcx), %eax
> -       movzbl  (VEC_SIZE * 2)(%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       jmp     L(loop_skip_page_cross_check)
>  # endif
> -       VZEROUPPER_RETURN
>
> +
> +       .p2align 4,, 10
> +L(return_vec_page_cross_0):
> +       addl    $-VEC_SIZE, %eax
> +L(return_vec_page_cross_1):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_STRNCMP
> -L(string_nbyte_offset_check):
> -       leaq    (VEC_SIZE * 4)(%r10), %r10
> -       cmpq    %r10, %r11
> -       jbe     L(zero)
> -       jmp     L(back_to_loop)
> +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero_in_loop_page_cross)
> +# else
> +       addl    %eax, %ecx
>  # endif
>
> -       .p2align 4
> -L(cross_page_loop):
> -       /* Check one byte/dword at a time.  */
>  # ifdef USE_AS_WCSCMP
> -       cmpl    %ecx, %eax
> +       movl    VEC_OFFSET(%rdi, %rcx), %edx
> +       xorl    %eax, %eax
> +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> +       je      L(ret9)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
>         subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> -       jne     L(different)
> -       addl    $SIZE_OF_CHAR, %edx
> -       cmpl    $(VEC_SIZE * 4), %edx
> -       je      L(main_loop_header)
> -# ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +L(ret9):
> +       VZEROUPPER_RETURN
> +
> +
> +       .p2align 4,, 10
> +L(page_cross):
> +# ifndef USE_AS_STRNCMP
> +       /* If both are VEC aligned we don't need any special logic here.
> +          Only valid for strcmp where stop condition is guranteed to be
> +          reachable by just reading memory.  */
> +       testl   $((VEC_SIZE - 1) << 20), %eax
> +       jz      L(no_page_cross)
>  # endif
> +
> +       movl    %edi, %eax
> +       movl    %esi, %ecx
> +       andl    $(PAGE_SIZE - 1), %eax
> +       andl    $(PAGE_SIZE - 1), %ecx
> +
> +       xorl    %OFFSET_REG, %OFFSET_REG
> +
> +       /* Check which is closer to page cross, s1 or s2.  */
> +       cmpl    %eax, %ecx
> +       jg      L(page_cross_s2)
> +
> +       /* The previous page cross check has false positives. Check for
> +          true positive as page cross logic is very expensive.  */
> +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> +       jbe     L(no_page_cross)
> +
> +       /* Set r8 to not interfere with normal return value (rdi and rsi
> +          did not swap).  */
>  # ifdef USE_AS_WCSCMP
> -       movl    (%rdi, %rdx), %eax
> -       movl    (%rsi, %rdx), %ecx
> +       /* any non-zero positive value that doesn't inference with 0x1.
> +        */
> +       movl    $2, %r8d
>  # else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %ecx
> +       xorl    %r8d, %r8d
>  # endif
> -       /* Check null char.  */
> -       testl   %eax, %eax
> -       jne     L(cross_page_loop)
> -       /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
> -          comparisons.  */
> -       subl    %ecx, %eax
> -# ifndef USE_AS_WCSCMP
> -L(different):
> +
> +       /* Check if less than 1x VEC till page cross.  */
> +       subl    $(VEC_SIZE * 3), %eax
> +       jg      L(less_1x_vec_till_page)
> +
> +       /* If more than 1x VEC till page cross, loop throuh safely
> +          loadable memory until within 1x VEC of page cross.  */
> +
> +       .p2align 4,, 10
> +L(page_cross_loop):
> +
> +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +
> +       jnz     L(check_ret_vec_page_cross)
> +       addl    $VEC_SIZE, %OFFSET_REG
> +# ifdef USE_AS_STRNCMP
> +       cmpq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross)
>  # endif
> -       VZEROUPPER_RETURN
> +       addl    $VEC_SIZE, %eax
> +       jl      L(page_cross_loop)
> +
> +       subl    %eax, %OFFSET_REG
> +       /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
> +          to not cross page so is safe to load. Since we have already
> +          loaded at least 1 VEC from rsi it is also guranteed to be safe.
> +        */
> +
> +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +
> +# ifdef USE_AS_STRNCMP
> +       leal    VEC_SIZE(%OFFSET_REG64), %eax
> +       cmpq    %rax, %rdx
> +       jbe     L(check_ret_vec_page_cross2)
> +       addq    %rdi, %rdx
> +# endif
> +       incl    %ecx
> +       jz      L(prepare_loop_no_len)
>
> +       .p2align 4,, 4
> +L(ret_vec_page_cross):
> +# ifndef USE_AS_STRNCMP
> +L(check_ret_vec_page_cross):
> +# endif
> +       tzcntl  %ecx, %ecx
> +       addl    %OFFSET_REG, %ecx
> +L(ret_vec_page_cross_cont):
>  # ifdef USE_AS_WCSCMP
> -       .p2align 4
> -L(different):
> -       /* Use movl to avoid modifying EFLAGS.  */
> -       movl    $0, %eax
> +       movl    (%rdi, %rcx), %edx
> +       xorl    %eax, %eax
> +       cmpl    (%rsi, %rcx), %edx
> +       je      L(ret12)
>         setl    %al
>         negl    %eax
> -       orl     $1, %eax
> -       VZEROUPPER_RETURN
> +       xorl    %r8d, %eax
> +# else
> +       movzbl  (%rdi, %rcx), %eax
> +       movzbl  (%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret12):
> +       VZEROUPPER_RETURN
>
>  # ifdef USE_AS_STRNCMP
> -       .p2align 4
> -L(zero):
> +       .p2align 4,, 10
> +L(check_ret_vec_page_cross2):
> +       incl    %ecx
> +L(check_ret_vec_page_cross):
> +       tzcntl  %ecx, %ecx
> +       addl    %OFFSET_REG, %ecx
> +       cmpq    %rcx, %rdx
> +       ja      L(ret_vec_page_cross_cont)
> +       .p2align 4,, 2
> +L(ret_zero_page_cross):
>         xorl    %eax, %eax
>         VZEROUPPER_RETURN
> +# endif
>
> -       .p2align 4
> -L(char0):
> -#  ifdef USE_AS_WCSCMP
> -       xorl    %eax, %eax
> -       movl    (%rdi), %ecx
> -       cmpl    (%rsi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rsi), %ecx
> -       movzbl  (%rdi), %eax
> -       subl    %ecx, %eax
> -#  endif
> -       VZEROUPPER_RETURN
> +       .p2align 4,, 4
> +L(page_cross_s2):
> +       /* Ensure this is a true page cross.  */
> +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %ecx
> +       jbe     L(no_page_cross)
> +
> +
> +       movl    %ecx, %eax
> +       movq    %rdi, %rcx
> +       movq    %rsi, %rdi
> +       movq    %rcx, %rsi
> +
> +       /* set r8 to negate return value as rdi and rsi swapped.  */
> +# ifdef USE_AS_WCSCMP
> +       movl    $-4, %r8d
> +# else
> +       movl    $-1, %r8d
>  # endif
> +       xorl    %OFFSET_REG, %OFFSET_REG
>
> -       .p2align 4
> -L(last_vector):
> -       addq    %rdx, %rdi
> -       addq    %rdx, %rsi
> +       /* Check if more than 1x VEC till page cross.  */
> +       subl    $(VEC_SIZE * 3), %eax
> +       jle     L(page_cross_loop)
> +
> +       .p2align 4,, 6
> +L(less_1x_vec_till_page):
> +       /* Find largest load size we can use.  */
> +       cmpl    $16, %eax
> +       ja      L(less_16_till_page)
> +
> +       VMOVU   (%rdi), %xmm0
> +       VPCMPEQ (%rsi), %xmm0, %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       incw    %cx
> +       jnz     L(check_ret_vec_page_cross)
> +       movl    $16, %OFFSET_REG
>  # ifdef USE_AS_STRNCMP
> -       subq    %rdx, %r11
> +       cmpq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subl    %eax, %OFFSET_REG
> +# else
> +       /* Explicit check for 16 byte alignment.  */
> +       subl    %eax, %OFFSET_REG
> +       jz      L(prepare_loop)
>  # endif
> -       tzcntl  %ecx, %edx
> +
> +       VMOVU   (%rdi, %OFFSET_REG64), %xmm0
> +       VPCMPEQ (%rsi, %OFFSET_REG64), %xmm0, %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       incw    %cx
> +       jnz     L(check_ret_vec_page_cross)
> +
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       addl    $16, %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subq    $-(VEC_SIZE * 4), %rdx
> +
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +# else
> +       leaq    (16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    (16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
>  # endif
> -# ifdef USE_AS_WCSCMP
> +       jmp     L(prepare_loop_aligned)
> +
> +# ifdef USE_AS_STRNCMP
> +       .p2align 4,, 2
> +L(ret_zero_page_cross_slow_case0):
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -# else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       ret
>  # endif
> -       VZEROUPPER_RETURN
>
> -       /* Comparing on page boundary region requires special treatment:
> -          It must done one vector at the time, starting with the wider
> -          ymm vector if possible, if not, with xmm. If fetching 16 bytes
> -          (xmm) still passes the boundary, byte comparison must be done.
> -        */
> -       .p2align 4
> -L(cross_page):
> -       /* Try one ymm vector at a time.  */
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> -       jg      L(cross_page_1_vector)
> -L(loop_1_vector):
> -       vmovdqu (%rdi, %rdx), %ymm1
> -       VPCMPEQ (%rsi, %rdx), %ymm1, %ymm0
> -       VPMINU  %ymm1, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -       vpmovmskb %ymm0, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(last_vector)
>
> -       addl    $VEC_SIZE, %edx
> +       .p2align 4,, 10
> +L(less_16_till_page):
> +       /* Find largest load size we can use.  */
> +       cmpl    $24, %eax
> +       ja      L(less_8_till_page)
>
> -       addl    $VEC_SIZE, %eax
> -# ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -# endif
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> -       jle     L(loop_1_vector)
> -L(cross_page_1_vector):
> -       /* Less than 32 bytes to check, try one xmm vector.  */
> -       cmpl    $(PAGE_SIZE - 16), %eax
> -       jg      L(cross_page_1_xmm)
> -       vmovdqu (%rdi, %rdx), %xmm1
> -       VPCMPEQ (%rsi, %rdx), %xmm1, %xmm0
> -       VPMINU  %xmm1, %xmm0, %xmm0
> -       VPCMPEQ %xmm7, %xmm0, %xmm0
> -       vpmovmskb %xmm0, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(last_vector)
> +       vmovq   (%rdi), %xmm0
> +       vmovq   (%rsi), %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       VPCMPEQ %xmm1, %xmm0, %xmm1
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       incb    %cl
> +       jnz     L(check_ret_vec_page_cross)
>
> -       addl    $16, %edx
> -# ifndef USE_AS_WCSCMP
> -       addl    $16, %eax
> +
> +# ifdef USE_AS_STRNCMP
> +       cmpq    $8, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
>  # endif
> +       movl    $24, %OFFSET_REG
> +       /* Explicit check for 16 byte alignment.  */
> +       subl    %eax, %OFFSET_REG
> +
> +
> +
> +       vmovq   (%rdi, %OFFSET_REG64), %xmm0
> +       vmovq   (%rsi, %OFFSET_REG64), %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       VPCMPEQ %xmm1, %xmm0, %xmm1
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       incb    %cl
> +       jnz     L(check_ret_vec_page_cross)
> +
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -# endif
> -
> -L(cross_page_1_xmm):
> -# ifndef USE_AS_WCSCMP
> -       /* Less than 16 bytes to check, try 8 byte vector.  NB: No need
> -          for wcscmp nor wcsncmp since wide char is 4 bytes.   */
> -       cmpl    $(PAGE_SIZE - 8), %eax
> -       jg      L(cross_page_8bytes)
> -       vmovq   (%rdi, %rdx), %xmm1
> -       vmovq   (%rsi, %rdx), %xmm0
> -       VPCMPEQ %xmm0, %xmm1, %xmm0
> -       VPMINU  %xmm1, %xmm0, %xmm0
> -       VPCMPEQ %xmm7, %xmm0, %xmm0
> -       vpmovmskb %xmm0, %ecx
> -       /* Only last 8 bits are valid.  */
> -       andl    $0xff, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(last_vector)
> +       addl    $8, %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subq    $-(VEC_SIZE * 4), %rdx
>
> -       addl    $8, %edx
> -       addl    $8, %eax
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +# else
> +       leaq    (8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    (8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +# endif
> +       jmp     L(prepare_loop_aligned)
> +
> +
> +       .p2align 4,, 10
> +L(less_8_till_page):
> +# ifdef USE_AS_WCSCMP
> +       /* If using wchar then this is the only check before we reach
> +          the page boundary.  */
> +       movl    (%rdi), %eax
> +       movl    (%rsi), %ecx
> +       cmpl    %ecx, %eax
> +       jnz     L(ret_less_8_wcs)
>  #  ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       addq    %rdi, %rdx
> +       /* We already checked for len <= 1 so cannot hit that case here.
> +        */
>  #  endif
> +       testl   %eax, %eax
> +       jnz     L(prepare_loop_no_len)
> +       ret
>
> -L(cross_page_8bytes):
> -       /* Less than 8 bytes to check, try 4 byte vector.  */
> -       cmpl    $(PAGE_SIZE - 4), %eax
> -       jg      L(cross_page_4bytes)
> -       vmovd   (%rdi, %rdx), %xmm1
> -       vmovd   (%rsi, %rdx), %xmm0
> -       VPCMPEQ %xmm0, %xmm1, %xmm0
> -       VPMINU  %xmm1, %xmm0, %xmm0
> -       VPCMPEQ %xmm7, %xmm0, %xmm0
> -       vpmovmskb %xmm0, %ecx
> -       /* Only last 4 bits are valid.  */
> -       andl    $0xf, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(last_vector)
> +       .p2align 4,, 8
> +L(ret_less_8_wcs):
> +       setl    %OFFSET_REG8
> +       negl    %OFFSET_REG
> +       movl    %OFFSET_REG, %eax
> +       xorl    %r8d, %eax
> +       ret
> +
> +# else
> +
> +       /* Find largest load size we can use.  */
> +       cmpl    $28, %eax
> +       ja      L(less_4_till_page)
> +
> +       vmovd   (%rdi), %xmm0
> +       vmovd   (%rsi), %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       VPCMPEQ %xmm1, %xmm0, %xmm1
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       subl    $0xf, %ecx
> +       jnz     L(check_ret_vec_page_cross)
>
> -       addl    $4, %edx
>  #  ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    $4, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case1)
>  #  endif
> +       movl    $28, %OFFSET_REG
> +       /* Explicit check for 16 byte alignment.  */
> +       subl    %eax, %OFFSET_REG
>
> -L(cross_page_4bytes):
> -# endif
> -       /* Less than 4 bytes to check, try one byte/dword at a time.  */
> -# ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -# endif
> -# ifdef USE_AS_WCSCMP
> -       movl    (%rdi, %rdx), %eax
> -       movl    (%rsi, %rdx), %ecx
> -# else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %ecx
> -# endif
> -       testl   %eax, %eax
> -       jne     L(cross_page_loop)
> +
> +
> +       vmovd   (%rdi, %OFFSET_REG64), %xmm0
> +       vmovd   (%rsi, %OFFSET_REG64), %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       VPCMPEQ %xmm1, %xmm0, %xmm1
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       subl    $0xf, %ecx
> +       jnz     L(check_ret_vec_page_cross)
> +
> +#  ifdef USE_AS_STRNCMP
> +       addl    $4, %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case1)
> +       subq    $-(VEC_SIZE * 4), %rdx
> +
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +#  else
> +       leaq    (4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    (4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +#  endif
> +       jmp     L(prepare_loop_aligned)
> +
> +#  ifdef USE_AS_STRNCMP
> +       .p2align 4,, 2
> +L(ret_zero_page_cross_slow_case1):
> +       xorl    %eax, %eax
> +       ret
> +#  endif
> +
> +       .p2align 4,, 10
> +L(less_4_till_page):
> +       subq    %rdi, %rsi
> +       /* Extremely slow byte comparison loop.  */
> +L(less_4_loop):
> +       movzbl  (%rdi), %eax
> +       movzbl  (%rsi, %rdi), %ecx
>         subl    %ecx, %eax
> -       VZEROUPPER_RETURN
> -END (STRCMP)
> +       jnz     L(ret_less_4_loop)
> +       testl   %ecx, %ecx
> +       jz      L(ret_zero_4_loop)
> +#  ifdef USE_AS_STRNCMP
> +       decq    %rdx
> +       jz      L(ret_zero_4_loop)
> +#  endif
> +       incq    %rdi
> +       /* end condition is reach page boundary (rdi is aligned).  */
> +       testl   $31, %edi
> +       jnz     L(less_4_loop)
> +       leaq    -(VEC_SIZE * 4)(%rdi, %rsi), %rsi
> +       addq    $-(VEC_SIZE * 4), %rdi
> +#  ifdef USE_AS_STRNCMP
> +       subq    $-(VEC_SIZE * 4), %rdx
> +#  endif
> +       jmp     L(prepare_loop_aligned)
> +
> +L(ret_zero_4_loop):
> +       xorl    %eax, %eax
> +       ret
> +L(ret_less_4_loop):
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
> +       ret
> +# endif
> +END(STRCMP)
>  #endif
> --
> 2.25.1
>

[-- Attachment #2: strcmp-evex.pdf --]
[-- Type: application/pdf, Size: 572606 bytes --]

[-- Attachment #3: strcmp-avx2.pdf --]
[-- Type: application/pdf, Size: 567440 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755]
  2022-01-09 12:35 ` [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
@ 2022-01-09 14:07   ` H.J. Lu
  2022-01-10  0:29     ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-09 14:07 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Sun, Jan 9, 2022 at 4:35 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Sun, Jan 9, 2022 at 6:30 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> > __wcscmp_avx2. For x86_64 this covers the entire address range so any
> > length larger could not possibly be used to bound `s1` or `s2`.

Please first submit a separate single patch to fix wcsncmp_avx2 and
wcsncmp_evex for BZ# 28755

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
  2022-01-09 12:29 [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
                   ` (4 preceding siblings ...)
  2022-01-09 12:35 ` [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
@ 2022-01-10  0:27 ` Noah Goldstein
  2022-01-10  0:27   ` [PATCH v2 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
                     ` (6 more replies)
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
  6 siblings, 7 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  0:27 UTC (permalink / raw)
  To: libc-alpha

Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index a45f9d2749..9c73b5899d 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -87,6 +87,16 @@ ENTRY (STRCMP)
 	je	L(char0)
 	jb	L(zero)
 #  ifdef USE_AS_WCSCMP
+#  ifndef __ILP32__
+	movq	%rdx, %rcx
+	/* Check if length could overflow when multiplied by
+	   sizeof(wchar_t). Checking top 8 bits will cover all potential
+	   overflow cases as well as redirect cases where its impossible to
+	   length to bound a valid memory region. In these cases just use
+	   'wcscmp'.  */
+	shrq	$56, %rcx
+	jnz	__wcscmp_avx2
+#  endif
 	/* Convert units: from wide to byte char.  */
 	shl	$2, %RDX_LP
 #  endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
@ 2022-01-10  0:27   ` Noah Goldstein
  2022-01-10  0:35     ` H.J. Lu
  2022-01-10  0:27   ` [PATCH v2 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  0:27 UTC (permalink / raw)
  To: libc-alpha

Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_evex. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-evex.S | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
index 1d971f3889..0cd939d5af 100644
--- a/sysdeps/x86_64/multiarch/strcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
@@ -104,6 +104,16 @@ ENTRY (STRCMP)
 	je	L(char0)
 	jb	L(zero)
 #  ifdef USE_AS_WCSCMP
+#  ifndef __ILP32__
+	movq	%rdx, %rcx
+	/* Check if length could overflow when multiplied by
+	   sizeof(wchar_t). Checking top 8 bits will cover all potential
+	   overflow cases as well as redirect cases where its impossible to
+	   length to bound a valid memory region. In these cases just use
+	   'wcscmp'.  */
+	shrq	$56, %rcx
+	jnz	__wcscmp_evex
+#  endif
 	/* Convert units: from wide to byte char.  */
 	shl	$2, %RDX_LP
 #  endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp].
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
  2022-01-10  0:27   ` [PATCH v2 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
@ 2022-01-10  0:27   ` Noah Goldstein
  2022-01-10  0:37     ` H.J. Lu
  2022-01-10  0:27   ` [PATCH v2 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  0:27 UTC (permalink / raw)
  To: libc-alpha

These implementations are incorrect. There may be a mismatch in s1/s2
before invalid memory but no null CHAR / length boundary.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 string/test-strcmp.c  | 35 -----------------------------------
 string/test-strncmp.c | 34 ----------------------------------
 2 files changed, 69 deletions(-)

diff --git a/string/test-strcmp.c b/string/test-strcmp.c
index 3c75076fb8..97d7bf5043 100644
--- a/string/test-strcmp.c
+++ b/string/test-strcmp.c
@@ -34,7 +34,6 @@
 # define STRLEN wcslen
 # define MEMCPY wmemcpy
 # define SIMPLE_STRCMP simple_wcscmp
-# define STUPID_STRCMP stupid_wcscmp
 # define CHAR wchar_t
 # define UCHAR wchar_t
 # define CHARBYTES 4
@@ -64,25 +63,6 @@ simple_wcscmp (const wchar_t *s1, const wchar_t *s2)
   return c1 < c2 ? -1 : 1;
 }
 
-int
-stupid_wcscmp (const wchar_t *s1, const wchar_t *s2)
-{
-  size_t ns1 = wcslen (s1) + 1;
-  size_t ns2 = wcslen (s2) + 1;
-  size_t n = ns1 < ns2 ? ns1 : ns2;
-  int ret = 0;
-
-  wchar_t c1, c2;
-
-  while (n--) {
-    c1 = *s1++;
-    c2 = *s2++;
-    if ((ret = c1 < c2 ? -1 : c1 == c2 ? 0 : 1) != 0)
-      break;
-  }
-  return ret;
-}
-
 #else
 # include <limits.h>
 
@@ -92,7 +72,6 @@ stupid_wcscmp (const wchar_t *s1, const wchar_t *s2)
 # define STRLEN strlen
 # define MEMCPY memcpy
 # define SIMPLE_STRCMP simple_strcmp
-# define STUPID_STRCMP stupid_strcmp
 # define CHAR char
 # define UCHAR unsigned char
 # define CHARBYTES 1
@@ -113,24 +92,10 @@ simple_strcmp (const char *s1, const char *s2)
   return ret;
 }
 
-int
-stupid_strcmp (const char *s1, const char *s2)
-{
-  size_t ns1 = strlen (s1) + 1;
-  size_t ns2 = strlen (s2) + 1;
-  size_t n = ns1 < ns2 ? ns1 : ns2;
-  int ret = 0;
-
-  while (n--)
-    if ((ret = *(unsigned char *) s1++ - *(unsigned char *) s2++) != 0)
-      break;
-  return ret;
-}
 #endif
 
 typedef int (*proto_t) (const CHAR *, const CHAR *);
 
-IMPL (STUPID_STRCMP, 1)
 IMPL (SIMPLE_STRCMP, 1)
 IMPL (STRCMP, 1)
 
diff --git a/string/test-strncmp.c b/string/test-strncmp.c
index e7d5edea39..61a283a0af 100644
--- a/string/test-strncmp.c
+++ b/string/test-strncmp.c
@@ -33,7 +33,6 @@
 # define STRDUP wcsdup
 # define MEMCPY wmemcpy
 # define SIMPLE_STRNCMP simple_wcsncmp
-# define STUPID_STRNCMP stupid_wcsncmp
 # define CHAR wchar_t
 # define UCHAR wchar_t
 # define CHARBYTES 4
@@ -57,25 +56,6 @@ simple_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
   return 0;
 }
 
-int
-stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
-{
-  wchar_t c1, c2;
-  size_t ns1 = wcsnlen (s1, n) + 1, ns2 = wcsnlen (s2, n) + 1;
-
-  n = ns1 < n ? ns1 : n;
-  n = ns2 < n ? ns2 : n;
-
-  while (n--)
-    {
-      c1 = *s1++;
-      c2 = *s2++;
-      if (c1 != c2)
-	return c1 > c2 ? 1 : -1;
-    }
-  return 0;
-}
-
 #else
 # define L(str) str
 # define STRNCMP strncmp
@@ -83,7 +63,6 @@ stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
 # define STRDUP strdup
 # define MEMCPY memcpy
 # define SIMPLE_STRNCMP simple_strncmp
-# define STUPID_STRNCMP stupid_strncmp
 # define CHAR char
 # define UCHAR unsigned char
 # define CHARBYTES 1
@@ -101,23 +80,10 @@ simple_strncmp (const char *s1, const char *s2, size_t n)
   return ret;
 }
 
-int
-stupid_strncmp (const char *s1, const char *s2, size_t n)
-{
-  size_t ns1 = strnlen (s1, n) + 1, ns2 = strnlen (s2, n) + 1;
-  int ret = 0;
-
-  n = ns1 < n ? ns1 : n;
-  n = ns2 < n ? ns2 : n;
-  while (n-- && (ret = *(unsigned char *) s1++ - * (unsigned char *) s2++) == 0);
-  return ret;
-}
-
 #endif
 
 typedef int (*proto_t) (const CHAR *, const CHAR *, size_t);
 
-IMPL (STUPID_STRNCMP, 0)
 IMPL (SIMPLE_STRNCMP, 0)
 IMPL (STRNCMP, 1)
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
  2022-01-10  0:27   ` [PATCH v2 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
  2022-01-10  0:27   ` [PATCH v2 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
@ 2022-01-10  0:27   ` Noah Goldstein
  2022-01-10  0:38     ` H.J. Lu
  2022-01-10  0:27   ` [PATCH v2 5/7] x86: Optimize strcmp-avx2.S Noah Goldstein
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  0:27 UTC (permalink / raw)
  To: libc-alpha

Add additional test cases for small / medium sizes.

Add tests in test-strncmp.c where `n` is near ULONG_MAX or LONG_MIN to
test for overflow bugs in length handling.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 string/test-strcmp.c  |  70 ++++++++++--
 string/test-strncmp.c | 248 +++++++++++++++++++++++++++++++++++++++---
 2 files changed, 298 insertions(+), 20 deletions(-)

diff --git a/string/test-strcmp.c b/string/test-strcmp.c
index 97d7bf5043..eacbdc8857 100644
--- a/string/test-strcmp.c
+++ b/string/test-strcmp.c
@@ -16,6 +16,9 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#define TEST_LEN (4096 * 3)
+#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
+
 #define TEST_MAIN
 #ifdef WIDE
 # define TEST_NAME "wcscmp"
@@ -129,7 +132,7 @@ do_one_test (impl_t *impl,
 
 static void
 do_test (size_t align1, size_t align2, size_t len, int max_char,
-	 int exp_result)
+         int exp_result)
 {
   size_t i;
 
@@ -138,19 +141,22 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
   if (len == 0)
     return;
 
-  align1 &= 63;
+  align1 &= ~(CHARBYTES - 1);
+  align2 &= ~(CHARBYTES - 1);
+
+  align1 &= getpagesize () - 1;
   if (align1 + (len + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
+  align2 &= getpagesize () - 1;
   if (align2 + (len + 1) * CHARBYTES >= page_size)
     return;
 
   /* Put them close to the end of page.  */
   i = align1 + CHARBYTES * (len + 2);
-  s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1);
+  s1 = (CHAR *)(buf1 + ((page_size - i) / 16 * 16) + align1);
   i = align2 + CHARBYTES * (len + 2);
-  s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16)  + align2);
+  s2 = (CHAR *)(buf2 + ((page_size - i) / 16 * 16) + align2);
 
   for (i = 0; i < len; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
@@ -161,9 +167,10 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
   s2[len - 1] -= exp_result;
 
   FOR_EACH_IMPL (impl, 0)
-    do_one_test (impl, s1, s2, exp_result);
+  do_one_test (impl, s1, s2, exp_result);
 }
 
+
 static void
 do_random_tests (void)
 {
@@ -385,7 +392,7 @@ check3 (void)
 int
 test_main (void)
 {
-  size_t i;
+  size_t i, j;
 
   test_init ();
   check();
@@ -426,6 +433,55 @@ test_main (void)
       do_test (2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1);
     }
 
+  for (j = 0; j < 160; ++j)
+    {
+      for (i = 0; i < TEST_LEN;)
+        {
+          do_test (getpagesize () - j - 1, 0, i, 127, 0);
+          do_test (getpagesize () - j - 1, 0, i, 127, 1);
+          do_test (getpagesize () - j - 1, 0, i, 127, -1);
+
+          do_test (getpagesize () - j - 1, j, i, 127, 0);
+          do_test (getpagesize () - j - 1, j, i, 127, 1);
+          do_test (getpagesize () - j - 1, j, i, 127, -1);
+
+          do_test (0, getpagesize () - j - 1, i, 127, 0);
+          do_test (0, getpagesize () - j - 1, i, 127, 1);
+          do_test (0, getpagesize () - j - 1, i, 127, -1);
+
+          do_test (j, getpagesize () - j - 1, i, 127, 0);
+          do_test (j, getpagesize () - j - 1, i, 127, 1);
+          do_test (j, getpagesize () - j - 1, i, 127, -1);
+
+          if (i < 32)
+            {
+              i += 1;
+            }
+          else if (i < 161)
+            {
+              i += 7;
+            }
+          else if (i + 161 < TEST_LEN)
+            {
+              i += 31;
+              i *= 17;
+              i /= 16;
+              if (i + 161 > TEST_LEN)
+                {
+                  i = TEST_LEN - 160;
+                }
+            }
+          else if (i + 32 < TEST_LEN)
+            {
+              i += 7;
+            }
+          else
+            {
+              i += 1;
+            }
+        }
+    }
+
   do_random_tests ();
   return ret;
 }
diff --git a/string/test-strncmp.c b/string/test-strncmp.c
index 61a283a0af..4fa6106eb4 100644
--- a/string/test-strncmp.c
+++ b/string/test-strncmp.c
@@ -16,6 +16,9 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#define TEST_LEN (4096 * 3)
+#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
+
 #define TEST_MAIN
 #ifdef WIDE
 # define TEST_NAME "wcsncmp"
@@ -166,10 +169,10 @@ do_test_limit (size_t align1, size_t align2, size_t len, size_t n, int max_char,
 }
 
 static void
-do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
-	 int exp_result)
+do_test_n (size_t align1, size_t align2, size_t len, size_t n, int n_in_bounds,
+           int max_char, int exp_result)
 {
-  size_t i;
+  size_t i, buf_bound;
   CHAR *s1, *s2;
 
   align1 &= ~(CHARBYTES - 1);
@@ -178,22 +181,28 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
   if (n == 0)
     return;
 
-  align1 &= 63;
-  if (align1 + (n + 1) * CHARBYTES >= page_size)
+  buf_bound = n_in_bounds ? n : len;
+
+  align1 &= getpagesize () - 1;
+  if (align1 + (buf_bound + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
-  if (align2 + (n + 1) * CHARBYTES >= page_size)
+  align2 &= getpagesize () - 1;
+  if (align2 + (buf_bound + 1) * CHARBYTES >= page_size)
     return;
 
-  s1 = (CHAR *) (buf1 + align1);
-  s2 = (CHAR *) (buf2 + align2);
+  s1 = (CHAR *)(buf1 + align1);
+  s2 = (CHAR *)(buf2 + align2);
 
-  for (i = 0; i < n; i++)
+  if (n_in_bounds)
+    {
+      s1[n] = 24 + exp_result;
+      s2[n] = 23;
+    }
+
+  for (i = 0; i < buf_bound; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
 
-  s1[n] = 24 + exp_result;
-  s2[n] = 23;
   s1[len] = 0;
   s2[len] = 0;
   if (exp_result < 0)
@@ -207,6 +216,13 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
     do_one_test (impl, s1, s2, n, exp_result);
 }
 
+static void
+do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
+         int exp_result)
+{
+  do_test_n (align1, align2, len, n, 1, max_char, exp_result);
+}
+
 static void
 do_page_test (size_t offset1, size_t offset2, CHAR *s2)
 {
@@ -400,10 +416,123 @@ check3 (void)
 	}
 }
 
+static void
+check_overflow (void)
+{
+  size_t i, j, of_mask, of_idx;
+  const size_t of_masks[]
+      = { ULONG_MAX, LONG_MIN, ULONG_MAX - (ULONG_MAX >> 2),
+          ((size_t)LONG_MAX) >> 1 };
+
+  for (of_idx = 0; of_idx < sizeof (of_masks) / sizeof (of_masks[0]); ++of_idx)
+    {
+      of_mask = of_masks[of_idx];
+      for (j = 0; j < 160; ++j)
+        {
+          for (i = 1; i <= 161; i += (32 / sizeof (CHAR)))
+            {
+              do_test_n (j, 0, i, of_mask, 0, 127, 0);
+              do_test_n (j, 0, i, of_mask, 0, 127, 1);
+              do_test_n (j, 0, i, of_mask, 0, 127, -1);
+
+              do_test_n (j, 0, i, of_mask - j / 2, 0, 127, 0);
+              do_test_n (j, 0, i, of_mask - j * 2, 0, 127, 1);
+              do_test_n (j, 0, i, of_mask - j, 0, 127, -1);
+
+              do_test_n (j / 2, j, i, of_mask, 0, 127, 0);
+              do_test_n (j / 2, j, i, of_mask, 0, 127, 1);
+              do_test_n (j / 2, j, i, of_mask, 0, 127, -1);
+
+              do_test_n (j / 2, j, i, of_mask - j, 0, 127, 0);
+              do_test_n (j / 2, j, i, of_mask - j / 2, 0, 127, 1);
+              do_test_n (j / 2, j, i, of_mask - j * 2, 0, 127, -1);
+
+              do_test_n (0, j, i, of_mask - j * 2, 0, 127, 0);
+              do_test_n (0, j, i, of_mask - j, 0, 127, 1);
+              do_test_n (0, j, i, of_mask - j / 2, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j / 2, 0, 127,
+                         0);
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j * 2, 0, 127,
+                         1);
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j, 0, 127,
+                         -1);
+
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask - j, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask - j / 2, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask - j * 2, 0, 127, -1);
+            }
+
+          for (i = 1; i < TEST_LEN; i += i)
+            {
+              do_test_n (j, 0, i - 1, of_mask, 0, 127, 0);
+              do_test_n (j, 0, i - 1, of_mask, 0, 127, 1);
+              do_test_n (j, 0, i - 1, of_mask, 0, 127, -1);
+
+              do_test_n (j, 0, i - 1, of_mask - j / 2, 0, 127, 0);
+              do_test_n (j, 0, i - 1, of_mask - j * 2, 0, 127, 1);
+              do_test_n (j, 0, i - 1, of_mask - j, 0, 127, -1);
+
+              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 0);
+              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 1);
+              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, -1);
+
+              do_test_n (j / 2, j, i - 1, of_mask - j, 0, 127, 0);
+              do_test_n (j / 2, j, i - 1, of_mask - j / 2, 0, 127, 1);
+              do_test_n (j / 2, j, i - 1, of_mask - j * 2, 0, 127, -1);
+
+              do_test_n (0, j, i - 1, of_mask - j * 2, 0, 127, 0);
+              do_test_n (0, j, i - 1, of_mask - j, 0, 127, 1);
+              do_test_n (0, j, i - 1, of_mask - j / 2, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127,
+                         -1);
+
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j / 2, 0,
+                         127, 0);
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j * 2, 0,
+                         127, 1);
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j, 0, 127,
+                         -1);
+
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask - j, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask - j / 2, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask - j * 2, 0, 127, -1);
+            }
+        }
+    }
+}
+
 int
 test_main (void)
 {
-  size_t i;
+  size_t i, j;
 
   test_init ();
 
@@ -470,6 +599,99 @@ test_main (void)
       do_test_limit (0, 0, 15 - i, 16 - i, 255, -1);
     }
 
+  for (j = 0; j < 160; ++j)
+    {
+      for (i = 0; i < TEST_LEN;)
+        {
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, 0, i, i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, i - 1, 0, 127, 0);
+
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, i - 1, 0, 127, 0);
+
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, i, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
+
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, i, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
+
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1);
+          if (i < 32)
+            {
+              i += 1;
+            }
+          else if (i < 161)
+            {
+              i += 7;
+            }
+          else if (i + 161 < TEST_LEN)
+            {
+              i += 31;
+              i *= 17;
+              i /= 16;
+              if (i + 161 > TEST_LEN)
+                {
+                  i = TEST_LEN - 160;
+                }
+            }
+          else if (i + 32 < TEST_LEN)
+            {
+              i += 7;
+            }
+          else
+            {
+              i += 1;
+            }
+        }
+    }
+
+  check_overflow ();
   do_random_tests ();
   return ret;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 5/7] x86: Optimize strcmp-avx2.S
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
                     ` (2 preceding siblings ...)
  2022-01-10  0:27   ` [PATCH v2 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
@ 2022-01-10  0:27   ` Noah Goldstein
  2022-01-10  0:41     ` H.J. Lu
  2022-01-10  0:27   ` [PATCH v2 6/7] x86: Optimize strcmp-evex.S Noah Goldstein
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  0:27 UTC (permalink / raw)
  To: libc-alpha

Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-avx2.S | 1590 ++++++++++++++----------
 1 file changed, 939 insertions(+), 651 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index 9c73b5899d..28d6a0025a 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -26,35 +26,57 @@
 
 # define PAGE_SIZE	4096
 
-/* VEC_SIZE = Number of bytes in a ymm register */
+	/* VEC_SIZE = Number of bytes in a ymm register.  */
 # define VEC_SIZE	32
 
-/* Shift for dividing by (VEC_SIZE * 4).  */
-# define DIVIDE_BY_VEC_4_SHIFT	7
-# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-# endif
+# define VMOVU	vmovdqu
+# define VMOVA	vmovdqa
 
 # ifdef USE_AS_WCSCMP
-/* Compare packed dwords.  */
+	/* Compare packed dwords.  */
 #  define VPCMPEQ	vpcmpeqd
-/* Compare packed dwords and store minimum.  */
+	/* Compare packed dwords and store minimum.  */
 #  define VPMINU	vpminud
-/* 1 dword char == 4 bytes.  */
+	/* 1 dword char == 4 bytes.  */
 #  define SIZE_OF_CHAR	4
 # else
-/* Compare packed bytes.  */
+	/* Compare packed bytes.  */
 #  define VPCMPEQ	vpcmpeqb
-/* Compare packed bytes and store minimum.  */
+	/* Compare packed bytes and store minimum.  */
 #  define VPMINU	vpminub
-/* 1 byte char == 1 byte.  */
+	/* 1 byte char == 1 byte.  */
 #  define SIZE_OF_CHAR	1
 # endif
 
+# ifdef USE_AS_STRNCMP
+#  define LOOP_REG	r9d
+#  define LOOP_REG64	r9
+
+#  define OFFSET_REG8	r9b
+#  define OFFSET_REG	r9d
+#  define OFFSET_REG64	r9
+# else
+#  define LOOP_REG	edx
+#  define LOOP_REG64	rdx
+
+#  define OFFSET_REG8	dl
+#  define OFFSET_REG	edx
+#  define OFFSET_REG64	rdx
+# endif
+
 # ifndef VZEROUPPER
 #  define VZEROUPPER	vzeroupper
 # endif
 
+# if defined USE_AS_STRNCMP
+#  define VEC_OFFSET	0
+# else
+#  define VEC_OFFSET	(-VEC_SIZE)
+# endif
+
+# define xmmZERO	xmm15
+# define ymmZERO	ymm15
+
 # ifndef SECTION
 #  define SECTION(p)	p##.avx
 # endif
@@ -79,783 +101,1049 @@
    the maximum offset is reached before a difference is found, zero is
    returned.  */
 
-	.section SECTION(.text),"ax",@progbits
-ENTRY (STRCMP)
+	.section SECTION(.text), "ax", @progbits
+ENTRY(STRCMP)
 # ifdef USE_AS_STRNCMP
-	/* Check for simple cases (0 or 1) in offset.  */
+#  ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %rdx
+#  endif
 	cmp	$1, %RDX_LP
-	je	L(char0)
-	jb	L(zero)
+	/* Signed comparison intentional. We use this branch to also
+	   test cases where length >= 2^63. These very large sizes can be
+	   handled with strcmp as there is no way for that length to
+	   actually bound the buffer.  */
+	jle	L(one_or_less)
 #  ifdef USE_AS_WCSCMP
-#  ifndef __ILP32__
 	movq	%rdx, %rcx
-	/* Check if length could overflow when multiplied by
-	   sizeof(wchar_t). Checking top 8 bits will cover all potential
-	   overflow cases as well as redirect cases where its impossible to
-	   length to bound a valid memory region. In these cases just use
-	   'wcscmp'.  */
+
+	/* Multiplying length by sizeof(wchar_t) can result in overflow.
+	   Check if that is possible. All cases where overflow are possible
+	   are cases where length is large enough that it can never be a
+	   bound on valid memory so just use wcscmp.  */
 	shrq	$56, %rcx
 	jnz	__wcscmp_avx2
+
+	leaq	(, %rdx, 4), %rdx
 #  endif
-	/* Convert units: from wide to byte char.  */
-	shl	$2, %RDX_LP
-#  endif
-	/* Register %r11 tracks the maximum offset.  */
-	mov	%RDX_LP, %R11_LP
 # endif
+	vpxor	%xmmZERO, %xmmZERO, %xmmZERO
 	movl	%edi, %eax
-	xorl	%edx, %edx
-	/* Make %xmm7 (%ymm7) all zeros in this function.  */
-	vpxor	%xmm7, %xmm7, %xmm7
 	orl	%esi, %eax
-	andl	$(PAGE_SIZE - 1), %eax
-	cmpl	$(PAGE_SIZE - (VEC_SIZE * 4)), %eax
-	jg	L(cross_page)
-	/* Start comparing 4 vectors.  */
-	vmovdqu	(%rdi), %ymm1
-	VPCMPEQ	(%rsi), %ymm1, %ymm0
-	VPMINU	%ymm1, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	je	L(next_3_vectors)
-	tzcntl	%ecx, %edx
+	sall	$20, %eax
+	/* Check if s1 or s2 may cross a page  in next 4x VEC loads.  */
+	cmpl	$((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
+	ja	L(page_cross)
+
+L(no_page_cross):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	(%rdi), %ymm0
+	/* 1s where s1 and s2 equal.  */
+	VPCMPEQ	(%rsi), %ymm0, %ymm1
+	/* 1s at null CHAR.  */
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	/* 1s where s1 and s2 equal AND not null CHAR.  */
+	vpandn	%ymm1, %ymm2, %ymm1
+
+	/* All 1s -> keep going, any 0s -> return.  */
+	vpmovmskb %ymm1, %ecx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx) is after the maximum
-	   offset (%r11).   */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$VEC_SIZE, %rdx
+	jbe	L(vec_0_test_len)
 # endif
+
+	/* All 1s represents all equals. incl will overflow to zero in
+	   all equals case. Otherwise 1s will carry until position of first
+	   mismatch.  */
+	incl	%ecx
+	jz	L(more_3x_vec)
+
+	.p2align 4,, 4
+L(return_vec_0):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	je	L(return)
-L(wcscmp_return):
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret0)
 	setl	%al
 	negl	%eax
 	orl	$1, %eax
-L(return):
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret0):
 L(return_vzeroupper):
 	ZERO_UPPER_VEC_REGISTERS_RETURN
 
-	.p2align 4
-L(return_vec_size):
-	tzcntl	%ecx, %edx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
-	   the maximum offset (%r11).  */
-	addq	$VEC_SIZE, %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	.p2align 4,, 8
+L(vec_0_test_len):
+	notl	%ecx
+	bzhil	%edx, %ecx, %eax
+	jnz	L(return_vec_0)
+	/* Align if will cross fetch block.  */
+	.p2align 4,, 2
+L(ret_zero):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
-# else
+	VZEROUPPER_RETURN
+
+	.p2align 4,, 5
+L(one_or_less):
+	jb	L(ret_zero)
 #  ifdef USE_AS_WCSCMP
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__wcscmp_avx2
+	movl	(%rdi), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rdi, %rdx), %ecx
-	cmpl	VEC_SIZE(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(%rsi), %edx
+	je	L(ret1)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	VEC_SIZE(%rdi, %rdx), %eax
-	movzbl	VEC_SIZE(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+
+	jnbe	__strcmp_avx2
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi), %ecx
+	subl	%ecx, %eax
 #  endif
+L(ret1):
+	ret
 # endif
-	VZEROUPPER_RETURN
 
-	.p2align 4
-L(return_2_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
+L(return_vec_1):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 2), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	/* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of
+	   overflow.  */
+	addq	$-VEC_SIZE, %rdx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_SIZE(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_SIZE(%rsi, %rcx), %edx
+	je	L(ret2)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 2)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 2)(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret2):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(return_3_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 3), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+L(return_vec_3):
+	salq	$32, %rcx
+# endif
+
+L(return_vec_2):
+# ifndef USE_AS_STRNCMP
+	tzcntl	%ecx, %ecx
+# else
+	tzcntq	%rcx, %rcx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx), %edx
+	je	L(ret3)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+# endif
+L(ret3):
+	VZEROUPPER_RETURN
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_3):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 3)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 3)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(VEC_SIZE * 3)(%rsi, %rcx), %edx
+	je	L(ret4)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 3)(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(VEC_SIZE * 3)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 3)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret4):
 	VZEROUPPER_RETURN
+# endif
+
+	.p2align 4,, 10
+L(more_3x_vec):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	VEC_SIZE(%rdi), %ymm0
+	VPCMPEQ	VEC_SIZE(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_1)
+
+# ifdef USE_AS_STRNCMP
+	subq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+	VMOVU	(VEC_SIZE * 2)(%rdi), %ymm0
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_2)
+
+	VMOVU	(VEC_SIZE * 3)(%rdi), %ymm0
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_3)
 
-	.p2align 4
-L(next_3_vectors):
-	vmovdqu	VEC_SIZE(%rdi), %ymm6
-	VPCMPEQ	VEC_SIZE(%rsi), %ymm6, %ymm3
-	VPMINU	%ymm6, %ymm3, %ymm3
-	VPCMPEQ	%ymm7, %ymm3, %ymm3
-	vpmovmskb %ymm3, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_vec_size)
-	vmovdqu	(VEC_SIZE * 2)(%rdi), %ymm5
-	vmovdqu	(VEC_SIZE * 3)(%rdi), %ymm4
-	vmovdqu	(VEC_SIZE * 3)(%rsi), %ymm0
-	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm5, %ymm2
-	VPMINU	%ymm5, %ymm2, %ymm2
-	VPCMPEQ	%ymm4, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm2, %ymm2
-	vpmovmskb %ymm2, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_2_vec_size)
-	VPMINU	%ymm4, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_3_vec_size)
-L(main_loop_header):
-	leaq	(VEC_SIZE * 4)(%rdi), %rdx
-	movl	$PAGE_SIZE, %ecx
-	/* Align load via RAX.  */
-	andq	$-(VEC_SIZE * 4), %rdx
-	subq	%rdi, %rdx
-	leaq	(%rdi, %rdx), %rax
 # ifdef USE_AS_STRNCMP
-	/* Starting from this point, the maximum offset, or simply the
-	   'offset', DECREASES by the same amount when base pointers are
-	   moved forward.  Return 0 when:
-	     1) On match: offset <= the matched vector index.
-	     2) On mistmach, offset is before the mistmatched index.
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	/* any non-zero positive value that doesn't inference with 0x1.
 	 */
-	subq	%rdx, %r11
-	jbe	L(zero)
-# endif
-	addq	%rsi, %rdx
-	movq	%rdx, %rsi
-	andl	$(PAGE_SIZE - 1), %esi
-	/* Number of bytes before page crossing.  */
-	subq	%rsi, %rcx
-	/* Number of VEC_SIZE * 4 blocks before page crossing.  */
-	shrq	$DIVIDE_BY_VEC_4_SHIFT, %rcx
-	/* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
-	movl	%ecx, %esi
-	jmp	L(loop_start)
+	movl	$2, %r8d
 
+# else
+	xorl	%r8d, %r8d
+# endif
+
+	/* The prepare labels are various entry points from the page
+	   cross logic.  */
+L(prepare_loop):
+
+# ifdef USE_AS_STRNCMP
+	/* Store N + (VEC_SIZE * 4) and place check at the begining of
+	   the loop.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rdx), %rdx
+# endif
+L(prepare_loop_no_len):
+
+	/* Align s1 and adjust s2 accordingly.  */
+	subq	%rdi, %rsi
+	andq	$-(VEC_SIZE * 4), %rdi
+	addq	%rdi, %rsi
+
+# ifdef USE_AS_STRNCMP
+	subq	%rdi, %rdx
+# endif
+
+L(prepare_loop_aligned):
+	/* eax stores distance from rsi to next page cross. These cases
+	   need to be handled specially as the 4x loop could potentially
+	   read memory past the length of s1 or s2 and across a page
+	   boundary.  */
+	movl	$-(VEC_SIZE * 4), %eax
+	subl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+
+	/* Loop 4x comparisons at a time.  */
 	.p2align 4
 L(loop):
+
+	/* End condition for strncmp.  */
 # ifdef USE_AS_STRNCMP
-	/* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
-	   the maximum offset (%r11) by the same amount.  */
-	subq	$(VEC_SIZE * 4), %r11
-	jbe	L(zero)
-# endif
-	addq	$(VEC_SIZE * 4), %rax
-	addq	$(VEC_SIZE * 4), %rdx
-L(loop_start):
-	testl	%esi, %esi
-	leal	-1(%esi), %esi
-	je	L(loop_cross_page)
-L(back_to_loop):
-	/* Main loop, comparing 4 vectors are a time.  */
-	vmovdqa	(%rax), %ymm0
-	vmovdqa	VEC_SIZE(%rax), %ymm3
-	VPCMPEQ	(%rdx), %ymm0, %ymm4
-	VPCMPEQ	VEC_SIZE(%rdx), %ymm3, %ymm1
-	VPMINU	%ymm0, %ymm4, %ymm4
-	VPMINU	%ymm3, %ymm1, %ymm1
-	vmovdqa	(VEC_SIZE * 2)(%rax), %ymm2
-	VPMINU	%ymm1, %ymm4, %ymm0
-	vmovdqa	(VEC_SIZE * 3)(%rax), %ymm3
-	VPCMPEQ	(VEC_SIZE * 2)(%rdx), %ymm2, %ymm5
-	VPCMPEQ	(VEC_SIZE * 3)(%rdx), %ymm3, %ymm6
-	VPMINU	%ymm2, %ymm5, %ymm5
-	VPMINU	%ymm3, %ymm6, %ymm6
-	VPMINU	%ymm5, %ymm0, %ymm0
-	VPMINU	%ymm6, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-
-	/* Test each mask (32 bits) individually because for VEC_SIZE
-	   == 32 is not possible to OR the four masks and keep all bits
-	   in a 64-bit integer register, differing from SSE2 strcmp
-	   where ORing is possible.  */
-	vpmovmskb %ymm0, %ecx
+	subq	$(VEC_SIZE * 4), %rdx
+	jbe	L(ret_zero)
+# endif
+
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+
+	/* Check if rsi loads will cross a page boundary.  */
+	addl	$-(VEC_SIZE * 4), %eax
+	jnb	L(page_cross_during_loop)
+
+	/* Loop entry after handling page cross during loop.  */
+L(loop_skip_page_cross_check):
+	VMOVA	(VEC_SIZE * 0)(%rdi), %ymm0
+	VMOVA	(VEC_SIZE * 1)(%rdi), %ymm2
+	VMOVA	(VEC_SIZE * 2)(%rdi), %ymm4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %ymm6
+
+	/* ymm1 all 1s where s1 and s2 equal. All 0s otherwise.  */
+	VPCMPEQ	(VEC_SIZE * 0)(%rsi), %ymm0, %ymm1
+
+	VPCMPEQ	(VEC_SIZE * 1)(%rsi), %ymm2, %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
+
+
+	/* If any mismatches or null CHAR then 0 CHAR, otherwise non-
+	   zero.  */
+	vpand	%ymm0, %ymm1, %ymm1
+
+
+	vpand	%ymm2, %ymm3, %ymm3
+	vpand	%ymm4, %ymm5, %ymm5
+	vpand	%ymm6, %ymm7, %ymm7
+
+	VPMINU	%ymm1, %ymm3, %ymm3
+	VPMINU	%ymm5, %ymm7, %ymm7
+
+	/* Reduce all 0 CHARs for the 4x VEC into ymm7.  */
+	VPMINU	%ymm3, %ymm7, %ymm7
+
+	/* If any 0 CHAR then done.  */
+	VPCMPEQ	%ymm7, %ymmZERO, %ymm7
+	vpmovmskb %ymm7, %LOOP_REG
+	testl	%LOOP_REG, %LOOP_REG
+	jz	L(loop)
+
+	/* Find which VEC has the mismatch of end of string.  */
+	VPCMPEQ	%ymm1, %ymmZERO, %ymm1
+	vpmovmskb %ymm1, %ecx
 	testl	%ecx, %ecx
-	je	L(loop)
-	VPCMPEQ	%ymm7, %ymm4, %ymm0
-	vpmovmskb %ymm0, %edi
-	testl	%edi, %edi
-	je	L(test_vec)
-	tzcntl	%edi, %ecx
+	jnz	L(return_vec_0_end)
+
+
+	VPCMPEQ	%ymm3, %ymmZERO, %ymm3
+	vpmovmskb %ymm3, %ecx
+	testl	%ecx, %ecx
+	jnz	L(return_vec_1_end)
+
+L(return_vec_2_3_end):
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	subq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+	VPCMPEQ	%ymm5, %ymmZERO, %ymm5
+	vpmovmskb %ymm5, %ecx
+	testl	%ecx, %ecx
+	jnz	L(return_vec_2_end)
+
+	/* LOOP_REG contains matches for null/mismatch from the loop. If
+	   VEC 0,1,and 2 all have no null and no mismatches then mismatch
+	   must entirely be from VEC 3 which is fully represented by
+	   LOOP_REG.  */
+	tzcntl	%LOOP_REG, %LOOP_REG
+
+# ifdef USE_AS_STRNCMP
+	subl	$-(VEC_SIZE), %LOOP_REG
+	cmpq	%LOOP_REG64, %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
+	je	L(ret5)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	(VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax
+	movzbl	(VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret5):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(test_vec):
 # ifdef USE_AS_STRNCMP
-	/* The first vector matched.  Return 0 if the maximum offset
-	   (%r11) <= VEC_SIZE.  */
-	cmpq	$VEC_SIZE, %r11
-	jbe	L(zero)
+	.p2align 4,, 2
+L(ret_zero_end):
+	xorl	%eax, %eax
+	VZEROUPPER_RETURN
 # endif
-	VPCMPEQ	%ymm7, %ymm1, %ymm1
-	vpmovmskb %ymm1, %ecx
-	testl	%ecx, %ecx
-	je	L(test_2_vec)
-	tzcntl	%ecx, %edi
+
+
+	/* The L(return_vec_N_end) differ from L(return_vec_N) in that
+	   they use the value of `r8` to negate the return value. This is
+	   because the page cross logic can swap `rdi` and `rsi`.  */
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	addq	$VEC_SIZE, %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+L(return_vec_1_end):
+	salq	$32, %rcx
+# endif
+L(return_vec_0_end):
+# ifndef USE_AS_STRNCMP
+	tzcntl	%ecx, %ecx
+# else
+	tzcntq	%rcx, %rcx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret6)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
+# endif
+L(ret6):
+	VZEROUPPER_RETURN
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_1_end):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_SIZE(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rsi, %rdi), %ecx
-	cmpl	VEC_SIZE(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
+	cmpl	VEC_SIZE(%rsi, %rcx), %edx
+	je	L(ret7)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 #  else
-	movzbl	VEC_SIZE(%rax, %rdi), %eax
-	movzbl	VEC_SIZE(%rdx, %rdi), %edx
-	subl	%edx, %eax
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 #  endif
-# endif
+L(ret7):
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(test_2_vec):
+	.p2align 4,, 10
+L(return_vec_2_end):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-	/* The first 2 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 2 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 2), %r11
-	jbe	L(zero)
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
-	VPCMPEQ	%ymm7, %ymm5, %ymm5
-	vpmovmskb %ymm5, %ecx
-	testl	%ecx, %ecx
-	je	L(test_3_vec)
-	tzcntl	%ecx, %edi
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx), %edx
+	je	L(ret11)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rdi), %ecx
-	cmpl	(VEC_SIZE * 2)(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rdi), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret11):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(test_3_vec):
+
+	/* Page cross in rsi in next 4x VEC.  */
+
+	/* TODO: Improve logic here.  */
+	.p2align 4,, 10
+L(page_cross_during_loop):
+	/* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
+
+	/* Optimistically rsi and rdi and both aligned inwhich case we
+	   don't need any logic here.  */
+	cmpl	$-(VEC_SIZE * 4), %eax
+	/* Don't adjust eax before jumping back to loop and we will
+	   never hit page cross case again.  */
+	je	L(loop_skip_page_cross_check)
+
+	/* Check if we can safely load a VEC.  */
+	cmpl	$-(VEC_SIZE * 3), %eax
+	jle	L(less_1x_vec_till_page_cross)
+
+	VMOVA	(%rdi), %ymm0
+	VPCMPEQ	(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_0_end)
+
+	/* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
+	cmpl	$-(VEC_SIZE * 2), %eax
+	jg	L(more_2x_vec_till_page_cross)
+
+	.p2align 4,, 4
+L(less_1x_vec_till_page_cross):
+	subl	$-(VEC_SIZE * 4), %eax
+	/* Guranteed safe to read from rdi - VEC_SIZE here. The only
+	   concerning case is first iteration if incoming s1 was near start
+	   of a page and s2 near end. If s1 was near the start of the page
+	   we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
+	   to read back -VEC_SIZE. If rdi is truly at the start of a page
+	   here, it means the previous page (rdi - VEC_SIZE) has already
+	   been loaded earlier so must be valid.  */
+	VMOVU	-VEC_SIZE(%rdi, %rax), %ymm0
+	VPCMPEQ	-VEC_SIZE(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+
+	/* Mask of potentially valid bits. The lower bits can be out of
+	   range comparisons (but safe regarding page crosses).  */
+	movl	$-1, %r10d
+	shlxl	%esi, %r10d, %r10d
+	notl	%ecx
+
 # ifdef USE_AS_STRNCMP
-	/* The first 3 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 3 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 3), %r11
-	jbe	L(zero)
-# endif
-	VPCMPEQ	%ymm7, %ymm6, %ymm6
-	vpmovmskb %ymm6, %esi
-	tzcntl	%esi, %ecx
+	cmpq	%rax, %rdx
+	jbe	L(return_page_cross_end_check)
+# endif
+	movl	%eax, %OFFSET_REG
+	addl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+
+	andl	%r10d, %ecx
+	jz	L(loop_skip_page_cross_check)
+
+	.p2align 4,, 3
+L(return_page_cross_end):
+	tzcntl	%ecx, %ecx
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 3), %rcx
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %esi
-	cmpl	(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	leal	-VEC_SIZE(%OFFSET_REG64, %rcx), %ecx
+L(return_page_cross_cmp_mem):
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	addl	%OFFSET_REG, %ecx
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rsi, %rcx), %esi
-	cmpl	(VEC_SIZE * 3)(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 3)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 3)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret8)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret8):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(loop_cross_page):
-	xorl	%r10d, %r10d
-	movq	%rdx, %rcx
-	/* Align load via RDX.  We load the extra ECX bytes which should
-	   be ignored.  */
-	andl	$((VEC_SIZE * 4) - 1), %ecx
-	/* R10 is -RCX.  */
-	subq	%rcx, %r10
-
-	/* This works only if VEC_SIZE * 2 == 64. */
-# if (VEC_SIZE * 2) != 64
-#  error (VEC_SIZE * 2) != 64
-# endif
-
-	/* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
-	cmpl	$(VEC_SIZE * 2), %ecx
-	jge	L(loop_cross_page_2_vec)
-
-	vmovdqu	(%rax, %r10), %ymm2
-	vmovdqu	VEC_SIZE(%rax, %r10), %ymm3
-	VPCMPEQ	(%rdx, %r10), %ymm2, %ymm0
-	VPCMPEQ	VEC_SIZE(%rdx, %r10), %ymm3, %ymm1
-	VPMINU	%ymm2, %ymm0, %ymm0
-	VPMINU	%ymm3, %ymm1, %ymm1
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm1, %ymm1
-
-	vpmovmskb %ymm0, %edi
-	vpmovmskb %ymm1, %esi
-
-	salq	$32, %rsi
-	xorq	%rsi, %rdi
-
-	/* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
-	shrq	%cl, %rdi
-
-	testq	%rdi, %rdi
-	je	L(loop_cross_page_2_vec)
-	tzcntq	%rdi, %rcx
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	.p2align 4,, 10
+L(return_page_cross_end_check):
+	tzcntl	%ecx, %ecx
+	leal	-VEC_SIZE(%rax, %rcx), %ecx
+	cmpl	%ecx, %edx
+	ja	L(return_page_cross_cmp_mem)
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# endif
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(loop_cross_page_2_vec):
-	/* The first VEC_SIZE * 2 bytes match or are ignored.  */
-	vmovdqu	(VEC_SIZE * 2)(%rax, %r10), %ymm2
-	vmovdqu	(VEC_SIZE * 3)(%rax, %r10), %ymm3
-	VPCMPEQ	(VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5
-	VPMINU	%ymm2, %ymm5, %ymm5
-	VPCMPEQ	(VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6
-	VPCMPEQ	%ymm7, %ymm5, %ymm5
-	VPMINU	%ymm3, %ymm6, %ymm6
-	VPCMPEQ	%ymm7, %ymm6, %ymm6
-
-	vpmovmskb %ymm5, %edi
-	vpmovmskb %ymm6, %esi
-
-	salq	$32, %rsi
-	xorq	%rsi, %rdi
 
-	xorl	%r8d, %r8d
-	/* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
-	subl	$(VEC_SIZE * 2), %ecx
-	jle	1f
-	/* Skip ECX bytes.  */
-	shrq	%cl, %rdi
-	/* R8 has number of bytes skipped.  */
-	movl	%ecx, %r8d
-1:
-	/* Before jumping back to the loop, set ESI to the number of
-	   VEC_SIZE * 4 blocks before page crossing.  */
-	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
-
-	testq	%rdi, %rdi
+	.p2align 4,, 10
+L(more_2x_vec_till_page_cross):
+	/* If more 2x vec till cross we will complete a full loop
+	   iteration here.  */
+
+	VMOVU	VEC_SIZE(%rdi), %ymm0
+	VPCMPEQ	VEC_SIZE(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_1_end)
+
 # ifdef USE_AS_STRNCMP
-	/* At this point, if %rdi value is 0, it already tested
-	   VEC_SIZE*4+%r10 byte starting from %rax. This label
-	   checks whether strncmp maximum offset reached or not.  */
-	je	L(string_nbyte_offset_check)
-# else
-	je	L(back_to_loop)
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
-	tzcntq	%rdi, %rcx
-	addq	%r10, %rcx
-	/* Adjust for number of bytes skipped.  */
-	addq	%r8, %rcx
+
+	subl	$-(VEC_SIZE * 4), %eax
+
+	/* Safe to include comparisons from lower bytes.  */
+	VMOVU	-(VEC_SIZE * 2)(%rdi, %rax), %ymm0
+	VPCMPEQ	-(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_page_cross_0)
+
+	VMOVU	-(VEC_SIZE * 1)(%rdi, %rax), %ymm0
+	VPCMPEQ	-(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_page_cross_1)
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rcx
-	subq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	/* Must check length here as length might proclude reading next
+	   page.  */
+	cmpq	%rax, %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
+# endif
+
+	/* Finish the loop.  */
+	VMOVA	(VEC_SIZE * 2)(%rdi), %ymm4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %ymm6
+
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
+	vpand	%ymm4, %ymm5, %ymm5
+	vpand	%ymm6, %ymm7, %ymm7
+	VPMINU	%ymm5, %ymm7, %ymm7
+	VPCMPEQ	%ymm7, %ymmZERO, %ymm7
+	vpmovmskb %ymm7, %LOOP_REG
+	testl	%LOOP_REG, %LOOP_REG
+	jnz	L(return_vec_2_3_end)
+
+	/* Best for code size to include ucond-jmp here. Would be faster
+	   if this case is hot to duplicate the L(return_vec_2_3_end) code
+	   as fall-through and have jump back to loop on mismatch
+	   comparison.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+	addl	$(PAGE_SIZE - VEC_SIZE * 8), %eax
+# ifdef USE_AS_STRNCMP
+	subq	$(VEC_SIZE * 4), %rdx
+	ja	L(loop_skip_page_cross_check)
+L(ret_zero_in_loop_page_cross):
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	VZEROUPPER_RETURN
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rcx), %edi
-	cmpl	(VEC_SIZE * 2)(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	jmp	L(loop_skip_page_cross_check)
 # endif
-	VZEROUPPER_RETURN
 
+
+	.p2align 4,, 10
+L(return_vec_page_cross_0):
+	addl	$-VEC_SIZE, %eax
+L(return_vec_page_cross_1):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-L(string_nbyte_offset_check):
-	leaq	(VEC_SIZE * 4)(%r10), %r10
-	cmpq	%r10, %r11
-	jbe	L(zero)
-	jmp	L(back_to_loop)
+	leal	-VEC_SIZE(%rax, %rcx), %ecx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
+# else
+	addl	%eax, %ecx
 # endif
 
-	.p2align 4
-L(cross_page_loop):
-	/* Check one byte/dword at a time.  */
 # ifdef USE_AS_WCSCMP
-	cmpl	%ecx, %eax
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
+	xorl	%eax, %eax
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret9)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
 	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
-	jne	L(different)
-	addl	$SIZE_OF_CHAR, %edx
-	cmpl	$(VEC_SIZE * 4), %edx
-	je	L(main_loop_header)
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+L(ret9):
+	VZEROUPPER_RETURN
+
+
+	.p2align 4,, 10
+L(page_cross):
+# ifndef USE_AS_STRNCMP
+	/* If both are VEC aligned we don't need any special logic here.
+	   Only valid for strcmp where stop condition is guranteed to be
+	   reachable by just reading memory.  */
+	testl	$((VEC_SIZE - 1) << 20), %eax
+	jz	L(no_page_cross)
 # endif
+
+	movl	%edi, %eax
+	movl	%esi, %ecx
+	andl	$(PAGE_SIZE - 1), %eax
+	andl	$(PAGE_SIZE - 1), %ecx
+
+	xorl	%OFFSET_REG, %OFFSET_REG
+
+	/* Check which is closer to page cross, s1 or s2.  */
+	cmpl	%eax, %ecx
+	jg	L(page_cross_s2)
+
+	/* The previous page cross check has false positives. Check for
+	   true positive as page cross logic is very expensive.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+	jbe	L(no_page_cross)
+
+	/* Set r8 to not interfere with normal return value (rdi and rsi
+	   did not swap).  */
 # ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
+	xorl	%r8d, %r8d
 # endif
-	/* Check null char.  */
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
-	/* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
-	   comparisons.  */
-	subl	%ecx, %eax
-# ifndef USE_AS_WCSCMP
-L(different):
+
+	/* Check if less than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jg	L(less_1x_vec_till_page)
+
+	/* If more than 1x VEC till page cross, loop throuh safely
+	   loadable memory until within 1x VEC of page cross.  */
+
+	.p2align 4,, 10
+L(page_cross_loop):
+
+	VMOVU	(%rdi, %OFFSET_REG64), %ymm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+
+	jnz	L(check_ret_vec_page_cross)
+	addl	$VEC_SIZE, %OFFSET_REG
+# ifdef USE_AS_STRNCMP
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
-	VZEROUPPER_RETURN
+	addl	$VEC_SIZE, %eax
+	jl	L(page_cross_loop)
+
+	subl	%eax, %OFFSET_REG
+	/* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
+	   to not cross page so is safe to load. Since we have already
+	   loaded at least 1 VEC from rsi it is also guranteed to be safe.
+	 */
+
+	VMOVU	(%rdi, %OFFSET_REG64), %ymm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+
+# ifdef USE_AS_STRNCMP
+	leal	VEC_SIZE(%OFFSET_REG64), %eax
+	cmpq	%rax, %rdx
+	jbe	L(check_ret_vec_page_cross2)
+	addq	%rdi, %rdx
+# endif
+	incl	%ecx
+	jz	L(prepare_loop_no_len)
 
+	.p2align 4,, 4
+L(ret_vec_page_cross):
+# ifndef USE_AS_STRNCMP
+L(check_ret_vec_page_cross):
+# endif
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+L(ret_vec_page_cross_cont):
 # ifdef USE_AS_WCSCMP
-	.p2align 4
-L(different):
-	/* Use movl to avoid modifying EFLAGS.  */
-	movl	$0, %eax
+	movl	(%rdi, %rcx), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret12)
 	setl	%al
 	negl	%eax
-	orl	$1, %eax
-	VZEROUPPER_RETURN
+	xorl	%r8d, %eax
+# else
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret12):
+	VZEROUPPER_RETURN
 
 # ifdef USE_AS_STRNCMP
-	.p2align 4
-L(zero):
+	.p2align 4,, 10
+L(check_ret_vec_page_cross2):
+	incl	%ecx
+L(check_ret_vec_page_cross):
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+	cmpq	%rcx, %rdx
+	ja	L(ret_vec_page_cross_cont)
+	.p2align 4,, 2
+L(ret_zero_page_cross):
 	xorl	%eax, %eax
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(char0):
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi), %ecx
-	cmpl	(%rsi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rsi), %ecx
-	movzbl	(%rdi), %eax
-	subl	%ecx, %eax
-#  endif
-	VZEROUPPER_RETURN
+	.p2align 4,, 4
+L(page_cross_s2):
+	/* Ensure this is a true page cross.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %ecx
+	jbe	L(no_page_cross)
+
+
+	movl	%ecx, %eax
+	movq	%rdi, %rcx
+	movq	%rsi, %rdi
+	movq	%rcx, %rsi
+
+	/* set r8 to negate return value as rdi and rsi swapped.  */
+# ifdef USE_AS_WCSCMP
+	movl	$-4, %r8d
+# else
+	movl	$-1, %r8d
 # endif
+	xorl	%OFFSET_REG, %OFFSET_REG
 
-	.p2align 4
-L(last_vector):
-	addq	%rdx, %rdi
-	addq	%rdx, %rsi
+	/* Check if more than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jle	L(page_cross_loop)
+
+	.p2align 4,, 6
+L(less_1x_vec_till_page):
+	/* Find largest load size we can use.  */
+	cmpl	$16, %eax
+	ja	L(less_16_till_page)
+
+	VMOVU	(%rdi), %xmm0
+	VPCMPEQ	(%rsi), %xmm0, %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incw	%cx
+	jnz	L(check_ret_vec_page_cross)
+	movl	$16, %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	subq	%rdx, %r11
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subl	%eax, %OFFSET_REG
+# else
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+	jz	L(prepare_loop)
 # endif
-	tzcntl	%ecx, %edx
+
+	VMOVU	(%rdi, %OFFSET_REG64), %xmm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %xmm0, %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incw	%cx
+	jnz	L(check_ret_vec_page_cross)
+
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addl	$16, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(VEC_SIZE * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# else
+	leaq	(16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
 # endif
-# ifdef USE_AS_WCSCMP
+	jmp	L(prepare_loop_aligned)
+
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case0):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	ret
 # endif
-	VZEROUPPER_RETURN
 
-	/* Comparing on page boundary region requires special treatment:
-	   It must done one vector at the time, starting with the wider
-	   ymm vector if possible, if not, with xmm. If fetching 16 bytes
-	   (xmm) still passes the boundary, byte comparison must be done.
-	 */
-	.p2align 4
-L(cross_page):
-	/* Try one ymm vector at a time.  */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jg	L(cross_page_1_vector)
-L(loop_1_vector):
-	vmovdqu	(%rdi, %rdx), %ymm1
-	VPCMPEQ	(%rsi, %rdx), %ymm1, %ymm0
-	VPMINU	%ymm1, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
 
-	addl	$VEC_SIZE, %edx
+	.p2align 4,, 10
+L(less_16_till_page):
+	/* Find largest load size we can use.  */
+	cmpl	$24, %eax
+	ja	L(less_8_till_page)
 
-	addl	$VEC_SIZE, %eax
-# ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jle	L(loop_1_vector)
-L(cross_page_1_vector):
-	/* Less than 32 bytes to check, try one xmm vector.  */
-	cmpl	$(PAGE_SIZE - 16), %eax
-	jg	L(cross_page_1_xmm)
-	vmovdqu	(%rdi, %rdx), %xmm1
-	VPCMPEQ	(%rsi, %rdx), %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	vmovq	(%rdi), %xmm0
+	vmovq	(%rsi), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incb	%cl
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$16, %edx
-# ifndef USE_AS_WCSCMP
-	addl	$16, %eax
+
+# ifdef USE_AS_STRNCMP
+	cmpq	$8, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
 # endif
+	movl	$24, %OFFSET_REG
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+
+
+
+	vmovq	(%rdi, %OFFSET_REG64), %xmm0
+	vmovq	(%rsi, %OFFSET_REG64), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incb	%cl
+	jnz	L(check_ret_vec_page_cross)
+
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-
-L(cross_page_1_xmm):
-# ifndef USE_AS_WCSCMP
-	/* Less than 16 bytes to check, try 8 byte vector.  NB: No need
-	   for wcscmp nor wcsncmp since wide char is 4 bytes.   */
-	cmpl	$(PAGE_SIZE - 8), %eax
-	jg	L(cross_page_8bytes)
-	vmovq	(%rdi, %rdx), %xmm1
-	vmovq	(%rsi, %rdx), %xmm0
-	VPCMPEQ	%xmm0, %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	addl	$8, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(VEC_SIZE * 4), %rdx
 
-	addl	$8, %edx
-	addl	$8, %eax
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# else
+	leaq	(8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# endif
+	jmp	L(prepare_loop_aligned)
+
+
+	.p2align 4,, 10
+L(less_8_till_page):
+# ifdef USE_AS_WCSCMP
+	/* If using wchar then this is the only check before we reach
+	   the page boundary.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	cmpl	%ecx, %eax
+	jnz	L(ret_less_8_wcs)
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addq	%rdi, %rdx
+	/* We already checked for len <= 1 so cannot hit that case here.
+	 */
 #  endif
+	testl	%eax, %eax
+	jnz	L(prepare_loop_no_len)
+	ret
 
-L(cross_page_8bytes):
-	/* Less than 8 bytes to check, try 4 byte vector.  */
-	cmpl	$(PAGE_SIZE - 4), %eax
-	jg	L(cross_page_4bytes)
-	vmovd	(%rdi, %rdx), %xmm1
-	vmovd	(%rsi, %rdx), %xmm0
-	VPCMPEQ	%xmm0, %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	/* Only last 4 bits are valid.  */
-	andl	$0xf, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	.p2align 4,, 8
+L(ret_less_8_wcs):
+	setl	%OFFSET_REG8
+	negl	%OFFSET_REG
+	movl	%OFFSET_REG, %eax
+	xorl	%r8d, %eax
+	ret
+
+# else
+
+	/* Find largest load size we can use.  */
+	cmpl	$28, %eax
+	ja	L(less_4_till_page)
+
+	vmovd	(%rdi), %xmm0
+	vmovd	(%rsi), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$4, %edx
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$4, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
 #  endif
+	movl	$28, %OFFSET_REG
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
 
-L(cross_page_4bytes):
-# endif
-	/* Less than 4 bytes to check, try one byte/dword at a time.  */
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-# ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
-# endif
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
+
+
+	vmovd	(%rdi, %OFFSET_REG64), %xmm0
+	vmovd	(%rsi, %OFFSET_REG64), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
+
+#  ifdef USE_AS_STRNCMP
+	addl	$4, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
+	subq	$-(VEC_SIZE * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+#  else
+	leaq	(4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+#  ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case1):
+	xorl	%eax, %eax
+	ret
+#  endif
+
+	.p2align 4,, 10
+L(less_4_till_page):
+	subq	%rdi, %rsi
+	/* Extremely slow byte comparison loop.  */
+L(less_4_loop):
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi, %rdi), %ecx
 	subl	%ecx, %eax
-	VZEROUPPER_RETURN
-END (STRCMP)
+	jnz	L(ret_less_4_loop)
+	testl	%ecx, %ecx
+	jz	L(ret_zero_4_loop)
+#  ifdef USE_AS_STRNCMP
+	decq	%rdx
+	jz	L(ret_zero_4_loop)
+#  endif
+	incq	%rdi
+	/* end condition is reach page boundary (rdi is aligned).  */
+	testl	$31, %edi
+	jnz	L(less_4_loop)
+	leaq	-(VEC_SIZE * 4)(%rdi, %rsi), %rsi
+	addq	$-(VEC_SIZE * 4), %rdi
+#  ifdef USE_AS_STRNCMP
+	subq	$-(VEC_SIZE * 4), %rdx
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+L(ret_zero_4_loop):
+	xorl	%eax, %eax
+	ret
+L(ret_less_4_loop):
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
+	ret
+# endif
+END(STRCMP)
 #endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 6/7] x86: Optimize strcmp-evex.S
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
                     ` (3 preceding siblings ...)
  2022-01-10  0:27   ` [PATCH v2 5/7] x86: Optimize strcmp-avx2.S Noah Goldstein
@ 2022-01-10  0:27   ` Noah Goldstein
  2022-01-10  0:41     ` H.J. Lu
  2022-01-10  0:27   ` [PATCH v2 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
  2022-01-10  0:34   ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] H.J. Lu
  6 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  0:27 UTC (permalink / raw)
  To: libc-alpha

Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases as well are nearly universally improved.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-evex.S | 1712 +++++++++++++-----------
 1 file changed, 919 insertions(+), 793 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
index 0cd939d5af..e5070f3d53 100644
--- a/sysdeps/x86_64/multiarch/strcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
@@ -26,54 +26,69 @@
 
 # define PAGE_SIZE	4096
 
-/* VEC_SIZE = Number of bytes in a ymm register */
+	/* VEC_SIZE = Number of bytes in a ymm register.  */
 # define VEC_SIZE	32
+# define CHAR_PER_VEC	(VEC_SIZE	/	SIZE_OF_CHAR)
 
-/* Shift for dividing by (VEC_SIZE * 4).  */
-# define DIVIDE_BY_VEC_4_SHIFT	7
-# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-# endif
-
-# define VMOVU		vmovdqu64
-# define VMOVA		vmovdqa64
+# define VMOVU	vmovdqu64
+# define VMOVA	vmovdqa64
 
 # ifdef USE_AS_WCSCMP
-/* Compare packed dwords.  */
-#  define VPCMP		vpcmpd
+#  define TESTEQ	subl	$0xff,
+	/* Compare packed dwords.  */
+#  define VPCMP	vpcmpd
 #  define VPMINU	vpminud
 #  define VPTESTM	vptestmd
-#  define SHIFT_REG32	r8d
-#  define SHIFT_REG64	r8
-/* 1 dword char == 4 bytes.  */
+	/* 1 dword char == 4 bytes.  */
 #  define SIZE_OF_CHAR	4
 # else
-/* Compare packed bytes.  */
-#  define VPCMP		vpcmpb
+#  define TESTEQ	incl
+	/* Compare packed bytes.  */
+#  define VPCMP	vpcmpb
 #  define VPMINU	vpminub
 #  define VPTESTM	vptestmb
-#  define SHIFT_REG32	ecx
-#  define SHIFT_REG64	rcx
-/* 1 byte char == 1 byte.  */
+	/* 1 byte char == 1 byte.  */
 #  define SIZE_OF_CHAR	1
 # endif
 
+# ifdef USE_AS_STRNCMP
+#  define LOOP_REG	r9d
+#  define LOOP_REG64	r9
+
+#  define OFFSET_REG8	r9b
+#  define OFFSET_REG	r9d
+#  define OFFSET_REG64	r9
+# else
+#  define LOOP_REG	edx
+#  define LOOP_REG64	rdx
+
+#  define OFFSET_REG8	dl
+#  define OFFSET_REG	edx
+#  define OFFSET_REG64	rdx
+# endif
+
+# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
+#  define VEC_OFFSET	0
+# else
+#  define VEC_OFFSET	(-VEC_SIZE)
+# endif
+
 # define XMMZERO	xmm16
-# define XMM0		xmm17
-# define XMM1		xmm18
+# define XMM0	xmm17
+# define XMM1	xmm18
 
 # define YMMZERO	ymm16
-# define YMM0		ymm17
-# define YMM1		ymm18
-# define YMM2		ymm19
-# define YMM3		ymm20
-# define YMM4		ymm21
-# define YMM5		ymm22
-# define YMM6		ymm23
-# define YMM7		ymm24
-# define YMM8		ymm25
-# define YMM9		ymm26
-# define YMM10		ymm27
+# define YMM0	ymm17
+# define YMM1	ymm18
+# define YMM2	ymm19
+# define YMM3	ymm20
+# define YMM4	ymm21
+# define YMM5	ymm22
+# define YMM6	ymm23
+# define YMM7	ymm24
+# define YMM8	ymm25
+# define YMM9	ymm26
+# define YMM10	ymm27
 
 /* Warning!
            wcscmp/wcsncmp have to use SIGNED comparison for elements.
@@ -96,985 +111,1096 @@
    the maximum offset is reached before a difference is found, zero is
    returned.  */
 
-	.section .text.evex,"ax",@progbits
-ENTRY (STRCMP)
+	.section .text.evex, "ax", @progbits
+ENTRY(STRCMP)
 # ifdef USE_AS_STRNCMP
-	/* Check for simple cases (0 or 1) in offset.  */
-	cmp	$1, %RDX_LP
-	je	L(char0)
-	jb	L(zero)
-#  ifdef USE_AS_WCSCMP
-#  ifndef __ILP32__
-	movq	%rdx, %rcx
-	/* Check if length could overflow when multiplied by
-	   sizeof(wchar_t). Checking top 8 bits will cover all potential
-	   overflow cases as well as redirect cases where its impossible to
-	   length to bound a valid memory region. In these cases just use
-	   'wcscmp'.  */
-	shrq	$56, %rcx
-	jnz	__wcscmp_evex
-#  endif
-	/* Convert units: from wide to byte char.  */
-	shl	$2, %RDX_LP
+#  ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %rdx
 #  endif
-	/* Register %r11 tracks the maximum offset.  */
-	mov	%RDX_LP, %R11_LP
+	cmp	$1, %RDX_LP
+	/* Signed comparison intentional. We use this branch to also
+	   test cases where length >= 2^63. These very large sizes can be
+	   handled with strcmp as there is no way for that length to
+	   actually bound the buffer.  */
+	jle	L(one_or_less)
 # endif
 	movl	%edi, %eax
-	xorl	%edx, %edx
-	/* Make %XMMZERO (%YMMZERO) all zeros in this function.  */
-	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
 	orl	%esi, %eax
-	andl	$(PAGE_SIZE - 1), %eax
-	cmpl	$(PAGE_SIZE - (VEC_SIZE * 4)), %eax
-	jg	L(cross_page)
-	/* Start comparing 4 vectors.  */
+	/* Shift out the bits irrelivant to page boundary ([63:12]).  */
+	sall	$20, %eax
+	/* Check if s1 or s2 may cross a page in next 4x VEC loads.  */
+	cmpl	$((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
+	ja	L(page_cross)
+
+L(no_page_cross):
+	/* Safe to compare 4x vectors.  */
 	VMOVU	(%rdi), %YMM0
-
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-
 	/* Each bit cleared in K1 represents a mismatch or a null CHAR
 	   in YMM0 and 32 bytes at (%rsi).  */
 	VPCMP	$0, (%rsi), %YMM0, %k1{%k2}
-
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(next_3_vectors)
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx) is after the maximum
-	   offset (%r11).   */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$CHAR_PER_VEC, %rdx
+	jbe	L(vec_0_test_len)
 # endif
+
+	/* TESTEQ is `incl` for strcmp/strncmp and `subl $0xff` for
+	   wcscmp/wcsncmp.  */
+
+	/* All 1s represents all equals. TESTEQ will overflow to zero in
+	   all equals case. Otherwise 1s will carry until position of first
+	   mismatch.  */
+	TESTEQ	%ecx
+	jz	L(more_3x_vec)
+
+	.p2align 4,, 4
+L(return_vec_0):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	je	L(return)
-L(wcscmp_return):
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret0)
 	setl	%al
 	negl	%eax
 	orl	$1, %eax
-L(return):
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret0):
 	ret
 
-L(return_vec_size):
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
-	   the maximum offset (%r11).  */
-	addq	$VEC_SIZE, %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	.p2align 4,, 4
+L(vec_0_test_len):
+	notl	%ecx
+	bzhil	%edx, %ecx, %eax
+	jnz	L(return_vec_0)
+	/* Align if will cross fetch block.  */
+	.p2align 4,, 2
+L(ret_zero):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
-# else
+	ret
+
+	.p2align 4,, 5
+L(one_or_less):
+	jb	L(ret_zero)
 #  ifdef USE_AS_WCSCMP
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__wcscmp_evex
+	movl	(%rdi), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rdi, %rdx), %ecx
-	cmpl	VEC_SIZE(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(%rsi), %edx
+	je	L(ret1)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	VEC_SIZE(%rdi, %rdx), %eax
-	movzbl	VEC_SIZE(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__strcmp_evex
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret1):
 	ret
+# endif
 
-L(return_2_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
+L(return_vec_1):
+	tzcntl	%ecx, %ecx
+# ifdef USE_AS_STRNCMP
+	/* rdx must be > CHAR_PER_VEC so its safe to subtract without
+	   worrying about underflow.  */
+	addq	$-CHAR_PER_VEC, %rdx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
+	movl	VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret2)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
+# else
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret2):
+	ret
+
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 2), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+L(return_vec_3):
+#  if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %ecx
 #  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	salq	$CHAR_PER_VEC, %rcx
 #  endif
+# endif
+L(return_vec_2):
+# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
+	tzcntl	%ecx, %ecx
 # else
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 2)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 2)(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	tzcntq	%rcx, %rcx
 # endif
-	ret
 
-L(return_3_vec_size):
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 3), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret3)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+# endif
+L(ret3):
+	ret
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_3):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 3)(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 3)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(VEC_SIZE * 3)(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret4)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 3)(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(VEC_SIZE * 3)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 3)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret4):
 	ret
+# endif
 
-	.p2align 4
-L(next_3_vectors):
-	VMOVU	VEC_SIZE(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
+	/* 32 byte align here ensures the main loop is ideally aligned
+	   for DSB.  */
+	.p2align 5
+L(more_3x_vec):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	(VEC_SIZE)(%rdi), %YMM0
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at VEC_SIZE(%rsi).  */
-	VPCMP	$0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
+	VPCMP	$0, (VEC_SIZE)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_1)
+
+# ifdef USE_AS_STRNCMP
+	subq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero)
 # endif
-	jne	L(return_vec_size)
 
 	VMOVU	(VEC_SIZE * 2)(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
 	VPCMP	$0, (VEC_SIZE * 2)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	jne	L(return_2_vec_size)
+	TESTEQ	%ecx
+	jnz	L(return_vec_2)
 
 	VMOVU	(VEC_SIZE * 3)(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
 	VPCMP	$0, (VEC_SIZE * 3)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_3)
+
+# ifdef USE_AS_STRNCMP
+	cmpq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
+
 # else
-	incl	%ecx
+	xorl	%r8d, %r8d
 # endif
-	jne	L(return_3_vec_size)
-L(main_loop_header):
-	leaq	(VEC_SIZE * 4)(%rdi), %rdx
-	movl	$PAGE_SIZE, %ecx
-	/* Align load via RAX.  */
-	andq	$-(VEC_SIZE * 4), %rdx
-	subq	%rdi, %rdx
-	leaq	(%rdi, %rdx), %rax
+
+	/* The prepare labels are various entry points from the page
+	   cross logic.  */
+L(prepare_loop):
+
 # ifdef USE_AS_STRNCMP
-	/* Starting from this point, the maximum offset, or simply the
-	   'offset', DECREASES by the same amount when base pointers are
-	   moved forward.  Return 0 when:
-	     1) On match: offset <= the matched vector index.
-	     2) On mistmach, offset is before the mistmatched index.
-	 */
-	subq	%rdx, %r11
-	jbe	L(zero)
+#  ifdef USE_AS_WCSCMP
+L(prepare_loop_no_len):
+	movl	%edi, %ecx
+	andl	$(VEC_SIZE * 4 - 1), %ecx
+	shrl	$2, %ecx
+	leaq	(CHAR_PER_VEC * 2)(%rdx, %rcx), %rdx
+#  else
+	/* Store N + (VEC_SIZE * 4) and place check at the begining of
+	   the loop.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rdx), %rdx
+L(prepare_loop_no_len):
+#  endif
+# else
+L(prepare_loop_no_len):
 # endif
-	addq	%rsi, %rdx
-	movq	%rdx, %rsi
-	andl	$(PAGE_SIZE - 1), %esi
-	/* Number of bytes before page crossing.  */
-	subq	%rsi, %rcx
-	/* Number of VEC_SIZE * 4 blocks before page crossing.  */
-	shrq	$DIVIDE_BY_VEC_4_SHIFT, %rcx
-	/* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
-	movl	%ecx, %esi
-	jmp	L(loop_start)
 
+	/* Align s1 and adjust s2 accordingly.  */
+	subq	%rdi, %rsi
+	andq	$-(VEC_SIZE * 4), %rdi
+L(prepare_loop_readj):
+	addq	%rdi, %rsi
+# if (defined USE_AS_STRNCMP) && !(defined USE_AS_WCSCMP)
+	subq	%rdi, %rdx
+# endif
+
+L(prepare_loop_aligned):
+	/* eax stores distance from rsi to next page cross. These cases
+	   need to be handled specially as the 4x loop could potentially
+	   read memory past the length of s1 or s2 and across a page
+	   boundary.  */
+	movl	$-(VEC_SIZE * 4), %eax
+	subl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+
+	vpxorq	%YMMZERO, %YMMZERO, %YMMZERO
+
+	/* Loop 4x comparisons at a time.  */
 	.p2align 4
 L(loop):
+
+	/* End condition for strncmp.  */
 # ifdef USE_AS_STRNCMP
-	/* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
-	   the maximum offset (%r11) by the same amount.  */
-	subq	$(VEC_SIZE * 4), %r11
-	jbe	L(zero)
+	subq	$(CHAR_PER_VEC * 4), %rdx
+	jbe	L(ret_zero)
 # endif
-	addq	$(VEC_SIZE * 4), %rax
-	addq	$(VEC_SIZE * 4), %rdx
-L(loop_start):
-	testl	%esi, %esi
-	leal	-1(%esi), %esi
-	je	L(loop_cross_page)
-L(back_to_loop):
-	/* Main loop, comparing 4 vectors are a time.  */
-	VMOVA	(%rax), %YMM0
-	VMOVA	VEC_SIZE(%rax), %YMM2
-	VMOVA	(VEC_SIZE * 2)(%rax), %YMM4
-	VMOVA	(VEC_SIZE * 3)(%rax), %YMM6
+
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+
+	/* Check if rsi loads will cross a page boundary.  */
+	addl	$-(VEC_SIZE * 4), %eax
+	jnb	L(page_cross_during_loop)
+
+	/* Loop entry after handling page cross during loop.  */
+L(loop_skip_page_cross_check):
+	VMOVA	(VEC_SIZE * 0)(%rdi), %YMM0
+	VMOVA	(VEC_SIZE * 1)(%rdi), %YMM2
+	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM6
 
 	VPMINU	%YMM0, %YMM2, %YMM8
 	VPMINU	%YMM4, %YMM6, %YMM9
 
-	/* A zero CHAR in YMM8 means that there is a null CHAR.  */
-	VPMINU	%YMM8, %YMM9, %YMM8
+	/* A zero CHAR in YMM9 means that there is a null CHAR.  */
+	VPMINU	%YMM8, %YMM9, %YMM9
 
 	/* Each bit set in K1 represents a non-null CHAR in YMM8.  */
-	VPTESTM	%YMM8, %YMM8, %k1
+	VPTESTM	%YMM9, %YMM9, %k1
 
-	/* (YMM ^ YMM): A non-zero CHAR represents a mismatch.  */
-	vpxorq	(%rdx), %YMM0, %YMM1
-	vpxorq	VEC_SIZE(%rdx), %YMM2, %YMM3
-	vpxorq	(VEC_SIZE * 2)(%rdx), %YMM4, %YMM5
-	vpxorq	(VEC_SIZE * 3)(%rdx), %YMM6, %YMM7
+	vpxorq	(VEC_SIZE * 0)(%rsi), %YMM0, %YMM1
+	vpxorq	(VEC_SIZE * 1)(%rsi), %YMM2, %YMM3
+	vpxorq	(VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
+	/* Ternary logic to xor (VEC_SIZE * 3)(%rsi) with YMM6 while
+	   oring with YMM1. Result is stored in YMM6.  */
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM1, %YMM6
 
-	vporq	%YMM1, %YMM3, %YMM9
-	vporq	%YMM5, %YMM7, %YMM10
+	/* Or together YMM3, YMM5, and YMM6.  */
+	vpternlogd $0xfe, %YMM3, %YMM5, %YMM6
 
-	/* A non-zero CHAR in YMM9 represents a mismatch.  */
-	vporq	%YMM9, %YMM10, %YMM9
 
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR.  */
-	VPCMP	$0, %YMMZERO, %YMM9, %k0{%k1}
-	kmovd   %k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	 L(loop)
+	/* A non-zero CHAR in YMM6 represents a mismatch.  */
+	VPCMP	$0, %YMMZERO, %YMM6, %k0{%k1}
+	kmovd	%k0, %LOOP_REG
 
-	/* Each bit set in K1 represents a non-null CHAR in YMM0.  */
+	TESTEQ	%LOOP_REG
+	jz	L(loop)
+
+
+	/* Find which VEC has the mismatch of end of string.  */
 	VPTESTM	%YMM0, %YMM0, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM0 and (%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM1, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(test_vec)
-	tzcntl	%ecx, %ecx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-# ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# endif
-	ret
+	TESTEQ	%ecx
+	jnz	L(return_vec_0_end)
 
-	.p2align 4
-L(test_vec):
-# ifdef USE_AS_STRNCMP
-	/* The first vector matched.  Return 0 if the maximum offset
-	   (%r11) <= VEC_SIZE.  */
-	cmpq	$VEC_SIZE, %r11
-	jbe	L(zero)
-# endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM2.  */
 	VPTESTM	%YMM2, %YMM2, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM2 and VEC_SIZE(%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM3, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(test_2_vec)
-	tzcntl	%ecx, %edi
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edi
-# endif
-# ifdef USE_AS_STRNCMP
-	addq	$VEC_SIZE, %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	VEC_SIZE(%rsi, %rdi), %ecx
-	cmpl	VEC_SIZE(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	VEC_SIZE(%rax, %rdi), %eax
-	movzbl	VEC_SIZE(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
-# endif
-	ret
+	TESTEQ	%ecx
+	jnz	L(return_vec_1_end)
 
-	.p2align 4
-L(test_2_vec):
+
+	/* Handle VEC 2 and 3 without branches.  */
+L(return_vec_2_3_end):
 # ifdef USE_AS_STRNCMP
-	/* The first 2 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 2 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 2), %r11
-	jbe	L(zero)
+	subq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero_end)
 # endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM4.  */
+
 	VPTESTM	%YMM4, %YMM4, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM4 and (VEC_SIZE * 2)(%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM5, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	TESTEQ	%ecx
+# if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %LOOP_REG
+	orl	%ecx, %LOOP_REG
 # else
-	incl	%ecx
+	salq	$CHAR_PER_VEC, %LOOP_REG64
+	orq	%rcx, %LOOP_REG64
+# endif
+L(return_vec_3_end):
+	/* LOOP_REG contains matches for null/mismatch from the loop. If
+	   VEC 0,1,and 2 all have no null and no mismatches then mismatch
+	   must entirely be from VEC 3 which is fully represented by
+	   LOOP_REG.  */
+# if CHAR_PER_VEC <= 16
+	tzcntl	%LOOP_REG, %LOOP_REG
+# else
+	tzcntq	%LOOP_REG64, %LOOP_REG64
+# endif
+# ifdef USE_AS_STRNCMP
+	cmpq	%LOOP_REG64, %rdx
+	jbe	L(ret_zero_end)
 # endif
-	je	L(test_3_vec)
-	tzcntl	%ecx, %edi
+
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edi
+	movl	(VEC_SIZE * 2)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
+	xorl	%eax, %eax
+	cmpl	(VEC_SIZE * 2)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
+	je	L(ret5)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	(VEC_SIZE * 2)(%rdi, %LOOP_REG64), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %LOOP_REG64), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret5):
+	ret
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	.p2align 4,, 2
+L(ret_zero_end):
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
+	ret
+# endif
+
+
+	/* The L(return_vec_N_end) differ from L(return_vec_N) in that
+	   they use the value of `r8` to negate the return value. This is
+	   because the page cross logic can swap `rdi` and `rsi`.  */
+	.p2align 4,, 10
+# ifdef USE_AS_STRNCMP
+L(return_vec_1_end):
+#  if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %ecx
 #  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
+	salq	$CHAR_PER_VEC, %rcx
 #  endif
+# endif
+L(return_vec_0_end):
+# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
+	tzcntl	%ecx, %ecx
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rdi), %ecx
-	cmpl	(VEC_SIZE * 2)(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rdi), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	tzcntq	%rcx, %rcx
 # endif
-	ret
 
-	.p2align 4
-L(test_3_vec):
 # ifdef USE_AS_STRNCMP
-	/* The first 3 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 3 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 3), %r11
-	jbe	L(zero)
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_end)
 # endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM6.  */
-	VPTESTM	%YMM6, %YMM6, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM6 and (VEC_SIZE * 3)(%rdx).  */
-	VPCMP	$0, %YMMZERO, %YMM7, %k0{%k1}
-	kmovd	%k0, %ecx
+
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret6)
+	setl	%al
+	negl	%eax
+	/* This is the non-zero case for `eax` so just xorl with `r8d`
+	   flip is `rdi` and `rsi` where swapped.  */
+	xorl	%r8d, %eax
 # else
-	incl	%ecx
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	/* Flip `eax` if `rdi` and `rsi` where swapped in page cross
+	   logic. Subtract `r8d` after xor for zero case.  */
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret6):
+	ret
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_1_end):
 	tzcntl	%ecx, %ecx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 3), %rcx
-	cmpq	%rcx, %r11
-	jbe	L(zero)
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %esi
-	cmpl	(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rsi, %rcx), %esi
-	cmpl	(VEC_SIZE * 3)(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
+	cmpl	VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret7)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 3)(%rdx, %rcx), %edx
-	subl	%edx, %eax
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 #  endif
-# endif
+L(ret7):
 	ret
-
-	.p2align 4
-L(loop_cross_page):
-	xorl	%r10d, %r10d
-	movq	%rdx, %rcx
-	/* Align load via RDX.  We load the extra ECX bytes which should
-	   be ignored.  */
-	andl	$((VEC_SIZE * 4) - 1), %ecx
-	/* R10 is -RCX.  */
-	subq	%rcx, %r10
-
-	/* This works only if VEC_SIZE * 2 == 64. */
-# if (VEC_SIZE * 2) != 64
-#  error (VEC_SIZE * 2) != 64
 # endif
 
-	/* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
-	cmpl	$(VEC_SIZE * 2), %ecx
-	jge	L(loop_cross_page_2_vec)
 
-	VMOVU	(%rax, %r10), %YMM2
-	VMOVU	VEC_SIZE(%rax, %r10), %YMM3
+	/* Page cross in rsi in next 4x VEC.  */
 
-	/* Each bit set in K2 represents a non-null CHAR in YMM2.  */
-	VPTESTM	%YMM2, %YMM2, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM2 and 32 bytes at (%rdx, %r10).  */
-	VPCMP	$0, (%rdx, %r10), %YMM2, %k1{%k2}
-	kmovd	%k1, %r9d
-	/* Don't use subl since it is the lower 16/32 bits of RDI
-	   below.  */
-	notl	%r9d
-# ifdef USE_AS_WCSCMP
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %r9d
-# endif
+	/* TODO: Improve logic here.  */
+	.p2align 4,, 10
+L(page_cross_during_loop):
+	/* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
 
-	/* Each bit set in K4 represents a non-null CHAR in YMM3.  */
-	VPTESTM	%YMM3, %YMM3, %k4
-	/* Each bit cleared in K3 represents a mismatch or a null CHAR
-	   in YMM3 and 32 bytes at VEC_SIZE(%rdx, %r10).  */
-	VPCMP	$0, VEC_SIZE(%rdx, %r10), %YMM3, %k3{%k4}
-	kmovd	%k3, %edi
-    /* Must use notl %edi here as lower bits are for CHAR
-	   comparisons potentially out of range thus can be 0 without
-	   indicating mismatch.  */
-	notl	%edi
-# ifdef USE_AS_WCSCMP
-	/* Don't use subl since it is the upper 8 bits of EDI below.  */
-	andl	$0xff, %edi
-# endif
+	/* Optimistically rsi and rdi and both aligned in which case we
+	   don't need any logic here.  */
+	cmpl	$-(VEC_SIZE * 4), %eax
+	/* Don't adjust eax before jumping back to loop and we will
+	   never hit page cross case again.  */
+	je	L(loop_skip_page_cross_check)
 
-# ifdef USE_AS_WCSCMP
-	/* NB: Each bit in EDI/R9D represents 4-byte element.  */
-	sall	$8, %edi
-	/* NB: Divide shift count by 4 since each bit in K1 represent 4
-	   bytes.  */
-	movl	%ecx, %SHIFT_REG32
-	sarl	$2, %SHIFT_REG32
-
-	/* Each bit in EDI represents a null CHAR or a mismatch.  */
-	orl	%r9d, %edi
-# else
-	salq	$32, %rdi
+	/* Check if we can safely load a VEC.  */
+	cmpl	$-(VEC_SIZE * 3), %eax
+	jle	L(less_1x_vec_till_page_cross)
 
-	/* Each bit in RDI represents a null CHAR or a mismatch.  */
-	orq	%r9, %rdi
-# endif
+	VMOVA	(%rdi), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_0_end)
+
+	/* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
+	cmpl	$-(VEC_SIZE * 2), %eax
+	jg	L(more_2x_vec_till_page_cross)
+
+	.p2align 4,, 4
+L(less_1x_vec_till_page_cross):
+	subl	$-(VEC_SIZE * 4), %eax
+	/* Guranteed safe to read from rdi - VEC_SIZE here. The only
+	   concerning case is first iteration if incoming s1 was near start
+	   of a page and s2 near end. If s1 was near the start of the page
+	   we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
+	   to read back -VEC_SIZE. If rdi is truly at the start of a page
+	   here, it means the previous page (rdi - VEC_SIZE) has already
+	   been loaded earlier so must be valid.  */
+	VMOVU	-VEC_SIZE(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -VEC_SIZE(%rsi, %rax), %YMM0, %k1{%k2}
+
+	/* Mask of potentially valid bits. The lower bits can be out of
+	   range comparisons (but safe regarding page crosses).  */
 
-	/* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
-	shrxq	%SHIFT_REG64, %rdi, %rdi
-	testq	%rdi, %rdi
-	je	L(loop_cross_page_2_vec)
-	tzcntq	%rdi, %rcx
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
+	movl	$-1, %r10d
+	movl	%esi, %ecx
+	andl	$(VEC_SIZE - 1), %ecx
+	shrl	$2, %ecx
+	shlxl	%ecx, %r10d, %ecx
+	movzbl	%cl, %r10d
+# else
+	movl	$-1, %ecx
+	shlxl	%esi, %ecx, %r10d
 # endif
+
+	kmovd	%k1, %ecx
+	notl	%ecx
+
+
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
+	movl	%eax, %r11d
+	shrl	$2, %r11d
+	cmpq	%r11, %rdx
 #  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
+	cmpq	%rax, %rdx
 #  endif
+	jbe	L(return_page_cross_end_check)
+# endif
+	movl	%eax, %OFFSET_REG
+
+	/* Readjust eax before potentially returning to the loop.  */
+	addl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+
+	andl	%r10d, %ecx
+	jz	L(loop_skip_page_cross_check)
+
+	.p2align 4,, 3
+L(return_page_cross_end):
+	tzcntl	%ecx, %ecx
+
+# if (defined USE_AS_STRNCMP) || (defined USE_AS_WCSCMP)
+	leal	-VEC_SIZE(%OFFSET_REG64, %rcx, SIZE_OF_CHAR), %ecx
+L(return_page_cross_cmp_mem):
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	addl	%OFFSET_REG, %ecx
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret8)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret8):
 	ret
 
-	.p2align 4
-L(loop_cross_page_2_vec):
-	/* The first VEC_SIZE * 2 bytes match or are ignored.  */
-	VMOVU	(VEC_SIZE * 2)(%rax, %r10), %YMM0
-	VMOVU	(VEC_SIZE * 3)(%rax, %r10), %YMM1
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_page_cross_end_check):
+	tzcntl	%ecx, %ecx
+	leal	-VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
+#  ifdef USE_AS_WCSCMP
+	sall	$2, %edx
+#  endif
+	cmpl	%ecx, %edx
+	ja	L(return_page_cross_cmp_mem)
+	xorl	%eax, %eax
+	ret
+# endif
+
 
+	.p2align 4,, 10
+L(more_2x_vec_till_page_cross):
+	/* If more 2x vec till cross we will complete a full loop
+	   iteration here.  */
+
+	VMOVA	VEC_SIZE(%rdi), %YMM0
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rdx, %r10).  */
-	VPCMP	$0, (VEC_SIZE * 2)(%rdx, %r10), %YMM0, %k1{%k2}
-	kmovd	%k1, %r9d
-	/* Don't use subl since it is the lower 16/32 bits of RDI
-	   below.  */
-	notl	%r9d
-# ifdef USE_AS_WCSCMP
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %r9d
-# endif
+	VPCMP	$0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_1_end)
 
-	VPTESTM	%YMM1, %YMM1, %k4
-	/* Each bit cleared in K3 represents a mismatch or a null CHAR
-	   in YMM1 and 32 bytes at (VEC_SIZE * 3)(%rdx, %r10).  */
-	VPCMP	$0, (VEC_SIZE * 3)(%rdx, %r10), %YMM1, %k3{%k4}
-	kmovd	%k3, %edi
-	/* Must use notl %edi here as lower bits are for CHAR
-	   comparisons potentially out of range thus can be 0 without
-	   indicating mismatch.  */
-	notl	%edi
-# ifdef USE_AS_WCSCMP
-	/* Don't use subl since it is the upper 8 bits of EDI below.  */
-	andl	$0xff, %edi
+# ifdef USE_AS_STRNCMP
+	cmpq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
 
-# ifdef USE_AS_WCSCMP
-	/* NB: Each bit in EDI/R9D represents 4-byte element.  */
-	sall	$8, %edi
+	subl	$-(VEC_SIZE * 4), %eax
 
-	/* Each bit in EDI represents a null CHAR or a mismatch.  */
-	orl	%r9d, %edi
-# else
-	salq	$32, %rdi
+	/* Safe to include comparisons from lower bytes.  */
+	VMOVU	-(VEC_SIZE * 2)(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -(VEC_SIZE * 2)(%rsi, %rax), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_page_cross_0)
+
+	VMOVU	-(VEC_SIZE * 1)(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -(VEC_SIZE * 1)(%rsi, %rax), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_page_cross_1)
 
-	/* Each bit in RDI represents a null CHAR or a mismatch.  */
-	orq	%r9, %rdi
+# ifdef USE_AS_STRNCMP
+	/* Must check length here as length might proclude reading next
+	   page.  */
+#  ifdef USE_AS_WCSCMP
+	movl	%eax, %r11d
+	shrl	$2, %r11d
+	cmpq	%r11, %rdx
+#  else
+	cmpq	%rax, %rdx
+#  endif
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
 
-	xorl	%r8d, %r8d
-	/* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
-	subl	$(VEC_SIZE * 2), %ecx
-	jle	1f
-	/* R8 has number of bytes skipped.  */
-	movl	%ecx, %r8d
-# ifdef USE_AS_WCSCMP
-	/* NB: Divide shift count by 4 since each bit in RDI represent 4
-	   bytes.  */
-	sarl	$2, %ecx
-	/* Skip ECX bytes.  */
-	shrl	%cl, %edi
+	/* Finish the loop.  */
+	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM6
+	VPMINU	%YMM4, %YMM6, %YMM9
+	VPTESTM	%YMM9, %YMM9, %k1
+
+	vpxorq	(VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
+	/* YMM6 = YMM5 | ((VEC_SIZE * 3)(%rsi) ^ YMM6).  */
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM5, %YMM6
+
+	VPCMP	$0, %YMMZERO, %YMM6, %k0{%k1}
+	kmovd	%k0, %LOOP_REG
+	TESTEQ	%LOOP_REG
+	jnz	L(return_vec_2_3_end)
+
+	/* Best for code size to include ucond-jmp here. Would be faster
+	   if this case is hot to duplicate the L(return_vec_2_3_end) code
+	   as fall-through and have jump back to loop on mismatch
+	   comparison.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+	addl	$(PAGE_SIZE - VEC_SIZE * 8), %eax
+# ifdef USE_AS_STRNCMP
+	subq	$(CHAR_PER_VEC * 4), %rdx
+	ja	L(loop_skip_page_cross_check)
+L(ret_zero_in_loop_page_cross):
+	xorl	%eax, %eax
+	ret
 # else
-	/* Skip ECX bytes.  */
-	shrq	%cl, %rdi
+	jmp	L(loop_skip_page_cross_check)
 # endif
-1:
-	/* Before jumping back to the loop, set ESI to the number of
-	   VEC_SIZE * 4 blocks before page crossing.  */
-	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
 
-	testq	%rdi, %rdi
-# ifdef USE_AS_STRNCMP
-	/* At this point, if %rdi value is 0, it already tested
-	   VEC_SIZE*4+%r10 byte starting from %rax. This label
-	   checks whether strncmp maximum offset reached or not.  */
-	je	L(string_nbyte_offset_check)
+
+	.p2align 4,, 10
+L(return_vec_page_cross_0):
+	addl	$-VEC_SIZE, %eax
+L(return_vec_page_cross_1):
+	tzcntl	%ecx, %ecx
+# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
+	leal	-VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
+#  ifdef USE_AS_STRNCMP
+#   ifdef USE_AS_WCSCMP
+	/* Must divide ecx instead of multiply rdx due to overflow.  */
+	movl	%ecx, %eax
+	shrl	$2, %eax
+	cmpq	%rax, %rdx
+#   else
+	cmpq	%rcx, %rdx
+#   endif
+	jbe	L(ret_zero_in_loop_page_cross)
+#  endif
 # else
-	je	L(back_to_loop)
+	addl	%eax, %ecx
 # endif
-	tzcntq	%rdi, %rcx
+
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-	addq	%r10, %rcx
-	/* Adjust for number of bytes skipped.  */
-	addq	%r8, %rcx
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rcx
-	subq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret9)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rcx), %edi
-	cmpl	(VEC_SIZE * 2)(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret9):
 	ret
 
-# ifdef USE_AS_STRNCMP
-L(string_nbyte_offset_check):
-	leaq	(VEC_SIZE * 4)(%r10), %r10
-	cmpq	%r10, %r11
-	jbe	L(zero)
-	jmp	L(back_to_loop)
+
+	.p2align 4,, 10
+L(page_cross):
+# ifndef USE_AS_STRNCMP
+	/* If both are VEC aligned we don't need any special logic here.
+	   Only valid for strcmp where stop condition is guranteed to be
+	   reachable by just reading memory.  */
+	testl	$((VEC_SIZE - 1) << 20), %eax
+	jz	L(no_page_cross)
 # endif
 
-	.p2align 4
-L(cross_page_loop):
-	/* Check one byte/dword at a time.  */
+	movl	%edi, %eax
+	movl	%esi, %ecx
+	andl	$(PAGE_SIZE - 1), %eax
+	andl	$(PAGE_SIZE - 1), %ecx
+
+	xorl	%OFFSET_REG, %OFFSET_REG
+
+	/* Check which is closer to page cross, s1 or s2.  */
+	cmpl	%eax, %ecx
+	jg	L(page_cross_s2)
+
+	/* The previous page cross check has false positives. Check for
+	   true positive as page cross logic is very expensive.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+	jbe	L(no_page_cross)
+
+
+	/* Set r8 to not interfere with normal return value (rdi and rsi
+	   did not swap).  */
 # ifdef USE_AS_WCSCMP
-	cmpl	%ecx, %eax
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
 # else
-	subl	%ecx, %eax
+	xorl	%r8d, %r8d
 # endif
-	jne	L(different)
-	addl	$SIZE_OF_CHAR, %edx
-	cmpl	$(VEC_SIZE * 4), %edx
-	je	L(main_loop_header)
+
+	/* Check if less than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jg	L(less_1x_vec_till_page)
+
+
+	/* If more than 1x VEC till page cross, loop throuh safely
+	   loadable memory until within 1x VEC of page cross.  */
+	.p2align 4,, 8
+L(page_cross_loop):
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(check_ret_vec_page_cross)
+	addl	$CHAR_PER_VEC, %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
+	addl	$VEC_SIZE, %eax
+	jl	L(page_cross_loop)
+
 # ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
+	shrl	$2, %eax
 # endif
-	/* Check null CHAR.  */
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
-	/* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
-	   comparisons.  */
-	subl	%ecx, %eax
-# ifndef USE_AS_WCSCMP
-L(different):
+
+
+	subl	%eax, %OFFSET_REG
+	/* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
+	   to not cross page so is safe to load. Since we have already
+	   loaded at least 1 VEC from rsi it is also guranteed to be safe.
+	 */
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2}
+
+	kmovd	%k1, %ecx
+# ifdef USE_AS_STRNCMP
+	leal	CHAR_PER_VEC(%OFFSET_REG64), %eax
+	cmpq	%rax, %rdx
+	jbe	L(check_ret_vec_page_cross2)
+#  ifdef USE_AS_WCSCMP
+	addq	$-(CHAR_PER_VEC * 2), %rdx
+#  else
+	addq	%rdi, %rdx
+#  endif
 # endif
-	ret
+	TESTEQ	%ecx
+	jz	L(prepare_loop_no_len)
 
+	.p2align 4,, 4
+L(ret_vec_page_cross):
+# ifndef USE_AS_STRNCMP
+L(check_ret_vec_page_cross):
+# endif
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+L(ret_vec_page_cross_cont):
 # ifdef USE_AS_WCSCMP
-	.p2align 4
-L(different):
-	/* Use movl to avoid modifying EFLAGS.  */
-	movl	$0, %eax
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret12)
 	setl	%al
 	negl	%eax
-	orl	$1, %eax
-	ret
+	xorl	%r8d, %eax
+# else
+	movzbl	(%rdi, %rcx, SIZE_OF_CHAR), %eax
+	movzbl	(%rsi, %rcx, SIZE_OF_CHAR), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret12):
+	ret
+
 
 # ifdef USE_AS_STRNCMP
-	.p2align 4
-L(zero):
+	.p2align 4,, 10
+L(check_ret_vec_page_cross2):
+	TESTEQ	%ecx
+L(check_ret_vec_page_cross):
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+	cmpq	%rcx, %rdx
+	ja	L(ret_vec_page_cross_cont)
+	.p2align 4,, 2
+L(ret_zero_page_cross):
 	xorl	%eax, %eax
 	ret
+# endif
 
-	.p2align 4
-L(char0):
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi), %ecx
-	cmpl	(%rsi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rsi), %ecx
-	movzbl	(%rdi), %eax
-	subl	%ecx, %eax
-#  endif
-	ret
+	.p2align 4,, 4
+L(page_cross_s2):
+	/* Ensure this is a true page cross.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %ecx
+	jbe	L(no_page_cross)
+
+
+	movl	%ecx, %eax
+	movq	%rdi, %rcx
+	movq	%rsi, %rdi
+	movq	%rcx, %rsi
+
+	/* set r8 to negate return value as rdi and rsi swapped.  */
+# ifdef USE_AS_WCSCMP
+	movl	$-4, %r8d
+# else
+	movl	$-1, %r8d
 # endif
+	xorl	%OFFSET_REG, %OFFSET_REG
 
-	.p2align 4
-L(last_vector):
-	addq	%rdx, %rdi
-	addq	%rdx, %rsi
-# ifdef USE_AS_STRNCMP
-	subq	%rdx, %r11
+	/* Check if more than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jle	L(page_cross_loop)
+
+	.p2align 4,, 6
+L(less_1x_vec_till_page):
+# ifdef USE_AS_WCSCMP
+	shrl	$2, %eax
 # endif
-	tzcntl	%ecx, %edx
+	/* Find largest load size we can use.  */
+	cmpl	$(16 / SIZE_OF_CHAR), %eax
+	ja	L(less_16_till_page)
+
+	/* Use 16 byte comparison.  */
+	vmovdqu	(%rdi), %xmm0
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, (%rsi), %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
+	subl	$0xf, %ecx
+# else
+	incw	%cx
 # endif
+	jnz	L(check_ret_vec_page_cross)
+	movl	$(16 / SIZE_OF_CHAR), %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subl	%eax, %OFFSET_REG
+# else
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+	jz	L(prepare_loop)
 # endif
+	vmovdqu	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	subl	$0xf, %ecx
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	incw	%cx
 # endif
+	jnz	L(check_ret_vec_page_cross)
+# ifdef USE_AS_STRNCMP
+	addl	$(16 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# else
+	leaq	(16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# endif
+	jmp	L(prepare_loop_aligned)
+
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case0):
+	xorl	%eax, %eax
 	ret
+# endif
 
-	/* Comparing on page boundary region requires special treatment:
-	   It must done one vector at the time, starting with the wider
-	   ymm vector if possible, if not, with xmm. If fetching 16 bytes
-	   (xmm) still passes the boundary, byte comparison must be done.
-	 */
-	.p2align 4
-L(cross_page):
-	/* Try one ymm vector at a time.  */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jg	L(cross_page_1_vector)
-L(loop_1_vector):
-	VMOVU	(%rdi, %rdx), %YMM0
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (%rsi, %rdx).  */
-	VPCMP	$0, (%rsi, %rdx), %YMM0, %k1{%k2}
+	.p2align 4,, 10
+L(less_16_till_page):
+	cmpl	$(24 / SIZE_OF_CHAR), %eax
+	ja	L(less_8_till_page)
+
+	/* Use 8 byte comparison.  */
+	vmovq	(%rdi), %xmm0
+	vmovq	(%rsi), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	subl	$0x3, %ecx
 # else
-	incl	%ecx
+	incb	%cl
 # endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$VEC_SIZE, %edx
 
-	addl	$VEC_SIZE, %eax
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$(8 / SIZE_OF_CHAR), %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
 # endif
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jle	L(loop_1_vector)
-L(cross_page_1_vector):
-	/* Less than 32 bytes to check, try one xmm vector.  */
-	cmpl	$(PAGE_SIZE - 16), %eax
-	jg	L(cross_page_1_xmm)
-	VMOVU	(%rdi, %rdx), %XMM0
+	movl	$(24 / SIZE_OF_CHAR), %OFFSET_REG
+	subl	%eax, %OFFSET_REG
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and 16 bytes at (%rsi, %rdx).  */
-	VPCMP	$0, (%rsi, %rdx), %XMM0, %k1{%k2}
+	vmovq	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	vmovq	(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	subl	$0xf, %ecx
+	subl	$0x3, %ecx
 # else
-	subl	$0xffff, %ecx
+	incb	%cl
 # endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
+
 
-	addl	$16, %edx
-# ifndef USE_AS_WCSCMP
-	addl	$16, %eax
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addl	$(8 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# else
+	leaq	(8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
 # endif
+	jmp	L(prepare_loop_aligned)
 
-L(cross_page_1_xmm):
-# ifndef USE_AS_WCSCMP
-	/* Less than 16 bytes to check, try 8 byte vector.  NB: No need
-	   for wcscmp nor wcsncmp since wide char is 4 bytes.   */
-	cmpl	$(PAGE_SIZE - 8), %eax
-	jg	L(cross_page_8bytes)
-	vmovq	(%rdi, %rdx), %XMM0
-	vmovq	(%rsi, %rdx), %XMM1
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and XMM1.  */
-	VPCMP	$0, %XMM1, %XMM0, %k1{%k2}
-	kmovb	%k1, %ecx
+
+
+	.p2align 4,, 10
+L(less_8_till_page):
 # ifdef USE_AS_WCSCMP
-	subl	$0x3, %ecx
+	/* If using wchar then this is the only check before we reach
+	   the page boundary.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	cmpl	%ecx, %eax
+	jnz	L(ret_less_8_wcs)
+#  ifdef USE_AS_STRNCMP
+	addq	$-(CHAR_PER_VEC * 2), %rdx
+	/* We already checked for len <= 1 so cannot hit that case here.
+	 */
+#  endif
+	testl	%eax, %eax
+	jnz	L(prepare_loop)
+	ret
+
+	.p2align 4,, 8
+L(ret_less_8_wcs):
+	setl	%OFFSET_REG8
+	negl	%OFFSET_REG
+	movl	%OFFSET_REG, %eax
+	xorl	%r8d, %eax
+	ret
+
 # else
-	subl	$0xff, %ecx
-# endif
-	jne	L(last_vector)
+	cmpl	$28, %eax
+	ja	L(less_4_till_page)
+
+	vmovd	(%rdi), %xmm0
+	vmovd	(%rsi), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$8, %edx
-	addl	$8, %eax
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$4, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
 #  endif
+	movl	$(28 / SIZE_OF_CHAR), %OFFSET_REG
+	subl	%eax, %OFFSET_REG
 
-L(cross_page_8bytes):
-	/* Less than 8 bytes to check, try 4 byte vector.  */
-	cmpl	$(PAGE_SIZE - 4), %eax
-	jg	L(cross_page_4bytes)
-	vmovd	(%rdi, %rdx), %XMM0
-	vmovd	(%rsi, %rdx), %XMM1
-
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and XMM1.  */
-	VPCMP	$0, %XMM1, %XMM0, %k1{%k2}
+	vmovd	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	vmovd	(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0x1, %ecx
-# else
 	subl	$0xf, %ecx
-# endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
+#  ifdef USE_AS_STRNCMP
+	addl	$(4 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+#  else
+	leaq	(4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+#  endif
+	jmp	L(prepare_loop_aligned)
+
 
-	addl	$4, %edx
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case1):
+	xorl	%eax, %eax
+	ret
 #  endif
 
-L(cross_page_4bytes):
-# endif
-	/* Less than 4 bytes to check, try one byte/dword at a time.  */
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-# ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
-# endif
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
+	.p2align 4,, 10
+L(less_4_till_page):
+	subq	%rdi, %rsi
+	/* Extremely slow byte comparison loop.  */
+L(less_4_loop):
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi, %rdi), %ecx
 	subl	%ecx, %eax
+	jnz	L(ret_less_4_loop)
+	testl	%ecx, %ecx
+	jz	L(ret_zero_4_loop)
+#  ifdef USE_AS_STRNCMP
+	decq	%rdx
+	jz	L(ret_zero_4_loop)
+#  endif
+	incq	%rdi
+	/* end condition is reach page boundary (rdi is aligned).  */
+	testl	$31, %edi
+	jnz	L(less_4_loop)
+	leaq	-(VEC_SIZE * 4)(%rdi, %rsi), %rsi
+	addq	$-(VEC_SIZE * 4), %rdi
+#  ifdef USE_AS_STRNCMP
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+L(ret_zero_4_loop):
+	xorl	%eax, %eax
+	ret
+L(ret_less_4_loop):
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 	ret
-END (STRCMP)
+# endif
+END(STRCMP)
 #endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
                     ` (4 preceding siblings ...)
  2022-01-10  0:27   ` [PATCH v2 6/7] x86: Optimize strcmp-evex.S Noah Goldstein
@ 2022-01-10  0:27   ` Noah Goldstein
  2022-01-10  0:34   ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] H.J. Lu
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  0:27 UTC (permalink / raw)
  To: libc-alpha

Add more small and medium sized tests for strcmp and strncmp.

As well for strcmp add option for more direct control of
alignment. Previously alignment was being pushed to the end of the
page. While this is the most difficult case to implement, it is far
from the common case and so shouldn't be the only benchmark.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 benchtests/bench-strcmp.c  | 142 ++++++++++++++++++++++++++-----------
 benchtests/bench-strncmp.c | 110 ++++++++++++++++++++--------
 2 files changed, 183 insertions(+), 69 deletions(-)

diff --git a/benchtests/bench-strcmp.c b/benchtests/bench-strcmp.c
index 387e76fcfb..3a60edfb15 100644
--- a/benchtests/bench-strcmp.c
+++ b/benchtests/bench-strcmp.c
@@ -99,8 +99,8 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl,
 }
 
 static void
-do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
-	 max_char, int exp_result)
+do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len,
+         int max_char, int exp_result, int at_end)
 {
   size_t i;
 
@@ -109,19 +109,28 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
   if (len == 0)
     return;
 
-  align1 &= 63;
+  align1 &= ~(CHARBYTES - 1);
+  align2 &= ~(CHARBYTES - 1);
+
+  align1 &= (getpagesize () - 1);
   if (align1 + (len + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
+  align2 &= (getpagesize () - 1);
   if (align2 + (len + 1) * CHARBYTES >= page_size)
     return;
 
   /* Put them close to the end of page.  */
-  i = align1 + CHARBYTES * (len + 2);
-  s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1);
-  i = align2 + CHARBYTES * (len + 2);
-  s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16)  + align2);
+  if (at_end)
+    {
+      i = align1 + CHARBYTES * (len + 2);
+      align1 = ((page_size - i) / 16 * 16) + align1;
+      i = align2 + CHARBYTES * (len + 2);
+      align2 = ((page_size - i) / 16 * 16) + align2;
+    }
+
+  s1 = (CHAR *)(buf1 + align1);
+  s2 = (CHAR *)(buf2 + align2);
 
   for (i = 0; i < len; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
@@ -132,9 +141,9 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
   s2[len - 1] -= exp_result;
 
   json_element_object_begin (json_ctx);
-  json_attr_uint (json_ctx, "length", (double) len);
-  json_attr_uint (json_ctx, "align1", (double) align1);
-  json_attr_uint (json_ctx, "align2", (double) align2);
+  json_attr_uint (json_ctx, "length", (double)len);
+  json_attr_uint (json_ctx, "align1", (double)align1);
+  json_attr_uint (json_ctx, "align2", (double)align2);
   json_array_begin (json_ctx, "timings");
 
   FOR_EACH_IMPL (impl, 0)
@@ -202,7 +211,8 @@ int
 test_main (void)
 {
   json_ctx_t json_ctx;
-  size_t i;
+  size_t i, j, k;
+  size_t pg_sz = getpagesize ();
 
   test_init ();
 
@@ -221,36 +231,88 @@ test_main (void)
   json_array_end (&json_ctx);
 
   json_array_begin (&json_ctx, "results");
-
-  for (i = 1; i < 32; ++i)
-    {
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 0);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, -1);
-    }
-
-  for (i = 1; i < 10 + CHARBYTESLOG; ++i)
+  for (k = 0; k < 2; ++k)
     {
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 0);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 0);
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 1);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 1);
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, -1);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, -1);
-      do_test (&json_ctx, 0, CHARBYTES * i, 2 << i, MIDCHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * (i + 1), 2 << i, LARGECHAR, 1);
+      for (i = 1; i < 32; ++i)
+        {
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, -1, k);
+        }
+
+      for (i = 1; i <= 8192;)
+        {
+          /* No page crosses.  */
+          do_test (&json_ctx, 0, 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, i * CHARBYTES, 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, i * CHARBYTES, i, MIDCHAR, 0, k);
+
+          /* False page crosses.  */
+          do_test (&json_ctx, pg_sz / 2, pg_sz / 2 - CHARBYTES, i, MIDCHAR, 0,
+                   k);
+          do_test (&json_ctx, pg_sz / 2 - CHARBYTES, pg_sz / 2, i, MIDCHAR, 0,
+                   k);
+
+          do_test (&json_ctx, pg_sz - (i * CHARBYTES), 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, pg_sz - (i * CHARBYTES), i, MIDCHAR, 0, k);
+
+          /* Real page cross.  */
+          for (j = 16; j < 128; j += 16)
+            {
+              do_test (&json_ctx, pg_sz - j, 0, i, MIDCHAR, 0, k);
+              do_test (&json_ctx, 0, pg_sz - j, i, MIDCHAR, 0, k);
+
+              do_test (&json_ctx, pg_sz - j, pg_sz - j / 2, i, MIDCHAR, 0, k);
+              do_test (&json_ctx, pg_sz - j / 2, pg_sz - j, i, MIDCHAR, 0, k);
+            }
+
+          if (i < 32)
+            {
+              ++i;
+            }
+          else if (i < 160)
+            {
+              i += 8;
+            }
+          else if (i < 512)
+            {
+              i += 32;
+            }
+          else
+            {
+              i *= 2;
+            }
+        }
+
+      for (i = 1; i < 10 + CHARBYTESLOG; ++i)
+        {
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 0, k);
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, -1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, -1, k);
+          do_test (&json_ctx, 0, CHARBYTES * i, 2 << i, MIDCHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * (i + 1), 2 << i,
+                   LARGECHAR, 1, k);
+        }
+
+      for (i = 1; i < 8; ++i)
+        {
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, 0, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, 0, k);
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, 1, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, -1, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, -1, k);
+        }
     }
-
-  for (i = 1; i < 8; ++i)
-    {
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, 0);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, 0);
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, 1);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, -1);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1);
-    }
-
   do_test_page_boundary (&json_ctx);
 
   json_array_end (&json_ctx);
diff --git a/benchtests/bench-strncmp.c b/benchtests/bench-strncmp.c
index b7a01fde64..6673a53521 100644
--- a/benchtests/bench-strncmp.c
+++ b/benchtests/bench-strncmp.c
@@ -150,43 +150,43 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, size_t
   if (n == 0)
     return;
 
-  align1 &= 63;
+  align1 &= getpagesize () - 1;
   if (align1 + (n + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 7;
+  align2 &= getpagesize () - 1;
   if (align2 + (n + 1) * CHARBYTES >= page_size)
     return;
 
   json_element_object_begin (json_ctx);
-  json_attr_uint (json_ctx, "strlen", (double) len);
-  json_attr_uint (json_ctx, "len", (double) n);
-  json_attr_uint (json_ctx, "align1", (double) align1);
-  json_attr_uint (json_ctx, "align2", (double) align2);
+  json_attr_uint (json_ctx, "strlen", (double)len);
+  json_attr_uint (json_ctx, "len", (double)n);
+  json_attr_uint (json_ctx, "align1", (double)align1);
+  json_attr_uint (json_ctx, "align2", (double)align2);
   json_array_begin (json_ctx, "timings");
 
   FOR_EACH_IMPL (impl, 0)
-    {
-      alloc_bufs ();
-      s1 = (CHAR *) (buf1 + align1);
-      s2 = (CHAR *) (buf2 + align2);
-
-      for (i = 0; i < n; i++)
-	s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
-
-      s1[n] = 24 + exp_result;
-      s2[n] = 23;
-      s1[len] = 0;
-      s2[len] = 0;
-      if (exp_result < 0)
-	s2[len] = 32;
-      else if (exp_result > 0)
-	s1[len] = 64;
-      if (len >= n)
-	s2[n - 1] -= exp_result;
+  {
+    alloc_bufs ();
+    s1 = (CHAR *)(buf1 + align1);
+    s2 = (CHAR *)(buf2 + align2);
+
+    for (i = 0; i < n; i++)
+      s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
+
+    s1[n] = 24 + exp_result;
+    s2[n] = 23;
+    s1[len] = 0;
+    s2[len] = 0;
+    if (exp_result < 0)
+      s2[len] = 32;
+    else if (exp_result > 0)
+      s1[len] = 64;
+    if (len >= n)
+      s2[n - 1] -= exp_result;
 
-      do_one_test (json_ctx, impl, s1, s2, n, exp_result);
-    }
+    do_one_test (json_ctx, impl, s1, s2, n, exp_result);
+  }
 
   json_array_end (json_ctx);
   json_element_object_end (json_ctx);
@@ -319,7 +319,8 @@ int
 test_main (void)
 {
   json_ctx_t json_ctx;
-  size_t i;
+  size_t i, j, len;
+  size_t pg_sz = getpagesize ();
 
   test_init ();
 
@@ -334,12 +335,12 @@ test_main (void)
 
   json_array_begin (&json_ctx, "ifuncs");
   FOR_EACH_IMPL (impl, 0)
-    json_element_string (&json_ctx, impl->name);
+  json_element_string (&json_ctx, impl->name);
   json_array_end (&json_ctx);
 
   json_array_begin (&json_ctx, "results");
 
-  for (i =0; i < 16; ++i)
+  for (i = 0; i < 16; ++i)
     {
       do_test (&json_ctx, 0, 0, 8, i, 127, 0);
       do_test (&json_ctx, 0, 0, 8, i, 127, -1);
@@ -361,6 +362,57 @@ test_main (void)
       do_test (&json_ctx, i, 3 * i, 8, i, 255, -1);
     }
 
+  for (len = 0; len <= 128; len += 64)
+    {
+      for (i = 1; i <= 8192;)
+        {
+          /* No page crosses.  */
+          do_test (&json_ctx, 0, 0, i, i + len, 127, 0);
+          do_test (&json_ctx, i * CHARBYTES, 0, i, i + len, 127, 0);
+          do_test (&json_ctx, 0, i * CHARBYTES, i, i + len, 127, 0);
+
+          /* False page crosses.  */
+          do_test (&json_ctx, pg_sz / 2, pg_sz / 2 - CHARBYTES, i, i + len,
+                   127, 0);
+          do_test (&json_ctx, pg_sz / 2 - CHARBYTES, pg_sz / 2, i, i + len,
+                   127, 0);
+
+          do_test (&json_ctx, pg_sz - (i * CHARBYTES), 0, i, i + len, 127,
+                   0);
+          do_test (&json_ctx, 0, pg_sz - (i * CHARBYTES), i, i + len, 127,
+                   0);
+
+          /* Real page cross.  */
+          for (j = 16; j < 128; j += 16)
+            {
+              do_test (&json_ctx, pg_sz - j, 0, i, i + len, 127, 0);
+              do_test (&json_ctx, 0, pg_sz - j, i, i + len, 127, 0);
+
+              do_test (&json_ctx, pg_sz - j, pg_sz - j / 2, i, i + len,
+                       127, 0);
+              do_test (&json_ctx, pg_sz - j / 2, pg_sz - j, i, i + len,
+                       127, 0);
+            }
+
+          if (i < 32)
+            {
+              ++i;
+            }
+          else if (i < 160)
+            {
+              i += 8;
+            }
+          else if (i < 256)
+            {
+              i += 32;
+            }
+          else
+            {
+              i *= 2;
+            }
+        }
+    }
+
   for (i = 1; i < 8; ++i)
     {
       do_test (&json_ctx, 0, 0, 8 << i, 16 << i, 127, 0);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755]
  2022-01-09 14:07   ` H.J. Lu
@ 2022-01-10  0:29     ` Noah Goldstein
  0 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  0:29 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell

On Sun, Jan 9, 2022 at 8:08 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Jan 9, 2022 at 4:35 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Sun, Jan 9, 2022 at 6:30 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> > > __wcscmp_avx2. For x86_64 this covers the entire address range so any
> > > length larger could not possibly be used to bound `s1` or `s2`.
>
> Please first submit a separate single patch to fix wcsncmp_avx2 and
> wcsncmp_evex for BZ# 28755

Done.

>
> Thanks.
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
                     ` (5 preceding siblings ...)
  2022-01-10  0:27   ` [PATCH v2 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
@ 2022-01-10  0:34   ` H.J. Lu
  6 siblings, 0 replies; 59+ messages in thread
From: H.J. Lu @ 2022-01-10  0:34 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 4:28 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> __wcscmp_avx2. For x86_64 this covers the entire address range so any
> length larger could not possibly be used to bound `s1` or `s2`.
>
> test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> index a45f9d2749..9c73b5899d 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> @@ -87,6 +87,16 @@ ENTRY (STRCMP)
>         je      L(char0)
>         jb      L(zero)
>  #  ifdef USE_AS_WCSCMP
> +#  ifndef __ILP32__
> +       movq    %rdx, %rcx
> +       /* Check if length could overflow when multiplied by
> +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> +          overflow cases as well as redirect cases where its impossible to
> +          length to bound a valid memory region. In these cases just use
> +          'wcscmp'.  */
> +       shrq    $56, %rcx
> +       jnz     __wcscmp_avx2
> +#  endif
>         /* Convert units: from wide to byte char.  */
>         shl     $2, %RDX_LP
>  #  endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
  2022-01-10  0:27   ` [PATCH v2 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
@ 2022-01-10  0:35     ` H.J. Lu
  0 siblings, 0 replies; 59+ messages in thread
From: H.J. Lu @ 2022-01-10  0:35 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 4:28 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> __wcscmp_evex. For x86_64 this covers the entire address range so any
> length larger could not possibly be used to bound `s1` or `s2`.
>
> test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  sysdeps/x86_64/multiarch/strcmp-evex.S | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
> index 1d971f3889..0cd939d5af 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-evex.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
> @@ -104,6 +104,16 @@ ENTRY (STRCMP)
>         je      L(char0)
>         jb      L(zero)
>  #  ifdef USE_AS_WCSCMP
> +#  ifndef __ILP32__
> +       movq    %rdx, %rcx
> +       /* Check if length could overflow when multiplied by
> +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> +          overflow cases as well as redirect cases where its impossible to
> +          length to bound a valid memory region. In these cases just use
> +          'wcscmp'.  */
> +       shrq    $56, %rcx
> +       jnz     __wcscmp_evex
> +#  endif
>         /* Convert units: from wide to byte char.  */
>         shl     $2, %RDX_LP
>  #  endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp].
  2022-01-10  0:27   ` [PATCH v2 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
@ 2022-01-10  0:37     ` H.J. Lu
  0 siblings, 0 replies; 59+ messages in thread
From: H.J. Lu @ 2022-01-10  0:37 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 4:29 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> These implementations are incorrect. There may be a mismatch in s1/s2
> before invalid memory but no null CHAR / length boundary.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  string/test-strcmp.c  | 35 -----------------------------------
>  string/test-strncmp.c | 34 ----------------------------------
>  2 files changed, 69 deletions(-)
>
> diff --git a/string/test-strcmp.c b/string/test-strcmp.c
> index 3c75076fb8..97d7bf5043 100644
> --- a/string/test-strcmp.c
> +++ b/string/test-strcmp.c
> @@ -34,7 +34,6 @@
>  # define STRLEN wcslen
>  # define MEMCPY wmemcpy
>  # define SIMPLE_STRCMP simple_wcscmp
> -# define STUPID_STRCMP stupid_wcscmp
>  # define CHAR wchar_t
>  # define UCHAR wchar_t
>  # define CHARBYTES 4
> @@ -64,25 +63,6 @@ simple_wcscmp (const wchar_t *s1, const wchar_t *s2)
>    return c1 < c2 ? -1 : 1;
>  }
>
> -int
> -stupid_wcscmp (const wchar_t *s1, const wchar_t *s2)
> -{
> -  size_t ns1 = wcslen (s1) + 1;
> -  size_t ns2 = wcslen (s2) + 1;
> -  size_t n = ns1 < ns2 ? ns1 : ns2;
> -  int ret = 0;
> -
> -  wchar_t c1, c2;
> -
> -  while (n--) {
> -    c1 = *s1++;
> -    c2 = *s2++;
> -    if ((ret = c1 < c2 ? -1 : c1 == c2 ? 0 : 1) != 0)
> -      break;
> -  }
> -  return ret;
> -}
> -
>  #else
>  # include <limits.h>
>
> @@ -92,7 +72,6 @@ stupid_wcscmp (const wchar_t *s1, const wchar_t *s2)
>  # define STRLEN strlen
>  # define MEMCPY memcpy
>  # define SIMPLE_STRCMP simple_strcmp
> -# define STUPID_STRCMP stupid_strcmp
>  # define CHAR char
>  # define UCHAR unsigned char
>  # define CHARBYTES 1
> @@ -113,24 +92,10 @@ simple_strcmp (const char *s1, const char *s2)
>    return ret;
>  }
>
> -int
> -stupid_strcmp (const char *s1, const char *s2)
> -{
> -  size_t ns1 = strlen (s1) + 1;
> -  size_t ns2 = strlen (s2) + 1;
> -  size_t n = ns1 < ns2 ? ns1 : ns2;
> -  int ret = 0;
> -
> -  while (n--)
> -    if ((ret = *(unsigned char *) s1++ - *(unsigned char *) s2++) != 0)
> -      break;
> -  return ret;
> -}
>  #endif
>
>  typedef int (*proto_t) (const CHAR *, const CHAR *);
>
> -IMPL (STUPID_STRCMP, 1)
>  IMPL (SIMPLE_STRCMP, 1)
>  IMPL (STRCMP, 1)
>
> diff --git a/string/test-strncmp.c b/string/test-strncmp.c
> index e7d5edea39..61a283a0af 100644
> --- a/string/test-strncmp.c
> +++ b/string/test-strncmp.c
> @@ -33,7 +33,6 @@
>  # define STRDUP wcsdup
>  # define MEMCPY wmemcpy
>  # define SIMPLE_STRNCMP simple_wcsncmp
> -# define STUPID_STRNCMP stupid_wcsncmp
>  # define CHAR wchar_t
>  # define UCHAR wchar_t
>  # define CHARBYTES 4
> @@ -57,25 +56,6 @@ simple_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
>    return 0;
>  }
>
> -int
> -stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
> -{
> -  wchar_t c1, c2;
> -  size_t ns1 = wcsnlen (s1, n) + 1, ns2 = wcsnlen (s2, n) + 1;
> -
> -  n = ns1 < n ? ns1 : n;
> -  n = ns2 < n ? ns2 : n;
> -
> -  while (n--)
> -    {
> -      c1 = *s1++;
> -      c2 = *s2++;
> -      if (c1 != c2)
> -       return c1 > c2 ? 1 : -1;
> -    }
> -  return 0;
> -}
> -
>  #else
>  # define L(str) str
>  # define STRNCMP strncmp
> @@ -83,7 +63,6 @@ stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
>  # define STRDUP strdup
>  # define MEMCPY memcpy
>  # define SIMPLE_STRNCMP simple_strncmp
> -# define STUPID_STRNCMP stupid_strncmp
>  # define CHAR char
>  # define UCHAR unsigned char
>  # define CHARBYTES 1
> @@ -101,23 +80,10 @@ simple_strncmp (const char *s1, const char *s2, size_t n)
>    return ret;
>  }
>
> -int
> -stupid_strncmp (const char *s1, const char *s2, size_t n)
> -{
> -  size_t ns1 = strnlen (s1, n) + 1, ns2 = strnlen (s2, n) + 1;
> -  int ret = 0;
> -
> -  n = ns1 < n ? ns1 : n;
> -  n = ns2 < n ? ns2 : n;
> -  while (n-- && (ret = *(unsigned char *) s1++ - * (unsigned char *) s2++) == 0);
> -  return ret;
> -}
> -
>  #endif
>
>  typedef int (*proto_t) (const CHAR *, const CHAR *, size_t);
>
> -IMPL (STUPID_STRNCMP, 0)
>  IMPL (SIMPLE_STRNCMP, 0)
>  IMPL (STRNCMP, 1)
>
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c
  2022-01-10  0:27   ` [PATCH v2 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
@ 2022-01-10  0:38     ` H.J. Lu
  2022-01-10  2:51       ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-10  0:38 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 4:30 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Add additional test cases for small / medium sizes.
>
> Add tests in test-strncmp.c where `n` is near ULONG_MAX or LONG_MIN to
> test for overflow bugs in length handling.

How long do new tests run?

> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  string/test-strcmp.c  |  70 ++++++++++--
>  string/test-strncmp.c | 248 +++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 298 insertions(+), 20 deletions(-)
>
> diff --git a/string/test-strcmp.c b/string/test-strcmp.c
> index 97d7bf5043..eacbdc8857 100644
> --- a/string/test-strcmp.c
> +++ b/string/test-strcmp.c
> @@ -16,6 +16,9 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>
> +#define TEST_LEN (4096 * 3)
> +#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
> +
>  #define TEST_MAIN
>  #ifdef WIDE
>  # define TEST_NAME "wcscmp"
> @@ -129,7 +132,7 @@ do_one_test (impl_t *impl,
>
>  static void
>  do_test (size_t align1, size_t align2, size_t len, int max_char,
> -        int exp_result)
> +         int exp_result)
>  {
>    size_t i;
>
> @@ -138,19 +141,22 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
>    if (len == 0)
>      return;
>
> -  align1 &= 63;
> +  align1 &= ~(CHARBYTES - 1);
> +  align2 &= ~(CHARBYTES - 1);
> +
> +  align1 &= getpagesize () - 1;
>    if (align1 + (len + 1) * CHARBYTES >= page_size)
>      return;
>
> -  align2 &= 63;
> +  align2 &= getpagesize () - 1;
>    if (align2 + (len + 1) * CHARBYTES >= page_size)
>      return;
>
>    /* Put them close to the end of page.  */
>    i = align1 + CHARBYTES * (len + 2);
> -  s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1);
> +  s1 = (CHAR *)(buf1 + ((page_size - i) / 16 * 16) + align1);
>    i = align2 + CHARBYTES * (len + 2);
> -  s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16)  + align2);
> +  s2 = (CHAR *)(buf2 + ((page_size - i) / 16 * 16) + align2);
>
>    for (i = 0; i < len; i++)
>      s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
> @@ -161,9 +167,10 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
>    s2[len - 1] -= exp_result;
>
>    FOR_EACH_IMPL (impl, 0)
> -    do_one_test (impl, s1, s2, exp_result);
> +  do_one_test (impl, s1, s2, exp_result);
>  }
>
> +
>  static void
>  do_random_tests (void)
>  {
> @@ -385,7 +392,7 @@ check3 (void)
>  int
>  test_main (void)
>  {
> -  size_t i;
> +  size_t i, j;
>
>    test_init ();
>    check();
> @@ -426,6 +433,55 @@ test_main (void)
>        do_test (2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1);
>      }
>
> +  for (j = 0; j < 160; ++j)
> +    {
> +      for (i = 0; i < TEST_LEN;)
> +        {
> +          do_test (getpagesize () - j - 1, 0, i, 127, 0);
> +          do_test (getpagesize () - j - 1, 0, i, 127, 1);
> +          do_test (getpagesize () - j - 1, 0, i, 127, -1);
> +
> +          do_test (getpagesize () - j - 1, j, i, 127, 0);
> +          do_test (getpagesize () - j - 1, j, i, 127, 1);
> +          do_test (getpagesize () - j - 1, j, i, 127, -1);
> +
> +          do_test (0, getpagesize () - j - 1, i, 127, 0);
> +          do_test (0, getpagesize () - j - 1, i, 127, 1);
> +          do_test (0, getpagesize () - j - 1, i, 127, -1);
> +
> +          do_test (j, getpagesize () - j - 1, i, 127, 0);
> +          do_test (j, getpagesize () - j - 1, i, 127, 1);
> +          do_test (j, getpagesize () - j - 1, i, 127, -1);
> +
> +          if (i < 32)
> +            {
> +              i += 1;
> +            }
> +          else if (i < 161)
> +            {
> +              i += 7;
> +            }
> +          else if (i + 161 < TEST_LEN)
> +            {
> +              i += 31;
> +              i *= 17;
> +              i /= 16;
> +              if (i + 161 > TEST_LEN)
> +                {
> +                  i = TEST_LEN - 160;
> +                }
> +            }
> +          else if (i + 32 < TEST_LEN)
> +            {
> +              i += 7;
> +            }
> +          else
> +            {
> +              i += 1;
> +            }
> +        }
> +    }
> +
>    do_random_tests ();
>    return ret;
>  }
> diff --git a/string/test-strncmp.c b/string/test-strncmp.c
> index 61a283a0af..4fa6106eb4 100644
> --- a/string/test-strncmp.c
> +++ b/string/test-strncmp.c
> @@ -16,6 +16,9 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>
> +#define TEST_LEN (4096 * 3)
> +#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
> +
>  #define TEST_MAIN
>  #ifdef WIDE
>  # define TEST_NAME "wcsncmp"
> @@ -166,10 +169,10 @@ do_test_limit (size_t align1, size_t align2, size_t len, size_t n, int max_char,
>  }
>
>  static void
> -do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
> -        int exp_result)
> +do_test_n (size_t align1, size_t align2, size_t len, size_t n, int n_in_bounds,
> +           int max_char, int exp_result)
>  {
> -  size_t i;
> +  size_t i, buf_bound;
>    CHAR *s1, *s2;
>
>    align1 &= ~(CHARBYTES - 1);
> @@ -178,22 +181,28 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
>    if (n == 0)
>      return;
>
> -  align1 &= 63;
> -  if (align1 + (n + 1) * CHARBYTES >= page_size)
> +  buf_bound = n_in_bounds ? n : len;
> +
> +  align1 &= getpagesize () - 1;
> +  if (align1 + (buf_bound + 1) * CHARBYTES >= page_size)
>      return;
>
> -  align2 &= 63;
> -  if (align2 + (n + 1) * CHARBYTES >= page_size)
> +  align2 &= getpagesize () - 1;
> +  if (align2 + (buf_bound + 1) * CHARBYTES >= page_size)
>      return;
>
> -  s1 = (CHAR *) (buf1 + align1);
> -  s2 = (CHAR *) (buf2 + align2);
> +  s1 = (CHAR *)(buf1 + align1);
> +  s2 = (CHAR *)(buf2 + align2);
>
> -  for (i = 0; i < n; i++)
> +  if (n_in_bounds)
> +    {
> +      s1[n] = 24 + exp_result;
> +      s2[n] = 23;
> +    }
> +
> +  for (i = 0; i < buf_bound; i++)
>      s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
>
> -  s1[n] = 24 + exp_result;
> -  s2[n] = 23;
>    s1[len] = 0;
>    s2[len] = 0;
>    if (exp_result < 0)
> @@ -207,6 +216,13 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
>      do_one_test (impl, s1, s2, n, exp_result);
>  }
>
> +static void
> +do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
> +         int exp_result)
> +{
> +  do_test_n (align1, align2, len, n, 1, max_char, exp_result);
> +}
> +
>  static void
>  do_page_test (size_t offset1, size_t offset2, CHAR *s2)
>  {
> @@ -400,10 +416,123 @@ check3 (void)
>         }
>  }
>
> +static void
> +check_overflow (void)
> +{
> +  size_t i, j, of_mask, of_idx;
> +  const size_t of_masks[]
> +      = { ULONG_MAX, LONG_MIN, ULONG_MAX - (ULONG_MAX >> 2),
> +          ((size_t)LONG_MAX) >> 1 };
> +
> +  for (of_idx = 0; of_idx < sizeof (of_masks) / sizeof (of_masks[0]); ++of_idx)
> +    {
> +      of_mask = of_masks[of_idx];
> +      for (j = 0; j < 160; ++j)
> +        {
> +          for (i = 1; i <= 161; i += (32 / sizeof (CHAR)))
> +            {
> +              do_test_n (j, 0, i, of_mask, 0, 127, 0);
> +              do_test_n (j, 0, i, of_mask, 0, 127, 1);
> +              do_test_n (j, 0, i, of_mask, 0, 127, -1);
> +
> +              do_test_n (j, 0, i, of_mask - j / 2, 0, 127, 0);
> +              do_test_n (j, 0, i, of_mask - j * 2, 0, 127, 1);
> +              do_test_n (j, 0, i, of_mask - j, 0, 127, -1);
> +
> +              do_test_n (j / 2, j, i, of_mask, 0, 127, 0);
> +              do_test_n (j / 2, j, i, of_mask, 0, 127, 1);
> +              do_test_n (j / 2, j, i, of_mask, 0, 127, -1);
> +
> +              do_test_n (j / 2, j, i, of_mask - j, 0, 127, 0);
> +              do_test_n (j / 2, j, i, of_mask - j / 2, 0, 127, 1);
> +              do_test_n (j / 2, j, i, of_mask - j * 2, 0, 127, -1);
> +
> +              do_test_n (0, j, i, of_mask - j * 2, 0, 127, 0);
> +              do_test_n (0, j, i, of_mask - j, 0, 127, 1);
> +              do_test_n (0, j, i, of_mask - j / 2, 0, 127, -1);
> +
> +              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 0);
> +              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 1);
> +              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, -1);
> +
> +              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j / 2, 0, 127,
> +                         0);
> +              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j * 2, 0, 127,
> +                         1);
> +              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j, 0, 127,
> +                         -1);
> +
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> +                         of_mask, 0, 127, 0);
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> +                         of_mask, 0, 127, 1);
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> +                         of_mask, 0, 127, -1);
> +
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> +                         of_mask - j, 0, 127, 0);
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> +                         of_mask - j / 2, 0, 127, 1);
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> +                         of_mask - j * 2, 0, 127, -1);
> +            }
> +
> +          for (i = 1; i < TEST_LEN; i += i)
> +            {
> +              do_test_n (j, 0, i - 1, of_mask, 0, 127, 0);
> +              do_test_n (j, 0, i - 1, of_mask, 0, 127, 1);
> +              do_test_n (j, 0, i - 1, of_mask, 0, 127, -1);
> +
> +              do_test_n (j, 0, i - 1, of_mask - j / 2, 0, 127, 0);
> +              do_test_n (j, 0, i - 1, of_mask - j * 2, 0, 127, 1);
> +              do_test_n (j, 0, i - 1, of_mask - j, 0, 127, -1);
> +
> +              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 0);
> +              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 1);
> +              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, -1);
> +
> +              do_test_n (j / 2, j, i - 1, of_mask - j, 0, 127, 0);
> +              do_test_n (j / 2, j, i - 1, of_mask - j / 2, 0, 127, 1);
> +              do_test_n (j / 2, j, i - 1, of_mask - j * 2, 0, 127, -1);
> +
> +              do_test_n (0, j, i - 1, of_mask - j * 2, 0, 127, 0);
> +              do_test_n (0, j, i - 1, of_mask - j, 0, 127, 1);
> +              do_test_n (0, j, i - 1, of_mask - j / 2, 0, 127, -1);
> +
> +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 0);
> +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 1);
> +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127,
> +                         -1);
> +
> +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j / 2, 0,
> +                         127, 0);
> +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j * 2, 0,
> +                         127, 1);
> +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j, 0, 127,
> +                         -1);
> +
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> +                         i - 1, of_mask, 0, 127, 0);
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> +                         i - 1, of_mask, 0, 127, 1);
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> +                         i - 1, of_mask, 0, 127, -1);
> +
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> +                         i - 1, of_mask - j, 0, 127, 0);
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> +                         i - 1, of_mask - j / 2, 0, 127, 1);
> +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> +                         i - 1, of_mask - j * 2, 0, 127, -1);
> +            }
> +        }
> +    }
> +}
> +
>  int
>  test_main (void)
>  {
> -  size_t i;
> +  size_t i, j;
>
>    test_init ();
>
> @@ -470,6 +599,99 @@ test_main (void)
>        do_test_limit (0, 0, 15 - i, 16 - i, 255, -1);
>      }
>
> +  for (j = 0; j < 160; ++j)
> +    {
> +      for (i = 0; i < TEST_LEN;)
> +        {
> +          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 0);
> +          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 1);
> +          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, -1);
> +
> +          do_test_n (getpagesize () - j - 1, 0, i, i, 0, 127, 0);
> +          do_test_n (getpagesize () - j - 1, 0, i, i - 1, 0, 127, 0);
> +
> +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 0);
> +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 1);
> +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, -1);
> +
> +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 0);
> +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 1);
> +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, -1);
> +
> +          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 0);
> +          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 1);
> +          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, -1);
> +
> +          do_test_n (getpagesize () - j - 1, j, i, i, 0, 127, 0);
> +          do_test_n (getpagesize () - j - 1, j, i, i - 1, 0, 127, 0);
> +
> +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 0);
> +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 1);
> +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, -1);
> +
> +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 0);
> +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 1);
> +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, -1);
> +
> +          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
> +          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
> +          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
> +
> +          do_test_n (0, getpagesize () - j - 1, i, i, 0, 127, 0);
> +          do_test_n (0, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
> +
> +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
> +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
> +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
> +
> +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0);
> +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1);
> +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1);
> +
> +          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
> +          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
> +          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
> +
> +          do_test_n (j, getpagesize () - j - 1, i, i, 0, 127, 0);
> +          do_test_n (j, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
> +
> +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
> +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
> +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
> +
> +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0);
> +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1);
> +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1);
> +          if (i < 32)
> +            {
> +              i += 1;
> +            }
> +          else if (i < 161)
> +            {
> +              i += 7;
> +            }
> +          else if (i + 161 < TEST_LEN)
> +            {
> +              i += 31;
> +              i *= 17;
> +              i /= 16;
> +              if (i + 161 > TEST_LEN)
> +                {
> +                  i = TEST_LEN - 160;
> +                }
> +            }
> +          else if (i + 32 < TEST_LEN)
> +            {
> +              i += 7;
> +            }
> +          else
> +            {
> +              i += 1;
> +            }
> +        }
> +    }
> +
> +  check_overflow ();
>    do_random_tests ();
>    return ret;
>  }
> --
> 2.25.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 5/7] x86: Optimize strcmp-avx2.S
  2022-01-10  0:27   ` [PATCH v2 5/7] x86: Optimize strcmp-avx2.S Noah Goldstein
@ 2022-01-10  0:41     ` H.J. Lu
  2022-01-10  1:06       ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-10  0:41 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 4:31 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Optimization are primarily to the loop logic and how the page cross
> logic interacts with the loop.
>
> The page cross logic is at times more expensive for short strings near
> the end of a page but not crossing the page. This is done to retest
> the page cross conditions with a non-faulty check and to improve the
> logic for entering the loop afterwards. This is only particular cases,
> however, and is general made up for by more than 10x improvements on
> the transition from the page cross -> loop case.
>
> The non-page cross cases are improved most for smaller sizes [0, 128]
> and go about even for (128, 4096]. The loop page cross logic is
> improved so some more significant speedup is seen there as well.
>
> test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  sysdeps/x86_64/multiarch/strcmp-avx2.S | 1590 ++++++++++++++----------
>  1 file changed, 939 insertions(+), 651 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> index 9c73b5899d..28d6a0025a 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> @@ -26,35 +26,57 @@
>
>  # define PAGE_SIZE     4096
>
> -/* VEC_SIZE = Number of bytes in a ymm register */
> +       /* VEC_SIZE = Number of bytes in a ymm register.  */
>  # define VEC_SIZE      32
>
> -/* Shift for dividing by (VEC_SIZE * 4).  */
> -# define DIVIDE_BY_VEC_4_SHIFT 7
> -# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> -#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> -# endif
> +# define VMOVU vmovdqu
> +# define VMOVA vmovdqa
>
>  # ifdef USE_AS_WCSCMP
> -/* Compare packed dwords.  */
> +       /* Compare packed dwords.  */
>  #  define VPCMPEQ      vpcmpeqd
> -/* Compare packed dwords and store minimum.  */
> +       /* Compare packed dwords and store minimum.  */
>  #  define VPMINU       vpminud
> -/* 1 dword char == 4 bytes.  */
> +       /* 1 dword char == 4 bytes.  */
>  #  define SIZE_OF_CHAR 4
>  # else
> -/* Compare packed bytes.  */
> +       /* Compare packed bytes.  */
>  #  define VPCMPEQ      vpcmpeqb
> -/* Compare packed bytes and store minimum.  */
> +       /* Compare packed bytes and store minimum.  */
>  #  define VPMINU       vpminub
> -/* 1 byte char == 1 byte.  */
> +       /* 1 byte char == 1 byte.  */
>  #  define SIZE_OF_CHAR 1
>  # endif
>
> +# ifdef USE_AS_STRNCMP
> +#  define LOOP_REG     r9d
> +#  define LOOP_REG64   r9
> +
> +#  define OFFSET_REG8  r9b
> +#  define OFFSET_REG   r9d
> +#  define OFFSET_REG64 r9
> +# else
> +#  define LOOP_REG     edx
> +#  define LOOP_REG64   rdx
> +
> +#  define OFFSET_REG8  dl
> +#  define OFFSET_REG   edx
> +#  define OFFSET_REG64 rdx
> +# endif
> +
>  # ifndef VZEROUPPER
>  #  define VZEROUPPER   vzeroupper
>  # endif
>
> +# if defined USE_AS_STRNCMP
> +#  define VEC_OFFSET   0
> +# else
> +#  define VEC_OFFSET   (-VEC_SIZE)
> +# endif
> +
> +# define xmmZERO       xmm15
> +# define ymmZERO       ymm15
> +
>  # ifndef SECTION
>  #  define SECTION(p)   p##.avx
>  # endif
> @@ -79,783 +101,1049 @@
>     the maximum offset is reached before a difference is found, zero is
>     returned.  */
>
> -       .section SECTION(.text),"ax",@progbits
> -ENTRY (STRCMP)
> +       .section SECTION(.text), "ax", @progbits
> +ENTRY(STRCMP)
>  # ifdef USE_AS_STRNCMP
> -       /* Check for simple cases (0 or 1) in offset.  */
> +#  ifdef __ILP32__
> +       /* Clear the upper 32 bits.  */
> +       movl    %edx, %rdx
> +#  endif
>         cmp     $1, %RDX_LP
> -       je      L(char0)
> -       jb      L(zero)
> +       /* Signed comparison intentional. We use this branch to also
> +          test cases where length >= 2^63. These very large sizes can be
> +          handled with strcmp as there is no way for that length to
> +          actually bound the buffer.  */
> +       jle     L(one_or_less)
>  #  ifdef USE_AS_WCSCMP
> -#  ifndef __ILP32__
>         movq    %rdx, %rcx
> -       /* Check if length could overflow when multiplied by
> -          sizeof(wchar_t). Checking top 8 bits will cover all potential
> -          overflow cases as well as redirect cases where its impossible to
> -          length to bound a valid memory region. In these cases just use
> -          'wcscmp'.  */
> +
> +       /* Multiplying length by sizeof(wchar_t) can result in overflow.
> +          Check if that is possible. All cases where overflow are possible
> +          are cases where length is large enough that it can never be a
> +          bound on valid memory so just use wcscmp.  */
>         shrq    $56, %rcx
>         jnz     __wcscmp_avx2
> +
> +       leaq    (, %rdx, 4), %rdx
>  #  endif
> -       /* Convert units: from wide to byte char.  */
> -       shl     $2, %RDX_LP
> -#  endif
> -       /* Register %r11 tracks the maximum offset.  */
> -       mov     %RDX_LP, %R11_LP
>  # endif
> +       vpxor   %xmmZERO, %xmmZERO, %xmmZERO
>         movl    %edi, %eax
> -       xorl    %edx, %edx
> -       /* Make %xmm7 (%ymm7) all zeros in this function.  */
> -       vpxor   %xmm7, %xmm7, %xmm7
>         orl     %esi, %eax
> -       andl    $(PAGE_SIZE - 1), %eax
> -       cmpl    $(PAGE_SIZE - (VEC_SIZE * 4)), %eax
> -       jg      L(cross_page)
> -       /* Start comparing 4 vectors.  */
> -       vmovdqu (%rdi), %ymm1
> -       VPCMPEQ (%rsi), %ymm1, %ymm0
> -       VPMINU  %ymm1, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -       vpmovmskb %ymm0, %ecx
> -       testl   %ecx, %ecx
> -       je      L(next_3_vectors)
> -       tzcntl  %ecx, %edx
> +       sall    $20, %eax
> +       /* Check if s1 or s2 may cross a page  in next 4x VEC loads.  */
> +       cmpl    $((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
> +       ja      L(page_cross)
> +
> +L(no_page_cross):
> +       /* Safe to compare 4x vectors.  */
> +       VMOVU   (%rdi), %ymm0
> +       /* 1s where s1 and s2 equal.  */
> +       VPCMPEQ (%rsi), %ymm0, %ymm1
> +       /* 1s at null CHAR.  */
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       /* 1s where s1 and s2 equal AND not null CHAR.  */
> +       vpandn  %ymm1, %ymm2, %ymm1
> +
> +       /* All 1s -> keep going, any 0s -> return.  */
> +       vpmovmskb %ymm1, %ecx
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx) is after the maximum
> -          offset (%r11).   */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    $VEC_SIZE, %rdx
> +       jbe     L(vec_0_test_len)
>  # endif
> +
> +       /* All 1s represents all equals. incl will overflow to zero in
> +          all equals case. Otherwise 1s will carry until position of first
> +          mismatch.  */
> +       incl    %ecx
> +       jz      L(more_3x_vec)
> +
> +       .p2align 4,, 4
> +L(return_vec_0):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_WCSCMP
> +       movl    (%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       je      L(return)
> -L(wcscmp_return):
> +       cmpl    (%rsi, %rcx), %edx
> +       je      L(ret0)
>         setl    %al
>         negl    %eax
>         orl     $1, %eax
> -L(return):
>  # else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       movzbl  (%rdi, %rcx), %eax
> +       movzbl  (%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  # endif
> +L(ret0):
>  L(return_vzeroupper):
>         ZERO_UPPER_VEC_REGISTERS_RETURN
>
> -       .p2align 4
> -L(return_vec_size):
> -       tzcntl  %ecx, %edx
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
> -          the maximum offset (%r11).  */
> -       addq    $VEC_SIZE, %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> +       .p2align 4,, 8
> +L(vec_0_test_len):
> +       notl    %ecx
> +       bzhil   %edx, %ecx, %eax
> +       jnz     L(return_vec_0)
> +       /* Align if will cross fetch block.  */
> +       .p2align 4,, 2
> +L(ret_zero):
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# else
> +       VZEROUPPER_RETURN
> +
> +       .p2align 4,, 5
> +L(one_or_less):
> +       jb      L(ret_zero)
>  #  ifdef USE_AS_WCSCMP
> +       /* 'nbe' covers the case where length is negative (large
> +          unsigned).  */
> +       jnbe    __wcscmp_avx2
> +       movl    (%rdi), %edx
>         xorl    %eax, %eax
> -       movl    VEC_SIZE(%rdi, %rdx), %ecx
> -       cmpl    VEC_SIZE(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> +       cmpl    (%rsi), %edx
> +       je      L(ret1)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  #  else
> -       movzbl  VEC_SIZE(%rdi, %rdx), %eax
> -       movzbl  VEC_SIZE(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       /* 'nbe' covers the case where length is negative (large
> +          unsigned).  */
> +
> +       jnbe    __strcmp_avx2
> +       movzbl  (%rdi), %eax
> +       movzbl  (%rsi), %ecx
> +       subl    %ecx, %eax
>  #  endif
> +L(ret1):
> +       ret
>  # endif
> -       VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(return_2_vec_size):
> -       tzcntl  %ecx, %edx
> +       .p2align 4,, 10
> +L(return_vec_1):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
> -          after the maximum offset (%r11).  */
> -       addq    $(VEC_SIZE * 2), %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> +       /* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of
> +          overflow.  */
> +       addq    $-VEC_SIZE, %rdx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero)
> +# endif
> +# ifdef USE_AS_WCSCMP
> +       movl    VEC_SIZE(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> +       je      L(ret2)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rdi, %rdx), %ecx
> -       cmpl    (VEC_SIZE * 2)(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rdi, %rdx), %eax
> -       movzbl  (VEC_SIZE * 2)(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  # endif
> +L(ret2):
>         VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(return_3_vec_size):
> -       tzcntl  %ecx, %edx
> +       .p2align 4,, 10
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
> -          after the maximum offset (%r11).  */
> -       addq    $(VEC_SIZE * 3), %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> +L(return_vec_3):
> +       salq    $32, %rcx
> +# endif
> +
> +L(return_vec_2):
> +# ifndef USE_AS_STRNCMP
> +       tzcntl  %ecx, %ecx
> +# else
> +       tzcntq  %rcx, %rcx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> +       je      L(ret3)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  # else
> +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +# endif
> +L(ret3):
> +       VZEROUPPER_RETURN
> +
> +# ifndef USE_AS_STRNCMP
> +       .p2align 4,, 10
> +L(return_vec_3):
> +       tzcntl  %ecx, %ecx
>  #  ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 3)(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (VEC_SIZE * 3)(%rdi, %rdx), %ecx
> -       cmpl    (VEC_SIZE * 3)(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> +       cmpl    (VEC_SIZE * 3)(%rsi, %rcx), %edx
> +       je      L(ret4)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  #  else
> -       movzbl  (VEC_SIZE * 3)(%rdi, %rdx), %eax
> -       movzbl  (VEC_SIZE * 3)(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       movzbl  (VEC_SIZE * 3)(%rdi, %rcx), %eax
> +       movzbl  (VEC_SIZE * 3)(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  #  endif
> -# endif
> +L(ret4):
>         VZEROUPPER_RETURN
> +# endif
> +
> +       .p2align 4,, 10
> +L(more_3x_vec):
> +       /* Safe to compare 4x vectors.  */
> +       VMOVU   VEC_SIZE(%rdi), %ymm0
> +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_1)
> +
> +# ifdef USE_AS_STRNCMP
> +       subq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +       VMOVU   (VEC_SIZE * 2)(%rdi), %ymm0
> +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_2)
> +
> +       VMOVU   (VEC_SIZE * 3)(%rdi), %ymm0
> +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_3)
>
> -       .p2align 4
> -L(next_3_vectors):
> -       vmovdqu VEC_SIZE(%rdi), %ymm6
> -       VPCMPEQ VEC_SIZE(%rsi), %ymm6, %ymm3
> -       VPMINU  %ymm6, %ymm3, %ymm3
> -       VPCMPEQ %ymm7, %ymm3, %ymm3
> -       vpmovmskb %ymm3, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(return_vec_size)
> -       vmovdqu (VEC_SIZE * 2)(%rdi), %ymm5
> -       vmovdqu (VEC_SIZE * 3)(%rdi), %ymm4
> -       vmovdqu (VEC_SIZE * 3)(%rsi), %ymm0
> -       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm5, %ymm2
> -       VPMINU  %ymm5, %ymm2, %ymm2
> -       VPCMPEQ %ymm4, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm2, %ymm2
> -       vpmovmskb %ymm2, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(return_2_vec_size)
> -       VPMINU  %ymm4, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -       vpmovmskb %ymm0, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(return_3_vec_size)
> -L(main_loop_header):
> -       leaq    (VEC_SIZE * 4)(%rdi), %rdx
> -       movl    $PAGE_SIZE, %ecx
> -       /* Align load via RAX.  */
> -       andq    $-(VEC_SIZE * 4), %rdx
> -       subq    %rdi, %rdx
> -       leaq    (%rdi, %rdx), %rax
>  # ifdef USE_AS_STRNCMP
> -       /* Starting from this point, the maximum offset, or simply the
> -          'offset', DECREASES by the same amount when base pointers are
> -          moved forward.  Return 0 when:
> -            1) On match: offset <= the matched vector index.
> -            2) On mistmach, offset is before the mistmatched index.
> +       cmpq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       /* any non-zero positive value that doesn't inference with 0x1.
>          */
> -       subq    %rdx, %r11
> -       jbe     L(zero)
> -# endif
> -       addq    %rsi, %rdx
> -       movq    %rdx, %rsi
> -       andl    $(PAGE_SIZE - 1), %esi
> -       /* Number of bytes before page crossing.  */
> -       subq    %rsi, %rcx
> -       /* Number of VEC_SIZE * 4 blocks before page crossing.  */
> -       shrq    $DIVIDE_BY_VEC_4_SHIFT, %rcx
> -       /* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
> -       movl    %ecx, %esi
> -       jmp     L(loop_start)
> +       movl    $2, %r8d
>
> +# else
> +       xorl    %r8d, %r8d
> +# endif
> +
> +       /* The prepare labels are various entry points from the page
> +          cross logic.  */
> +L(prepare_loop):
> +
> +# ifdef USE_AS_STRNCMP
> +       /* Store N + (VEC_SIZE * 4) and place check at the begining of
> +          the loop.  */
> +       leaq    (VEC_SIZE * 2)(%rdi, %rdx), %rdx
> +# endif
> +L(prepare_loop_no_len):
> +
> +       /* Align s1 and adjust s2 accordingly.  */
> +       subq    %rdi, %rsi
> +       andq    $-(VEC_SIZE * 4), %rdi
> +       addq    %rdi, %rsi
> +
> +# ifdef USE_AS_STRNCMP
> +       subq    %rdi, %rdx
> +# endif
> +
> +L(prepare_loop_aligned):
> +       /* eax stores distance from rsi to next page cross. These cases
> +          need to be handled specially as the 4x loop could potentially
> +          read memory past the length of s1 or s2 and across a page
> +          boundary.  */
> +       movl    $-(VEC_SIZE * 4), %eax
> +       subl    %esi, %eax
> +       andl    $(PAGE_SIZE - 1), %eax
> +
> +       /* Loop 4x comparisons at a time.  */
>         .p2align 4
>  L(loop):
> +
> +       /* End condition for strncmp.  */
>  # ifdef USE_AS_STRNCMP
> -       /* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
> -          the maximum offset (%r11) by the same amount.  */
> -       subq    $(VEC_SIZE * 4), %r11
> -       jbe     L(zero)
> -# endif
> -       addq    $(VEC_SIZE * 4), %rax
> -       addq    $(VEC_SIZE * 4), %rdx
> -L(loop_start):
> -       testl   %esi, %esi
> -       leal    -1(%esi), %esi
> -       je      L(loop_cross_page)
> -L(back_to_loop):
> -       /* Main loop, comparing 4 vectors are a time.  */
> -       vmovdqa (%rax), %ymm0
> -       vmovdqa VEC_SIZE(%rax), %ymm3
> -       VPCMPEQ (%rdx), %ymm0, %ymm4
> -       VPCMPEQ VEC_SIZE(%rdx), %ymm3, %ymm1
> -       VPMINU  %ymm0, %ymm4, %ymm4
> -       VPMINU  %ymm3, %ymm1, %ymm1
> -       vmovdqa (VEC_SIZE * 2)(%rax), %ymm2
> -       VPMINU  %ymm1, %ymm4, %ymm0
> -       vmovdqa (VEC_SIZE * 3)(%rax), %ymm3
> -       VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm2, %ymm5
> -       VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm3, %ymm6
> -       VPMINU  %ymm2, %ymm5, %ymm5
> -       VPMINU  %ymm3, %ymm6, %ymm6
> -       VPMINU  %ymm5, %ymm0, %ymm0
> -       VPMINU  %ymm6, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -
> -       /* Test each mask (32 bits) individually because for VEC_SIZE
> -          == 32 is not possible to OR the four masks and keep all bits
> -          in a 64-bit integer register, differing from SSE2 strcmp
> -          where ORing is possible.  */
> -       vpmovmskb %ymm0, %ecx
> +       subq    $(VEC_SIZE * 4), %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       subq    $-(VEC_SIZE * 4), %rsi
> +
> +       /* Check if rsi loads will cross a page boundary.  */
> +       addl    $-(VEC_SIZE * 4), %eax
> +       jnb     L(page_cross_during_loop)
> +
> +       /* Loop entry after handling page cross during loop.  */
> +L(loop_skip_page_cross_check):
> +       VMOVA   (VEC_SIZE * 0)(%rdi), %ymm0
> +       VMOVA   (VEC_SIZE * 1)(%rdi), %ymm2
> +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> +
> +       /* ymm1 all 1s where s1 and s2 equal. All 0s otherwise.  */
> +       VPCMPEQ (VEC_SIZE * 0)(%rsi), %ymm0, %ymm1
> +
> +       VPCMPEQ (VEC_SIZE * 1)(%rsi), %ymm2, %ymm3
> +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> +
> +
> +       /* If any mismatches or null CHAR then 0 CHAR, otherwise non-
> +          zero.  */
> +       vpand   %ymm0, %ymm1, %ymm1
> +
> +
> +       vpand   %ymm2, %ymm3, %ymm3
> +       vpand   %ymm4, %ymm5, %ymm5
> +       vpand   %ymm6, %ymm7, %ymm7
> +
> +       VPMINU  %ymm1, %ymm3, %ymm3
> +       VPMINU  %ymm5, %ymm7, %ymm7
> +
> +       /* Reduce all 0 CHARs for the 4x VEC into ymm7.  */
> +       VPMINU  %ymm3, %ymm7, %ymm7
> +
> +       /* If any 0 CHAR then done.  */
> +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> +       vpmovmskb %ymm7, %LOOP_REG
> +       testl   %LOOP_REG, %LOOP_REG
> +       jz      L(loop)
> +
> +       /* Find which VEC has the mismatch of end of string.  */
> +       VPCMPEQ %ymm1, %ymmZERO, %ymm1
> +       vpmovmskb %ymm1, %ecx
>         testl   %ecx, %ecx
> -       je      L(loop)
> -       VPCMPEQ %ymm7, %ymm4, %ymm0
> -       vpmovmskb %ymm0, %edi
> -       testl   %edi, %edi
> -       je      L(test_vec)
> -       tzcntl  %edi, %ecx
> +       jnz     L(return_vec_0_end)
> +
> +
> +       VPCMPEQ %ymm3, %ymmZERO, %ymm3
> +       vpmovmskb %ymm3, %ecx
> +       testl   %ecx, %ecx
> +       jnz     L(return_vec_1_end)
> +
> +L(return_vec_2_3_end):
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       subq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(ret_zero_end)
> +# endif
> +
> +       VPCMPEQ %ymm5, %ymmZERO, %ymm5
> +       vpmovmskb %ymm5, %ecx
> +       testl   %ecx, %ecx
> +       jnz     L(return_vec_2_end)
> +
> +       /* LOOP_REG contains matches for null/mismatch from the loop. If
> +          VEC 0,1,and 2 all have no null and no mismatches then mismatch
> +          must entirely be from VEC 3 which is fully represented by
> +          LOOP_REG.  */
> +       tzcntl  %LOOP_REG, %LOOP_REG
> +
> +# ifdef USE_AS_STRNCMP
> +       subl    $-(VEC_SIZE), %LOOP_REG
> +       cmpq    %LOOP_REG64, %rdx
> +       jbe     L(ret_zero_end)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> +       je      L(ret5)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax
> +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret5):
>         VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(test_vec):
>  # ifdef USE_AS_STRNCMP
> -       /* The first vector matched.  Return 0 if the maximum offset
> -          (%r11) <= VEC_SIZE.  */
> -       cmpq    $VEC_SIZE, %r11
> -       jbe     L(zero)
> +       .p2align 4,, 2
> +L(ret_zero_end):
> +       xorl    %eax, %eax
> +       VZEROUPPER_RETURN
>  # endif
> -       VPCMPEQ %ymm7, %ymm1, %ymm1
> -       vpmovmskb %ymm1, %ecx
> -       testl   %ecx, %ecx
> -       je      L(test_2_vec)
> -       tzcntl  %ecx, %edi
> +
> +
> +       /* The L(return_vec_N_end) differ from L(return_vec_N) in that
> +          they use the value of `r8` to negate the return value. This is
> +          because the page cross logic can swap `rdi` and `rsi`.  */
> +       .p2align 4,, 10
>  # ifdef USE_AS_STRNCMP
> -       addq    $VEC_SIZE, %rdi
> -       cmpq    %rdi, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +L(return_vec_1_end):
> +       salq    $32, %rcx
> +# endif
> +L(return_vec_0_end):
> +# ifndef USE_AS_STRNCMP
> +       tzcntl  %ecx, %ecx
> +# else
> +       tzcntq  %rcx, %rcx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero_end)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       movl    (%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rdi), %ecx
> -       cmpl    (%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rdi), %eax
> -       movzbl  (%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (%rsi, %rcx), %edx
> +       je      L(ret6)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> +       movzbl  (%rdi, %rcx), %eax
> +       movzbl  (%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
> +# endif
> +L(ret6):
> +       VZEROUPPER_RETURN
> +
> +# ifndef USE_AS_STRNCMP
> +       .p2align 4,, 10
> +L(return_vec_1_end):
> +       tzcntl  %ecx, %ecx
>  #  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       movl    VEC_SIZE(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    VEC_SIZE(%rsi, %rdi), %ecx
> -       cmpl    VEC_SIZE(%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> +       je      L(ret7)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  #  else
> -       movzbl  VEC_SIZE(%rax, %rdi), %eax
> -       movzbl  VEC_SIZE(%rdx, %rdi), %edx
> -       subl    %edx, %eax
> +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  #  endif
> -# endif
> +L(ret7):
>         VZEROUPPER_RETURN
> +# endif
>
> -       .p2align 4
> -L(test_2_vec):
> +       .p2align 4,, 10
> +L(return_vec_2_end):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_STRNCMP
> -       /* The first 2 vectors matched.  Return 0 if the maximum offset
> -          (%r11) <= 2 * VEC_SIZE.  */
> -       cmpq    $(VEC_SIZE * 2), %r11
> -       jbe     L(zero)
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero_page_cross)
>  # endif
> -       VPCMPEQ %ymm7, %ymm5, %ymm5
> -       vpmovmskb %ymm5, %ecx
> -       testl   %ecx, %ecx
> -       je      L(test_3_vec)
> -       tzcntl  %ecx, %edi
> -# ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 2), %rdi
> -       cmpq    %rdi, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +# ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rdi), %ecx
> -       cmpl    (%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rdi), %eax
> -       movzbl  (%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> +       je      L(ret11)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rsi, %rdi), %ecx
> -       cmpl    (VEC_SIZE * 2)(%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rax, %rdi), %eax
> -       movzbl  (VEC_SIZE * 2)(%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret11):
>         VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(test_3_vec):
> +
> +       /* Page cross in rsi in next 4x VEC.  */
> +
> +       /* TODO: Improve logic here.  */
> +       .p2align 4,, 10
> +L(page_cross_during_loop):
> +       /* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
> +
> +       /* Optimistically rsi and rdi and both aligned inwhich case we
> +          don't need any logic here.  */
> +       cmpl    $-(VEC_SIZE * 4), %eax
> +       /* Don't adjust eax before jumping back to loop and we will
> +          never hit page cross case again.  */
> +       je      L(loop_skip_page_cross_check)
> +
> +       /* Check if we can safely load a VEC.  */
> +       cmpl    $-(VEC_SIZE * 3), %eax
> +       jle     L(less_1x_vec_till_page_cross)
> +
> +       VMOVA   (%rdi), %ymm0
> +       VPCMPEQ (%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_0_end)
> +
> +       /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
> +       cmpl    $-(VEC_SIZE * 2), %eax
> +       jg      L(more_2x_vec_till_page_cross)
> +
> +       .p2align 4,, 4
> +L(less_1x_vec_till_page_cross):
> +       subl    $-(VEC_SIZE * 4), %eax
> +       /* Guranteed safe to read from rdi - VEC_SIZE here. The only
> +          concerning case is first iteration if incoming s1 was near start
> +          of a page and s2 near end. If s1 was near the start of the page
> +          we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
> +          to read back -VEC_SIZE. If rdi is truly at the start of a page
> +          here, it means the previous page (rdi - VEC_SIZE) has already
> +          been loaded earlier so must be valid.  */
> +       VMOVU   -VEC_SIZE(%rdi, %rax), %ymm0
> +       VPCMPEQ -VEC_SIZE(%rsi, %rax), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +
> +       /* Mask of potentially valid bits. The lower bits can be out of
> +          range comparisons (but safe regarding page crosses).  */
> +       movl    $-1, %r10d
> +       shlxl   %esi, %r10d, %r10d
> +       notl    %ecx
> +
>  # ifdef USE_AS_STRNCMP
> -       /* The first 3 vectors matched.  Return 0 if the maximum offset
> -          (%r11) <= 3 * VEC_SIZE.  */
> -       cmpq    $(VEC_SIZE * 3), %r11
> -       jbe     L(zero)
> -# endif
> -       VPCMPEQ %ymm7, %ymm6, %ymm6
> -       vpmovmskb %ymm6, %esi
> -       tzcntl  %esi, %ecx
> +       cmpq    %rax, %rdx
> +       jbe     L(return_page_cross_end_check)
> +# endif
> +       movl    %eax, %OFFSET_REG
> +       addl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> +
> +       andl    %r10d, %ecx
> +       jz      L(loop_skip_page_cross_check)
> +
> +       .p2align 4,, 3
> +L(return_page_cross_end):
> +       tzcntl  %ecx, %ecx
> +
>  # ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 3), %rcx
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %esi
> -       cmpl    (%rdx, %rcx), %esi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       leal    -VEC_SIZE(%OFFSET_REG64, %rcx), %ecx
> +L(return_page_cross_cmp_mem):
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       addl    %OFFSET_REG, %ecx
> +# endif
> +# ifdef USE_AS_WCSCMP
> +       movl    VEC_OFFSET(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (VEC_SIZE * 3)(%rsi, %rcx), %esi
> -       cmpl    (VEC_SIZE * 3)(%rdx, %rcx), %esi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 3)(%rax, %rcx), %eax
> -       movzbl  (VEC_SIZE * 3)(%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> +       je      L(ret8)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
> +# else
> +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret8):
>         VZEROUPPER_RETURN
>
> -       .p2align 4
> -L(loop_cross_page):
> -       xorl    %r10d, %r10d
> -       movq    %rdx, %rcx
> -       /* Align load via RDX.  We load the extra ECX bytes which should
> -          be ignored.  */
> -       andl    $((VEC_SIZE * 4) - 1), %ecx
> -       /* R10 is -RCX.  */
> -       subq    %rcx, %r10
> -
> -       /* This works only if VEC_SIZE * 2 == 64. */
> -# if (VEC_SIZE * 2) != 64
> -#  error (VEC_SIZE * 2) != 64
> -# endif
> -
> -       /* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
> -       cmpl    $(VEC_SIZE * 2), %ecx
> -       jge     L(loop_cross_page_2_vec)
> -
> -       vmovdqu (%rax, %r10), %ymm2
> -       vmovdqu VEC_SIZE(%rax, %r10), %ymm3
> -       VPCMPEQ (%rdx, %r10), %ymm2, %ymm0
> -       VPCMPEQ VEC_SIZE(%rdx, %r10), %ymm3, %ymm1
> -       VPMINU  %ymm2, %ymm0, %ymm0
> -       VPMINU  %ymm3, %ymm1, %ymm1
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm1, %ymm1
> -
> -       vpmovmskb %ymm0, %edi
> -       vpmovmskb %ymm1, %esi
> -
> -       salq    $32, %rsi
> -       xorq    %rsi, %rdi
> -
> -       /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
> -       shrq    %cl, %rdi
> -
> -       testq   %rdi, %rdi
> -       je      L(loop_cross_page_2_vec)
> -       tzcntq  %rdi, %rcx
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       .p2align 4,, 10
> +L(return_page_cross_end_check):
> +       tzcntl  %ecx, %ecx
> +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> +       cmpl    %ecx, %edx
> +       ja      L(return_page_cross_cmp_mem)
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# endif
>         VZEROUPPER_RETURN
> +# endif
>
> -       .p2align 4
> -L(loop_cross_page_2_vec):
> -       /* The first VEC_SIZE * 2 bytes match or are ignored.  */
> -       vmovdqu (VEC_SIZE * 2)(%rax, %r10), %ymm2
> -       vmovdqu (VEC_SIZE * 3)(%rax, %r10), %ymm3
> -       VPCMPEQ (VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5
> -       VPMINU  %ymm2, %ymm5, %ymm5
> -       VPCMPEQ (VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6
> -       VPCMPEQ %ymm7, %ymm5, %ymm5
> -       VPMINU  %ymm3, %ymm6, %ymm6
> -       VPCMPEQ %ymm7, %ymm6, %ymm6
> -
> -       vpmovmskb %ymm5, %edi
> -       vpmovmskb %ymm6, %esi
> -
> -       salq    $32, %rsi
> -       xorq    %rsi, %rdi
>
> -       xorl    %r8d, %r8d
> -       /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
> -       subl    $(VEC_SIZE * 2), %ecx
> -       jle     1f
> -       /* Skip ECX bytes.  */
> -       shrq    %cl, %rdi
> -       /* R8 has number of bytes skipped.  */
> -       movl    %ecx, %r8d
> -1:
> -       /* Before jumping back to the loop, set ESI to the number of
> -          VEC_SIZE * 4 blocks before page crossing.  */
> -       movl    $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
> -
> -       testq   %rdi, %rdi
> +       .p2align 4,, 10
> +L(more_2x_vec_till_page_cross):
> +       /* If more 2x vec till cross we will complete a full loop
> +          iteration here.  */
> +
> +       VMOVU   VEC_SIZE(%rdi), %ymm0
> +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_1_end)
> +
>  # ifdef USE_AS_STRNCMP
> -       /* At this point, if %rdi value is 0, it already tested
> -          VEC_SIZE*4+%r10 byte starting from %rax. This label
> -          checks whether strncmp maximum offset reached or not.  */
> -       je      L(string_nbyte_offset_check)
> -# else
> -       je      L(back_to_loop)
> +       cmpq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(ret_zero_in_loop_page_cross)
>  # endif
> -       tzcntq  %rdi, %rcx
> -       addq    %r10, %rcx
> -       /* Adjust for number of bytes skipped.  */
> -       addq    %r8, %rcx
> +
> +       subl    $-(VEC_SIZE * 4), %eax
> +
> +       /* Safe to include comparisons from lower bytes.  */
> +       VMOVU   -(VEC_SIZE * 2)(%rdi, %rax), %ymm0
> +       VPCMPEQ -(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_page_cross_0)
> +
> +       VMOVU   -(VEC_SIZE * 1)(%rdi, %rax), %ymm0
> +       VPCMPEQ -(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +       jnz     L(return_vec_page_cross_1)
> +
>  # ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 2), %rcx
> -       subq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       /* Must check length here as length might proclude reading next
> +          page.  */
> +       cmpq    %rax, %rdx
> +       jbe     L(ret_zero_in_loop_page_cross)
> +# endif
> +
> +       /* Finish the loop.  */
> +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> +
> +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> +       vpand   %ymm4, %ymm5, %ymm5
> +       vpand   %ymm6, %ymm7, %ymm7
> +       VPMINU  %ymm5, %ymm7, %ymm7
> +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> +       vpmovmskb %ymm7, %LOOP_REG
> +       testl   %LOOP_REG, %LOOP_REG
> +       jnz     L(return_vec_2_3_end)
> +
> +       /* Best for code size to include ucond-jmp here. Would be faster
> +          if this case is hot to duplicate the L(return_vec_2_3_end) code
> +          as fall-through and have jump back to loop on mismatch
> +          comparison.  */
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       subq    $-(VEC_SIZE * 4), %rsi
> +       addl    $(PAGE_SIZE - VEC_SIZE * 8), %eax
> +# ifdef USE_AS_STRNCMP
> +       subq    $(VEC_SIZE * 4), %rdx
> +       ja      L(loop_skip_page_cross_check)
> +L(ret_zero_in_loop_page_cross):
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       VZEROUPPER_RETURN
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rsi, %rcx), %edi
> -       cmpl    (VEC_SIZE * 2)(%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rax, %rcx), %eax
> -       movzbl  (VEC_SIZE * 2)(%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       jmp     L(loop_skip_page_cross_check)
>  # endif
> -       VZEROUPPER_RETURN
>
> +
> +       .p2align 4,, 10
> +L(return_vec_page_cross_0):
> +       addl    $-VEC_SIZE, %eax
> +L(return_vec_page_cross_1):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_STRNCMP
> -L(string_nbyte_offset_check):
> -       leaq    (VEC_SIZE * 4)(%r10), %r10
> -       cmpq    %r10, %r11
> -       jbe     L(zero)
> -       jmp     L(back_to_loop)
> +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero_in_loop_page_cross)
> +# else
> +       addl    %eax, %ecx
>  # endif
>
> -       .p2align 4
> -L(cross_page_loop):
> -       /* Check one byte/dword at a time.  */
>  # ifdef USE_AS_WCSCMP
> -       cmpl    %ecx, %eax
> +       movl    VEC_OFFSET(%rdi, %rcx), %edx
> +       xorl    %eax, %eax
> +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> +       je      L(ret9)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
>         subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> -       jne     L(different)
> -       addl    $SIZE_OF_CHAR, %edx
> -       cmpl    $(VEC_SIZE * 4), %edx
> -       je      L(main_loop_header)
> -# ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +L(ret9):
> +       VZEROUPPER_RETURN
> +
> +
> +       .p2align 4,, 10
> +L(page_cross):
> +# ifndef USE_AS_STRNCMP
> +       /* If both are VEC aligned we don't need any special logic here.
> +          Only valid for strcmp where stop condition is guranteed to be
> +          reachable by just reading memory.  */
> +       testl   $((VEC_SIZE - 1) << 20), %eax
> +       jz      L(no_page_cross)
>  # endif
> +
> +       movl    %edi, %eax
> +       movl    %esi, %ecx
> +       andl    $(PAGE_SIZE - 1), %eax
> +       andl    $(PAGE_SIZE - 1), %ecx
> +
> +       xorl    %OFFSET_REG, %OFFSET_REG
> +
> +       /* Check which is closer to page cross, s1 or s2.  */
> +       cmpl    %eax, %ecx
> +       jg      L(page_cross_s2)
> +
> +       /* The previous page cross check has false positives. Check for
> +          true positive as page cross logic is very expensive.  */
> +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> +       jbe     L(no_page_cross)
> +
> +       /* Set r8 to not interfere with normal return value (rdi and rsi
> +          did not swap).  */
>  # ifdef USE_AS_WCSCMP
> -       movl    (%rdi, %rdx), %eax
> -       movl    (%rsi, %rdx), %ecx
> +       /* any non-zero positive value that doesn't inference with 0x1.
> +        */
> +       movl    $2, %r8d
>  # else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %ecx
> +       xorl    %r8d, %r8d
>  # endif
> -       /* Check null char.  */
> -       testl   %eax, %eax
> -       jne     L(cross_page_loop)
> -       /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
> -          comparisons.  */
> -       subl    %ecx, %eax
> -# ifndef USE_AS_WCSCMP
> -L(different):
> +
> +       /* Check if less than 1x VEC till page cross.  */
> +       subl    $(VEC_SIZE * 3), %eax
> +       jg      L(less_1x_vec_till_page)
> +
> +       /* If more than 1x VEC till page cross, loop throuh safely
> +          loadable memory until within 1x VEC of page cross.  */
> +
> +       .p2align 4,, 10
> +L(page_cross_loop):
> +
> +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +       incl    %ecx
> +
> +       jnz     L(check_ret_vec_page_cross)
> +       addl    $VEC_SIZE, %OFFSET_REG
> +# ifdef USE_AS_STRNCMP
> +       cmpq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross)
>  # endif
> -       VZEROUPPER_RETURN
> +       addl    $VEC_SIZE, %eax
> +       jl      L(page_cross_loop)
> +
> +       subl    %eax, %OFFSET_REG
> +       /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
> +          to not cross page so is safe to load. Since we have already
> +          loaded at least 1 VEC from rsi it is also guranteed to be safe.
> +        */
> +
> +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> +       vpandn  %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %ecx
> +
> +# ifdef USE_AS_STRNCMP
> +       leal    VEC_SIZE(%OFFSET_REG64), %eax
> +       cmpq    %rax, %rdx
> +       jbe     L(check_ret_vec_page_cross2)
> +       addq    %rdi, %rdx
> +# endif
> +       incl    %ecx
> +       jz      L(prepare_loop_no_len)
>
> +       .p2align 4,, 4
> +L(ret_vec_page_cross):
> +# ifndef USE_AS_STRNCMP
> +L(check_ret_vec_page_cross):
> +# endif
> +       tzcntl  %ecx, %ecx
> +       addl    %OFFSET_REG, %ecx
> +L(ret_vec_page_cross_cont):
>  # ifdef USE_AS_WCSCMP
> -       .p2align 4
> -L(different):
> -       /* Use movl to avoid modifying EFLAGS.  */
> -       movl    $0, %eax
> +       movl    (%rdi, %rcx), %edx
> +       xorl    %eax, %eax
> +       cmpl    (%rsi, %rcx), %edx
> +       je      L(ret12)
>         setl    %al
>         negl    %eax
> -       orl     $1, %eax
> -       VZEROUPPER_RETURN
> +       xorl    %r8d, %eax
> +# else
> +       movzbl  (%rdi, %rcx), %eax
> +       movzbl  (%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret12):
> +       VZEROUPPER_RETURN
>
>  # ifdef USE_AS_STRNCMP
> -       .p2align 4
> -L(zero):
> +       .p2align 4,, 10
> +L(check_ret_vec_page_cross2):
> +       incl    %ecx
> +L(check_ret_vec_page_cross):
> +       tzcntl  %ecx, %ecx
> +       addl    %OFFSET_REG, %ecx
> +       cmpq    %rcx, %rdx
> +       ja      L(ret_vec_page_cross_cont)
> +       .p2align 4,, 2
> +L(ret_zero_page_cross):
>         xorl    %eax, %eax
>         VZEROUPPER_RETURN
> +# endif
>
> -       .p2align 4
> -L(char0):
> -#  ifdef USE_AS_WCSCMP
> -       xorl    %eax, %eax
> -       movl    (%rdi), %ecx
> -       cmpl    (%rsi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rsi), %ecx
> -       movzbl  (%rdi), %eax
> -       subl    %ecx, %eax
> -#  endif
> -       VZEROUPPER_RETURN
> +       .p2align 4,, 4
> +L(page_cross_s2):
> +       /* Ensure this is a true page cross.  */
> +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %ecx
> +       jbe     L(no_page_cross)
> +
> +
> +       movl    %ecx, %eax
> +       movq    %rdi, %rcx
> +       movq    %rsi, %rdi
> +       movq    %rcx, %rsi
> +
> +       /* set r8 to negate return value as rdi and rsi swapped.  */
> +# ifdef USE_AS_WCSCMP
> +       movl    $-4, %r8d
> +# else
> +       movl    $-1, %r8d
>  # endif
> +       xorl    %OFFSET_REG, %OFFSET_REG
>
> -       .p2align 4
> -L(last_vector):
> -       addq    %rdx, %rdi
> -       addq    %rdx, %rsi
> +       /* Check if more than 1x VEC till page cross.  */
> +       subl    $(VEC_SIZE * 3), %eax
> +       jle     L(page_cross_loop)
> +
> +       .p2align 4,, 6
> +L(less_1x_vec_till_page):
> +       /* Find largest load size we can use.  */
> +       cmpl    $16, %eax
> +       ja      L(less_16_till_page)
> +
> +       VMOVU   (%rdi), %xmm0
> +       VPCMPEQ (%rsi), %xmm0, %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       incw    %cx
> +       jnz     L(check_ret_vec_page_cross)
> +       movl    $16, %OFFSET_REG
>  # ifdef USE_AS_STRNCMP
> -       subq    %rdx, %r11
> +       cmpq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subl    %eax, %OFFSET_REG
> +# else
> +       /* Explicit check for 16 byte alignment.  */
> +       subl    %eax, %OFFSET_REG
> +       jz      L(prepare_loop)
>  # endif
> -       tzcntl  %ecx, %edx
> +
> +       VMOVU   (%rdi, %OFFSET_REG64), %xmm0
> +       VPCMPEQ (%rsi, %OFFSET_REG64), %xmm0, %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       incw    %cx
> +       jnz     L(check_ret_vec_page_cross)
> +
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       addl    $16, %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subq    $-(VEC_SIZE * 4), %rdx
> +
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +# else
> +       leaq    (16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    (16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
>  # endif
> -# ifdef USE_AS_WCSCMP
> +       jmp     L(prepare_loop_aligned)
> +
> +# ifdef USE_AS_STRNCMP
> +       .p2align 4,, 2
> +L(ret_zero_page_cross_slow_case0):
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -# else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       ret
>  # endif
> -       VZEROUPPER_RETURN
>
> -       /* Comparing on page boundary region requires special treatment:
> -          It must done one vector at the time, starting with the wider
> -          ymm vector if possible, if not, with xmm. If fetching 16 bytes
> -          (xmm) still passes the boundary, byte comparison must be done.
> -        */
> -       .p2align 4
> -L(cross_page):
> -       /* Try one ymm vector at a time.  */
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> -       jg      L(cross_page_1_vector)
> -L(loop_1_vector):
> -       vmovdqu (%rdi, %rdx), %ymm1
> -       VPCMPEQ (%rsi, %rdx), %ymm1, %ymm0
> -       VPMINU  %ymm1, %ymm0, %ymm0
> -       VPCMPEQ %ymm7, %ymm0, %ymm0
> -       vpmovmskb %ymm0, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(last_vector)
>
> -       addl    $VEC_SIZE, %edx
> +       .p2align 4,, 10
> +L(less_16_till_page):
> +       /* Find largest load size we can use.  */
> +       cmpl    $24, %eax
> +       ja      L(less_8_till_page)
>
> -       addl    $VEC_SIZE, %eax
> -# ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -# endif
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> -       jle     L(loop_1_vector)
> -L(cross_page_1_vector):
> -       /* Less than 32 bytes to check, try one xmm vector.  */
> -       cmpl    $(PAGE_SIZE - 16), %eax
> -       jg      L(cross_page_1_xmm)
> -       vmovdqu (%rdi, %rdx), %xmm1
> -       VPCMPEQ (%rsi, %rdx), %xmm1, %xmm0
> -       VPMINU  %xmm1, %xmm0, %xmm0
> -       VPCMPEQ %xmm7, %xmm0, %xmm0
> -       vpmovmskb %xmm0, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(last_vector)
> +       vmovq   (%rdi), %xmm0
> +       vmovq   (%rsi), %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       VPCMPEQ %xmm1, %xmm0, %xmm1
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       incb    %cl
> +       jnz     L(check_ret_vec_page_cross)
>
> -       addl    $16, %edx
> -# ifndef USE_AS_WCSCMP
> -       addl    $16, %eax
> +
> +# ifdef USE_AS_STRNCMP
> +       cmpq    $8, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
>  # endif
> +       movl    $24, %OFFSET_REG
> +       /* Explicit check for 16 byte alignment.  */
> +       subl    %eax, %OFFSET_REG
> +
> +
> +
> +       vmovq   (%rdi, %OFFSET_REG64), %xmm0
> +       vmovq   (%rsi, %OFFSET_REG64), %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       VPCMPEQ %xmm1, %xmm0, %xmm1
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       incb    %cl
> +       jnz     L(check_ret_vec_page_cross)
> +
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -# endif
> -
> -L(cross_page_1_xmm):
> -# ifndef USE_AS_WCSCMP
> -       /* Less than 16 bytes to check, try 8 byte vector.  NB: No need
> -          for wcscmp nor wcsncmp since wide char is 4 bytes.   */
> -       cmpl    $(PAGE_SIZE - 8), %eax
> -       jg      L(cross_page_8bytes)
> -       vmovq   (%rdi, %rdx), %xmm1
> -       vmovq   (%rsi, %rdx), %xmm0
> -       VPCMPEQ %xmm0, %xmm1, %xmm0
> -       VPMINU  %xmm1, %xmm0, %xmm0
> -       VPCMPEQ %xmm7, %xmm0, %xmm0
> -       vpmovmskb %xmm0, %ecx
> -       /* Only last 8 bits are valid.  */
> -       andl    $0xff, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(last_vector)
> +       addl    $8, %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subq    $-(VEC_SIZE * 4), %rdx
>
> -       addl    $8, %edx
> -       addl    $8, %eax
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +# else
> +       leaq    (8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    (8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +# endif
> +       jmp     L(prepare_loop_aligned)
> +
> +
> +       .p2align 4,, 10
> +L(less_8_till_page):
> +# ifdef USE_AS_WCSCMP
> +       /* If using wchar then this is the only check before we reach
> +          the page boundary.  */
> +       movl    (%rdi), %eax
> +       movl    (%rsi), %ecx
> +       cmpl    %ecx, %eax
> +       jnz     L(ret_less_8_wcs)
>  #  ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       addq    %rdi, %rdx
> +       /* We already checked for len <= 1 so cannot hit that case here.
> +        */
>  #  endif
> +       testl   %eax, %eax
> +       jnz     L(prepare_loop_no_len)
> +       ret
>
> -L(cross_page_8bytes):
> -       /* Less than 8 bytes to check, try 4 byte vector.  */
> -       cmpl    $(PAGE_SIZE - 4), %eax
> -       jg      L(cross_page_4bytes)
> -       vmovd   (%rdi, %rdx), %xmm1
> -       vmovd   (%rsi, %rdx), %xmm0
> -       VPCMPEQ %xmm0, %xmm1, %xmm0
> -       VPMINU  %xmm1, %xmm0, %xmm0
> -       VPCMPEQ %xmm7, %xmm0, %xmm0
> -       vpmovmskb %xmm0, %ecx
> -       /* Only last 4 bits are valid.  */
> -       andl    $0xf, %ecx
> -       testl   %ecx, %ecx
> -       jne     L(last_vector)
> +       .p2align 4,, 8
> +L(ret_less_8_wcs):
> +       setl    %OFFSET_REG8
> +       negl    %OFFSET_REG
> +       movl    %OFFSET_REG, %eax
> +       xorl    %r8d, %eax
> +       ret
> +
> +# else
> +
> +       /* Find largest load size we can use.  */
> +       cmpl    $28, %eax
> +       ja      L(less_4_till_page)
> +
> +       vmovd   (%rdi), %xmm0
> +       vmovd   (%rsi), %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       VPCMPEQ %xmm1, %xmm0, %xmm1
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       subl    $0xf, %ecx
> +       jnz     L(check_ret_vec_page_cross)
>
> -       addl    $4, %edx
>  #  ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    $4, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case1)
>  #  endif
> +       movl    $28, %OFFSET_REG
> +       /* Explicit check for 16 byte alignment.  */
> +       subl    %eax, %OFFSET_REG
>
> -L(cross_page_4bytes):
> -# endif
> -       /* Less than 4 bytes to check, try one byte/dword at a time.  */
> -# ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -# endif
> -# ifdef USE_AS_WCSCMP
> -       movl    (%rdi, %rdx), %eax
> -       movl    (%rsi, %rdx), %ecx
> -# else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %ecx
> -# endif
> -       testl   %eax, %eax
> -       jne     L(cross_page_loop)
> +
> +
> +       vmovd   (%rdi, %OFFSET_REG64), %xmm0
> +       vmovd   (%rsi, %OFFSET_REG64), %xmm1
> +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> +       VPCMPEQ %xmm1, %xmm0, %xmm1
> +       vpandn  %xmm1, %xmm2, %xmm1
> +       vpmovmskb %ymm1, %ecx
> +       subl    $0xf, %ecx
> +       jnz     L(check_ret_vec_page_cross)
> +
> +#  ifdef USE_AS_STRNCMP
> +       addl    $4, %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case1)
> +       subq    $-(VEC_SIZE * 4), %rdx
> +
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +#  else
> +       leaq    (4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> +       leaq    (4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> +#  endif
> +       jmp     L(prepare_loop_aligned)
> +
> +#  ifdef USE_AS_STRNCMP
> +       .p2align 4,, 2
> +L(ret_zero_page_cross_slow_case1):
> +       xorl    %eax, %eax
> +       ret
> +#  endif
> +
> +       .p2align 4,, 10
> +L(less_4_till_page):
> +       subq    %rdi, %rsi
> +       /* Extremely slow byte comparison loop.  */
> +L(less_4_loop):
> +       movzbl  (%rdi), %eax
> +       movzbl  (%rsi, %rdi), %ecx
>         subl    %ecx, %eax
> -       VZEROUPPER_RETURN
> -END (STRCMP)
> +       jnz     L(ret_less_4_loop)
> +       testl   %ecx, %ecx
> +       jz      L(ret_zero_4_loop)
> +#  ifdef USE_AS_STRNCMP
> +       decq    %rdx
> +       jz      L(ret_zero_4_loop)
> +#  endif
> +       incq    %rdi
> +       /* end condition is reach page boundary (rdi is aligned).  */
> +       testl   $31, %edi
> +       jnz     L(less_4_loop)
> +       leaq    -(VEC_SIZE * 4)(%rdi, %rsi), %rsi
> +       addq    $-(VEC_SIZE * 4), %rdi
> +#  ifdef USE_AS_STRNCMP
> +       subq    $-(VEC_SIZE * 4), %rdx
> +#  endif
> +       jmp     L(prepare_loop_aligned)
> +
> +L(ret_zero_4_loop):
> +       xorl    %eax, %eax
> +       ret
> +L(ret_less_4_loop):
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
> +       ret
> +# endif
> +END(STRCMP)
>  #endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/7] x86: Optimize strcmp-evex.S
  2022-01-10  0:27   ` [PATCH v2 6/7] x86: Optimize strcmp-evex.S Noah Goldstein
@ 2022-01-10  0:41     ` H.J. Lu
  0 siblings, 0 replies; 59+ messages in thread
From: H.J. Lu @ 2022-01-10  0:41 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 4:32 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Optimization are primarily to the loop logic and how the page cross
> logic interacts with the loop.
>
> The page cross logic is at times more expensive for short strings near
> the end of a page but not crossing the page. This is done to retest
> the page cross conditions with a non-faulty check and to improve the
> logic for entering the loop afterwards. This is only particular cases,
> however, and is general made up for by more than 10x improvements on
> the transition from the page cross -> loop case.
>
> The non-page cross cases as well are nearly universally improved.
>
> test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  sysdeps/x86_64/multiarch/strcmp-evex.S | 1712 +++++++++++++-----------
>  1 file changed, 919 insertions(+), 793 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
> index 0cd939d5af..e5070f3d53 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-evex.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
> @@ -26,54 +26,69 @@
>
>  # define PAGE_SIZE     4096
>
> -/* VEC_SIZE = Number of bytes in a ymm register */
> +       /* VEC_SIZE = Number of bytes in a ymm register.  */
>  # define VEC_SIZE      32
> +# define CHAR_PER_VEC  (VEC_SIZE       /       SIZE_OF_CHAR)
>
> -/* Shift for dividing by (VEC_SIZE * 4).  */
> -# define DIVIDE_BY_VEC_4_SHIFT 7
> -# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> -#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> -# endif
> -
> -# define VMOVU         vmovdqu64
> -# define VMOVA         vmovdqa64
> +# define VMOVU vmovdqu64
> +# define VMOVA vmovdqa64
>
>  # ifdef USE_AS_WCSCMP
> -/* Compare packed dwords.  */
> -#  define VPCMP                vpcmpd
> +#  define TESTEQ       subl    $0xff,
> +       /* Compare packed dwords.  */
> +#  define VPCMP        vpcmpd
>  #  define VPMINU       vpminud
>  #  define VPTESTM      vptestmd
> -#  define SHIFT_REG32  r8d
> -#  define SHIFT_REG64  r8
> -/* 1 dword char == 4 bytes.  */
> +       /* 1 dword char == 4 bytes.  */
>  #  define SIZE_OF_CHAR 4
>  # else
> -/* Compare packed bytes.  */
> -#  define VPCMP                vpcmpb
> +#  define TESTEQ       incl
> +       /* Compare packed bytes.  */
> +#  define VPCMP        vpcmpb
>  #  define VPMINU       vpminub
>  #  define VPTESTM      vptestmb
> -#  define SHIFT_REG32  ecx
> -#  define SHIFT_REG64  rcx
> -/* 1 byte char == 1 byte.  */
> +       /* 1 byte char == 1 byte.  */
>  #  define SIZE_OF_CHAR 1
>  # endif
>
> +# ifdef USE_AS_STRNCMP
> +#  define LOOP_REG     r9d
> +#  define LOOP_REG64   r9
> +
> +#  define OFFSET_REG8  r9b
> +#  define OFFSET_REG   r9d
> +#  define OFFSET_REG64 r9
> +# else
> +#  define LOOP_REG     edx
> +#  define LOOP_REG64   rdx
> +
> +#  define OFFSET_REG8  dl
> +#  define OFFSET_REG   edx
> +#  define OFFSET_REG64 rdx
> +# endif
> +
> +# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
> +#  define VEC_OFFSET   0
> +# else
> +#  define VEC_OFFSET   (-VEC_SIZE)
> +# endif
> +
>  # define XMMZERO       xmm16
> -# define XMM0          xmm17
> -# define XMM1          xmm18
> +# define XMM0  xmm17
> +# define XMM1  xmm18
>
>  # define YMMZERO       ymm16
> -# define YMM0          ymm17
> -# define YMM1          ymm18
> -# define YMM2          ymm19
> -# define YMM3          ymm20
> -# define YMM4          ymm21
> -# define YMM5          ymm22
> -# define YMM6          ymm23
> -# define YMM7          ymm24
> -# define YMM8          ymm25
> -# define YMM9          ymm26
> -# define YMM10         ymm27
> +# define YMM0  ymm17
> +# define YMM1  ymm18
> +# define YMM2  ymm19
> +# define YMM3  ymm20
> +# define YMM4  ymm21
> +# define YMM5  ymm22
> +# define YMM6  ymm23
> +# define YMM7  ymm24
> +# define YMM8  ymm25
> +# define YMM9  ymm26
> +# define YMM10 ymm27
>
>  /* Warning!
>             wcscmp/wcsncmp have to use SIGNED comparison for elements.
> @@ -96,985 +111,1096 @@
>     the maximum offset is reached before a difference is found, zero is
>     returned.  */
>
> -       .section .text.evex,"ax",@progbits
> -ENTRY (STRCMP)
> +       .section .text.evex, "ax", @progbits
> +ENTRY(STRCMP)
>  # ifdef USE_AS_STRNCMP
> -       /* Check for simple cases (0 or 1) in offset.  */
> -       cmp     $1, %RDX_LP
> -       je      L(char0)
> -       jb      L(zero)
> -#  ifdef USE_AS_WCSCMP
> -#  ifndef __ILP32__
> -       movq    %rdx, %rcx
> -       /* Check if length could overflow when multiplied by
> -          sizeof(wchar_t). Checking top 8 bits will cover all potential
> -          overflow cases as well as redirect cases where its impossible to
> -          length to bound a valid memory region. In these cases just use
> -          'wcscmp'.  */
> -       shrq    $56, %rcx
> -       jnz     __wcscmp_evex
> -#  endif
> -       /* Convert units: from wide to byte char.  */
> -       shl     $2, %RDX_LP
> +#  ifdef __ILP32__
> +       /* Clear the upper 32 bits.  */
> +       movl    %edx, %rdx
>  #  endif
> -       /* Register %r11 tracks the maximum offset.  */
> -       mov     %RDX_LP, %R11_LP
> +       cmp     $1, %RDX_LP
> +       /* Signed comparison intentional. We use this branch to also
> +          test cases where length >= 2^63. These very large sizes can be
> +          handled with strcmp as there is no way for that length to
> +          actually bound the buffer.  */
> +       jle     L(one_or_less)
>  # endif
>         movl    %edi, %eax
> -       xorl    %edx, %edx
> -       /* Make %XMMZERO (%YMMZERO) all zeros in this function.  */
> -       vpxorq  %XMMZERO, %XMMZERO, %XMMZERO
>         orl     %esi, %eax
> -       andl    $(PAGE_SIZE - 1), %eax
> -       cmpl    $(PAGE_SIZE - (VEC_SIZE * 4)), %eax
> -       jg      L(cross_page)
> -       /* Start comparing 4 vectors.  */
> +       /* Shift out the bits irrelivant to page boundary ([63:12]).  */
> +       sall    $20, %eax
> +       /* Check if s1 or s2 may cross a page in next 4x VEC loads.  */
> +       cmpl    $((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
> +       ja      L(page_cross)
> +
> +L(no_page_cross):
> +       /* Safe to compare 4x vectors.  */
>         VMOVU   (%rdi), %YMM0
> -
> -       /* Each bit set in K2 represents a non-null CHAR in YMM0.  */
>         VPTESTM %YMM0, %YMM0, %k2
> -
>         /* Each bit cleared in K1 represents a mismatch or a null CHAR
>            in YMM0 and 32 bytes at (%rsi).  */
>         VPCMP   $0, (%rsi), %YMM0, %k1{%k2}
> -
>         kmovd   %k1, %ecx
> -# ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> -# else
> -       incl    %ecx
> -# endif
> -       je      L(next_3_vectors)
> -       tzcntl  %ecx, %edx
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %edx
> -# endif
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx) is after the maximum
> -          offset (%r11).   */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    $CHAR_PER_VEC, %rdx
> +       jbe     L(vec_0_test_len)
>  # endif
> +
> +       /* TESTEQ is `incl` for strcmp/strncmp and `subl $0xff` for
> +          wcscmp/wcsncmp.  */
> +
> +       /* All 1s represents all equals. TESTEQ will overflow to zero in
> +          all equals case. Otherwise 1s will carry until position of first
> +          mismatch.  */
> +       TESTEQ  %ecx
> +       jz      L(more_3x_vec)
> +
> +       .p2align 4,, 4
> +L(return_vec_0):
> +       tzcntl  %ecx, %ecx
>  # ifdef USE_AS_WCSCMP
> +       movl    (%rdi, %rcx, SIZE_OF_CHAR), %edx
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       je      L(return)
> -L(wcscmp_return):
> +       cmpl    (%rsi, %rcx, SIZE_OF_CHAR), %edx
> +       je      L(ret0)
>         setl    %al
>         negl    %eax
>         orl     $1, %eax
> -L(return):
>  # else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       movzbl  (%rdi, %rcx), %eax
> +       movzbl  (%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  # endif
> +L(ret0):
>         ret
>
> -L(return_vec_size):
> -       tzcntl  %ecx, %edx
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %edx
> -# endif
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
> -          the maximum offset (%r11).  */
> -       addq    $VEC_SIZE, %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> +       .p2align 4,, 4
> +L(vec_0_test_len):
> +       notl    %ecx
> +       bzhil   %edx, %ecx, %eax
> +       jnz     L(return_vec_0)
> +       /* Align if will cross fetch block.  */
> +       .p2align 4,, 2
> +L(ret_zero):
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# else
> +       ret
> +
> +       .p2align 4,, 5
> +L(one_or_less):
> +       jb      L(ret_zero)
>  #  ifdef USE_AS_WCSCMP
> +       /* 'nbe' covers the case where length is negative (large
> +          unsigned).  */
> +       jnbe    __wcscmp_evex
> +       movl    (%rdi), %edx
>         xorl    %eax, %eax
> -       movl    VEC_SIZE(%rdi, %rdx), %ecx
> -       cmpl    VEC_SIZE(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> +       cmpl    (%rsi), %edx
> +       je      L(ret1)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  #  else
> -       movzbl  VEC_SIZE(%rdi, %rdx), %eax
> -       movzbl  VEC_SIZE(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       /* 'nbe' covers the case where length is negative (large
> +          unsigned).  */
> +       jnbe    __strcmp_evex
> +       movzbl  (%rdi), %eax
> +       movzbl  (%rsi), %ecx
> +       subl    %ecx, %eax
>  #  endif
> -# endif
> +L(ret1):
>         ret
> +# endif
>
> -L(return_2_vec_size):
> -       tzcntl  %ecx, %edx
> +       .p2align 4,, 10
> +L(return_vec_1):
> +       tzcntl  %ecx, %ecx
> +# ifdef USE_AS_STRNCMP
> +       /* rdx must be > CHAR_PER_VEC so its safe to subtract without
> +          worrying about underflow.  */
> +       addq    $-CHAR_PER_VEC, %rdx
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero)
> +# endif
>  # ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %edx
> +       movl    VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
> +       xorl    %eax, %eax
> +       cmpl    VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx
> +       je      L(ret2)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
> +# else
> +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  # endif
> +L(ret2):
> +       ret
> +
> +       .p2align 4,, 10
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
> -          after the maximum offset (%r11).  */
> -       addq    $(VEC_SIZE * 2), %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> +L(return_vec_3):
> +#  if CHAR_PER_VEC <= 16
> +       sall    $CHAR_PER_VEC, %ecx
>  #  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       salq    $CHAR_PER_VEC, %rcx
>  #  endif
> +# endif
> +L(return_vec_2):
> +# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
> +       tzcntl  %ecx, %ecx
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rdi, %rdx), %ecx
> -       cmpl    (VEC_SIZE * 2)(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rdi, %rdx), %eax
> -       movzbl  (VEC_SIZE * 2)(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       tzcntq  %rcx, %rcx
>  # endif
> -       ret
>
> -L(return_3_vec_size):
> -       tzcntl  %ecx, %edx
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %edx
> -# endif
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
> -          after the maximum offset (%r11).  */
> -       addq    $(VEC_SIZE * 3), %rdx
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -#  ifdef USE_AS_WCSCMP
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +# ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx
>         xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx, SIZE_OF_CHAR), %edx
> +       je      L(ret3)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  # else
> +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +# endif
> +L(ret3):
> +       ret
> +
> +# ifndef USE_AS_STRNCMP
> +       .p2align 4,, 10
> +L(return_vec_3):
> +       tzcntl  %ecx, %ecx
>  #  ifdef USE_AS_WCSCMP
> +       movl    (VEC_SIZE * 3)(%rdi, %rcx, SIZE_OF_CHAR), %edx
>         xorl    %eax, %eax
> -       movl    (VEC_SIZE * 3)(%rdi, %rdx), %ecx
> -       cmpl    (VEC_SIZE * 3)(%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> +       cmpl    (VEC_SIZE * 3)(%rsi, %rcx, SIZE_OF_CHAR), %edx
> +       je      L(ret4)
> +       setl    %al
> +       negl    %eax
> +       orl     $1, %eax
>  #  else
> -       movzbl  (VEC_SIZE * 3)(%rdi, %rdx), %eax
> -       movzbl  (VEC_SIZE * 3)(%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       movzbl  (VEC_SIZE * 3)(%rdi, %rcx), %eax
> +       movzbl  (VEC_SIZE * 3)(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
>  #  endif
> -# endif
> +L(ret4):
>         ret
> +# endif
>
> -       .p2align 4
> -L(next_3_vectors):
> -       VMOVU   VEC_SIZE(%rdi), %YMM0
> -       /* Each bit set in K2 represents a non-null CHAR in YMM0.  */
> +       /* 32 byte align here ensures the main loop is ideally aligned
> +          for DSB.  */
> +       .p2align 5
> +L(more_3x_vec):
> +       /* Safe to compare 4x vectors.  */
> +       VMOVU   (VEC_SIZE)(%rdi), %YMM0
>         VPTESTM %YMM0, %YMM0, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in YMM0 and 32 bytes at VEC_SIZE(%rsi).  */
> -       VPCMP   $0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
> +       VPCMP   $0, (VEC_SIZE)(%rsi), %YMM0, %k1{%k2}
>         kmovd   %k1, %ecx
> -# ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> -# else
> -       incl    %ecx
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_1)
> +
> +# ifdef USE_AS_STRNCMP
> +       subq    $(CHAR_PER_VEC * 2), %rdx
> +       jbe     L(ret_zero)
>  # endif
> -       jne     L(return_vec_size)
>
>         VMOVU   (VEC_SIZE * 2)(%rdi), %YMM0
> -       /* Each bit set in K2 represents a non-null CHAR in YMM0.  */
>         VPTESTM %YMM0, %YMM0, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
>         VPCMP   $0, (VEC_SIZE * 2)(%rsi), %YMM0, %k1{%k2}
>         kmovd   %k1, %ecx
> -# ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> -# else
> -       incl    %ecx
> -# endif
> -       jne     L(return_2_vec_size)
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_2)
>
>         VMOVU   (VEC_SIZE * 3)(%rdi), %YMM0
> -       /* Each bit set in K2 represents a non-null CHAR in YMM0.  */
>         VPTESTM %YMM0, %YMM0, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
>         VPCMP   $0, (VEC_SIZE * 3)(%rsi), %YMM0, %k1{%k2}
>         kmovd   %k1, %ecx
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_3)
> +
> +# ifdef USE_AS_STRNCMP
> +       cmpq    $(CHAR_PER_VEC * 2), %rdx
> +       jbe     L(ret_zero)
> +# endif
> +
> +
>  # ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> +       /* any non-zero positive value that doesn't inference with 0x1.
> +        */
> +       movl    $2, %r8d
> +
>  # else
> -       incl    %ecx
> +       xorl    %r8d, %r8d
>  # endif
> -       jne     L(return_3_vec_size)
> -L(main_loop_header):
> -       leaq    (VEC_SIZE * 4)(%rdi), %rdx
> -       movl    $PAGE_SIZE, %ecx
> -       /* Align load via RAX.  */
> -       andq    $-(VEC_SIZE * 4), %rdx
> -       subq    %rdi, %rdx
> -       leaq    (%rdi, %rdx), %rax
> +
> +       /* The prepare labels are various entry points from the page
> +          cross logic.  */
> +L(prepare_loop):
> +
>  # ifdef USE_AS_STRNCMP
> -       /* Starting from this point, the maximum offset, or simply the
> -          'offset', DECREASES by the same amount when base pointers are
> -          moved forward.  Return 0 when:
> -            1) On match: offset <= the matched vector index.
> -            2) On mistmach, offset is before the mistmatched index.
> -        */
> -       subq    %rdx, %r11
> -       jbe     L(zero)
> +#  ifdef USE_AS_WCSCMP
> +L(prepare_loop_no_len):
> +       movl    %edi, %ecx
> +       andl    $(VEC_SIZE * 4 - 1), %ecx
> +       shrl    $2, %ecx
> +       leaq    (CHAR_PER_VEC * 2)(%rdx, %rcx), %rdx
> +#  else
> +       /* Store N + (VEC_SIZE * 4) and place check at the begining of
> +          the loop.  */
> +       leaq    (VEC_SIZE * 2)(%rdi, %rdx), %rdx
> +L(prepare_loop_no_len):
> +#  endif
> +# else
> +L(prepare_loop_no_len):
>  # endif
> -       addq    %rsi, %rdx
> -       movq    %rdx, %rsi
> -       andl    $(PAGE_SIZE - 1), %esi
> -       /* Number of bytes before page crossing.  */
> -       subq    %rsi, %rcx
> -       /* Number of VEC_SIZE * 4 blocks before page crossing.  */
> -       shrq    $DIVIDE_BY_VEC_4_SHIFT, %rcx
> -       /* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
> -       movl    %ecx, %esi
> -       jmp     L(loop_start)
>
> +       /* Align s1 and adjust s2 accordingly.  */
> +       subq    %rdi, %rsi
> +       andq    $-(VEC_SIZE * 4), %rdi
> +L(prepare_loop_readj):
> +       addq    %rdi, %rsi
> +# if (defined USE_AS_STRNCMP) && !(defined USE_AS_WCSCMP)
> +       subq    %rdi, %rdx
> +# endif
> +
> +L(prepare_loop_aligned):
> +       /* eax stores distance from rsi to next page cross. These cases
> +          need to be handled specially as the 4x loop could potentially
> +          read memory past the length of s1 or s2 and across a page
> +          boundary.  */
> +       movl    $-(VEC_SIZE * 4), %eax
> +       subl    %esi, %eax
> +       andl    $(PAGE_SIZE - 1), %eax
> +
> +       vpxorq  %YMMZERO, %YMMZERO, %YMMZERO
> +
> +       /* Loop 4x comparisons at a time.  */
>         .p2align 4
>  L(loop):
> +
> +       /* End condition for strncmp.  */
>  # ifdef USE_AS_STRNCMP
> -       /* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
> -          the maximum offset (%r11) by the same amount.  */
> -       subq    $(VEC_SIZE * 4), %r11
> -       jbe     L(zero)
> +       subq    $(CHAR_PER_VEC * 4), %rdx
> +       jbe     L(ret_zero)
>  # endif
> -       addq    $(VEC_SIZE * 4), %rax
> -       addq    $(VEC_SIZE * 4), %rdx
> -L(loop_start):
> -       testl   %esi, %esi
> -       leal    -1(%esi), %esi
> -       je      L(loop_cross_page)
> -L(back_to_loop):
> -       /* Main loop, comparing 4 vectors are a time.  */
> -       VMOVA   (%rax), %YMM0
> -       VMOVA   VEC_SIZE(%rax), %YMM2
> -       VMOVA   (VEC_SIZE * 2)(%rax), %YMM4
> -       VMOVA   (VEC_SIZE * 3)(%rax), %YMM6
> +
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       subq    $-(VEC_SIZE * 4), %rsi
> +
> +       /* Check if rsi loads will cross a page boundary.  */
> +       addl    $-(VEC_SIZE * 4), %eax
> +       jnb     L(page_cross_during_loop)
> +
> +       /* Loop entry after handling page cross during loop.  */
> +L(loop_skip_page_cross_check):
> +       VMOVA   (VEC_SIZE * 0)(%rdi), %YMM0
> +       VMOVA   (VEC_SIZE * 1)(%rdi), %YMM2
> +       VMOVA   (VEC_SIZE * 2)(%rdi), %YMM4
> +       VMOVA   (VEC_SIZE * 3)(%rdi), %YMM6
>
>         VPMINU  %YMM0, %YMM2, %YMM8
>         VPMINU  %YMM4, %YMM6, %YMM9
>
> -       /* A zero CHAR in YMM8 means that there is a null CHAR.  */
> -       VPMINU  %YMM8, %YMM9, %YMM8
> +       /* A zero CHAR in YMM9 means that there is a null CHAR.  */
> +       VPMINU  %YMM8, %YMM9, %YMM9
>
>         /* Each bit set in K1 represents a non-null CHAR in YMM8.  */
> -       VPTESTM %YMM8, %YMM8, %k1
> +       VPTESTM %YMM9, %YMM9, %k1
>
> -       /* (YMM ^ YMM): A non-zero CHAR represents a mismatch.  */
> -       vpxorq  (%rdx), %YMM0, %YMM1
> -       vpxorq  VEC_SIZE(%rdx), %YMM2, %YMM3
> -       vpxorq  (VEC_SIZE * 2)(%rdx), %YMM4, %YMM5
> -       vpxorq  (VEC_SIZE * 3)(%rdx), %YMM6, %YMM7
> +       vpxorq  (VEC_SIZE * 0)(%rsi), %YMM0, %YMM1
> +       vpxorq  (VEC_SIZE * 1)(%rsi), %YMM2, %YMM3
> +       vpxorq  (VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
> +       /* Ternary logic to xor (VEC_SIZE * 3)(%rsi) with YMM6 while
> +          oring with YMM1. Result is stored in YMM6.  */
> +       vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM1, %YMM6
>
> -       vporq   %YMM1, %YMM3, %YMM9
> -       vporq   %YMM5, %YMM7, %YMM10
> +       /* Or together YMM3, YMM5, and YMM6.  */
> +       vpternlogd $0xfe, %YMM3, %YMM5, %YMM6
>
> -       /* A non-zero CHAR in YMM9 represents a mismatch.  */
> -       vporq   %YMM9, %YMM10, %YMM9
>
> -       /* Each bit cleared in K0 represents a mismatch or a null CHAR.  */
> -       VPCMP   $0, %YMMZERO, %YMM9, %k0{%k1}
> -       kmovd   %k0, %ecx
> -# ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> -# else
> -       incl    %ecx
> -# endif
> -       je       L(loop)
> +       /* A non-zero CHAR in YMM6 represents a mismatch.  */
> +       VPCMP   $0, %YMMZERO, %YMM6, %k0{%k1}
> +       kmovd   %k0, %LOOP_REG
>
> -       /* Each bit set in K1 represents a non-null CHAR in YMM0.  */
> +       TESTEQ  %LOOP_REG
> +       jz      L(loop)
> +
> +
> +       /* Find which VEC has the mismatch of end of string.  */
>         VPTESTM %YMM0, %YMM0, %k1
> -       /* Each bit cleared in K0 represents a mismatch or a null CHAR
> -          in YMM0 and (%rdx).  */
>         VPCMP   $0, %YMMZERO, %YMM1, %k0{%k1}
>         kmovd   %k0, %ecx
> -# ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> -# else
> -       incl    %ecx
> -# endif
> -       je      L(test_vec)
> -       tzcntl  %ecx, %ecx
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %ecx
> -# endif
> -# ifdef USE_AS_STRNCMP
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# endif
> -       ret
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_0_end)
>
> -       .p2align 4
> -L(test_vec):
> -# ifdef USE_AS_STRNCMP
> -       /* The first vector matched.  Return 0 if the maximum offset
> -          (%r11) <= VEC_SIZE.  */
> -       cmpq    $VEC_SIZE, %r11
> -       jbe     L(zero)
> -# endif
> -       /* Each bit set in K1 represents a non-null CHAR in YMM2.  */
>         VPTESTM %YMM2, %YMM2, %k1
> -       /* Each bit cleared in K0 represents a mismatch or a null CHAR
> -          in YMM2 and VEC_SIZE(%rdx).  */
>         VPCMP   $0, %YMMZERO, %YMM3, %k0{%k1}
>         kmovd   %k0, %ecx
> -# ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> -# else
> -       incl    %ecx
> -# endif
> -       je      L(test_2_vec)
> -       tzcntl  %ecx, %edi
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %edi
> -# endif
> -# ifdef USE_AS_STRNCMP
> -       addq    $VEC_SIZE, %rdi
> -       cmpq    %rdi, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rdi), %ecx
> -       cmpl    (%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rdi), %eax
> -       movzbl  (%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> -# else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    VEC_SIZE(%rsi, %rdi), %ecx
> -       cmpl    VEC_SIZE(%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  VEC_SIZE(%rax, %rdi), %eax
> -       movzbl  VEC_SIZE(%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> -# endif
> -       ret
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_1_end)
>
> -       .p2align 4
> -L(test_2_vec):
> +
> +       /* Handle VEC 2 and 3 without branches.  */
> +L(return_vec_2_3_end):
>  # ifdef USE_AS_STRNCMP
> -       /* The first 2 vectors matched.  Return 0 if the maximum offset
> -          (%r11) <= 2 * VEC_SIZE.  */
> -       cmpq    $(VEC_SIZE * 2), %r11
> -       jbe     L(zero)
> +       subq    $(CHAR_PER_VEC * 2), %rdx
> +       jbe     L(ret_zero_end)
>  # endif
> -       /* Each bit set in K1 represents a non-null CHAR in YMM4.  */
> +
>         VPTESTM %YMM4, %YMM4, %k1
> -       /* Each bit cleared in K0 represents a mismatch or a null CHAR
> -          in YMM4 and (VEC_SIZE * 2)(%rdx).  */
>         VPCMP   $0, %YMMZERO, %YMM5, %k0{%k1}
>         kmovd   %k0, %ecx
> -# ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> +       TESTEQ  %ecx
> +# if CHAR_PER_VEC <= 16
> +       sall    $CHAR_PER_VEC, %LOOP_REG
> +       orl     %ecx, %LOOP_REG
>  # else
> -       incl    %ecx
> +       salq    $CHAR_PER_VEC, %LOOP_REG64
> +       orq     %rcx, %LOOP_REG64
> +# endif
> +L(return_vec_3_end):
> +       /* LOOP_REG contains matches for null/mismatch from the loop. If
> +          VEC 0,1,and 2 all have no null and no mismatches then mismatch
> +          must entirely be from VEC 3 which is fully represented by
> +          LOOP_REG.  */
> +# if CHAR_PER_VEC <= 16
> +       tzcntl  %LOOP_REG, %LOOP_REG
> +# else
> +       tzcntq  %LOOP_REG64, %LOOP_REG64
> +# endif
> +# ifdef USE_AS_STRNCMP
> +       cmpq    %LOOP_REG64, %rdx
> +       jbe     L(ret_zero_end)
>  # endif
> -       je      L(test_3_vec)
> -       tzcntl  %ecx, %edi
> +
>  # ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %edi
> +       movl    (VEC_SIZE * 2)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
> +       xorl    %eax, %eax
> +       cmpl    (VEC_SIZE * 2)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
> +       je      L(ret5)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
> +# else
> +       movzbl  (VEC_SIZE * 2)(%rdi, %LOOP_REG64), %eax
> +       movzbl  (VEC_SIZE * 2)(%rsi, %LOOP_REG64), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret5):
> +       ret
> +
>  # ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 2), %rdi
> -       cmpq    %rdi, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       .p2align 4,, 2
> +L(ret_zero_end):
>         xorl    %eax, %eax
> -       movl    (%rsi, %rdi), %ecx
> -       cmpl    (%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> +       ret
> +# endif
> +
> +
> +       /* The L(return_vec_N_end) differ from L(return_vec_N) in that
> +          they use the value of `r8` to negate the return value. This is
> +          because the page cross logic can swap `rdi` and `rsi`.  */
> +       .p2align 4,, 10
> +# ifdef USE_AS_STRNCMP
> +L(return_vec_1_end):
> +#  if CHAR_PER_VEC <= 16
> +       sall    $CHAR_PER_VEC, %ecx
>  #  else
> -       movzbl  (%rax, %rdi), %eax
> -       movzbl  (%rdx, %rdi), %edx
> -       subl    %edx, %eax
> +       salq    $CHAR_PER_VEC, %rcx
>  #  endif
> +# endif
> +L(return_vec_0_end):
> +# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
> +       tzcntl  %ecx, %ecx
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rsi, %rdi), %ecx
> -       cmpl    (VEC_SIZE * 2)(%rdx, %rdi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rax, %rdi), %eax
> -       movzbl  (VEC_SIZE * 2)(%rdx, %rdi), %edx
> -       subl    %edx, %eax
> -#  endif
> +       tzcntq  %rcx, %rcx
>  # endif
> -       ret
>
> -       .p2align 4
> -L(test_3_vec):
>  # ifdef USE_AS_STRNCMP
> -       /* The first 3 vectors matched.  Return 0 if the maximum offset
> -          (%r11) <= 3 * VEC_SIZE.  */
> -       cmpq    $(VEC_SIZE * 3), %r11
> -       jbe     L(zero)
> +       cmpq    %rcx, %rdx
> +       jbe     L(ret_zero_end)
>  # endif
> -       /* Each bit set in K1 represents a non-null CHAR in YMM6.  */
> -       VPTESTM %YMM6, %YMM6, %k1
> -       /* Each bit cleared in K0 represents a mismatch or a null CHAR
> -          in YMM6 and (VEC_SIZE * 3)(%rdx).  */
> -       VPCMP   $0, %YMMZERO, %YMM7, %k0{%k1}
> -       kmovd   %k0, %ecx
> +
>  # ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> +       movl    (%rdi, %rcx, SIZE_OF_CHAR), %edx
> +       xorl    %eax, %eax
> +       cmpl    (%rsi, %rcx, SIZE_OF_CHAR), %edx
> +       je      L(ret6)
> +       setl    %al
> +       negl    %eax
> +       /* This is the non-zero case for `eax` so just xorl with `r8d`
> +          flip is `rdi` and `rsi` where swapped.  */
> +       xorl    %r8d, %eax
>  # else
> -       incl    %ecx
> +       movzbl  (%rdi, %rcx), %eax
> +       movzbl  (%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       /* Flip `eax` if `rdi` and `rsi` where swapped in page cross
> +          logic. Subtract `r8d` after xor for zero case.  */
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret6):
> +       ret
> +
> +# ifndef USE_AS_STRNCMP
> +       .p2align 4,, 10
> +L(return_vec_1_end):
>         tzcntl  %ecx, %ecx
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %ecx
> -# endif
> -# ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 3), %rcx
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
>  #  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       movl    VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %esi
> -       cmpl    (%rdx, %rcx), %esi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> -# else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 3)(%rsi, %rcx), %esi
> -       cmpl    (VEC_SIZE * 3)(%rdx, %rcx), %esi
> -       jne     L(wcscmp_return)
> +       cmpl    VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx
> +       je      L(ret7)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  #  else
> -       movzbl  (VEC_SIZE * 3)(%rax, %rcx), %eax
> -       movzbl  (VEC_SIZE * 3)(%rdx, %rcx), %edx
> -       subl    %edx, %eax
> +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  #  endif
> -# endif
> +L(ret7):
>         ret
> -
> -       .p2align 4
> -L(loop_cross_page):
> -       xorl    %r10d, %r10d
> -       movq    %rdx, %rcx
> -       /* Align load via RDX.  We load the extra ECX bytes which should
> -          be ignored.  */
> -       andl    $((VEC_SIZE * 4) - 1), %ecx
> -       /* R10 is -RCX.  */
> -       subq    %rcx, %r10
> -
> -       /* This works only if VEC_SIZE * 2 == 64. */
> -# if (VEC_SIZE * 2) != 64
> -#  error (VEC_SIZE * 2) != 64
>  # endif
>
> -       /* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
> -       cmpl    $(VEC_SIZE * 2), %ecx
> -       jge     L(loop_cross_page_2_vec)
>
> -       VMOVU   (%rax, %r10), %YMM2
> -       VMOVU   VEC_SIZE(%rax, %r10), %YMM3
> +       /* Page cross in rsi in next 4x VEC.  */
>
> -       /* Each bit set in K2 represents a non-null CHAR in YMM2.  */
> -       VPTESTM %YMM2, %YMM2, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in YMM2 and 32 bytes at (%rdx, %r10).  */
> -       VPCMP   $0, (%rdx, %r10), %YMM2, %k1{%k2}
> -       kmovd   %k1, %r9d
> -       /* Don't use subl since it is the lower 16/32 bits of RDI
> -          below.  */
> -       notl    %r9d
> -# ifdef USE_AS_WCSCMP
> -       /* Only last 8 bits are valid.  */
> -       andl    $0xff, %r9d
> -# endif
> +       /* TODO: Improve logic here.  */
> +       .p2align 4,, 10
> +L(page_cross_during_loop):
> +       /* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
>
> -       /* Each bit set in K4 represents a non-null CHAR in YMM3.  */
> -       VPTESTM %YMM3, %YMM3, %k4
> -       /* Each bit cleared in K3 represents a mismatch or a null CHAR
> -          in YMM3 and 32 bytes at VEC_SIZE(%rdx, %r10).  */
> -       VPCMP   $0, VEC_SIZE(%rdx, %r10), %YMM3, %k3{%k4}
> -       kmovd   %k3, %edi
> -    /* Must use notl %edi here as lower bits are for CHAR
> -          comparisons potentially out of range thus can be 0 without
> -          indicating mismatch.  */
> -       notl    %edi
> -# ifdef USE_AS_WCSCMP
> -       /* Don't use subl since it is the upper 8 bits of EDI below.  */
> -       andl    $0xff, %edi
> -# endif
> +       /* Optimistically rsi and rdi and both aligned in which case we
> +          don't need any logic here.  */
> +       cmpl    $-(VEC_SIZE * 4), %eax
> +       /* Don't adjust eax before jumping back to loop and we will
> +          never hit page cross case again.  */
> +       je      L(loop_skip_page_cross_check)
>
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Each bit in EDI/R9D represents 4-byte element.  */
> -       sall    $8, %edi
> -       /* NB: Divide shift count by 4 since each bit in K1 represent 4
> -          bytes.  */
> -       movl    %ecx, %SHIFT_REG32
> -       sarl    $2, %SHIFT_REG32
> -
> -       /* Each bit in EDI represents a null CHAR or a mismatch.  */
> -       orl     %r9d, %edi
> -# else
> -       salq    $32, %rdi
> +       /* Check if we can safely load a VEC.  */
> +       cmpl    $-(VEC_SIZE * 3), %eax
> +       jle     L(less_1x_vec_till_page_cross)
>
> -       /* Each bit in RDI represents a null CHAR or a mismatch.  */
> -       orq     %r9, %rdi
> -# endif
> +       VMOVA   (%rdi), %YMM0
> +       VPTESTM %YMM0, %YMM0, %k2
> +       VPCMP   $0, (%rsi), %YMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_0_end)
> +
> +       /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
> +       cmpl    $-(VEC_SIZE * 2), %eax
> +       jg      L(more_2x_vec_till_page_cross)
> +
> +       .p2align 4,, 4
> +L(less_1x_vec_till_page_cross):
> +       subl    $-(VEC_SIZE * 4), %eax
> +       /* Guranteed safe to read from rdi - VEC_SIZE here. The only
> +          concerning case is first iteration if incoming s1 was near start
> +          of a page and s2 near end. If s1 was near the start of the page
> +          we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
> +          to read back -VEC_SIZE. If rdi is truly at the start of a page
> +          here, it means the previous page (rdi - VEC_SIZE) has already
> +          been loaded earlier so must be valid.  */
> +       VMOVU   -VEC_SIZE(%rdi, %rax), %YMM0
> +       VPTESTM %YMM0, %YMM0, %k2
> +       VPCMP   $0, -VEC_SIZE(%rsi, %rax), %YMM0, %k1{%k2}
> +
> +       /* Mask of potentially valid bits. The lower bits can be out of
> +          range comparisons (but safe regarding page crosses).  */
>
> -       /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
> -       shrxq   %SHIFT_REG64, %rdi, %rdi
> -       testq   %rdi, %rdi
> -       je      L(loop_cross_page_2_vec)
> -       tzcntq  %rdi, %rcx
>  # ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %ecx
> +       movl    $-1, %r10d
> +       movl    %esi, %ecx
> +       andl    $(VEC_SIZE - 1), %ecx
> +       shrl    $2, %ecx
> +       shlxl   %ecx, %r10d, %ecx
> +       movzbl  %cl, %r10d
> +# else
> +       movl    $-1, %ecx
> +       shlxl   %esi, %ecx, %r10d
>  # endif
> +
> +       kmovd   %k1, %ecx
> +       notl    %ecx
> +
> +
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %rcx, %r11
> -       jbe     L(zero)
>  #  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> +       movl    %eax, %r11d
> +       shrl    $2, %r11d
> +       cmpq    %r11, %rdx
>  #  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> +       cmpq    %rax, %rdx
>  #  endif
> +       jbe     L(return_page_cross_end_check)
> +# endif
> +       movl    %eax, %OFFSET_REG
> +
> +       /* Readjust eax before potentially returning to the loop.  */
> +       addl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> +
> +       andl    %r10d, %ecx
> +       jz      L(loop_skip_page_cross_check)
> +
> +       .p2align 4,, 3
> +L(return_page_cross_end):
> +       tzcntl  %ecx, %ecx
> +
> +# if (defined USE_AS_STRNCMP) || (defined USE_AS_WCSCMP)
> +       leal    -VEC_SIZE(%OFFSET_REG64, %rcx, SIZE_OF_CHAR), %ecx
> +L(return_page_cross_cmp_mem):
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       addl    %OFFSET_REG, %ecx
> +# endif
> +# ifdef USE_AS_WCSCMP
> +       movl    VEC_OFFSET(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> +       je      L(ret8)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
> +# else
> +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret8):
>         ret
>
> -       .p2align 4
> -L(loop_cross_page_2_vec):
> -       /* The first VEC_SIZE * 2 bytes match or are ignored.  */
> -       VMOVU   (VEC_SIZE * 2)(%rax, %r10), %YMM0
> -       VMOVU   (VEC_SIZE * 3)(%rax, %r10), %YMM1
> +# ifdef USE_AS_STRNCMP
> +       .p2align 4,, 10
> +L(return_page_cross_end_check):
> +       tzcntl  %ecx, %ecx
> +       leal    -VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
> +#  ifdef USE_AS_WCSCMP
> +       sall    $2, %edx
> +#  endif
> +       cmpl    %ecx, %edx
> +       ja      L(return_page_cross_cmp_mem)
> +       xorl    %eax, %eax
> +       ret
> +# endif
> +
>
> +       .p2align 4,, 10
> +L(more_2x_vec_till_page_cross):
> +       /* If more 2x vec till cross we will complete a full loop
> +          iteration here.  */
> +
> +       VMOVA   VEC_SIZE(%rdi), %YMM0
>         VPTESTM %YMM0, %YMM0, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rdx, %r10).  */
> -       VPCMP   $0, (VEC_SIZE * 2)(%rdx, %r10), %YMM0, %k1{%k2}
> -       kmovd   %k1, %r9d
> -       /* Don't use subl since it is the lower 16/32 bits of RDI
> -          below.  */
> -       notl    %r9d
> -# ifdef USE_AS_WCSCMP
> -       /* Only last 8 bits are valid.  */
> -       andl    $0xff, %r9d
> -# endif
> +       VPCMP   $0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_1_end)
>
> -       VPTESTM %YMM1, %YMM1, %k4
> -       /* Each bit cleared in K3 represents a mismatch or a null CHAR
> -          in YMM1 and 32 bytes at (VEC_SIZE * 3)(%rdx, %r10).  */
> -       VPCMP   $0, (VEC_SIZE * 3)(%rdx, %r10), %YMM1, %k3{%k4}
> -       kmovd   %k3, %edi
> -       /* Must use notl %edi here as lower bits are for CHAR
> -          comparisons potentially out of range thus can be 0 without
> -          indicating mismatch.  */
> -       notl    %edi
> -# ifdef USE_AS_WCSCMP
> -       /* Don't use subl since it is the upper 8 bits of EDI below.  */
> -       andl    $0xff, %edi
> +# ifdef USE_AS_STRNCMP
> +       cmpq    $(CHAR_PER_VEC * 2), %rdx
> +       jbe     L(ret_zero_in_loop_page_cross)
>  # endif
>
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Each bit in EDI/R9D represents 4-byte element.  */
> -       sall    $8, %edi
> +       subl    $-(VEC_SIZE * 4), %eax
>
> -       /* Each bit in EDI represents a null CHAR or a mismatch.  */
> -       orl     %r9d, %edi
> -# else
> -       salq    $32, %rdi
> +       /* Safe to include comparisons from lower bytes.  */
> +       VMOVU   -(VEC_SIZE * 2)(%rdi, %rax), %YMM0
> +       VPTESTM %YMM0, %YMM0, %k2
> +       VPCMP   $0, -(VEC_SIZE * 2)(%rsi, %rax), %YMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_page_cross_0)
> +
> +       VMOVU   -(VEC_SIZE * 1)(%rdi, %rax), %YMM0
> +       VPTESTM %YMM0, %YMM0, %k2
> +       VPCMP   $0, -(VEC_SIZE * 1)(%rsi, %rax), %YMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +       TESTEQ  %ecx
> +       jnz     L(return_vec_page_cross_1)
>
> -       /* Each bit in RDI represents a null CHAR or a mismatch.  */
> -       orq     %r9, %rdi
> +# ifdef USE_AS_STRNCMP
> +       /* Must check length here as length might proclude reading next
> +          page.  */
> +#  ifdef USE_AS_WCSCMP
> +       movl    %eax, %r11d
> +       shrl    $2, %r11d
> +       cmpq    %r11, %rdx
> +#  else
> +       cmpq    %rax, %rdx
> +#  endif
> +       jbe     L(ret_zero_in_loop_page_cross)
>  # endif
>
> -       xorl    %r8d, %r8d
> -       /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
> -       subl    $(VEC_SIZE * 2), %ecx
> -       jle     1f
> -       /* R8 has number of bytes skipped.  */
> -       movl    %ecx, %r8d
> -# ifdef USE_AS_WCSCMP
> -       /* NB: Divide shift count by 4 since each bit in RDI represent 4
> -          bytes.  */
> -       sarl    $2, %ecx
> -       /* Skip ECX bytes.  */
> -       shrl    %cl, %edi
> +       /* Finish the loop.  */
> +       VMOVA   (VEC_SIZE * 2)(%rdi), %YMM4
> +       VMOVA   (VEC_SIZE * 3)(%rdi), %YMM6
> +       VPMINU  %YMM4, %YMM6, %YMM9
> +       VPTESTM %YMM9, %YMM9, %k1
> +
> +       vpxorq  (VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
> +       /* YMM6 = YMM5 | ((VEC_SIZE * 3)(%rsi) ^ YMM6).  */
> +       vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM5, %YMM6
> +
> +       VPCMP   $0, %YMMZERO, %YMM6, %k0{%k1}
> +       kmovd   %k0, %LOOP_REG
> +       TESTEQ  %LOOP_REG
> +       jnz     L(return_vec_2_3_end)
> +
> +       /* Best for code size to include ucond-jmp here. Would be faster
> +          if this case is hot to duplicate the L(return_vec_2_3_end) code
> +          as fall-through and have jump back to loop on mismatch
> +          comparison.  */
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       subq    $-(VEC_SIZE * 4), %rsi
> +       addl    $(PAGE_SIZE - VEC_SIZE * 8), %eax
> +# ifdef USE_AS_STRNCMP
> +       subq    $(CHAR_PER_VEC * 4), %rdx
> +       ja      L(loop_skip_page_cross_check)
> +L(ret_zero_in_loop_page_cross):
> +       xorl    %eax, %eax
> +       ret
>  # else
> -       /* Skip ECX bytes.  */
> -       shrq    %cl, %rdi
> +       jmp     L(loop_skip_page_cross_check)
>  # endif
> -1:
> -       /* Before jumping back to the loop, set ESI to the number of
> -          VEC_SIZE * 4 blocks before page crossing.  */
> -       movl    $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
>
> -       testq   %rdi, %rdi
> -# ifdef USE_AS_STRNCMP
> -       /* At this point, if %rdi value is 0, it already tested
> -          VEC_SIZE*4+%r10 byte starting from %rax. This label
> -          checks whether strncmp maximum offset reached or not.  */
> -       je      L(string_nbyte_offset_check)
> +
> +       .p2align 4,, 10
> +L(return_vec_page_cross_0):
> +       addl    $-VEC_SIZE, %eax
> +L(return_vec_page_cross_1):
> +       tzcntl  %ecx, %ecx
> +# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
> +       leal    -VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
> +#  ifdef USE_AS_STRNCMP
> +#   ifdef USE_AS_WCSCMP
> +       /* Must divide ecx instead of multiply rdx due to overflow.  */
> +       movl    %ecx, %eax
> +       shrl    $2, %eax
> +       cmpq    %rax, %rdx
> +#   else
> +       cmpq    %rcx, %rdx
> +#   endif
> +       jbe     L(ret_zero_in_loop_page_cross)
> +#  endif
>  # else
> -       je      L(back_to_loop)
> +       addl    %eax, %ecx
>  # endif
> -       tzcntq  %rdi, %rcx
> +
>  # ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %ecx
> -# endif
> -       addq    %r10, %rcx
> -       /* Adjust for number of bytes skipped.  */
> -       addq    %r8, %rcx
> -# ifdef USE_AS_STRNCMP
> -       addq    $(VEC_SIZE * 2), %rcx
> -       subq    %rcx, %r11
> -       jbe     L(zero)
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> +       movl    VEC_OFFSET(%rdi, %rcx), %edx
>         xorl    %eax, %eax
> -       movl    (%rsi, %rcx), %edi
> -       cmpl    (%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rax, %rcx), %eax
> -       movzbl  (%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> +       je      L(ret9)
> +       setl    %al
> +       negl    %eax
> +       xorl    %r8d, %eax
>  # else
> -#  ifdef USE_AS_WCSCMP
> -       movq    %rax, %rsi
> -       xorl    %eax, %eax
> -       movl    (VEC_SIZE * 2)(%rsi, %rcx), %edi
> -       cmpl    (VEC_SIZE * 2)(%rdx, %rcx), %edi
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (VEC_SIZE * 2)(%rax, %rcx), %eax
> -       movzbl  (VEC_SIZE * 2)(%rdx, %rcx), %edx
> -       subl    %edx, %eax
> -#  endif
> +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret9):
>         ret
>
> -# ifdef USE_AS_STRNCMP
> -L(string_nbyte_offset_check):
> -       leaq    (VEC_SIZE * 4)(%r10), %r10
> -       cmpq    %r10, %r11
> -       jbe     L(zero)
> -       jmp     L(back_to_loop)
> +
> +       .p2align 4,, 10
> +L(page_cross):
> +# ifndef USE_AS_STRNCMP
> +       /* If both are VEC aligned we don't need any special logic here.
> +          Only valid for strcmp where stop condition is guranteed to be
> +          reachable by just reading memory.  */
> +       testl   $((VEC_SIZE - 1) << 20), %eax
> +       jz      L(no_page_cross)
>  # endif
>
> -       .p2align 4
> -L(cross_page_loop):
> -       /* Check one byte/dword at a time.  */
> +       movl    %edi, %eax
> +       movl    %esi, %ecx
> +       andl    $(PAGE_SIZE - 1), %eax
> +       andl    $(PAGE_SIZE - 1), %ecx
> +
> +       xorl    %OFFSET_REG, %OFFSET_REG
> +
> +       /* Check which is closer to page cross, s1 or s2.  */
> +       cmpl    %eax, %ecx
> +       jg      L(page_cross_s2)
> +
> +       /* The previous page cross check has false positives. Check for
> +          true positive as page cross logic is very expensive.  */
> +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> +       jbe     L(no_page_cross)
> +
> +
> +       /* Set r8 to not interfere with normal return value (rdi and rsi
> +          did not swap).  */
>  # ifdef USE_AS_WCSCMP
> -       cmpl    %ecx, %eax
> +       /* any non-zero positive value that doesn't inference with 0x1.
> +        */
> +       movl    $2, %r8d
>  # else
> -       subl    %ecx, %eax
> +       xorl    %r8d, %r8d
>  # endif
> -       jne     L(different)
> -       addl    $SIZE_OF_CHAR, %edx
> -       cmpl    $(VEC_SIZE * 4), %edx
> -       je      L(main_loop_header)
> +
> +       /* Check if less than 1x VEC till page cross.  */
> +       subl    $(VEC_SIZE * 3), %eax
> +       jg      L(less_1x_vec_till_page)
> +
> +
> +       /* If more than 1x VEC till page cross, loop throuh safely
> +          loadable memory until within 1x VEC of page cross.  */
> +       .p2align 4,, 8
> +L(page_cross_loop):
> +       VMOVU   (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
> +       VPTESTM %YMM0, %YMM0, %k2
> +       VPCMP   $0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +       TESTEQ  %ecx
> +       jnz     L(check_ret_vec_page_cross)
> +       addl    $CHAR_PER_VEC, %OFFSET_REG
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross)
>  # endif
> +       addl    $VEC_SIZE, %eax
> +       jl      L(page_cross_loop)
> +
>  # ifdef USE_AS_WCSCMP
> -       movl    (%rdi, %rdx), %eax
> -       movl    (%rsi, %rdx), %ecx
> -# else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %ecx
> +       shrl    $2, %eax
>  # endif
> -       /* Check null CHAR.  */
> -       testl   %eax, %eax
> -       jne     L(cross_page_loop)
> -       /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
> -          comparisons.  */
> -       subl    %ecx, %eax
> -# ifndef USE_AS_WCSCMP
> -L(different):
> +
> +
> +       subl    %eax, %OFFSET_REG
> +       /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
> +          to not cross page so is safe to load. Since we have already
> +          loaded at least 1 VEC from rsi it is also guranteed to be safe.
> +        */
> +       VMOVU   (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
> +       VPTESTM %YMM0, %YMM0, %k2
> +       VPCMP   $0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2}
> +
> +       kmovd   %k1, %ecx
> +# ifdef USE_AS_STRNCMP
> +       leal    CHAR_PER_VEC(%OFFSET_REG64), %eax
> +       cmpq    %rax, %rdx
> +       jbe     L(check_ret_vec_page_cross2)
> +#  ifdef USE_AS_WCSCMP
> +       addq    $-(CHAR_PER_VEC * 2), %rdx
> +#  else
> +       addq    %rdi, %rdx
> +#  endif
>  # endif
> -       ret
> +       TESTEQ  %ecx
> +       jz      L(prepare_loop_no_len)
>
> +       .p2align 4,, 4
> +L(ret_vec_page_cross):
> +# ifndef USE_AS_STRNCMP
> +L(check_ret_vec_page_cross):
> +# endif
> +       tzcntl  %ecx, %ecx
> +       addl    %OFFSET_REG, %ecx
> +L(ret_vec_page_cross_cont):
>  # ifdef USE_AS_WCSCMP
> -       .p2align 4
> -L(different):
> -       /* Use movl to avoid modifying EFLAGS.  */
> -       movl    $0, %eax
> +       movl    (%rdi, %rcx, SIZE_OF_CHAR), %edx
> +       xorl    %eax, %eax
> +       cmpl    (%rsi, %rcx, SIZE_OF_CHAR), %edx
> +       je      L(ret12)
>         setl    %al
>         negl    %eax
> -       orl     $1, %eax
> -       ret
> +       xorl    %r8d, %eax
> +# else
> +       movzbl  (%rdi, %rcx, SIZE_OF_CHAR), %eax
> +       movzbl  (%rsi, %rcx, SIZE_OF_CHAR), %ecx
> +       subl    %ecx, %eax
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>  # endif
> +L(ret12):
> +       ret
> +
>
>  # ifdef USE_AS_STRNCMP
> -       .p2align 4
> -L(zero):
> +       .p2align 4,, 10
> +L(check_ret_vec_page_cross2):
> +       TESTEQ  %ecx
> +L(check_ret_vec_page_cross):
> +       tzcntl  %ecx, %ecx
> +       addl    %OFFSET_REG, %ecx
> +       cmpq    %rcx, %rdx
> +       ja      L(ret_vec_page_cross_cont)
> +       .p2align 4,, 2
> +L(ret_zero_page_cross):
>         xorl    %eax, %eax
>         ret
> +# endif
>
> -       .p2align 4
> -L(char0):
> -#  ifdef USE_AS_WCSCMP
> -       xorl    %eax, %eax
> -       movl    (%rdi), %ecx
> -       cmpl    (%rsi), %ecx
> -       jne     L(wcscmp_return)
> -#  else
> -       movzbl  (%rsi), %ecx
> -       movzbl  (%rdi), %eax
> -       subl    %ecx, %eax
> -#  endif
> -       ret
> +       .p2align 4,, 4
> +L(page_cross_s2):
> +       /* Ensure this is a true page cross.  */
> +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %ecx
> +       jbe     L(no_page_cross)
> +
> +
> +       movl    %ecx, %eax
> +       movq    %rdi, %rcx
> +       movq    %rsi, %rdi
> +       movq    %rcx, %rsi
> +
> +       /* set r8 to negate return value as rdi and rsi swapped.  */
> +# ifdef USE_AS_WCSCMP
> +       movl    $-4, %r8d
> +# else
> +       movl    $-1, %r8d
>  # endif
> +       xorl    %OFFSET_REG, %OFFSET_REG
>
> -       .p2align 4
> -L(last_vector):
> -       addq    %rdx, %rdi
> -       addq    %rdx, %rsi
> -# ifdef USE_AS_STRNCMP
> -       subq    %rdx, %r11
> +       /* Check if more than 1x VEC till page cross.  */
> +       subl    $(VEC_SIZE * 3), %eax
> +       jle     L(page_cross_loop)
> +
> +       .p2align 4,, 6
> +L(less_1x_vec_till_page):
> +# ifdef USE_AS_WCSCMP
> +       shrl    $2, %eax
>  # endif
> -       tzcntl  %ecx, %edx
> +       /* Find largest load size we can use.  */
> +       cmpl    $(16 / SIZE_OF_CHAR), %eax
> +       ja      L(less_16_till_page)
> +
> +       /* Use 16 byte comparison.  */
> +       vmovdqu (%rdi), %xmm0
> +       VPTESTM %xmm0, %xmm0, %k2
> +       VPCMP   $0, (%rsi), %xmm0, %k1{%k2}
> +       kmovd   %k1, %ecx
>  # ifdef USE_AS_WCSCMP
> -       /* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
> -       sall    $2, %edx
> +       subl    $0xf, %ecx
> +# else
> +       incw    %cx
>  # endif
> +       jnz     L(check_ret_vec_page_cross)
> +       movl    $(16 / SIZE_OF_CHAR), %OFFSET_REG
>  # ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subl    %eax, %OFFSET_REG
> +# else
> +       /* Explicit check for 16 byte alignment.  */
> +       subl    %eax, %OFFSET_REG
> +       jz      L(prepare_loop)
>  # endif
> +       vmovdqu (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
> +       VPTESTM %xmm0, %xmm0, %k2
> +       VPCMP   $0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0, %k1{%k2}
> +       kmovd   %k1, %ecx
>  # ifdef USE_AS_WCSCMP
> -       xorl    %eax, %eax
> -       movl    (%rdi, %rdx), %ecx
> -       cmpl    (%rsi, %rdx), %ecx
> -       jne     L(wcscmp_return)
> +       subl    $0xf, %ecx
>  # else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %edx
> -       subl    %edx, %eax
> +       incw    %cx
>  # endif
> +       jnz     L(check_ret_vec_page_cross)
> +# ifdef USE_AS_STRNCMP
> +       addl    $(16 / SIZE_OF_CHAR), %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subq    $-(CHAR_PER_VEC * 4), %rdx
> +
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
> +# else
> +       leaq    (16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
> +       leaq    (16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
> +# endif
> +       jmp     L(prepare_loop_aligned)
> +
> +# ifdef USE_AS_STRNCMP
> +       .p2align 4,, 2
> +L(ret_zero_page_cross_slow_case0):
> +       xorl    %eax, %eax
>         ret
> +# endif
>
> -       /* Comparing on page boundary region requires special treatment:
> -          It must done one vector at the time, starting with the wider
> -          ymm vector if possible, if not, with xmm. If fetching 16 bytes
> -          (xmm) still passes the boundary, byte comparison must be done.
> -        */
> -       .p2align 4
> -L(cross_page):
> -       /* Try one ymm vector at a time.  */
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> -       jg      L(cross_page_1_vector)
> -L(loop_1_vector):
> -       VMOVU   (%rdi, %rdx), %YMM0
>
> -       VPTESTM %YMM0, %YMM0, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in YMM0 and 32 bytes at (%rsi, %rdx).  */
> -       VPCMP   $0, (%rsi, %rdx), %YMM0, %k1{%k2}
> +       .p2align 4,, 10
> +L(less_16_till_page):
> +       cmpl    $(24 / SIZE_OF_CHAR), %eax
> +       ja      L(less_8_till_page)
> +
> +       /* Use 8 byte comparison.  */
> +       vmovq   (%rdi), %xmm0
> +       vmovq   (%rsi), %xmm1
> +       VPTESTM %xmm0, %xmm0, %k2
> +       VPCMP   $0, %xmm1, %xmm0, %k1{%k2}
>         kmovd   %k1, %ecx
>  # ifdef USE_AS_WCSCMP
> -       subl    $0xff, %ecx
> +       subl    $0x3, %ecx
>  # else
> -       incl    %ecx
> +       incb    %cl
>  # endif
> -       jne     L(last_vector)
> +       jnz     L(check_ret_vec_page_cross)
>
> -       addl    $VEC_SIZE, %edx
>
> -       addl    $VEC_SIZE, %eax
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    $(8 / SIZE_OF_CHAR), %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
>  # endif
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> -       jle     L(loop_1_vector)
> -L(cross_page_1_vector):
> -       /* Less than 32 bytes to check, try one xmm vector.  */
> -       cmpl    $(PAGE_SIZE - 16), %eax
> -       jg      L(cross_page_1_xmm)
> -       VMOVU   (%rdi, %rdx), %XMM0
> +       movl    $(24 / SIZE_OF_CHAR), %OFFSET_REG
> +       subl    %eax, %OFFSET_REG
>
> -       VPTESTM %YMM0, %YMM0, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in XMM0 and 16 bytes at (%rsi, %rdx).  */
> -       VPCMP   $0, (%rsi, %rdx), %XMM0, %k1{%k2}
> +       vmovq   (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
> +       vmovq   (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1
> +       VPTESTM %xmm0, %xmm0, %k2
> +       VPCMP   $0, %xmm1, %xmm0, %k1{%k2}
>         kmovd   %k1, %ecx
>  # ifdef USE_AS_WCSCMP
> -       subl    $0xf, %ecx
> +       subl    $0x3, %ecx
>  # else
> -       subl    $0xffff, %ecx
> +       incb    %cl
>  # endif
> -       jne     L(last_vector)
> +       jnz     L(check_ret_vec_page_cross)
> +
>
> -       addl    $16, %edx
> -# ifndef USE_AS_WCSCMP
> -       addl    $16, %eax
> -# endif
>  # ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       addl    $(8 / SIZE_OF_CHAR), %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case0)
> +       subq    $-(CHAR_PER_VEC * 4), %rdx
> +
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
> +# else
> +       leaq    (8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
> +       leaq    (8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
>  # endif
> +       jmp     L(prepare_loop_aligned)
>
> -L(cross_page_1_xmm):
> -# ifndef USE_AS_WCSCMP
> -       /* Less than 16 bytes to check, try 8 byte vector.  NB: No need
> -          for wcscmp nor wcsncmp since wide char is 4 bytes.   */
> -       cmpl    $(PAGE_SIZE - 8), %eax
> -       jg      L(cross_page_8bytes)
> -       vmovq   (%rdi, %rdx), %XMM0
> -       vmovq   (%rsi, %rdx), %XMM1
>
> -       VPTESTM %YMM0, %YMM0, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in XMM0 and XMM1.  */
> -       VPCMP   $0, %XMM1, %XMM0, %k1{%k2}
> -       kmovb   %k1, %ecx
> +
> +
> +       .p2align 4,, 10
> +L(less_8_till_page):
>  # ifdef USE_AS_WCSCMP
> -       subl    $0x3, %ecx
> +       /* If using wchar then this is the only check before we reach
> +          the page boundary.  */
> +       movl    (%rdi), %eax
> +       movl    (%rsi), %ecx
> +       cmpl    %ecx, %eax
> +       jnz     L(ret_less_8_wcs)
> +#  ifdef USE_AS_STRNCMP
> +       addq    $-(CHAR_PER_VEC * 2), %rdx
> +       /* We already checked for len <= 1 so cannot hit that case here.
> +        */
> +#  endif
> +       testl   %eax, %eax
> +       jnz     L(prepare_loop)
> +       ret
> +
> +       .p2align 4,, 8
> +L(ret_less_8_wcs):
> +       setl    %OFFSET_REG8
> +       negl    %OFFSET_REG
> +       movl    %OFFSET_REG, %eax
> +       xorl    %r8d, %eax
> +       ret
> +
>  # else
> -       subl    $0xff, %ecx
> -# endif
> -       jne     L(last_vector)
> +       cmpl    $28, %eax
> +       ja      L(less_4_till_page)
> +
> +       vmovd   (%rdi), %xmm0
> +       vmovd   (%rsi), %xmm1
> +       VPTESTM %xmm0, %xmm0, %k2
> +       VPCMP   $0, %xmm1, %xmm0, %k1{%k2}
> +       kmovd   %k1, %ecx
> +       subl    $0xf, %ecx
> +       jnz     L(check_ret_vec_page_cross)
>
> -       addl    $8, %edx
> -       addl    $8, %eax
>  #  ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       cmpq    $4, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case1)
>  #  endif
> +       movl    $(28 / SIZE_OF_CHAR), %OFFSET_REG
> +       subl    %eax, %OFFSET_REG
>
> -L(cross_page_8bytes):
> -       /* Less than 8 bytes to check, try 4 byte vector.  */
> -       cmpl    $(PAGE_SIZE - 4), %eax
> -       jg      L(cross_page_4bytes)
> -       vmovd   (%rdi, %rdx), %XMM0
> -       vmovd   (%rsi, %rdx), %XMM1
> -
> -       VPTESTM %YMM0, %YMM0, %k2
> -       /* Each bit cleared in K1 represents a mismatch or a null CHAR
> -          in XMM0 and XMM1.  */
> -       VPCMP   $0, %XMM1, %XMM0, %k1{%k2}
> +       vmovd   (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
> +       vmovd   (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1
> +       VPTESTM %xmm0, %xmm0, %k2
> +       VPCMP   $0, %xmm1, %xmm0, %k1{%k2}
>         kmovd   %k1, %ecx
> -# ifdef USE_AS_WCSCMP
> -       subl    $0x1, %ecx
> -# else
>         subl    $0xf, %ecx
> -# endif
> -       jne     L(last_vector)
> +       jnz     L(check_ret_vec_page_cross)
> +#  ifdef USE_AS_STRNCMP
> +       addl    $(4 / SIZE_OF_CHAR), %OFFSET_REG
> +       subq    %OFFSET_REG64, %rdx
> +       jbe     L(ret_zero_page_cross_slow_case1)
> +       subq    $-(CHAR_PER_VEC * 4), %rdx
> +
> +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
> +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
> +#  else
> +       leaq    (4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
> +       leaq    (4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
> +#  endif
> +       jmp     L(prepare_loop_aligned)
> +
>
> -       addl    $4, %edx
>  #  ifdef USE_AS_STRNCMP
> -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> -          (%r11).  */
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> +       .p2align 4,, 2
> +L(ret_zero_page_cross_slow_case1):
> +       xorl    %eax, %eax
> +       ret
>  #  endif
>
> -L(cross_page_4bytes):
> -# endif
> -       /* Less than 4 bytes to check, try one byte/dword at a time.  */
> -# ifdef USE_AS_STRNCMP
> -       cmpq    %r11, %rdx
> -       jae     L(zero)
> -# endif
> -# ifdef USE_AS_WCSCMP
> -       movl    (%rdi, %rdx), %eax
> -       movl    (%rsi, %rdx), %ecx
> -# else
> -       movzbl  (%rdi, %rdx), %eax
> -       movzbl  (%rsi, %rdx), %ecx
> -# endif
> -       testl   %eax, %eax
> -       jne     L(cross_page_loop)
> +       .p2align 4,, 10
> +L(less_4_till_page):
> +       subq    %rdi, %rsi
> +       /* Extremely slow byte comparison loop.  */
> +L(less_4_loop):
> +       movzbl  (%rdi), %eax
> +       movzbl  (%rsi, %rdi), %ecx
>         subl    %ecx, %eax
> +       jnz     L(ret_less_4_loop)
> +       testl   %ecx, %ecx
> +       jz      L(ret_zero_4_loop)
> +#  ifdef USE_AS_STRNCMP
> +       decq    %rdx
> +       jz      L(ret_zero_4_loop)
> +#  endif
> +       incq    %rdi
> +       /* end condition is reach page boundary (rdi is aligned).  */
> +       testl   $31, %edi
> +       jnz     L(less_4_loop)
> +       leaq    -(VEC_SIZE * 4)(%rdi, %rsi), %rsi
> +       addq    $-(VEC_SIZE * 4), %rdi
> +#  ifdef USE_AS_STRNCMP
> +       subq    $-(CHAR_PER_VEC * 4), %rdx
> +#  endif
> +       jmp     L(prepare_loop_aligned)
> +
> +L(ret_zero_4_loop):
> +       xorl    %eax, %eax
> +       ret
> +L(ret_less_4_loop):
> +       xorl    %r8d, %eax
> +       subl    %r8d, %eax
>         ret
> -END (STRCMP)
> +# endif
> +END(STRCMP)
>  #endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 5/7] x86: Optimize strcmp-avx2.S
  2022-01-10  0:41     ` H.J. Lu
@ 2022-01-10  1:06       ` Noah Goldstein
  2022-01-10  1:58         ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  1:06 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 6:41 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Jan 9, 2022 at 4:31 PM Noah Goldstein via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > Optimization are primarily to the loop logic and how the page cross
> > logic interacts with the loop.
> >
> > The page cross logic is at times more expensive for short strings near
> > the end of a page but not crossing the page. This is done to retest
> > the page cross conditions with a non-faulty check and to improve the
> > logic for entering the loop afterwards. This is only particular cases,
> > however, and is general made up for by more than 10x improvements on
> > the transition from the page cross -> loop case.
> >
> > The non-page cross cases are improved most for smaller sizes [0, 128]
> > and go about even for (128, 4096]. The loop page cross logic is
> > improved so some more significant speedup is seen there as well.
> >
> > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > ---
> >  sysdeps/x86_64/multiarch/strcmp-avx2.S | 1590 ++++++++++++++----------
> >  1 file changed, 939 insertions(+), 651 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > index 9c73b5899d..28d6a0025a 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > @@ -26,35 +26,57 @@
> >
> >  # define PAGE_SIZE     4096
> >
> > -/* VEC_SIZE = Number of bytes in a ymm register */
> > +       /* VEC_SIZE = Number of bytes in a ymm register.  */
> >  # define VEC_SIZE      32
> >
> > -/* Shift for dividing by (VEC_SIZE * 4).  */
> > -# define DIVIDE_BY_VEC_4_SHIFT 7
> > -# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> > -#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> > -# endif
> > +# define VMOVU vmovdqu
> > +# define VMOVA vmovdqa
> >
> >  # ifdef USE_AS_WCSCMP
> > -/* Compare packed dwords.  */
> > +       /* Compare packed dwords.  */
> >  #  define VPCMPEQ      vpcmpeqd
> > -/* Compare packed dwords and store minimum.  */
> > +       /* Compare packed dwords and store minimum.  */
> >  #  define VPMINU       vpminud
> > -/* 1 dword char == 4 bytes.  */
> > +       /* 1 dword char == 4 bytes.  */
> >  #  define SIZE_OF_CHAR 4
> >  # else
> > -/* Compare packed bytes.  */
> > +       /* Compare packed bytes.  */
> >  #  define VPCMPEQ      vpcmpeqb
> > -/* Compare packed bytes and store minimum.  */
> > +       /* Compare packed bytes and store minimum.  */
> >  #  define VPMINU       vpminub
> > -/* 1 byte char == 1 byte.  */
> > +       /* 1 byte char == 1 byte.  */
> >  #  define SIZE_OF_CHAR 1
> >  # endif
> >
> > +# ifdef USE_AS_STRNCMP
> > +#  define LOOP_REG     r9d
> > +#  define LOOP_REG64   r9
> > +
> > +#  define OFFSET_REG8  r9b
> > +#  define OFFSET_REG   r9d
> > +#  define OFFSET_REG64 r9
> > +# else
> > +#  define LOOP_REG     edx
> > +#  define LOOP_REG64   rdx
> > +
> > +#  define OFFSET_REG8  dl
> > +#  define OFFSET_REG   edx
> > +#  define OFFSET_REG64 rdx
> > +# endif
> > +
> >  # ifndef VZEROUPPER
> >  #  define VZEROUPPER   vzeroupper
> >  # endif
> >
> > +# if defined USE_AS_STRNCMP
> > +#  define VEC_OFFSET   0
> > +# else
> > +#  define VEC_OFFSET   (-VEC_SIZE)
> > +# endif
> > +
> > +# define xmmZERO       xmm15
> > +# define ymmZERO       ymm15
> > +
> >  # ifndef SECTION
> >  #  define SECTION(p)   p##.avx
> >  # endif
> > @@ -79,783 +101,1049 @@
> >     the maximum offset is reached before a difference is found, zero is
> >     returned.  */
> >
> > -       .section SECTION(.text),"ax",@progbits
> > -ENTRY (STRCMP)
> > +       .section SECTION(.text), "ax", @progbits
> > +ENTRY(STRCMP)
> >  # ifdef USE_AS_STRNCMP
> > -       /* Check for simple cases (0 or 1) in offset.  */
> > +#  ifdef __ILP32__
> > +       /* Clear the upper 32 bits.  */
> > +       movl    %edx, %rdx
> > +#  endif
> >         cmp     $1, %RDX_LP
> > -       je      L(char0)
> > -       jb      L(zero)
> > +       /* Signed comparison intentional. We use this branch to also
> > +          test cases where length >= 2^63. These very large sizes can be
> > +          handled with strcmp as there is no way for that length to
> > +          actually bound the buffer.  */
> > +       jle     L(one_or_less)
> >  #  ifdef USE_AS_WCSCMP
> > -#  ifndef __ILP32__
> >         movq    %rdx, %rcx
> > -       /* Check if length could overflow when multiplied by
> > -          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > -          overflow cases as well as redirect cases where its impossible to
> > -          length to bound a valid memory region. In these cases just use
> > -          'wcscmp'.  */
> > +
> > +       /* Multiplying length by sizeof(wchar_t) can result in overflow.
> > +          Check if that is possible. All cases where overflow are possible
> > +          are cases where length is large enough that it can never be a
> > +          bound on valid memory so just use wcscmp.  */
> >         shrq    $56, %rcx
> >         jnz     __wcscmp_avx2
> > +
> > +       leaq    (, %rdx, 4), %rdx
> >  #  endif
> > -       /* Convert units: from wide to byte char.  */
> > -       shl     $2, %RDX_LP
> > -#  endif
> > -       /* Register %r11 tracks the maximum offset.  */
> > -       mov     %RDX_LP, %R11_LP
> >  # endif
> > +       vpxor   %xmmZERO, %xmmZERO, %xmmZERO
> >         movl    %edi, %eax
> > -       xorl    %edx, %edx
> > -       /* Make %xmm7 (%ymm7) all zeros in this function.  */
> > -       vpxor   %xmm7, %xmm7, %xmm7
> >         orl     %esi, %eax
> > -       andl    $(PAGE_SIZE - 1), %eax
> > -       cmpl    $(PAGE_SIZE - (VEC_SIZE * 4)), %eax
> > -       jg      L(cross_page)
> > -       /* Start comparing 4 vectors.  */
> > -       vmovdqu (%rdi), %ymm1
> > -       VPCMPEQ (%rsi), %ymm1, %ymm0
> > -       VPMINU  %ymm1, %ymm0, %ymm0
> > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > -       vpmovmskb %ymm0, %ecx
> > -       testl   %ecx, %ecx
> > -       je      L(next_3_vectors)
> > -       tzcntl  %ecx, %edx
> > +       sall    $20, %eax
> > +       /* Check if s1 or s2 may cross a page  in next 4x VEC loads.  */
> > +       cmpl    $((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
> > +       ja      L(page_cross)
> > +
> > +L(no_page_cross):
> > +       /* Safe to compare 4x vectors.  */
> > +       VMOVU   (%rdi), %ymm0
> > +       /* 1s where s1 and s2 equal.  */
> > +       VPCMPEQ (%rsi), %ymm0, %ymm1
> > +       /* 1s at null CHAR.  */
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       /* 1s where s1 and s2 equal AND not null CHAR.  */
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +
> > +       /* All 1s -> keep going, any 0s -> return.  */
> > +       vpmovmskb %ymm1, %ecx
> >  # ifdef USE_AS_STRNCMP
> > -       /* Return 0 if the mismatched index (%rdx) is after the maximum
> > -          offset (%r11).   */
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > +       cmpq    $VEC_SIZE, %rdx
> > +       jbe     L(vec_0_test_len)
> >  # endif
> > +
> > +       /* All 1s represents all equals. incl will overflow to zero in
> > +          all equals case. Otherwise 1s will carry until position of first
> > +          mismatch.  */
> > +       incl    %ecx
> > +       jz      L(more_3x_vec)
> > +
> > +       .p2align 4,, 4
> > +L(return_vec_0):
> > +       tzcntl  %ecx, %ecx
> >  # ifdef USE_AS_WCSCMP
> > +       movl    (%rdi, %rcx), %edx
> >         xorl    %eax, %eax
> > -       movl    (%rdi, %rdx), %ecx
> > -       cmpl    (%rsi, %rdx), %ecx
> > -       je      L(return)
> > -L(wcscmp_return):
> > +       cmpl    (%rsi, %rcx), %edx
> > +       je      L(ret0)
> >         setl    %al
> >         negl    %eax
> >         orl     $1, %eax
> > -L(return):
> >  # else
> > -       movzbl  (%rdi, %rdx), %eax
> > -       movzbl  (%rsi, %rdx), %edx
> > -       subl    %edx, %eax
> > +       movzbl  (%rdi, %rcx), %eax
> > +       movzbl  (%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> >  # endif
> > +L(ret0):
> >  L(return_vzeroupper):
> >         ZERO_UPPER_VEC_REGISTERS_RETURN
> >
> > -       .p2align 4
> > -L(return_vec_size):
> > -       tzcntl  %ecx, %edx
> >  # ifdef USE_AS_STRNCMP
> > -       /* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
> > -          the maximum offset (%r11).  */
> > -       addq    $VEC_SIZE, %rdx
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > +       .p2align 4,, 8
> > +L(vec_0_test_len):
> > +       notl    %ecx
> > +       bzhil   %edx, %ecx, %eax
> > +       jnz     L(return_vec_0)
> > +       /* Align if will cross fetch block.  */
> > +       .p2align 4,, 2
> > +L(ret_zero):
> >         xorl    %eax, %eax
> > -       movl    (%rdi, %rdx), %ecx
> > -       cmpl    (%rsi, %rdx), %ecx
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rdi, %rdx), %eax
> > -       movzbl  (%rsi, %rdx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > -# else
> > +       VZEROUPPER_RETURN
> > +
> > +       .p2align 4,, 5
> > +L(one_or_less):
> > +       jb      L(ret_zero)
> >  #  ifdef USE_AS_WCSCMP
> > +       /* 'nbe' covers the case where length is negative (large
> > +          unsigned).  */
> > +       jnbe    __wcscmp_avx2
> > +       movl    (%rdi), %edx
> >         xorl    %eax, %eax
> > -       movl    VEC_SIZE(%rdi, %rdx), %ecx
> > -       cmpl    VEC_SIZE(%rsi, %rdx), %ecx
> > -       jne     L(wcscmp_return)
> > +       cmpl    (%rsi), %edx
> > +       je      L(ret1)
> > +       setl    %al
> > +       negl    %eax
> > +       orl     $1, %eax
> >  #  else
> > -       movzbl  VEC_SIZE(%rdi, %rdx), %eax
> > -       movzbl  VEC_SIZE(%rsi, %rdx), %edx
> > -       subl    %edx, %eax
> > +       /* 'nbe' covers the case where length is negative (large
> > +          unsigned).  */
> > +
> > +       jnbe    __strcmp_avx2
> > +       movzbl  (%rdi), %eax
> > +       movzbl  (%rsi), %ecx
> > +       subl    %ecx, %eax
> >  #  endif
> > +L(ret1):
> > +       ret
> >  # endif
> > -       VZEROUPPER_RETURN
> >
> > -       .p2align 4
> > -L(return_2_vec_size):
> > -       tzcntl  %ecx, %edx
> > +       .p2align 4,, 10
> > +L(return_vec_1):
> > +       tzcntl  %ecx, %ecx
> >  # ifdef USE_AS_STRNCMP
> > -       /* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
> > -          after the maximum offset (%r11).  */
> > -       addq    $(VEC_SIZE * 2), %rdx
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > +       /* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of
> > +          overflow.  */
> > +       addq    $-VEC_SIZE, %rdx
> > +       cmpq    %rcx, %rdx
> > +       jbe     L(ret_zero)
> > +# endif
> > +# ifdef USE_AS_WCSCMP
> > +       movl    VEC_SIZE(%rdi, %rcx), %edx
> >         xorl    %eax, %eax
> > -       movl    (%rdi, %rdx), %ecx
> > -       cmpl    (%rsi, %rdx), %ecx
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rdi, %rdx), %eax
> > -       movzbl  (%rsi, %rdx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> > +       je      L(ret2)
> > +       setl    %al
> > +       negl    %eax
> > +       orl     $1, %eax
> >  # else
> > -#  ifdef USE_AS_WCSCMP
> > -       xorl    %eax, %eax
> > -       movl    (VEC_SIZE * 2)(%rdi, %rdx), %ecx
> > -       cmpl    (VEC_SIZE * 2)(%rsi, %rdx), %ecx
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (VEC_SIZE * 2)(%rdi, %rdx), %eax
> > -       movzbl  (VEC_SIZE * 2)(%rsi, %rdx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> > +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> >  # endif
> > +L(ret2):
> >         VZEROUPPER_RETURN
> >
> > -       .p2align 4
> > -L(return_3_vec_size):
> > -       tzcntl  %ecx, %edx
> > +       .p2align 4,, 10
> >  # ifdef USE_AS_STRNCMP
> > -       /* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
> > -          after the maximum offset (%r11).  */
> > -       addq    $(VEC_SIZE * 3), %rdx
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > +L(return_vec_3):
> > +       salq    $32, %rcx
> > +# endif
> > +
> > +L(return_vec_2):
> > +# ifndef USE_AS_STRNCMP
> > +       tzcntl  %ecx, %ecx
> > +# else
> > +       tzcntq  %rcx, %rcx
> > +       cmpq    %rcx, %rdx
> > +       jbe     L(ret_zero)
> > +# endif
> > +
> > +# ifdef USE_AS_WCSCMP
> > +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
> >         xorl    %eax, %eax
> > -       movl    (%rdi, %rdx), %ecx
> > -       cmpl    (%rsi, %rdx), %ecx
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rdi, %rdx), %eax
> > -       movzbl  (%rsi, %rdx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> > +       je      L(ret3)
> > +       setl    %al
> > +       negl    %eax
> > +       orl     $1, %eax
> >  # else
> > +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> > +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> > +# endif
> > +L(ret3):
> > +       VZEROUPPER_RETURN
> > +
> > +# ifndef USE_AS_STRNCMP
> > +       .p2align 4,, 10
> > +L(return_vec_3):
> > +       tzcntl  %ecx, %ecx
> >  #  ifdef USE_AS_WCSCMP
> > +       movl    (VEC_SIZE * 3)(%rdi, %rcx), %edx
> >         xorl    %eax, %eax
> > -       movl    (VEC_SIZE * 3)(%rdi, %rdx), %ecx
> > -       cmpl    (VEC_SIZE * 3)(%rsi, %rdx), %ecx
> > -       jne     L(wcscmp_return)
> > +       cmpl    (VEC_SIZE * 3)(%rsi, %rcx), %edx
> > +       je      L(ret4)
> > +       setl    %al
> > +       negl    %eax
> > +       orl     $1, %eax
> >  #  else
> > -       movzbl  (VEC_SIZE * 3)(%rdi, %rdx), %eax
> > -       movzbl  (VEC_SIZE * 3)(%rsi, %rdx), %edx
> > -       subl    %edx, %eax
> > +       movzbl  (VEC_SIZE * 3)(%rdi, %rcx), %eax
> > +       movzbl  (VEC_SIZE * 3)(%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> >  #  endif
> > -# endif
> > +L(ret4):
> >         VZEROUPPER_RETURN
> > +# endif
> > +
> > +       .p2align 4,, 10
> > +L(more_3x_vec):
> > +       /* Safe to compare 4x vectors.  */
> > +       VMOVU   VEC_SIZE(%rdi), %ymm0
> > +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incl    %ecx
> > +       jnz     L(return_vec_1)
> > +
> > +# ifdef USE_AS_STRNCMP
> > +       subq    $(VEC_SIZE * 2), %rdx
> > +       jbe     L(ret_zero)
> > +# endif
> > +
> > +       VMOVU   (VEC_SIZE * 2)(%rdi), %ymm0
> > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incl    %ecx
> > +       jnz     L(return_vec_2)
> > +
> > +       VMOVU   (VEC_SIZE * 3)(%rdi), %ymm0
> > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incl    %ecx
> > +       jnz     L(return_vec_3)
> >
> > -       .p2align 4
> > -L(next_3_vectors):
> > -       vmovdqu VEC_SIZE(%rdi), %ymm6
> > -       VPCMPEQ VEC_SIZE(%rsi), %ymm6, %ymm3
> > -       VPMINU  %ymm6, %ymm3, %ymm3
> > -       VPCMPEQ %ymm7, %ymm3, %ymm3
> > -       vpmovmskb %ymm3, %ecx
> > -       testl   %ecx, %ecx
> > -       jne     L(return_vec_size)
> > -       vmovdqu (VEC_SIZE * 2)(%rdi), %ymm5
> > -       vmovdqu (VEC_SIZE * 3)(%rdi), %ymm4
> > -       vmovdqu (VEC_SIZE * 3)(%rsi), %ymm0
> > -       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm5, %ymm2
> > -       VPMINU  %ymm5, %ymm2, %ymm2
> > -       VPCMPEQ %ymm4, %ymm0, %ymm0
> > -       VPCMPEQ %ymm7, %ymm2, %ymm2
> > -       vpmovmskb %ymm2, %ecx
> > -       testl   %ecx, %ecx
> > -       jne     L(return_2_vec_size)
> > -       VPMINU  %ymm4, %ymm0, %ymm0
> > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > -       vpmovmskb %ymm0, %ecx
> > -       testl   %ecx, %ecx
> > -       jne     L(return_3_vec_size)
> > -L(main_loop_header):
> > -       leaq    (VEC_SIZE * 4)(%rdi), %rdx
> > -       movl    $PAGE_SIZE, %ecx
> > -       /* Align load via RAX.  */
> > -       andq    $-(VEC_SIZE * 4), %rdx
> > -       subq    %rdi, %rdx
> > -       leaq    (%rdi, %rdx), %rax
> >  # ifdef USE_AS_STRNCMP
> > -       /* Starting from this point, the maximum offset, or simply the
> > -          'offset', DECREASES by the same amount when base pointers are
> > -          moved forward.  Return 0 when:
> > -            1) On match: offset <= the matched vector index.
> > -            2) On mistmach, offset is before the mistmatched index.
> > +       cmpq    $(VEC_SIZE * 2), %rdx
> > +       jbe     L(ret_zero)
> > +# endif
> > +
> > +# ifdef USE_AS_WCSCMP
> > +       /* any non-zero positive value that doesn't inference with 0x1.
> >          */
> > -       subq    %rdx, %r11
> > -       jbe     L(zero)
> > -# endif
> > -       addq    %rsi, %rdx
> > -       movq    %rdx, %rsi
> > -       andl    $(PAGE_SIZE - 1), %esi
> > -       /* Number of bytes before page crossing.  */
> > -       subq    %rsi, %rcx
> > -       /* Number of VEC_SIZE * 4 blocks before page crossing.  */
> > -       shrq    $DIVIDE_BY_VEC_4_SHIFT, %rcx
> > -       /* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
> > -       movl    %ecx, %esi
> > -       jmp     L(loop_start)
> > +       movl    $2, %r8d
> >
> > +# else
> > +       xorl    %r8d, %r8d
> > +# endif
> > +
> > +       /* The prepare labels are various entry points from the page
> > +          cross logic.  */
> > +L(prepare_loop):
> > +
> > +# ifdef USE_AS_STRNCMP
> > +       /* Store N + (VEC_SIZE * 4) and place check at the begining of
> > +          the loop.  */
> > +       leaq    (VEC_SIZE * 2)(%rdi, %rdx), %rdx
> > +# endif
> > +L(prepare_loop_no_len):
> > +
> > +       /* Align s1 and adjust s2 accordingly.  */
> > +       subq    %rdi, %rsi
> > +       andq    $-(VEC_SIZE * 4), %rdi
> > +       addq    %rdi, %rsi
> > +
> > +# ifdef USE_AS_STRNCMP
> > +       subq    %rdi, %rdx
> > +# endif
> > +
> > +L(prepare_loop_aligned):
> > +       /* eax stores distance from rsi to next page cross. These cases
> > +          need to be handled specially as the 4x loop could potentially
> > +          read memory past the length of s1 or s2 and across a page
> > +          boundary.  */
> > +       movl    $-(VEC_SIZE * 4), %eax
> > +       subl    %esi, %eax
> > +       andl    $(PAGE_SIZE - 1), %eax
> > +
> > +       /* Loop 4x comparisons at a time.  */
> >         .p2align 4
> >  L(loop):
> > +
> > +       /* End condition for strncmp.  */
> >  # ifdef USE_AS_STRNCMP
> > -       /* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
> > -          the maximum offset (%r11) by the same amount.  */
> > -       subq    $(VEC_SIZE * 4), %r11
> > -       jbe     L(zero)
> > -# endif
> > -       addq    $(VEC_SIZE * 4), %rax
> > -       addq    $(VEC_SIZE * 4), %rdx
> > -L(loop_start):
> > -       testl   %esi, %esi
> > -       leal    -1(%esi), %esi
> > -       je      L(loop_cross_page)
> > -L(back_to_loop):
> > -       /* Main loop, comparing 4 vectors are a time.  */
> > -       vmovdqa (%rax), %ymm0
> > -       vmovdqa VEC_SIZE(%rax), %ymm3
> > -       VPCMPEQ (%rdx), %ymm0, %ymm4
> > -       VPCMPEQ VEC_SIZE(%rdx), %ymm3, %ymm1
> > -       VPMINU  %ymm0, %ymm4, %ymm4
> > -       VPMINU  %ymm3, %ymm1, %ymm1
> > -       vmovdqa (VEC_SIZE * 2)(%rax), %ymm2
> > -       VPMINU  %ymm1, %ymm4, %ymm0
> > -       vmovdqa (VEC_SIZE * 3)(%rax), %ymm3
> > -       VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm2, %ymm5
> > -       VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm3, %ymm6
> > -       VPMINU  %ymm2, %ymm5, %ymm5
> > -       VPMINU  %ymm3, %ymm6, %ymm6
> > -       VPMINU  %ymm5, %ymm0, %ymm0
> > -       VPMINU  %ymm6, %ymm0, %ymm0
> > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > -
> > -       /* Test each mask (32 bits) individually because for VEC_SIZE
> > -          == 32 is not possible to OR the four masks and keep all bits
> > -          in a 64-bit integer register, differing from SSE2 strcmp
> > -          where ORing is possible.  */
> > -       vpmovmskb %ymm0, %ecx
> > +       subq    $(VEC_SIZE * 4), %rdx
> > +       jbe     L(ret_zero)
> > +# endif
> > +
> > +       subq    $-(VEC_SIZE * 4), %rdi
> > +       subq    $-(VEC_SIZE * 4), %rsi
> > +
> > +       /* Check if rsi loads will cross a page boundary.  */
> > +       addl    $-(VEC_SIZE * 4), %eax
> > +       jnb     L(page_cross_during_loop)
> > +
> > +       /* Loop entry after handling page cross during loop.  */
> > +L(loop_skip_page_cross_check):
> > +       VMOVA   (VEC_SIZE * 0)(%rdi), %ymm0
> > +       VMOVA   (VEC_SIZE * 1)(%rdi), %ymm2
> > +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> > +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> > +
> > +       /* ymm1 all 1s where s1 and s2 equal. All 0s otherwise.  */
> > +       VPCMPEQ (VEC_SIZE * 0)(%rsi), %ymm0, %ymm1
> > +
> > +       VPCMPEQ (VEC_SIZE * 1)(%rsi), %ymm2, %ymm3
> > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> > +
> > +
> > +       /* If any mismatches or null CHAR then 0 CHAR, otherwise non-
> > +          zero.  */
> > +       vpand   %ymm0, %ymm1, %ymm1
> > +
> > +
> > +       vpand   %ymm2, %ymm3, %ymm3
> > +       vpand   %ymm4, %ymm5, %ymm5
> > +       vpand   %ymm6, %ymm7, %ymm7
> > +
> > +       VPMINU  %ymm1, %ymm3, %ymm3
> > +       VPMINU  %ymm5, %ymm7, %ymm7
> > +
> > +       /* Reduce all 0 CHARs for the 4x VEC into ymm7.  */
> > +       VPMINU  %ymm3, %ymm7, %ymm7
> > +
> > +       /* If any 0 CHAR then done.  */
> > +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> > +       vpmovmskb %ymm7, %LOOP_REG
> > +       testl   %LOOP_REG, %LOOP_REG
> > +       jz      L(loop)
> > +
> > +       /* Find which VEC has the mismatch of end of string.  */
> > +       VPCMPEQ %ymm1, %ymmZERO, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> >         testl   %ecx, %ecx
> > -       je      L(loop)
> > -       VPCMPEQ %ymm7, %ymm4, %ymm0
> > -       vpmovmskb %ymm0, %edi
> > -       testl   %edi, %edi
> > -       je      L(test_vec)
> > -       tzcntl  %edi, %ecx
> > +       jnz     L(return_vec_0_end)
> > +
> > +
> > +       VPCMPEQ %ymm3, %ymmZERO, %ymm3
> > +       vpmovmskb %ymm3, %ecx
> > +       testl   %ecx, %ecx
> > +       jnz     L(return_vec_1_end)
> > +
> > +L(return_vec_2_3_end):
> >  # ifdef USE_AS_STRNCMP
> > -       cmpq    %rcx, %r11
> > -       jbe     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > +       subq    $(VEC_SIZE * 2), %rdx
> > +       jbe     L(ret_zero_end)
> > +# endif
> > +
> > +       VPCMPEQ %ymm5, %ymmZERO, %ymm5
> > +       vpmovmskb %ymm5, %ecx
> > +       testl   %ecx, %ecx
> > +       jnz     L(return_vec_2_end)
> > +
> > +       /* LOOP_REG contains matches for null/mismatch from the loop. If
> > +          VEC 0,1,and 2 all have no null and no mismatches then mismatch
> > +          must entirely be from VEC 3 which is fully represented by
> > +          LOOP_REG.  */
> > +       tzcntl  %LOOP_REG, %LOOP_REG
> > +
> > +# ifdef USE_AS_STRNCMP
> > +       subl    $-(VEC_SIZE), %LOOP_REG
> > +       cmpq    %LOOP_REG64, %rdx
> > +       jbe     L(ret_zero_end)
> > +# endif
> > +
> > +# ifdef USE_AS_WCSCMP
> > +       movl    (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx
> >         xorl    %eax, %eax
> > -       movl    (%rsi, %rcx), %edi
> > -       cmpl    (%rdx, %rcx), %edi
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rax, %rcx), %eax
> > -       movzbl  (%rdx, %rcx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       cmpl    (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> > +       je      L(ret5)
> > +       setl    %al
> > +       negl    %eax
> > +       xorl    %r8d, %eax
> >  # else
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > -       xorl    %eax, %eax
> > -       movl    (%rsi, %rcx), %edi
> > -       cmpl    (%rdx, %rcx), %edi
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rax, %rcx), %eax
> > -       movzbl  (%rdx, %rcx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax
> > +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> > +       subl    %ecx, %eax
> > +       xorl    %r8d, %eax
> > +       subl    %r8d, %eax
> >  # endif
> > +L(ret5):
> >         VZEROUPPER_RETURN
> >
> > -       .p2align 4
> > -L(test_vec):
> >  # ifdef USE_AS_STRNCMP
> > -       /* The first vector matched.  Return 0 if the maximum offset
> > -          (%r11) <= VEC_SIZE.  */
> > -       cmpq    $VEC_SIZE, %r11
> > -       jbe     L(zero)
> > +       .p2align 4,, 2
> > +L(ret_zero_end):
> > +       xorl    %eax, %eax
> > +       VZEROUPPER_RETURN
> >  # endif
> > -       VPCMPEQ %ymm7, %ymm1, %ymm1
> > -       vpmovmskb %ymm1, %ecx
> > -       testl   %ecx, %ecx
> > -       je      L(test_2_vec)
> > -       tzcntl  %ecx, %edi
> > +
> > +
> > +       /* The L(return_vec_N_end) differ from L(return_vec_N) in that
> > +          they use the value of `r8` to negate the return value. This is
> > +          because the page cross logic can swap `rdi` and `rsi`.  */
> > +       .p2align 4,, 10
> >  # ifdef USE_AS_STRNCMP
> > -       addq    $VEC_SIZE, %rdi
> > -       cmpq    %rdi, %r11
> > -       jbe     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > +L(return_vec_1_end):
> > +       salq    $32, %rcx
> > +# endif
> > +L(return_vec_0_end):
> > +# ifndef USE_AS_STRNCMP
> > +       tzcntl  %ecx, %ecx
> > +# else
> > +       tzcntq  %rcx, %rcx
> > +       cmpq    %rcx, %rdx
> > +       jbe     L(ret_zero_end)
> > +# endif
> > +
> > +# ifdef USE_AS_WCSCMP
> > +       movl    (%rdi, %rcx), %edx
> >         xorl    %eax, %eax
> > -       movl    (%rsi, %rdi), %ecx
> > -       cmpl    (%rdx, %rdi), %ecx
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rax, %rdi), %eax
> > -       movzbl  (%rdx, %rdi), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       cmpl    (%rsi, %rcx), %edx
> > +       je      L(ret6)
> > +       setl    %al
> > +       negl    %eax
> > +       xorl    %r8d, %eax
> >  # else
> > +       movzbl  (%rdi, %rcx), %eax
> > +       movzbl  (%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> > +       xorl    %r8d, %eax
> > +       subl    %r8d, %eax
> > +# endif
> > +L(ret6):
> > +       VZEROUPPER_RETURN
> > +
> > +# ifndef USE_AS_STRNCMP
> > +       .p2align 4,, 10
> > +L(return_vec_1_end):
> > +       tzcntl  %ecx, %ecx
> >  #  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > +       movl    VEC_SIZE(%rdi, %rcx), %edx
> >         xorl    %eax, %eax
> > -       movl    VEC_SIZE(%rsi, %rdi), %ecx
> > -       cmpl    VEC_SIZE(%rdx, %rdi), %ecx
> > -       jne     L(wcscmp_return)
> > +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> > +       je      L(ret7)
> > +       setl    %al
> > +       negl    %eax
> > +       xorl    %r8d, %eax
> >  #  else
> > -       movzbl  VEC_SIZE(%rax, %rdi), %eax
> > -       movzbl  VEC_SIZE(%rdx, %rdi), %edx
> > -       subl    %edx, %eax
> > +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> > +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> > +       xorl    %r8d, %eax
> > +       subl    %r8d, %eax
> >  #  endif
> > -# endif
> > +L(ret7):
> >         VZEROUPPER_RETURN
> > +# endif
> >
> > -       .p2align 4
> > -L(test_2_vec):
> > +       .p2align 4,, 10
> > +L(return_vec_2_end):
> > +       tzcntl  %ecx, %ecx
> >  # ifdef USE_AS_STRNCMP
> > -       /* The first 2 vectors matched.  Return 0 if the maximum offset
> > -          (%r11) <= 2 * VEC_SIZE.  */
> > -       cmpq    $(VEC_SIZE * 2), %r11
> > -       jbe     L(zero)
> > +       cmpq    %rcx, %rdx
> > +       jbe     L(ret_zero_page_cross)
> >  # endif
> > -       VPCMPEQ %ymm7, %ymm5, %ymm5
> > -       vpmovmskb %ymm5, %ecx
> > -       testl   %ecx, %ecx
> > -       je      L(test_3_vec)
> > -       tzcntl  %ecx, %edi
> > -# ifdef USE_AS_STRNCMP
> > -       addq    $(VEC_SIZE * 2), %rdi
> > -       cmpq    %rdi, %r11
> > -       jbe     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > +# ifdef USE_AS_WCSCMP
> > +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
> >         xorl    %eax, %eax
> > -       movl    (%rsi, %rdi), %ecx
> > -       cmpl    (%rdx, %rdi), %ecx
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rax, %rdi), %eax
> > -       movzbl  (%rdx, %rdi), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> > +       je      L(ret11)
> > +       setl    %al
> > +       negl    %eax
> > +       xorl    %r8d, %eax
> >  # else
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > -       xorl    %eax, %eax
> > -       movl    (VEC_SIZE * 2)(%rsi, %rdi), %ecx
> > -       cmpl    (VEC_SIZE * 2)(%rdx, %rdi), %ecx
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (VEC_SIZE * 2)(%rax, %rdi), %eax
> > -       movzbl  (VEC_SIZE * 2)(%rdx, %rdi), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> > +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> > +       xorl    %r8d, %eax
> > +       subl    %r8d, %eax
> >  # endif
> > +L(ret11):
> >         VZEROUPPER_RETURN
> >
> > -       .p2align 4
> > -L(test_3_vec):
> > +
> > +       /* Page cross in rsi in next 4x VEC.  */
> > +
> > +       /* TODO: Improve logic here.  */
> > +       .p2align 4,, 10
> > +L(page_cross_during_loop):
> > +       /* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
> > +
> > +       /* Optimistically rsi and rdi and both aligned inwhich case we
> > +          don't need any logic here.  */
> > +       cmpl    $-(VEC_SIZE * 4), %eax
> > +       /* Don't adjust eax before jumping back to loop and we will
> > +          never hit page cross case again.  */
> > +       je      L(loop_skip_page_cross_check)
> > +
> > +       /* Check if we can safely load a VEC.  */
> > +       cmpl    $-(VEC_SIZE * 3), %eax
> > +       jle     L(less_1x_vec_till_page_cross)
> > +
> > +       VMOVA   (%rdi), %ymm0
> > +       VPCMPEQ (%rsi), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incl    %ecx
> > +       jnz     L(return_vec_0_end)
> > +
> > +       /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
> > +       cmpl    $-(VEC_SIZE * 2), %eax
> > +       jg      L(more_2x_vec_till_page_cross)
> > +
> > +       .p2align 4,, 4
> > +L(less_1x_vec_till_page_cross):
> > +       subl    $-(VEC_SIZE * 4), %eax
> > +       /* Guranteed safe to read from rdi - VEC_SIZE here. The only
> > +          concerning case is first iteration if incoming s1 was near start
> > +          of a page and s2 near end. If s1 was near the start of the page
> > +          we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
> > +          to read back -VEC_SIZE. If rdi is truly at the start of a page
> > +          here, it means the previous page (rdi - VEC_SIZE) has already
> > +          been loaded earlier so must be valid.  */
> > +       VMOVU   -VEC_SIZE(%rdi, %rax), %ymm0
> > +       VPCMPEQ -VEC_SIZE(%rsi, %rax), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +
> > +       /* Mask of potentially valid bits. The lower bits can be out of
> > +          range comparisons (but safe regarding page crosses).  */
> > +       movl    $-1, %r10d
> > +       shlxl   %esi, %r10d, %r10d
> > +       notl    %ecx
> > +
> >  # ifdef USE_AS_STRNCMP
> > -       /* The first 3 vectors matched.  Return 0 if the maximum offset
> > -          (%r11) <= 3 * VEC_SIZE.  */
> > -       cmpq    $(VEC_SIZE * 3), %r11
> > -       jbe     L(zero)
> > -# endif
> > -       VPCMPEQ %ymm7, %ymm6, %ymm6
> > -       vpmovmskb %ymm6, %esi
> > -       tzcntl  %esi, %ecx
> > +       cmpq    %rax, %rdx
> > +       jbe     L(return_page_cross_end_check)
> > +# endif
> > +       movl    %eax, %OFFSET_REG
> > +       addl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> > +
> > +       andl    %r10d, %ecx
> > +       jz      L(loop_skip_page_cross_check)
> > +
> > +       .p2align 4,, 3
> > +L(return_page_cross_end):
> > +       tzcntl  %ecx, %ecx
> > +
> >  # ifdef USE_AS_STRNCMP
> > -       addq    $(VEC_SIZE * 3), %rcx
> > -       cmpq    %rcx, %r11
> > -       jbe     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > -       xorl    %eax, %eax
> > -       movl    (%rsi, %rcx), %esi
> > -       cmpl    (%rdx, %rcx), %esi
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rax, %rcx), %eax
> > -       movzbl  (%rdx, %rcx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       leal    -VEC_SIZE(%OFFSET_REG64, %rcx), %ecx
> > +L(return_page_cross_cmp_mem):
> >  # else
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > +       addl    %OFFSET_REG, %ecx
> > +# endif
> > +# ifdef USE_AS_WCSCMP
> > +       movl    VEC_OFFSET(%rdi, %rcx), %edx
> >         xorl    %eax, %eax
> > -       movl    (VEC_SIZE * 3)(%rsi, %rcx), %esi
> > -       cmpl    (VEC_SIZE * 3)(%rdx, %rcx), %esi
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (VEC_SIZE * 3)(%rax, %rcx), %eax
> > -       movzbl  (VEC_SIZE * 3)(%rdx, %rcx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> > +       je      L(ret8)
> > +       setl    %al
> > +       negl    %eax
> > +       xorl    %r8d, %eax
> > +# else
> > +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> > +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> > +       xorl    %r8d, %eax
> > +       subl    %r8d, %eax
> >  # endif
> > +L(ret8):
> >         VZEROUPPER_RETURN
> >
> > -       .p2align 4
> > -L(loop_cross_page):
> > -       xorl    %r10d, %r10d
> > -       movq    %rdx, %rcx
> > -       /* Align load via RDX.  We load the extra ECX bytes which should
> > -          be ignored.  */
> > -       andl    $((VEC_SIZE * 4) - 1), %ecx
> > -       /* R10 is -RCX.  */
> > -       subq    %rcx, %r10
> > -
> > -       /* This works only if VEC_SIZE * 2 == 64. */
> > -# if (VEC_SIZE * 2) != 64
> > -#  error (VEC_SIZE * 2) != 64
> > -# endif
> > -
> > -       /* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
> > -       cmpl    $(VEC_SIZE * 2), %ecx
> > -       jge     L(loop_cross_page_2_vec)
> > -
> > -       vmovdqu (%rax, %r10), %ymm2
> > -       vmovdqu VEC_SIZE(%rax, %r10), %ymm3
> > -       VPCMPEQ (%rdx, %r10), %ymm2, %ymm0
> > -       VPCMPEQ VEC_SIZE(%rdx, %r10), %ymm3, %ymm1
> > -       VPMINU  %ymm2, %ymm0, %ymm0
> > -       VPMINU  %ymm3, %ymm1, %ymm1
> > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > -       VPCMPEQ %ymm7, %ymm1, %ymm1
> > -
> > -       vpmovmskb %ymm0, %edi
> > -       vpmovmskb %ymm1, %esi
> > -
> > -       salq    $32, %rsi
> > -       xorq    %rsi, %rdi
> > -
> > -       /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
> > -       shrq    %cl, %rdi
> > -
> > -       testq   %rdi, %rdi
> > -       je      L(loop_cross_page_2_vec)
> > -       tzcntq  %rdi, %rcx
> >  # ifdef USE_AS_STRNCMP
> > -       cmpq    %rcx, %r11
> > -       jbe     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > +       .p2align 4,, 10
> > +L(return_page_cross_end_check):
> > +       tzcntl  %ecx, %ecx
> > +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> > +       cmpl    %ecx, %edx
> > +       ja      L(return_page_cross_cmp_mem)
> >         xorl    %eax, %eax
> > -       movl    (%rsi, %rcx), %edi
> > -       cmpl    (%rdx, %rcx), %edi
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rax, %rcx), %eax
> > -       movzbl  (%rdx, %rcx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > -# else
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > -       xorl    %eax, %eax
> > -       movl    (%rsi, %rcx), %edi
> > -       cmpl    (%rdx, %rcx), %edi
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rax, %rcx), %eax
> > -       movzbl  (%rdx, %rcx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > -# endif
> >         VZEROUPPER_RETURN
> > +# endif
> >
> > -       .p2align 4
> > -L(loop_cross_page_2_vec):
> > -       /* The first VEC_SIZE * 2 bytes match or are ignored.  */
> > -       vmovdqu (VEC_SIZE * 2)(%rax, %r10), %ymm2
> > -       vmovdqu (VEC_SIZE * 3)(%rax, %r10), %ymm3
> > -       VPCMPEQ (VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5
> > -       VPMINU  %ymm2, %ymm5, %ymm5
> > -       VPCMPEQ (VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6
> > -       VPCMPEQ %ymm7, %ymm5, %ymm5
> > -       VPMINU  %ymm3, %ymm6, %ymm6
> > -       VPCMPEQ %ymm7, %ymm6, %ymm6
> > -
> > -       vpmovmskb %ymm5, %edi
> > -       vpmovmskb %ymm6, %esi
> > -
> > -       salq    $32, %rsi
> > -       xorq    %rsi, %rdi
> >
> > -       xorl    %r8d, %r8d
> > -       /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
> > -       subl    $(VEC_SIZE * 2), %ecx
> > -       jle     1f
> > -       /* Skip ECX bytes.  */
> > -       shrq    %cl, %rdi
> > -       /* R8 has number of bytes skipped.  */
> > -       movl    %ecx, %r8d
> > -1:
> > -       /* Before jumping back to the loop, set ESI to the number of
> > -          VEC_SIZE * 4 blocks before page crossing.  */
> > -       movl    $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
> > -
> > -       testq   %rdi, %rdi
> > +       .p2align 4,, 10
> > +L(more_2x_vec_till_page_cross):
> > +       /* If more 2x vec till cross we will complete a full loop
> > +          iteration here.  */
> > +
> > +       VMOVU   VEC_SIZE(%rdi), %ymm0
> > +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incl    %ecx
> > +       jnz     L(return_vec_1_end)
> > +
> >  # ifdef USE_AS_STRNCMP
> > -       /* At this point, if %rdi value is 0, it already tested
> > -          VEC_SIZE*4+%r10 byte starting from %rax. This label
> > -          checks whether strncmp maximum offset reached or not.  */
> > -       je      L(string_nbyte_offset_check)
> > -# else
> > -       je      L(back_to_loop)
> > +       cmpq    $(VEC_SIZE * 2), %rdx
> > +       jbe     L(ret_zero_in_loop_page_cross)
> >  # endif
> > -       tzcntq  %rdi, %rcx
> > -       addq    %r10, %rcx
> > -       /* Adjust for number of bytes skipped.  */
> > -       addq    %r8, %rcx
> > +
> > +       subl    $-(VEC_SIZE * 4), %eax
> > +
> > +       /* Safe to include comparisons from lower bytes.  */
> > +       VMOVU   -(VEC_SIZE * 2)(%rdi, %rax), %ymm0
> > +       VPCMPEQ -(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incl    %ecx
> > +       jnz     L(return_vec_page_cross_0)
> > +
> > +       VMOVU   -(VEC_SIZE * 1)(%rdi, %rax), %ymm0
> > +       VPCMPEQ -(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incl    %ecx
> > +       jnz     L(return_vec_page_cross_1)
> > +
> >  # ifdef USE_AS_STRNCMP
> > -       addq    $(VEC_SIZE * 2), %rcx
> > -       subq    %rcx, %r11
> > -       jbe     L(zero)
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > +       /* Must check length here as length might proclude reading next
> > +          page.  */
> > +       cmpq    %rax, %rdx
> > +       jbe     L(ret_zero_in_loop_page_cross)
> > +# endif
> > +
> > +       /* Finish the loop.  */
> > +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> > +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> > +
> > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> > +       vpand   %ymm4, %ymm5, %ymm5
> > +       vpand   %ymm6, %ymm7, %ymm7
> > +       VPMINU  %ymm5, %ymm7, %ymm7
> > +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> > +       vpmovmskb %ymm7, %LOOP_REG
> > +       testl   %LOOP_REG, %LOOP_REG
> > +       jnz     L(return_vec_2_3_end)
> > +
> > +       /* Best for code size to include ucond-jmp here. Would be faster
> > +          if this case is hot to duplicate the L(return_vec_2_3_end) code
> > +          as fall-through and have jump back to loop on mismatch
> > +          comparison.  */
> > +       subq    $-(VEC_SIZE * 4), %rdi
> > +       subq    $-(VEC_SIZE * 4), %rsi
> > +       addl    $(PAGE_SIZE - VEC_SIZE * 8), %eax
> > +# ifdef USE_AS_STRNCMP
> > +       subq    $(VEC_SIZE * 4), %rdx
> > +       ja      L(loop_skip_page_cross_check)
> > +L(ret_zero_in_loop_page_cross):
> >         xorl    %eax, %eax
> > -       movl    (%rsi, %rcx), %edi
> > -       cmpl    (%rdx, %rcx), %edi
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rax, %rcx), %eax
> > -       movzbl  (%rdx, %rcx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       VZEROUPPER_RETURN
> >  # else
> > -#  ifdef USE_AS_WCSCMP
> > -       movq    %rax, %rsi
> > -       xorl    %eax, %eax
> > -       movl    (VEC_SIZE * 2)(%rsi, %rcx), %edi
> > -       cmpl    (VEC_SIZE * 2)(%rdx, %rcx), %edi
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (VEC_SIZE * 2)(%rax, %rcx), %eax
> > -       movzbl  (VEC_SIZE * 2)(%rdx, %rcx), %edx
> > -       subl    %edx, %eax
> > -#  endif
> > +       jmp     L(loop_skip_page_cross_check)
> >  # endif
> > -       VZEROUPPER_RETURN
> >
> > +
> > +       .p2align 4,, 10
> > +L(return_vec_page_cross_0):
> > +       addl    $-VEC_SIZE, %eax
> > +L(return_vec_page_cross_1):
> > +       tzcntl  %ecx, %ecx
> >  # ifdef USE_AS_STRNCMP
> > -L(string_nbyte_offset_check):
> > -       leaq    (VEC_SIZE * 4)(%r10), %r10
> > -       cmpq    %r10, %r11
> > -       jbe     L(zero)
> > -       jmp     L(back_to_loop)
> > +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> > +       cmpq    %rcx, %rdx
> > +       jbe     L(ret_zero_in_loop_page_cross)
> > +# else
> > +       addl    %eax, %ecx
> >  # endif
> >
> > -       .p2align 4
> > -L(cross_page_loop):
> > -       /* Check one byte/dword at a time.  */
> >  # ifdef USE_AS_WCSCMP
> > -       cmpl    %ecx, %eax
> > +       movl    VEC_OFFSET(%rdi, %rcx), %edx
> > +       xorl    %eax, %eax
> > +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> > +       je      L(ret9)
> > +       setl    %al
> > +       negl    %eax
> > +       xorl    %r8d, %eax
> >  # else
> > +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> > +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> >         subl    %ecx, %eax
> > +       xorl    %r8d, %eax
> > +       subl    %r8d, %eax
> >  # endif
> > -       jne     L(different)
> > -       addl    $SIZE_OF_CHAR, %edx
> > -       cmpl    $(VEC_SIZE * 4), %edx
> > -       je      L(main_loop_header)
> > -# ifdef USE_AS_STRNCMP
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > +L(ret9):
> > +       VZEROUPPER_RETURN
> > +
> > +
> > +       .p2align 4,, 10
> > +L(page_cross):
> > +# ifndef USE_AS_STRNCMP
> > +       /* If both are VEC aligned we don't need any special logic here.
> > +          Only valid for strcmp where stop condition is guranteed to be
> > +          reachable by just reading memory.  */
> > +       testl   $((VEC_SIZE - 1) << 20), %eax
> > +       jz      L(no_page_cross)
> >  # endif
> > +
> > +       movl    %edi, %eax
> > +       movl    %esi, %ecx
> > +       andl    $(PAGE_SIZE - 1), %eax
> > +       andl    $(PAGE_SIZE - 1), %ecx
> > +
> > +       xorl    %OFFSET_REG, %OFFSET_REG
> > +
> > +       /* Check which is closer to page cross, s1 or s2.  */
> > +       cmpl    %eax, %ecx
> > +       jg      L(page_cross_s2)
> > +
> > +       /* The previous page cross check has false positives. Check for
> > +          true positive as page cross logic is very expensive.  */
> > +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> > +       jbe     L(no_page_cross)
> > +
> > +       /* Set r8 to not interfere with normal return value (rdi and rsi
> > +          did not swap).  */
> >  # ifdef USE_AS_WCSCMP
> > -       movl    (%rdi, %rdx), %eax
> > -       movl    (%rsi, %rdx), %ecx
> > +       /* any non-zero positive value that doesn't inference with 0x1.
> > +        */
> > +       movl    $2, %r8d
> >  # else
> > -       movzbl  (%rdi, %rdx), %eax
> > -       movzbl  (%rsi, %rdx), %ecx
> > +       xorl    %r8d, %r8d
> >  # endif
> > -       /* Check null char.  */
> > -       testl   %eax, %eax
> > -       jne     L(cross_page_loop)
> > -       /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
> > -          comparisons.  */
> > -       subl    %ecx, %eax
> > -# ifndef USE_AS_WCSCMP
> > -L(different):
> > +
> > +       /* Check if less than 1x VEC till page cross.  */
> > +       subl    $(VEC_SIZE * 3), %eax
> > +       jg      L(less_1x_vec_till_page)
> > +
> > +       /* If more than 1x VEC till page cross, loop throuh safely
> > +          loadable memory until within 1x VEC of page cross.  */
> > +
> > +       .p2align 4,, 10
> > +L(page_cross_loop):
> > +
> > +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> > +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incl    %ecx
> > +
> > +       jnz     L(check_ret_vec_page_cross)
> > +       addl    $VEC_SIZE, %OFFSET_REG
> > +# ifdef USE_AS_STRNCMP
> > +       cmpq    %OFFSET_REG64, %rdx
> > +       jbe     L(ret_zero_page_cross)
> >  # endif
> > -       VZEROUPPER_RETURN
> > +       addl    $VEC_SIZE, %eax
> > +       jl      L(page_cross_loop)
> > +
> > +       subl    %eax, %OFFSET_REG
> > +       /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
> > +          to not cross page so is safe to load. Since we have already
> > +          loaded at least 1 VEC from rsi it is also guranteed to be safe.
> > +        */
> > +
> > +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> > +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > +       vpandn  %ymm1, %ymm2, %ymm1
> > +       vpmovmskb %ymm1, %ecx
> > +
> > +# ifdef USE_AS_STRNCMP
> > +       leal    VEC_SIZE(%OFFSET_REG64), %eax
> > +       cmpq    %rax, %rdx
> > +       jbe     L(check_ret_vec_page_cross2)
> > +       addq    %rdi, %rdx
> > +# endif
> > +       incl    %ecx
> > +       jz      L(prepare_loop_no_len)
> >
> > +       .p2align 4,, 4
> > +L(ret_vec_page_cross):
> > +# ifndef USE_AS_STRNCMP
> > +L(check_ret_vec_page_cross):
> > +# endif
> > +       tzcntl  %ecx, %ecx
> > +       addl    %OFFSET_REG, %ecx
> > +L(ret_vec_page_cross_cont):
> >  # ifdef USE_AS_WCSCMP
> > -       .p2align 4
> > -L(different):
> > -       /* Use movl to avoid modifying EFLAGS.  */
> > -       movl    $0, %eax
> > +       movl    (%rdi, %rcx), %edx
> > +       xorl    %eax, %eax
> > +       cmpl    (%rsi, %rcx), %edx
> > +       je      L(ret12)
> >         setl    %al
> >         negl    %eax
> > -       orl     $1, %eax
> > -       VZEROUPPER_RETURN
> > +       xorl    %r8d, %eax
> > +# else
> > +       movzbl  (%rdi, %rcx), %eax
> > +       movzbl  (%rsi, %rcx), %ecx
> > +       subl    %ecx, %eax
> > +       xorl    %r8d, %eax
> > +       subl    %r8d, %eax
> >  # endif
> > +L(ret12):
> > +       VZEROUPPER_RETURN
> >
> >  # ifdef USE_AS_STRNCMP
> > -       .p2align 4
> > -L(zero):
> > +       .p2align 4,, 10
> > +L(check_ret_vec_page_cross2):
> > +       incl    %ecx
> > +L(check_ret_vec_page_cross):
> > +       tzcntl  %ecx, %ecx
> > +       addl    %OFFSET_REG, %ecx
> > +       cmpq    %rcx, %rdx
> > +       ja      L(ret_vec_page_cross_cont)
> > +       .p2align 4,, 2
> > +L(ret_zero_page_cross):
> >         xorl    %eax, %eax
> >         VZEROUPPER_RETURN
> > +# endif
> >
> > -       .p2align 4
> > -L(char0):
> > -#  ifdef USE_AS_WCSCMP
> > -       xorl    %eax, %eax
> > -       movl    (%rdi), %ecx
> > -       cmpl    (%rsi), %ecx
> > -       jne     L(wcscmp_return)
> > -#  else
> > -       movzbl  (%rsi), %ecx
> > -       movzbl  (%rdi), %eax
> > -       subl    %ecx, %eax
> > -#  endif
> > -       VZEROUPPER_RETURN
> > +       .p2align 4,, 4
> > +L(page_cross_s2):
> > +       /* Ensure this is a true page cross.  */
> > +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %ecx
> > +       jbe     L(no_page_cross)
> > +
> > +
> > +       movl    %ecx, %eax
> > +       movq    %rdi, %rcx
> > +       movq    %rsi, %rdi
> > +       movq    %rcx, %rsi
> > +
> > +       /* set r8 to negate return value as rdi and rsi swapped.  */
> > +# ifdef USE_AS_WCSCMP
> > +       movl    $-4, %r8d
> > +# else
> > +       movl    $-1, %r8d
> >  # endif
> > +       xorl    %OFFSET_REG, %OFFSET_REG
> >
> > -       .p2align 4
> > -L(last_vector):
> > -       addq    %rdx, %rdi
> > -       addq    %rdx, %rsi
> > +       /* Check if more than 1x VEC till page cross.  */
> > +       subl    $(VEC_SIZE * 3), %eax
> > +       jle     L(page_cross_loop)
> > +
> > +       .p2align 4,, 6
> > +L(less_1x_vec_till_page):
> > +       /* Find largest load size we can use.  */
> > +       cmpl    $16, %eax
> > +       ja      L(less_16_till_page)
> > +
> > +       VMOVU   (%rdi), %xmm0
> > +       VPCMPEQ (%rsi), %xmm0, %xmm1
> > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > +       vpandn  %xmm1, %xmm2, %xmm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incw    %cx
> > +       jnz     L(check_ret_vec_page_cross)
> > +       movl    $16, %OFFSET_REG
> >  # ifdef USE_AS_STRNCMP
> > -       subq    %rdx, %r11
> > +       cmpq    %OFFSET_REG64, %rdx
> > +       jbe     L(ret_zero_page_cross_slow_case0)
> > +       subl    %eax, %OFFSET_REG
> > +# else
> > +       /* Explicit check for 16 byte alignment.  */
> > +       subl    %eax, %OFFSET_REG
> > +       jz      L(prepare_loop)
> >  # endif
> > -       tzcntl  %ecx, %edx
> > +
> > +       VMOVU   (%rdi, %OFFSET_REG64), %xmm0
> > +       VPCMPEQ (%rsi, %OFFSET_REG64), %xmm0, %xmm1
> > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > +       vpandn  %xmm1, %xmm2, %xmm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incw    %cx
> > +       jnz     L(check_ret_vec_page_cross)
> > +
> >  # ifdef USE_AS_STRNCMP
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > +       addl    $16, %OFFSET_REG
> > +       subq    %OFFSET_REG64, %rdx
> > +       jbe     L(ret_zero_page_cross_slow_case0)
> > +       subq    $-(VEC_SIZE * 4), %rdx
> > +
> > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > +# else
> > +       leaq    (16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > +       leaq    (16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> >  # endif
> > -# ifdef USE_AS_WCSCMP
> > +       jmp     L(prepare_loop_aligned)
> > +
> > +# ifdef USE_AS_STRNCMP
> > +       .p2align 4,, 2
> > +L(ret_zero_page_cross_slow_case0):
> >         xorl    %eax, %eax
> > -       movl    (%rdi, %rdx), %ecx
> > -       cmpl    (%rsi, %rdx), %ecx
> > -       jne     L(wcscmp_return)
> > -# else
> > -       movzbl  (%rdi, %rdx), %eax
> > -       movzbl  (%rsi, %rdx), %edx
> > -       subl    %edx, %eax
> > +       ret
> >  # endif
> > -       VZEROUPPER_RETURN
> >
> > -       /* Comparing on page boundary region requires special treatment:
> > -          It must done one vector at the time, starting with the wider
> > -          ymm vector if possible, if not, with xmm. If fetching 16 bytes
> > -          (xmm) still passes the boundary, byte comparison must be done.
> > -        */
> > -       .p2align 4
> > -L(cross_page):
> > -       /* Try one ymm vector at a time.  */
> > -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> > -       jg      L(cross_page_1_vector)
> > -L(loop_1_vector):
> > -       vmovdqu (%rdi, %rdx), %ymm1
> > -       VPCMPEQ (%rsi, %rdx), %ymm1, %ymm0
> > -       VPMINU  %ymm1, %ymm0, %ymm0
> > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > -       vpmovmskb %ymm0, %ecx
> > -       testl   %ecx, %ecx
> > -       jne     L(last_vector)
> >
> > -       addl    $VEC_SIZE, %edx
> > +       .p2align 4,, 10
> > +L(less_16_till_page):
> > +       /* Find largest load size we can use.  */
> > +       cmpl    $24, %eax
> > +       ja      L(less_8_till_page)
> >
> > -       addl    $VEC_SIZE, %eax
> > -# ifdef USE_AS_STRNCMP
> > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > -          (%r11).  */
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > -# endif
> > -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> > -       jle     L(loop_1_vector)
> > -L(cross_page_1_vector):
> > -       /* Less than 32 bytes to check, try one xmm vector.  */
> > -       cmpl    $(PAGE_SIZE - 16), %eax
> > -       jg      L(cross_page_1_xmm)
> > -       vmovdqu (%rdi, %rdx), %xmm1
> > -       VPCMPEQ (%rsi, %rdx), %xmm1, %xmm0
> > -       VPMINU  %xmm1, %xmm0, %xmm0
> > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > -       vpmovmskb %xmm0, %ecx
> > -       testl   %ecx, %ecx
> > -       jne     L(last_vector)
> > +       vmovq   (%rdi), %xmm0
> > +       vmovq   (%rsi), %xmm1
> > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > +       vpandn  %xmm1, %xmm2, %xmm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incb    %cl
> > +       jnz     L(check_ret_vec_page_cross)
> >
> > -       addl    $16, %edx
> > -# ifndef USE_AS_WCSCMP
> > -       addl    $16, %eax
> > +
> > +# ifdef USE_AS_STRNCMP
> > +       cmpq    $8, %rdx
> > +       jbe     L(ret_zero_page_cross_slow_case0)
> >  # endif
> > +       movl    $24, %OFFSET_REG
> > +       /* Explicit check for 16 byte alignment.  */
> > +       subl    %eax, %OFFSET_REG
> > +
> > +
> > +
> > +       vmovq   (%rdi, %OFFSET_REG64), %xmm0
> > +       vmovq   (%rsi, %OFFSET_REG64), %xmm1
> > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > +       vpandn  %xmm1, %xmm2, %xmm1
> > +       vpmovmskb %ymm1, %ecx
> > +       incb    %cl
> > +       jnz     L(check_ret_vec_page_cross)
> > +
> >  # ifdef USE_AS_STRNCMP
> > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > -          (%r11).  */
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > -# endif
> > -
> > -L(cross_page_1_xmm):
> > -# ifndef USE_AS_WCSCMP
> > -       /* Less than 16 bytes to check, try 8 byte vector.  NB: No need
> > -          for wcscmp nor wcsncmp since wide char is 4 bytes.   */
> > -       cmpl    $(PAGE_SIZE - 8), %eax
> > -       jg      L(cross_page_8bytes)
> > -       vmovq   (%rdi, %rdx), %xmm1
> > -       vmovq   (%rsi, %rdx), %xmm0
> > -       VPCMPEQ %xmm0, %xmm1, %xmm0
> > -       VPMINU  %xmm1, %xmm0, %xmm0
> > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > -       vpmovmskb %xmm0, %ecx
> > -       /* Only last 8 bits are valid.  */
> > -       andl    $0xff, %ecx
> > -       testl   %ecx, %ecx
> > -       jne     L(last_vector)
> > +       addl    $8, %OFFSET_REG
> > +       subq    %OFFSET_REG64, %rdx
> > +       jbe     L(ret_zero_page_cross_slow_case0)
> > +       subq    $-(VEC_SIZE * 4), %rdx
> >
> > -       addl    $8, %edx
> > -       addl    $8, %eax
> > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > +# else
> > +       leaq    (8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > +       leaq    (8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > +# endif
> > +       jmp     L(prepare_loop_aligned)
> > +
> > +
> > +       .p2align 4,, 10
> > +L(less_8_till_page):
> > +# ifdef USE_AS_WCSCMP
> > +       /* If using wchar then this is the only check before we reach
> > +          the page boundary.  */
> > +       movl    (%rdi), %eax
> > +       movl    (%rsi), %ecx
> > +       cmpl    %ecx, %eax
> > +       jnz     L(ret_less_8_wcs)
> >  #  ifdef USE_AS_STRNCMP
> > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > -          (%r11).  */
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > +       addq    %rdi, %rdx
> > +       /* We already checked for len <= 1 so cannot hit that case here.
> > +        */
> >  #  endif
> > +       testl   %eax, %eax
> > +       jnz     L(prepare_loop_no_len)
> > +       ret
> >
> > -L(cross_page_8bytes):
> > -       /* Less than 8 bytes to check, try 4 byte vector.  */
> > -       cmpl    $(PAGE_SIZE - 4), %eax
> > -       jg      L(cross_page_4bytes)
> > -       vmovd   (%rdi, %rdx), %xmm1
> > -       vmovd   (%rsi, %rdx), %xmm0
> > -       VPCMPEQ %xmm0, %xmm1, %xmm0
> > -       VPMINU  %xmm1, %xmm0, %xmm0
> > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > -       vpmovmskb %xmm0, %ecx
> > -       /* Only last 4 bits are valid.  */
> > -       andl    $0xf, %ecx
> > -       testl   %ecx, %ecx
> > -       jne     L(last_vector)
> > +       .p2align 4,, 8
> > +L(ret_less_8_wcs):
> > +       setl    %OFFSET_REG8
> > +       negl    %OFFSET_REG
> > +       movl    %OFFSET_REG, %eax
> > +       xorl    %r8d, %eax
> > +       ret
> > +
> > +# else
> > +
> > +       /* Find largest load size we can use.  */
> > +       cmpl    $28, %eax
> > +       ja      L(less_4_till_page)
> > +
> > +       vmovd   (%rdi), %xmm0
> > +       vmovd   (%rsi), %xmm1
> > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > +       vpandn  %xmm1, %xmm2, %xmm1
> > +       vpmovmskb %ymm1, %ecx
> > +       subl    $0xf, %ecx
> > +       jnz     L(check_ret_vec_page_cross)
> >
> > -       addl    $4, %edx
> >  #  ifdef USE_AS_STRNCMP
> > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > -          (%r11).  */
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > +       cmpq    $4, %rdx
> > +       jbe     L(ret_zero_page_cross_slow_case1)
> >  #  endif
> > +       movl    $28, %OFFSET_REG
> > +       /* Explicit check for 16 byte alignment.  */
> > +       subl    %eax, %OFFSET_REG
> >
> > -L(cross_page_4bytes):
> > -# endif
> > -       /* Less than 4 bytes to check, try one byte/dword at a time.  */
> > -# ifdef USE_AS_STRNCMP
> > -       cmpq    %r11, %rdx
> > -       jae     L(zero)
> > -# endif
> > -# ifdef USE_AS_WCSCMP
> > -       movl    (%rdi, %rdx), %eax
> > -       movl    (%rsi, %rdx), %ecx
> > -# else
> > -       movzbl  (%rdi, %rdx), %eax
> > -       movzbl  (%rsi, %rdx), %ecx
> > -# endif
> > -       testl   %eax, %eax
> > -       jne     L(cross_page_loop)
> > +
> > +
> > +       vmovd   (%rdi, %OFFSET_REG64), %xmm0
> > +       vmovd   (%rsi, %OFFSET_REG64), %xmm1
> > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > +       vpandn  %xmm1, %xmm2, %xmm1
> > +       vpmovmskb %ymm1, %ecx
> > +       subl    $0xf, %ecx
> > +       jnz     L(check_ret_vec_page_cross)
> > +
> > +#  ifdef USE_AS_STRNCMP
> > +       addl    $4, %OFFSET_REG
> > +       subq    %OFFSET_REG64, %rdx
> > +       jbe     L(ret_zero_page_cross_slow_case1)
> > +       subq    $-(VEC_SIZE * 4), %rdx
> > +
> > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > +#  else
> > +       leaq    (4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > +       leaq    (4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > +#  endif
> > +       jmp     L(prepare_loop_aligned)
> > +
> > +#  ifdef USE_AS_STRNCMP
> > +       .p2align 4,, 2
> > +L(ret_zero_page_cross_slow_case1):
> > +       xorl    %eax, %eax
> > +       ret
> > +#  endif
> > +
> > +       .p2align 4,, 10
> > +L(less_4_till_page):
> > +       subq    %rdi, %rsi
> > +       /* Extremely slow byte comparison loop.  */
> > +L(less_4_loop):
> > +       movzbl  (%rdi), %eax
> > +       movzbl  (%rsi, %rdi), %ecx
> >         subl    %ecx, %eax
> > -       VZEROUPPER_RETURN
> > -END (STRCMP)
> > +       jnz     L(ret_less_4_loop)
> > +       testl   %ecx, %ecx
> > +       jz      L(ret_zero_4_loop)
> > +#  ifdef USE_AS_STRNCMP
> > +       decq    %rdx
> > +       jz      L(ret_zero_4_loop)
> > +#  endif
> > +       incq    %rdi
> > +       /* end condition is reach page boundary (rdi is aligned).  */
> > +       testl   $31, %edi
> > +       jnz     L(less_4_loop)
> > +       leaq    -(VEC_SIZE * 4)(%rdi, %rsi), %rsi
> > +       addq    $-(VEC_SIZE * 4), %rdi
> > +#  ifdef USE_AS_STRNCMP
> > +       subq    $-(VEC_SIZE * 4), %rdx
> > +#  endif
> > +       jmp     L(prepare_loop_aligned)
> > +
> > +L(ret_zero_4_loop):
> > +       xorl    %eax, %eax
> > +       ret
> > +L(ret_less_4_loop):
> > +       xorl    %r8d, %eax
> > +       subl    %r8d, %eax
> > +       ret
> > +# endif
> > +END(STRCMP)
> >  #endif
> > --
> > 2.25.1
> >
>
> LGTM.

Should I wait until 2.36 release to push the optimized versions?

There are alot of edge cases with these functions and last time we
tried to improve them in:

commit c46e9afb2df5fc9e39ff4d13777e4b4c26e04e55
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Oct 29 12:40:20 2021 -0700

    x86-64: Improve EVEX strcmp with masked load


We missed a case:
https://bugzilla.redhat.com/show_bug.cgi?id=2026399#c19
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 5/7] x86: Optimize strcmp-avx2.S
  2022-01-10  1:06       ` Noah Goldstein
@ 2022-01-10  1:58         ` H.J. Lu
  2022-01-10  2:54           ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-10  1:58 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 5:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Sun, Jan 9, 2022 at 6:41 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Sun, Jan 9, 2022 at 4:31 PM Noah Goldstein via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> > >
> > > Optimization are primarily to the loop logic and how the page cross
> > > logic interacts with the loop.
> > >
> > > The page cross logic is at times more expensive for short strings near
> > > the end of a page but not crossing the page. This is done to retest
> > > the page cross conditions with a non-faulty check and to improve the
> > > logic for entering the loop afterwards. This is only particular cases,
> > > however, and is general made up for by more than 10x improvements on
> > > the transition from the page cross -> loop case.
> > >
> > > The non-page cross cases are improved most for smaller sizes [0, 128]
> > > and go about even for (128, 4096]. The loop page cross logic is
> > > improved so some more significant speedup is seen there as well.
> > >
> > > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> > >
> > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > ---
> > >  sysdeps/x86_64/multiarch/strcmp-avx2.S | 1590 ++++++++++++++----------
> > >  1 file changed, 939 insertions(+), 651 deletions(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > index 9c73b5899d..28d6a0025a 100644
> > > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > @@ -26,35 +26,57 @@
> > >
> > >  # define PAGE_SIZE     4096
> > >
> > > -/* VEC_SIZE = Number of bytes in a ymm register */
> > > +       /* VEC_SIZE = Number of bytes in a ymm register.  */
> > >  # define VEC_SIZE      32
> > >
> > > -/* Shift for dividing by (VEC_SIZE * 4).  */
> > > -# define DIVIDE_BY_VEC_4_SHIFT 7
> > > -# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> > > -#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> > > -# endif
> > > +# define VMOVU vmovdqu
> > > +# define VMOVA vmovdqa
> > >
> > >  # ifdef USE_AS_WCSCMP
> > > -/* Compare packed dwords.  */
> > > +       /* Compare packed dwords.  */
> > >  #  define VPCMPEQ      vpcmpeqd
> > > -/* Compare packed dwords and store minimum.  */
> > > +       /* Compare packed dwords and store minimum.  */
> > >  #  define VPMINU       vpminud
> > > -/* 1 dword char == 4 bytes.  */
> > > +       /* 1 dword char == 4 bytes.  */
> > >  #  define SIZE_OF_CHAR 4
> > >  # else
> > > -/* Compare packed bytes.  */
> > > +       /* Compare packed bytes.  */
> > >  #  define VPCMPEQ      vpcmpeqb
> > > -/* Compare packed bytes and store minimum.  */
> > > +       /* Compare packed bytes and store minimum.  */
> > >  #  define VPMINU       vpminub
> > > -/* 1 byte char == 1 byte.  */
> > > +       /* 1 byte char == 1 byte.  */
> > >  #  define SIZE_OF_CHAR 1
> > >  # endif
> > >
> > > +# ifdef USE_AS_STRNCMP
> > > +#  define LOOP_REG     r9d
> > > +#  define LOOP_REG64   r9
> > > +
> > > +#  define OFFSET_REG8  r9b
> > > +#  define OFFSET_REG   r9d
> > > +#  define OFFSET_REG64 r9
> > > +# else
> > > +#  define LOOP_REG     edx
> > > +#  define LOOP_REG64   rdx
> > > +
> > > +#  define OFFSET_REG8  dl
> > > +#  define OFFSET_REG   edx
> > > +#  define OFFSET_REG64 rdx
> > > +# endif
> > > +
> > >  # ifndef VZEROUPPER
> > >  #  define VZEROUPPER   vzeroupper
> > >  # endif
> > >
> > > +# if defined USE_AS_STRNCMP
> > > +#  define VEC_OFFSET   0
> > > +# else
> > > +#  define VEC_OFFSET   (-VEC_SIZE)
> > > +# endif
> > > +
> > > +# define xmmZERO       xmm15
> > > +# define ymmZERO       ymm15
> > > +
> > >  # ifndef SECTION
> > >  #  define SECTION(p)   p##.avx
> > >  # endif
> > > @@ -79,783 +101,1049 @@
> > >     the maximum offset is reached before a difference is found, zero is
> > >     returned.  */
> > >
> > > -       .section SECTION(.text),"ax",@progbits
> > > -ENTRY (STRCMP)
> > > +       .section SECTION(.text), "ax", @progbits
> > > +ENTRY(STRCMP)
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* Check for simple cases (0 or 1) in offset.  */
> > > +#  ifdef __ILP32__
> > > +       /* Clear the upper 32 bits.  */
> > > +       movl    %edx, %rdx
> > > +#  endif
> > >         cmp     $1, %RDX_LP
> > > -       je      L(char0)
> > > -       jb      L(zero)
> > > +       /* Signed comparison intentional. We use this branch to also
> > > +          test cases where length >= 2^63. These very large sizes can be
> > > +          handled with strcmp as there is no way for that length to
> > > +          actually bound the buffer.  */
> > > +       jle     L(one_or_less)
> > >  #  ifdef USE_AS_WCSCMP
> > > -#  ifndef __ILP32__
> > >         movq    %rdx, %rcx
> > > -       /* Check if length could overflow when multiplied by
> > > -          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > > -          overflow cases as well as redirect cases where its impossible to
> > > -          length to bound a valid memory region. In these cases just use
> > > -          'wcscmp'.  */
> > > +
> > > +       /* Multiplying length by sizeof(wchar_t) can result in overflow.
> > > +          Check if that is possible. All cases where overflow are possible
> > > +          are cases where length is large enough that it can never be a
> > > +          bound on valid memory so just use wcscmp.  */
> > >         shrq    $56, %rcx
> > >         jnz     __wcscmp_avx2
> > > +
> > > +       leaq    (, %rdx, 4), %rdx
> > >  #  endif
> > > -       /* Convert units: from wide to byte char.  */
> > > -       shl     $2, %RDX_LP
> > > -#  endif
> > > -       /* Register %r11 tracks the maximum offset.  */
> > > -       mov     %RDX_LP, %R11_LP
> > >  # endif
> > > +       vpxor   %xmmZERO, %xmmZERO, %xmmZERO
> > >         movl    %edi, %eax
> > > -       xorl    %edx, %edx
> > > -       /* Make %xmm7 (%ymm7) all zeros in this function.  */
> > > -       vpxor   %xmm7, %xmm7, %xmm7
> > >         orl     %esi, %eax
> > > -       andl    $(PAGE_SIZE - 1), %eax
> > > -       cmpl    $(PAGE_SIZE - (VEC_SIZE * 4)), %eax
> > > -       jg      L(cross_page)
> > > -       /* Start comparing 4 vectors.  */
> > > -       vmovdqu (%rdi), %ymm1
> > > -       VPCMPEQ (%rsi), %ymm1, %ymm0
> > > -       VPMINU  %ymm1, %ymm0, %ymm0
> > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > -       vpmovmskb %ymm0, %ecx
> > > -       testl   %ecx, %ecx
> > > -       je      L(next_3_vectors)
> > > -       tzcntl  %ecx, %edx
> > > +       sall    $20, %eax
> > > +       /* Check if s1 or s2 may cross a page  in next 4x VEC loads.  */
> > > +       cmpl    $((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
> > > +       ja      L(page_cross)
> > > +
> > > +L(no_page_cross):
> > > +       /* Safe to compare 4x vectors.  */
> > > +       VMOVU   (%rdi), %ymm0
> > > +       /* 1s where s1 and s2 equal.  */
> > > +       VPCMPEQ (%rsi), %ymm0, %ymm1
> > > +       /* 1s at null CHAR.  */
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       /* 1s where s1 and s2 equal AND not null CHAR.  */
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +
> > > +       /* All 1s -> keep going, any 0s -> return.  */
> > > +       vpmovmskb %ymm1, %ecx
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* Return 0 if the mismatched index (%rdx) is after the maximum
> > > -          offset (%r11).   */
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > +       cmpq    $VEC_SIZE, %rdx
> > > +       jbe     L(vec_0_test_len)
> > >  # endif
> > > +
> > > +       /* All 1s represents all equals. incl will overflow to zero in
> > > +          all equals case. Otherwise 1s will carry until position of first
> > > +          mismatch.  */
> > > +       incl    %ecx
> > > +       jz      L(more_3x_vec)
> > > +
> > > +       .p2align 4,, 4
> > > +L(return_vec_0):
> > > +       tzcntl  %ecx, %ecx
> > >  # ifdef USE_AS_WCSCMP
> > > +       movl    (%rdi, %rcx), %edx
> > >         xorl    %eax, %eax
> > > -       movl    (%rdi, %rdx), %ecx
> > > -       cmpl    (%rsi, %rdx), %ecx
> > > -       je      L(return)
> > > -L(wcscmp_return):
> > > +       cmpl    (%rsi, %rcx), %edx
> > > +       je      L(ret0)
> > >         setl    %al
> > >         negl    %eax
> > >         orl     $1, %eax
> > > -L(return):
> > >  # else
> > > -       movzbl  (%rdi, %rdx), %eax
> > > -       movzbl  (%rsi, %rdx), %edx
> > > -       subl    %edx, %eax
> > > +       movzbl  (%rdi, %rcx), %eax
> > > +       movzbl  (%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > >  # endif
> > > +L(ret0):
> > >  L(return_vzeroupper):
> > >         ZERO_UPPER_VEC_REGISTERS_RETURN
> > >
> > > -       .p2align 4
> > > -L(return_vec_size):
> > > -       tzcntl  %ecx, %edx
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
> > > -          the maximum offset (%r11).  */
> > > -       addq    $VEC_SIZE, %rdx
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > +       .p2align 4,, 8
> > > +L(vec_0_test_len):
> > > +       notl    %ecx
> > > +       bzhil   %edx, %ecx, %eax
> > > +       jnz     L(return_vec_0)
> > > +       /* Align if will cross fetch block.  */
> > > +       .p2align 4,, 2
> > > +L(ret_zero):
> > >         xorl    %eax, %eax
> > > -       movl    (%rdi, %rdx), %ecx
> > > -       cmpl    (%rsi, %rdx), %ecx
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rdi, %rdx), %eax
> > > -       movzbl  (%rsi, %rdx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > -# else
> > > +       VZEROUPPER_RETURN
> > > +
> > > +       .p2align 4,, 5
> > > +L(one_or_less):
> > > +       jb      L(ret_zero)
> > >  #  ifdef USE_AS_WCSCMP
> > > +       /* 'nbe' covers the case where length is negative (large
> > > +          unsigned).  */
> > > +       jnbe    __wcscmp_avx2
> > > +       movl    (%rdi), %edx
> > >         xorl    %eax, %eax
> > > -       movl    VEC_SIZE(%rdi, %rdx), %ecx
> > > -       cmpl    VEC_SIZE(%rsi, %rdx), %ecx
> > > -       jne     L(wcscmp_return)
> > > +       cmpl    (%rsi), %edx
> > > +       je      L(ret1)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       orl     $1, %eax
> > >  #  else
> > > -       movzbl  VEC_SIZE(%rdi, %rdx), %eax
> > > -       movzbl  VEC_SIZE(%rsi, %rdx), %edx
> > > -       subl    %edx, %eax
> > > +       /* 'nbe' covers the case where length is negative (large
> > > +          unsigned).  */
> > > +
> > > +       jnbe    __strcmp_avx2
> > > +       movzbl  (%rdi), %eax
> > > +       movzbl  (%rsi), %ecx
> > > +       subl    %ecx, %eax
> > >  #  endif
> > > +L(ret1):
> > > +       ret
> > >  # endif
> > > -       VZEROUPPER_RETURN
> > >
> > > -       .p2align 4
> > > -L(return_2_vec_size):
> > > -       tzcntl  %ecx, %edx
> > > +       .p2align 4,, 10
> > > +L(return_vec_1):
> > > +       tzcntl  %ecx, %ecx
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
> > > -          after the maximum offset (%r11).  */
> > > -       addq    $(VEC_SIZE * 2), %rdx
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > +       /* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of
> > > +          overflow.  */
> > > +       addq    $-VEC_SIZE, %rdx
> > > +       cmpq    %rcx, %rdx
> > > +       jbe     L(ret_zero)
> > > +# endif
> > > +# ifdef USE_AS_WCSCMP
> > > +       movl    VEC_SIZE(%rdi, %rcx), %edx
> > >         xorl    %eax, %eax
> > > -       movl    (%rdi, %rdx), %ecx
> > > -       cmpl    (%rsi, %rdx), %ecx
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rdi, %rdx), %eax
> > > -       movzbl  (%rsi, %rdx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> > > +       je      L(ret2)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       orl     $1, %eax
> > >  # else
> > > -#  ifdef USE_AS_WCSCMP
> > > -       xorl    %eax, %eax
> > > -       movl    (VEC_SIZE * 2)(%rdi, %rdx), %ecx
> > > -       cmpl    (VEC_SIZE * 2)(%rsi, %rdx), %ecx
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (VEC_SIZE * 2)(%rdi, %rdx), %eax
> > > -       movzbl  (VEC_SIZE * 2)(%rsi, %rdx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> > > +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > >  # endif
> > > +L(ret2):
> > >         VZEROUPPER_RETURN
> > >
> > > -       .p2align 4
> > > -L(return_3_vec_size):
> > > -       tzcntl  %ecx, %edx
> > > +       .p2align 4,, 10
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
> > > -          after the maximum offset (%r11).  */
> > > -       addq    $(VEC_SIZE * 3), %rdx
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > +L(return_vec_3):
> > > +       salq    $32, %rcx
> > > +# endif
> > > +
> > > +L(return_vec_2):
> > > +# ifndef USE_AS_STRNCMP
> > > +       tzcntl  %ecx, %ecx
> > > +# else
> > > +       tzcntq  %rcx, %rcx
> > > +       cmpq    %rcx, %rdx
> > > +       jbe     L(ret_zero)
> > > +# endif
> > > +
> > > +# ifdef USE_AS_WCSCMP
> > > +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
> > >         xorl    %eax, %eax
> > > -       movl    (%rdi, %rdx), %ecx
> > > -       cmpl    (%rsi, %rdx), %ecx
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rdi, %rdx), %eax
> > > -       movzbl  (%rsi, %rdx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> > > +       je      L(ret3)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       orl     $1, %eax
> > >  # else
> > > +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> > > +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > > +# endif
> > > +L(ret3):
> > > +       VZEROUPPER_RETURN
> > > +
> > > +# ifndef USE_AS_STRNCMP
> > > +       .p2align 4,, 10
> > > +L(return_vec_3):
> > > +       tzcntl  %ecx, %ecx
> > >  #  ifdef USE_AS_WCSCMP
> > > +       movl    (VEC_SIZE * 3)(%rdi, %rcx), %edx
> > >         xorl    %eax, %eax
> > > -       movl    (VEC_SIZE * 3)(%rdi, %rdx), %ecx
> > > -       cmpl    (VEC_SIZE * 3)(%rsi, %rdx), %ecx
> > > -       jne     L(wcscmp_return)
> > > +       cmpl    (VEC_SIZE * 3)(%rsi, %rcx), %edx
> > > +       je      L(ret4)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       orl     $1, %eax
> > >  #  else
> > > -       movzbl  (VEC_SIZE * 3)(%rdi, %rdx), %eax
> > > -       movzbl  (VEC_SIZE * 3)(%rsi, %rdx), %edx
> > > -       subl    %edx, %eax
> > > +       movzbl  (VEC_SIZE * 3)(%rdi, %rcx), %eax
> > > +       movzbl  (VEC_SIZE * 3)(%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > >  #  endif
> > > -# endif
> > > +L(ret4):
> > >         VZEROUPPER_RETURN
> > > +# endif
> > > +
> > > +       .p2align 4,, 10
> > > +L(more_3x_vec):
> > > +       /* Safe to compare 4x vectors.  */
> > > +       VMOVU   VEC_SIZE(%rdi), %ymm0
> > > +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incl    %ecx
> > > +       jnz     L(return_vec_1)
> > > +
> > > +# ifdef USE_AS_STRNCMP
> > > +       subq    $(VEC_SIZE * 2), %rdx
> > > +       jbe     L(ret_zero)
> > > +# endif
> > > +
> > > +       VMOVU   (VEC_SIZE * 2)(%rdi), %ymm0
> > > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incl    %ecx
> > > +       jnz     L(return_vec_2)
> > > +
> > > +       VMOVU   (VEC_SIZE * 3)(%rdi), %ymm0
> > > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incl    %ecx
> > > +       jnz     L(return_vec_3)
> > >
> > > -       .p2align 4
> > > -L(next_3_vectors):
> > > -       vmovdqu VEC_SIZE(%rdi), %ymm6
> > > -       VPCMPEQ VEC_SIZE(%rsi), %ymm6, %ymm3
> > > -       VPMINU  %ymm6, %ymm3, %ymm3
> > > -       VPCMPEQ %ymm7, %ymm3, %ymm3
> > > -       vpmovmskb %ymm3, %ecx
> > > -       testl   %ecx, %ecx
> > > -       jne     L(return_vec_size)
> > > -       vmovdqu (VEC_SIZE * 2)(%rdi), %ymm5
> > > -       vmovdqu (VEC_SIZE * 3)(%rdi), %ymm4
> > > -       vmovdqu (VEC_SIZE * 3)(%rsi), %ymm0
> > > -       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm5, %ymm2
> > > -       VPMINU  %ymm5, %ymm2, %ymm2
> > > -       VPCMPEQ %ymm4, %ymm0, %ymm0
> > > -       VPCMPEQ %ymm7, %ymm2, %ymm2
> > > -       vpmovmskb %ymm2, %ecx
> > > -       testl   %ecx, %ecx
> > > -       jne     L(return_2_vec_size)
> > > -       VPMINU  %ymm4, %ymm0, %ymm0
> > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > -       vpmovmskb %ymm0, %ecx
> > > -       testl   %ecx, %ecx
> > > -       jne     L(return_3_vec_size)
> > > -L(main_loop_header):
> > > -       leaq    (VEC_SIZE * 4)(%rdi), %rdx
> > > -       movl    $PAGE_SIZE, %ecx
> > > -       /* Align load via RAX.  */
> > > -       andq    $-(VEC_SIZE * 4), %rdx
> > > -       subq    %rdi, %rdx
> > > -       leaq    (%rdi, %rdx), %rax
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* Starting from this point, the maximum offset, or simply the
> > > -          'offset', DECREASES by the same amount when base pointers are
> > > -          moved forward.  Return 0 when:
> > > -            1) On match: offset <= the matched vector index.
> > > -            2) On mistmach, offset is before the mistmatched index.
> > > +       cmpq    $(VEC_SIZE * 2), %rdx
> > > +       jbe     L(ret_zero)
> > > +# endif
> > > +
> > > +# ifdef USE_AS_WCSCMP
> > > +       /* any non-zero positive value that doesn't inference with 0x1.
> > >          */
> > > -       subq    %rdx, %r11
> > > -       jbe     L(zero)
> > > -# endif
> > > -       addq    %rsi, %rdx
> > > -       movq    %rdx, %rsi
> > > -       andl    $(PAGE_SIZE - 1), %esi
> > > -       /* Number of bytes before page crossing.  */
> > > -       subq    %rsi, %rcx
> > > -       /* Number of VEC_SIZE * 4 blocks before page crossing.  */
> > > -       shrq    $DIVIDE_BY_VEC_4_SHIFT, %rcx
> > > -       /* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
> > > -       movl    %ecx, %esi
> > > -       jmp     L(loop_start)
> > > +       movl    $2, %r8d
> > >
> > > +# else
> > > +       xorl    %r8d, %r8d
> > > +# endif
> > > +
> > > +       /* The prepare labels are various entry points from the page
> > > +          cross logic.  */
> > > +L(prepare_loop):
> > > +
> > > +# ifdef USE_AS_STRNCMP
> > > +       /* Store N + (VEC_SIZE * 4) and place check at the begining of
> > > +          the loop.  */
> > > +       leaq    (VEC_SIZE * 2)(%rdi, %rdx), %rdx
> > > +# endif
> > > +L(prepare_loop_no_len):
> > > +
> > > +       /* Align s1 and adjust s2 accordingly.  */
> > > +       subq    %rdi, %rsi
> > > +       andq    $-(VEC_SIZE * 4), %rdi
> > > +       addq    %rdi, %rsi
> > > +
> > > +# ifdef USE_AS_STRNCMP
> > > +       subq    %rdi, %rdx
> > > +# endif
> > > +
> > > +L(prepare_loop_aligned):
> > > +       /* eax stores distance from rsi to next page cross. These cases
> > > +          need to be handled specially as the 4x loop could potentially
> > > +          read memory past the length of s1 or s2 and across a page
> > > +          boundary.  */
> > > +       movl    $-(VEC_SIZE * 4), %eax
> > > +       subl    %esi, %eax
> > > +       andl    $(PAGE_SIZE - 1), %eax
> > > +
> > > +       /* Loop 4x comparisons at a time.  */
> > >         .p2align 4
> > >  L(loop):
> > > +
> > > +       /* End condition for strncmp.  */
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
> > > -          the maximum offset (%r11) by the same amount.  */
> > > -       subq    $(VEC_SIZE * 4), %r11
> > > -       jbe     L(zero)
> > > -# endif
> > > -       addq    $(VEC_SIZE * 4), %rax
> > > -       addq    $(VEC_SIZE * 4), %rdx
> > > -L(loop_start):
> > > -       testl   %esi, %esi
> > > -       leal    -1(%esi), %esi
> > > -       je      L(loop_cross_page)
> > > -L(back_to_loop):
> > > -       /* Main loop, comparing 4 vectors are a time.  */
> > > -       vmovdqa (%rax), %ymm0
> > > -       vmovdqa VEC_SIZE(%rax), %ymm3
> > > -       VPCMPEQ (%rdx), %ymm0, %ymm4
> > > -       VPCMPEQ VEC_SIZE(%rdx), %ymm3, %ymm1
> > > -       VPMINU  %ymm0, %ymm4, %ymm4
> > > -       VPMINU  %ymm3, %ymm1, %ymm1
> > > -       vmovdqa (VEC_SIZE * 2)(%rax), %ymm2
> > > -       VPMINU  %ymm1, %ymm4, %ymm0
> > > -       vmovdqa (VEC_SIZE * 3)(%rax), %ymm3
> > > -       VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm2, %ymm5
> > > -       VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm3, %ymm6
> > > -       VPMINU  %ymm2, %ymm5, %ymm5
> > > -       VPMINU  %ymm3, %ymm6, %ymm6
> > > -       VPMINU  %ymm5, %ymm0, %ymm0
> > > -       VPMINU  %ymm6, %ymm0, %ymm0
> > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > -
> > > -       /* Test each mask (32 bits) individually because for VEC_SIZE
> > > -          == 32 is not possible to OR the four masks and keep all bits
> > > -          in a 64-bit integer register, differing from SSE2 strcmp
> > > -          where ORing is possible.  */
> > > -       vpmovmskb %ymm0, %ecx
> > > +       subq    $(VEC_SIZE * 4), %rdx
> > > +       jbe     L(ret_zero)
> > > +# endif
> > > +
> > > +       subq    $-(VEC_SIZE * 4), %rdi
> > > +       subq    $-(VEC_SIZE * 4), %rsi
> > > +
> > > +       /* Check if rsi loads will cross a page boundary.  */
> > > +       addl    $-(VEC_SIZE * 4), %eax
> > > +       jnb     L(page_cross_during_loop)
> > > +
> > > +       /* Loop entry after handling page cross during loop.  */
> > > +L(loop_skip_page_cross_check):
> > > +       VMOVA   (VEC_SIZE * 0)(%rdi), %ymm0
> > > +       VMOVA   (VEC_SIZE * 1)(%rdi), %ymm2
> > > +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> > > +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> > > +
> > > +       /* ymm1 all 1s where s1 and s2 equal. All 0s otherwise.  */
> > > +       VPCMPEQ (VEC_SIZE * 0)(%rsi), %ymm0, %ymm1
> > > +
> > > +       VPCMPEQ (VEC_SIZE * 1)(%rsi), %ymm2, %ymm3
> > > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> > > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> > > +
> > > +
> > > +       /* If any mismatches or null CHAR then 0 CHAR, otherwise non-
> > > +          zero.  */
> > > +       vpand   %ymm0, %ymm1, %ymm1
> > > +
> > > +
> > > +       vpand   %ymm2, %ymm3, %ymm3
> > > +       vpand   %ymm4, %ymm5, %ymm5
> > > +       vpand   %ymm6, %ymm7, %ymm7
> > > +
> > > +       VPMINU  %ymm1, %ymm3, %ymm3
> > > +       VPMINU  %ymm5, %ymm7, %ymm7
> > > +
> > > +       /* Reduce all 0 CHARs for the 4x VEC into ymm7.  */
> > > +       VPMINU  %ymm3, %ymm7, %ymm7
> > > +
> > > +       /* If any 0 CHAR then done.  */
> > > +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> > > +       vpmovmskb %ymm7, %LOOP_REG
> > > +       testl   %LOOP_REG, %LOOP_REG
> > > +       jz      L(loop)
> > > +
> > > +       /* Find which VEC has the mismatch of end of string.  */
> > > +       VPCMPEQ %ymm1, %ymmZERO, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > >         testl   %ecx, %ecx
> > > -       je      L(loop)
> > > -       VPCMPEQ %ymm7, %ymm4, %ymm0
> > > -       vpmovmskb %ymm0, %edi
> > > -       testl   %edi, %edi
> > > -       je      L(test_vec)
> > > -       tzcntl  %edi, %ecx
> > > +       jnz     L(return_vec_0_end)
> > > +
> > > +
> > > +       VPCMPEQ %ymm3, %ymmZERO, %ymm3
> > > +       vpmovmskb %ymm3, %ecx
> > > +       testl   %ecx, %ecx
> > > +       jnz     L(return_vec_1_end)
> > > +
> > > +L(return_vec_2_3_end):
> > >  # ifdef USE_AS_STRNCMP
> > > -       cmpq    %rcx, %r11
> > > -       jbe     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > +       subq    $(VEC_SIZE * 2), %rdx
> > > +       jbe     L(ret_zero_end)
> > > +# endif
> > > +
> > > +       VPCMPEQ %ymm5, %ymmZERO, %ymm5
> > > +       vpmovmskb %ymm5, %ecx
> > > +       testl   %ecx, %ecx
> > > +       jnz     L(return_vec_2_end)
> > > +
> > > +       /* LOOP_REG contains matches for null/mismatch from the loop. If
> > > +          VEC 0,1,and 2 all have no null and no mismatches then mismatch
> > > +          must entirely be from VEC 3 which is fully represented by
> > > +          LOOP_REG.  */
> > > +       tzcntl  %LOOP_REG, %LOOP_REG
> > > +
> > > +# ifdef USE_AS_STRNCMP
> > > +       subl    $-(VEC_SIZE), %LOOP_REG
> > > +       cmpq    %LOOP_REG64, %rdx
> > > +       jbe     L(ret_zero_end)
> > > +# endif
> > > +
> > > +# ifdef USE_AS_WCSCMP
> > > +       movl    (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx
> > >         xorl    %eax, %eax
> > > -       movl    (%rsi, %rcx), %edi
> > > -       cmpl    (%rdx, %rcx), %edi
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rax, %rcx), %eax
> > > -       movzbl  (%rdx, %rcx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       cmpl    (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> > > +       je      L(ret5)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       xorl    %r8d, %eax
> > >  # else
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > -       xorl    %eax, %eax
> > > -       movl    (%rsi, %rcx), %edi
> > > -       cmpl    (%rdx, %rcx), %edi
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rax, %rcx), %eax
> > > -       movzbl  (%rdx, %rcx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax
> > > +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> > > +       subl    %ecx, %eax
> > > +       xorl    %r8d, %eax
> > > +       subl    %r8d, %eax
> > >  # endif
> > > +L(ret5):
> > >         VZEROUPPER_RETURN
> > >
> > > -       .p2align 4
> > > -L(test_vec):
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* The first vector matched.  Return 0 if the maximum offset
> > > -          (%r11) <= VEC_SIZE.  */
> > > -       cmpq    $VEC_SIZE, %r11
> > > -       jbe     L(zero)
> > > +       .p2align 4,, 2
> > > +L(ret_zero_end):
> > > +       xorl    %eax, %eax
> > > +       VZEROUPPER_RETURN
> > >  # endif
> > > -       VPCMPEQ %ymm7, %ymm1, %ymm1
> > > -       vpmovmskb %ymm1, %ecx
> > > -       testl   %ecx, %ecx
> > > -       je      L(test_2_vec)
> > > -       tzcntl  %ecx, %edi
> > > +
> > > +
> > > +       /* The L(return_vec_N_end) differ from L(return_vec_N) in that
> > > +          they use the value of `r8` to negate the return value. This is
> > > +          because the page cross logic can swap `rdi` and `rsi`.  */
> > > +       .p2align 4,, 10
> > >  # ifdef USE_AS_STRNCMP
> > > -       addq    $VEC_SIZE, %rdi
> > > -       cmpq    %rdi, %r11
> > > -       jbe     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > +L(return_vec_1_end):
> > > +       salq    $32, %rcx
> > > +# endif
> > > +L(return_vec_0_end):
> > > +# ifndef USE_AS_STRNCMP
> > > +       tzcntl  %ecx, %ecx
> > > +# else
> > > +       tzcntq  %rcx, %rcx
> > > +       cmpq    %rcx, %rdx
> > > +       jbe     L(ret_zero_end)
> > > +# endif
> > > +
> > > +# ifdef USE_AS_WCSCMP
> > > +       movl    (%rdi, %rcx), %edx
> > >         xorl    %eax, %eax
> > > -       movl    (%rsi, %rdi), %ecx
> > > -       cmpl    (%rdx, %rdi), %ecx
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rax, %rdi), %eax
> > > -       movzbl  (%rdx, %rdi), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       cmpl    (%rsi, %rcx), %edx
> > > +       je      L(ret6)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       xorl    %r8d, %eax
> > >  # else
> > > +       movzbl  (%rdi, %rcx), %eax
> > > +       movzbl  (%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > > +       xorl    %r8d, %eax
> > > +       subl    %r8d, %eax
> > > +# endif
> > > +L(ret6):
> > > +       VZEROUPPER_RETURN
> > > +
> > > +# ifndef USE_AS_STRNCMP
> > > +       .p2align 4,, 10
> > > +L(return_vec_1_end):
> > > +       tzcntl  %ecx, %ecx
> > >  #  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > +       movl    VEC_SIZE(%rdi, %rcx), %edx
> > >         xorl    %eax, %eax
> > > -       movl    VEC_SIZE(%rsi, %rdi), %ecx
> > > -       cmpl    VEC_SIZE(%rdx, %rdi), %ecx
> > > -       jne     L(wcscmp_return)
> > > +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> > > +       je      L(ret7)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       xorl    %r8d, %eax
> > >  #  else
> > > -       movzbl  VEC_SIZE(%rax, %rdi), %eax
> > > -       movzbl  VEC_SIZE(%rdx, %rdi), %edx
> > > -       subl    %edx, %eax
> > > +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> > > +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > > +       xorl    %r8d, %eax
> > > +       subl    %r8d, %eax
> > >  #  endif
> > > -# endif
> > > +L(ret7):
> > >         VZEROUPPER_RETURN
> > > +# endif
> > >
> > > -       .p2align 4
> > > -L(test_2_vec):
> > > +       .p2align 4,, 10
> > > +L(return_vec_2_end):
> > > +       tzcntl  %ecx, %ecx
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* The first 2 vectors matched.  Return 0 if the maximum offset
> > > -          (%r11) <= 2 * VEC_SIZE.  */
> > > -       cmpq    $(VEC_SIZE * 2), %r11
> > > -       jbe     L(zero)
> > > +       cmpq    %rcx, %rdx
> > > +       jbe     L(ret_zero_page_cross)
> > >  # endif
> > > -       VPCMPEQ %ymm7, %ymm5, %ymm5
> > > -       vpmovmskb %ymm5, %ecx
> > > -       testl   %ecx, %ecx
> > > -       je      L(test_3_vec)
> > > -       tzcntl  %ecx, %edi
> > > -# ifdef USE_AS_STRNCMP
> > > -       addq    $(VEC_SIZE * 2), %rdi
> > > -       cmpq    %rdi, %r11
> > > -       jbe     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > +# ifdef USE_AS_WCSCMP
> > > +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
> > >         xorl    %eax, %eax
> > > -       movl    (%rsi, %rdi), %ecx
> > > -       cmpl    (%rdx, %rdi), %ecx
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rax, %rdi), %eax
> > > -       movzbl  (%rdx, %rdi), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> > > +       je      L(ret11)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       xorl    %r8d, %eax
> > >  # else
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > -       xorl    %eax, %eax
> > > -       movl    (VEC_SIZE * 2)(%rsi, %rdi), %ecx
> > > -       cmpl    (VEC_SIZE * 2)(%rdx, %rdi), %ecx
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (VEC_SIZE * 2)(%rax, %rdi), %eax
> > > -       movzbl  (VEC_SIZE * 2)(%rdx, %rdi), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> > > +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > > +       xorl    %r8d, %eax
> > > +       subl    %r8d, %eax
> > >  # endif
> > > +L(ret11):
> > >         VZEROUPPER_RETURN
> > >
> > > -       .p2align 4
> > > -L(test_3_vec):
> > > +
> > > +       /* Page cross in rsi in next 4x VEC.  */
> > > +
> > > +       /* TODO: Improve logic here.  */
> > > +       .p2align 4,, 10
> > > +L(page_cross_during_loop):
> > > +       /* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
> > > +
> > > +       /* Optimistically rsi and rdi and both aligned inwhich case we
> > > +          don't need any logic here.  */
> > > +       cmpl    $-(VEC_SIZE * 4), %eax
> > > +       /* Don't adjust eax before jumping back to loop and we will
> > > +          never hit page cross case again.  */
> > > +       je      L(loop_skip_page_cross_check)
> > > +
> > > +       /* Check if we can safely load a VEC.  */
> > > +       cmpl    $-(VEC_SIZE * 3), %eax
> > > +       jle     L(less_1x_vec_till_page_cross)
> > > +
> > > +       VMOVA   (%rdi), %ymm0
> > > +       VPCMPEQ (%rsi), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incl    %ecx
> > > +       jnz     L(return_vec_0_end)
> > > +
> > > +       /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
> > > +       cmpl    $-(VEC_SIZE * 2), %eax
> > > +       jg      L(more_2x_vec_till_page_cross)
> > > +
> > > +       .p2align 4,, 4
> > > +L(less_1x_vec_till_page_cross):
> > > +       subl    $-(VEC_SIZE * 4), %eax
> > > +       /* Guranteed safe to read from rdi - VEC_SIZE here. The only
> > > +          concerning case is first iteration if incoming s1 was near start
> > > +          of a page and s2 near end. If s1 was near the start of the page
> > > +          we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
> > > +          to read back -VEC_SIZE. If rdi is truly at the start of a page
> > > +          here, it means the previous page (rdi - VEC_SIZE) has already
> > > +          been loaded earlier so must be valid.  */
> > > +       VMOVU   -VEC_SIZE(%rdi, %rax), %ymm0
> > > +       VPCMPEQ -VEC_SIZE(%rsi, %rax), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +
> > > +       /* Mask of potentially valid bits. The lower bits can be out of
> > > +          range comparisons (but safe regarding page crosses).  */
> > > +       movl    $-1, %r10d
> > > +       shlxl   %esi, %r10d, %r10d
> > > +       notl    %ecx
> > > +
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* The first 3 vectors matched.  Return 0 if the maximum offset
> > > -          (%r11) <= 3 * VEC_SIZE.  */
> > > -       cmpq    $(VEC_SIZE * 3), %r11
> > > -       jbe     L(zero)
> > > -# endif
> > > -       VPCMPEQ %ymm7, %ymm6, %ymm6
> > > -       vpmovmskb %ymm6, %esi
> > > -       tzcntl  %esi, %ecx
> > > +       cmpq    %rax, %rdx
> > > +       jbe     L(return_page_cross_end_check)
> > > +# endif
> > > +       movl    %eax, %OFFSET_REG
> > > +       addl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> > > +
> > > +       andl    %r10d, %ecx
> > > +       jz      L(loop_skip_page_cross_check)
> > > +
> > > +       .p2align 4,, 3
> > > +L(return_page_cross_end):
> > > +       tzcntl  %ecx, %ecx
> > > +
> > >  # ifdef USE_AS_STRNCMP
> > > -       addq    $(VEC_SIZE * 3), %rcx
> > > -       cmpq    %rcx, %r11
> > > -       jbe     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > -       xorl    %eax, %eax
> > > -       movl    (%rsi, %rcx), %esi
> > > -       cmpl    (%rdx, %rcx), %esi
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rax, %rcx), %eax
> > > -       movzbl  (%rdx, %rcx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       leal    -VEC_SIZE(%OFFSET_REG64, %rcx), %ecx
> > > +L(return_page_cross_cmp_mem):
> > >  # else
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > +       addl    %OFFSET_REG, %ecx
> > > +# endif
> > > +# ifdef USE_AS_WCSCMP
> > > +       movl    VEC_OFFSET(%rdi, %rcx), %edx
> > >         xorl    %eax, %eax
> > > -       movl    (VEC_SIZE * 3)(%rsi, %rcx), %esi
> > > -       cmpl    (VEC_SIZE * 3)(%rdx, %rcx), %esi
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (VEC_SIZE * 3)(%rax, %rcx), %eax
> > > -       movzbl  (VEC_SIZE * 3)(%rdx, %rcx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> > > +       je      L(ret8)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       xorl    %r8d, %eax
> > > +# else
> > > +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> > > +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > > +       xorl    %r8d, %eax
> > > +       subl    %r8d, %eax
> > >  # endif
> > > +L(ret8):
> > >         VZEROUPPER_RETURN
> > >
> > > -       .p2align 4
> > > -L(loop_cross_page):
> > > -       xorl    %r10d, %r10d
> > > -       movq    %rdx, %rcx
> > > -       /* Align load via RDX.  We load the extra ECX bytes which should
> > > -          be ignored.  */
> > > -       andl    $((VEC_SIZE * 4) - 1), %ecx
> > > -       /* R10 is -RCX.  */
> > > -       subq    %rcx, %r10
> > > -
> > > -       /* This works only if VEC_SIZE * 2 == 64. */
> > > -# if (VEC_SIZE * 2) != 64
> > > -#  error (VEC_SIZE * 2) != 64
> > > -# endif
> > > -
> > > -       /* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
> > > -       cmpl    $(VEC_SIZE * 2), %ecx
> > > -       jge     L(loop_cross_page_2_vec)
> > > -
> > > -       vmovdqu (%rax, %r10), %ymm2
> > > -       vmovdqu VEC_SIZE(%rax, %r10), %ymm3
> > > -       VPCMPEQ (%rdx, %r10), %ymm2, %ymm0
> > > -       VPCMPEQ VEC_SIZE(%rdx, %r10), %ymm3, %ymm1
> > > -       VPMINU  %ymm2, %ymm0, %ymm0
> > > -       VPMINU  %ymm3, %ymm1, %ymm1
> > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > -       VPCMPEQ %ymm7, %ymm1, %ymm1
> > > -
> > > -       vpmovmskb %ymm0, %edi
> > > -       vpmovmskb %ymm1, %esi
> > > -
> > > -       salq    $32, %rsi
> > > -       xorq    %rsi, %rdi
> > > -
> > > -       /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
> > > -       shrq    %cl, %rdi
> > > -
> > > -       testq   %rdi, %rdi
> > > -       je      L(loop_cross_page_2_vec)
> > > -       tzcntq  %rdi, %rcx
> > >  # ifdef USE_AS_STRNCMP
> > > -       cmpq    %rcx, %r11
> > > -       jbe     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > +       .p2align 4,, 10
> > > +L(return_page_cross_end_check):
> > > +       tzcntl  %ecx, %ecx
> > > +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> > > +       cmpl    %ecx, %edx
> > > +       ja      L(return_page_cross_cmp_mem)
> > >         xorl    %eax, %eax
> > > -       movl    (%rsi, %rcx), %edi
> > > -       cmpl    (%rdx, %rcx), %edi
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rax, %rcx), %eax
> > > -       movzbl  (%rdx, %rcx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > -# else
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > -       xorl    %eax, %eax
> > > -       movl    (%rsi, %rcx), %edi
> > > -       cmpl    (%rdx, %rcx), %edi
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rax, %rcx), %eax
> > > -       movzbl  (%rdx, %rcx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > -# endif
> > >         VZEROUPPER_RETURN
> > > +# endif
> > >
> > > -       .p2align 4
> > > -L(loop_cross_page_2_vec):
> > > -       /* The first VEC_SIZE * 2 bytes match or are ignored.  */
> > > -       vmovdqu (VEC_SIZE * 2)(%rax, %r10), %ymm2
> > > -       vmovdqu (VEC_SIZE * 3)(%rax, %r10), %ymm3
> > > -       VPCMPEQ (VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5
> > > -       VPMINU  %ymm2, %ymm5, %ymm5
> > > -       VPCMPEQ (VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6
> > > -       VPCMPEQ %ymm7, %ymm5, %ymm5
> > > -       VPMINU  %ymm3, %ymm6, %ymm6
> > > -       VPCMPEQ %ymm7, %ymm6, %ymm6
> > > -
> > > -       vpmovmskb %ymm5, %edi
> > > -       vpmovmskb %ymm6, %esi
> > > -
> > > -       salq    $32, %rsi
> > > -       xorq    %rsi, %rdi
> > >
> > > -       xorl    %r8d, %r8d
> > > -       /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
> > > -       subl    $(VEC_SIZE * 2), %ecx
> > > -       jle     1f
> > > -       /* Skip ECX bytes.  */
> > > -       shrq    %cl, %rdi
> > > -       /* R8 has number of bytes skipped.  */
> > > -       movl    %ecx, %r8d
> > > -1:
> > > -       /* Before jumping back to the loop, set ESI to the number of
> > > -          VEC_SIZE * 4 blocks before page crossing.  */
> > > -       movl    $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
> > > -
> > > -       testq   %rdi, %rdi
> > > +       .p2align 4,, 10
> > > +L(more_2x_vec_till_page_cross):
> > > +       /* If more 2x vec till cross we will complete a full loop
> > > +          iteration here.  */
> > > +
> > > +       VMOVU   VEC_SIZE(%rdi), %ymm0
> > > +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incl    %ecx
> > > +       jnz     L(return_vec_1_end)
> > > +
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* At this point, if %rdi value is 0, it already tested
> > > -          VEC_SIZE*4+%r10 byte starting from %rax. This label
> > > -          checks whether strncmp maximum offset reached or not.  */
> > > -       je      L(string_nbyte_offset_check)
> > > -# else
> > > -       je      L(back_to_loop)
> > > +       cmpq    $(VEC_SIZE * 2), %rdx
> > > +       jbe     L(ret_zero_in_loop_page_cross)
> > >  # endif
> > > -       tzcntq  %rdi, %rcx
> > > -       addq    %r10, %rcx
> > > -       /* Adjust for number of bytes skipped.  */
> > > -       addq    %r8, %rcx
> > > +
> > > +       subl    $-(VEC_SIZE * 4), %eax
> > > +
> > > +       /* Safe to include comparisons from lower bytes.  */
> > > +       VMOVU   -(VEC_SIZE * 2)(%rdi, %rax), %ymm0
> > > +       VPCMPEQ -(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incl    %ecx
> > > +       jnz     L(return_vec_page_cross_0)
> > > +
> > > +       VMOVU   -(VEC_SIZE * 1)(%rdi, %rax), %ymm0
> > > +       VPCMPEQ -(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incl    %ecx
> > > +       jnz     L(return_vec_page_cross_1)
> > > +
> > >  # ifdef USE_AS_STRNCMP
> > > -       addq    $(VEC_SIZE * 2), %rcx
> > > -       subq    %rcx, %r11
> > > -       jbe     L(zero)
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > +       /* Must check length here as length might proclude reading next
> > > +          page.  */
> > > +       cmpq    %rax, %rdx
> > > +       jbe     L(ret_zero_in_loop_page_cross)
> > > +# endif
> > > +
> > > +       /* Finish the loop.  */
> > > +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> > > +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> > > +
> > > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> > > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> > > +       vpand   %ymm4, %ymm5, %ymm5
> > > +       vpand   %ymm6, %ymm7, %ymm7
> > > +       VPMINU  %ymm5, %ymm7, %ymm7
> > > +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> > > +       vpmovmskb %ymm7, %LOOP_REG
> > > +       testl   %LOOP_REG, %LOOP_REG
> > > +       jnz     L(return_vec_2_3_end)
> > > +
> > > +       /* Best for code size to include ucond-jmp here. Would be faster
> > > +          if this case is hot to duplicate the L(return_vec_2_3_end) code
> > > +          as fall-through and have jump back to loop on mismatch
> > > +          comparison.  */
> > > +       subq    $-(VEC_SIZE * 4), %rdi
> > > +       subq    $-(VEC_SIZE * 4), %rsi
> > > +       addl    $(PAGE_SIZE - VEC_SIZE * 8), %eax
> > > +# ifdef USE_AS_STRNCMP
> > > +       subq    $(VEC_SIZE * 4), %rdx
> > > +       ja      L(loop_skip_page_cross_check)
> > > +L(ret_zero_in_loop_page_cross):
> > >         xorl    %eax, %eax
> > > -       movl    (%rsi, %rcx), %edi
> > > -       cmpl    (%rdx, %rcx), %edi
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rax, %rcx), %eax
> > > -       movzbl  (%rdx, %rcx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       VZEROUPPER_RETURN
> > >  # else
> > > -#  ifdef USE_AS_WCSCMP
> > > -       movq    %rax, %rsi
> > > -       xorl    %eax, %eax
> > > -       movl    (VEC_SIZE * 2)(%rsi, %rcx), %edi
> > > -       cmpl    (VEC_SIZE * 2)(%rdx, %rcx), %edi
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (VEC_SIZE * 2)(%rax, %rcx), %eax
> > > -       movzbl  (VEC_SIZE * 2)(%rdx, %rcx), %edx
> > > -       subl    %edx, %eax
> > > -#  endif
> > > +       jmp     L(loop_skip_page_cross_check)
> > >  # endif
> > > -       VZEROUPPER_RETURN
> > >
> > > +
> > > +       .p2align 4,, 10
> > > +L(return_vec_page_cross_0):
> > > +       addl    $-VEC_SIZE, %eax
> > > +L(return_vec_page_cross_1):
> > > +       tzcntl  %ecx, %ecx
> > >  # ifdef USE_AS_STRNCMP
> > > -L(string_nbyte_offset_check):
> > > -       leaq    (VEC_SIZE * 4)(%r10), %r10
> > > -       cmpq    %r10, %r11
> > > -       jbe     L(zero)
> > > -       jmp     L(back_to_loop)
> > > +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> > > +       cmpq    %rcx, %rdx
> > > +       jbe     L(ret_zero_in_loop_page_cross)
> > > +# else
> > > +       addl    %eax, %ecx
> > >  # endif
> > >
> > > -       .p2align 4
> > > -L(cross_page_loop):
> > > -       /* Check one byte/dword at a time.  */
> > >  # ifdef USE_AS_WCSCMP
> > > -       cmpl    %ecx, %eax
> > > +       movl    VEC_OFFSET(%rdi, %rcx), %edx
> > > +       xorl    %eax, %eax
> > > +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> > > +       je      L(ret9)
> > > +       setl    %al
> > > +       negl    %eax
> > > +       xorl    %r8d, %eax
> > >  # else
> > > +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> > > +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> > >         subl    %ecx, %eax
> > > +       xorl    %r8d, %eax
> > > +       subl    %r8d, %eax
> > >  # endif
> > > -       jne     L(different)
> > > -       addl    $SIZE_OF_CHAR, %edx
> > > -       cmpl    $(VEC_SIZE * 4), %edx
> > > -       je      L(main_loop_header)
> > > -# ifdef USE_AS_STRNCMP
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > +L(ret9):
> > > +       VZEROUPPER_RETURN
> > > +
> > > +
> > > +       .p2align 4,, 10
> > > +L(page_cross):
> > > +# ifndef USE_AS_STRNCMP
> > > +       /* If both are VEC aligned we don't need any special logic here.
> > > +          Only valid for strcmp where stop condition is guranteed to be
> > > +          reachable by just reading memory.  */
> > > +       testl   $((VEC_SIZE - 1) << 20), %eax
> > > +       jz      L(no_page_cross)
> > >  # endif
> > > +
> > > +       movl    %edi, %eax
> > > +       movl    %esi, %ecx
> > > +       andl    $(PAGE_SIZE - 1), %eax
> > > +       andl    $(PAGE_SIZE - 1), %ecx
> > > +
> > > +       xorl    %OFFSET_REG, %OFFSET_REG
> > > +
> > > +       /* Check which is closer to page cross, s1 or s2.  */
> > > +       cmpl    %eax, %ecx
> > > +       jg      L(page_cross_s2)
> > > +
> > > +       /* The previous page cross check has false positives. Check for
> > > +          true positive as page cross logic is very expensive.  */
> > > +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> > > +       jbe     L(no_page_cross)
> > > +
> > > +       /* Set r8 to not interfere with normal return value (rdi and rsi
> > > +          did not swap).  */
> > >  # ifdef USE_AS_WCSCMP
> > > -       movl    (%rdi, %rdx), %eax
> > > -       movl    (%rsi, %rdx), %ecx
> > > +       /* any non-zero positive value that doesn't inference with 0x1.
> > > +        */
> > > +       movl    $2, %r8d
> > >  # else
> > > -       movzbl  (%rdi, %rdx), %eax
> > > -       movzbl  (%rsi, %rdx), %ecx
> > > +       xorl    %r8d, %r8d
> > >  # endif
> > > -       /* Check null char.  */
> > > -       testl   %eax, %eax
> > > -       jne     L(cross_page_loop)
> > > -       /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
> > > -          comparisons.  */
> > > -       subl    %ecx, %eax
> > > -# ifndef USE_AS_WCSCMP
> > > -L(different):
> > > +
> > > +       /* Check if less than 1x VEC till page cross.  */
> > > +       subl    $(VEC_SIZE * 3), %eax
> > > +       jg      L(less_1x_vec_till_page)
> > > +
> > > +       /* If more than 1x VEC till page cross, loop throuh safely
> > > +          loadable memory until within 1x VEC of page cross.  */
> > > +
> > > +       .p2align 4,, 10
> > > +L(page_cross_loop):
> > > +
> > > +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> > > +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incl    %ecx
> > > +
> > > +       jnz     L(check_ret_vec_page_cross)
> > > +       addl    $VEC_SIZE, %OFFSET_REG
> > > +# ifdef USE_AS_STRNCMP
> > > +       cmpq    %OFFSET_REG64, %rdx
> > > +       jbe     L(ret_zero_page_cross)
> > >  # endif
> > > -       VZEROUPPER_RETURN
> > > +       addl    $VEC_SIZE, %eax
> > > +       jl      L(page_cross_loop)
> > > +
> > > +       subl    %eax, %OFFSET_REG
> > > +       /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
> > > +          to not cross page so is safe to load. Since we have already
> > > +          loaded at least 1 VEC from rsi it is also guranteed to be safe.
> > > +        */
> > > +
> > > +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> > > +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +
> > > +# ifdef USE_AS_STRNCMP
> > > +       leal    VEC_SIZE(%OFFSET_REG64), %eax
> > > +       cmpq    %rax, %rdx
> > > +       jbe     L(check_ret_vec_page_cross2)
> > > +       addq    %rdi, %rdx
> > > +# endif
> > > +       incl    %ecx
> > > +       jz      L(prepare_loop_no_len)
> > >
> > > +       .p2align 4,, 4
> > > +L(ret_vec_page_cross):
> > > +# ifndef USE_AS_STRNCMP
> > > +L(check_ret_vec_page_cross):
> > > +# endif
> > > +       tzcntl  %ecx, %ecx
> > > +       addl    %OFFSET_REG, %ecx
> > > +L(ret_vec_page_cross_cont):
> > >  # ifdef USE_AS_WCSCMP
> > > -       .p2align 4
> > > -L(different):
> > > -       /* Use movl to avoid modifying EFLAGS.  */
> > > -       movl    $0, %eax
> > > +       movl    (%rdi, %rcx), %edx
> > > +       xorl    %eax, %eax
> > > +       cmpl    (%rsi, %rcx), %edx
> > > +       je      L(ret12)
> > >         setl    %al
> > >         negl    %eax
> > > -       orl     $1, %eax
> > > -       VZEROUPPER_RETURN
> > > +       xorl    %r8d, %eax
> > > +# else
> > > +       movzbl  (%rdi, %rcx), %eax
> > > +       movzbl  (%rsi, %rcx), %ecx
> > > +       subl    %ecx, %eax
> > > +       xorl    %r8d, %eax
> > > +       subl    %r8d, %eax
> > >  # endif
> > > +L(ret12):
> > > +       VZEROUPPER_RETURN
> > >
> > >  # ifdef USE_AS_STRNCMP
> > > -       .p2align 4
> > > -L(zero):
> > > +       .p2align 4,, 10
> > > +L(check_ret_vec_page_cross2):
> > > +       incl    %ecx
> > > +L(check_ret_vec_page_cross):
> > > +       tzcntl  %ecx, %ecx
> > > +       addl    %OFFSET_REG, %ecx
> > > +       cmpq    %rcx, %rdx
> > > +       ja      L(ret_vec_page_cross_cont)
> > > +       .p2align 4,, 2
> > > +L(ret_zero_page_cross):
> > >         xorl    %eax, %eax
> > >         VZEROUPPER_RETURN
> > > +# endif
> > >
> > > -       .p2align 4
> > > -L(char0):
> > > -#  ifdef USE_AS_WCSCMP
> > > -       xorl    %eax, %eax
> > > -       movl    (%rdi), %ecx
> > > -       cmpl    (%rsi), %ecx
> > > -       jne     L(wcscmp_return)
> > > -#  else
> > > -       movzbl  (%rsi), %ecx
> > > -       movzbl  (%rdi), %eax
> > > -       subl    %ecx, %eax
> > > -#  endif
> > > -       VZEROUPPER_RETURN
> > > +       .p2align 4,, 4
> > > +L(page_cross_s2):
> > > +       /* Ensure this is a true page cross.  */
> > > +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %ecx
> > > +       jbe     L(no_page_cross)
> > > +
> > > +
> > > +       movl    %ecx, %eax
> > > +       movq    %rdi, %rcx
> > > +       movq    %rsi, %rdi
> > > +       movq    %rcx, %rsi
> > > +
> > > +       /* set r8 to negate return value as rdi and rsi swapped.  */
> > > +# ifdef USE_AS_WCSCMP
> > > +       movl    $-4, %r8d
> > > +# else
> > > +       movl    $-1, %r8d
> > >  # endif
> > > +       xorl    %OFFSET_REG, %OFFSET_REG
> > >
> > > -       .p2align 4
> > > -L(last_vector):
> > > -       addq    %rdx, %rdi
> > > -       addq    %rdx, %rsi
> > > +       /* Check if more than 1x VEC till page cross.  */
> > > +       subl    $(VEC_SIZE * 3), %eax
> > > +       jle     L(page_cross_loop)
> > > +
> > > +       .p2align 4,, 6
> > > +L(less_1x_vec_till_page):
> > > +       /* Find largest load size we can use.  */
> > > +       cmpl    $16, %eax
> > > +       ja      L(less_16_till_page)
> > > +
> > > +       VMOVU   (%rdi), %xmm0
> > > +       VPCMPEQ (%rsi), %xmm0, %xmm1
> > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incw    %cx
> > > +       jnz     L(check_ret_vec_page_cross)
> > > +       movl    $16, %OFFSET_REG
> > >  # ifdef USE_AS_STRNCMP
> > > -       subq    %rdx, %r11
> > > +       cmpq    %OFFSET_REG64, %rdx
> > > +       jbe     L(ret_zero_page_cross_slow_case0)
> > > +       subl    %eax, %OFFSET_REG
> > > +# else
> > > +       /* Explicit check for 16 byte alignment.  */
> > > +       subl    %eax, %OFFSET_REG
> > > +       jz      L(prepare_loop)
> > >  # endif
> > > -       tzcntl  %ecx, %edx
> > > +
> > > +       VMOVU   (%rdi, %OFFSET_REG64), %xmm0
> > > +       VPCMPEQ (%rsi, %OFFSET_REG64), %xmm0, %xmm1
> > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incw    %cx
> > > +       jnz     L(check_ret_vec_page_cross)
> > > +
> > >  # ifdef USE_AS_STRNCMP
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > +       addl    $16, %OFFSET_REG
> > > +       subq    %OFFSET_REG64, %rdx
> > > +       jbe     L(ret_zero_page_cross_slow_case0)
> > > +       subq    $-(VEC_SIZE * 4), %rdx
> > > +
> > > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > +# else
> > > +       leaq    (16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > +       leaq    (16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > >  # endif
> > > -# ifdef USE_AS_WCSCMP
> > > +       jmp     L(prepare_loop_aligned)
> > > +
> > > +# ifdef USE_AS_STRNCMP
> > > +       .p2align 4,, 2
> > > +L(ret_zero_page_cross_slow_case0):
> > >         xorl    %eax, %eax
> > > -       movl    (%rdi, %rdx), %ecx
> > > -       cmpl    (%rsi, %rdx), %ecx
> > > -       jne     L(wcscmp_return)
> > > -# else
> > > -       movzbl  (%rdi, %rdx), %eax
> > > -       movzbl  (%rsi, %rdx), %edx
> > > -       subl    %edx, %eax
> > > +       ret
> > >  # endif
> > > -       VZEROUPPER_RETURN
> > >
> > > -       /* Comparing on page boundary region requires special treatment:
> > > -          It must done one vector at the time, starting with the wider
> > > -          ymm vector if possible, if not, with xmm. If fetching 16 bytes
> > > -          (xmm) still passes the boundary, byte comparison must be done.
> > > -        */
> > > -       .p2align 4
> > > -L(cross_page):
> > > -       /* Try one ymm vector at a time.  */
> > > -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> > > -       jg      L(cross_page_1_vector)
> > > -L(loop_1_vector):
> > > -       vmovdqu (%rdi, %rdx), %ymm1
> > > -       VPCMPEQ (%rsi, %rdx), %ymm1, %ymm0
> > > -       VPMINU  %ymm1, %ymm0, %ymm0
> > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > -       vpmovmskb %ymm0, %ecx
> > > -       testl   %ecx, %ecx
> > > -       jne     L(last_vector)
> > >
> > > -       addl    $VEC_SIZE, %edx
> > > +       .p2align 4,, 10
> > > +L(less_16_till_page):
> > > +       /* Find largest load size we can use.  */
> > > +       cmpl    $24, %eax
> > > +       ja      L(less_8_till_page)
> > >
> > > -       addl    $VEC_SIZE, %eax
> > > -# ifdef USE_AS_STRNCMP
> > > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > > -          (%r11).  */
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > -# endif
> > > -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> > > -       jle     L(loop_1_vector)
> > > -L(cross_page_1_vector):
> > > -       /* Less than 32 bytes to check, try one xmm vector.  */
> > > -       cmpl    $(PAGE_SIZE - 16), %eax
> > > -       jg      L(cross_page_1_xmm)
> > > -       vmovdqu (%rdi, %rdx), %xmm1
> > > -       VPCMPEQ (%rsi, %rdx), %xmm1, %xmm0
> > > -       VPMINU  %xmm1, %xmm0, %xmm0
> > > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > > -       vpmovmskb %xmm0, %ecx
> > > -       testl   %ecx, %ecx
> > > -       jne     L(last_vector)
> > > +       vmovq   (%rdi), %xmm0
> > > +       vmovq   (%rsi), %xmm1
> > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incb    %cl
> > > +       jnz     L(check_ret_vec_page_cross)
> > >
> > > -       addl    $16, %edx
> > > -# ifndef USE_AS_WCSCMP
> > > -       addl    $16, %eax
> > > +
> > > +# ifdef USE_AS_STRNCMP
> > > +       cmpq    $8, %rdx
> > > +       jbe     L(ret_zero_page_cross_slow_case0)
> > >  # endif
> > > +       movl    $24, %OFFSET_REG
> > > +       /* Explicit check for 16 byte alignment.  */
> > > +       subl    %eax, %OFFSET_REG
> > > +
> > > +
> > > +
> > > +       vmovq   (%rdi, %OFFSET_REG64), %xmm0
> > > +       vmovq   (%rsi, %OFFSET_REG64), %xmm1
> > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       incb    %cl
> > > +       jnz     L(check_ret_vec_page_cross)
> > > +
> > >  # ifdef USE_AS_STRNCMP
> > > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > > -          (%r11).  */
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > -# endif
> > > -
> > > -L(cross_page_1_xmm):
> > > -# ifndef USE_AS_WCSCMP
> > > -       /* Less than 16 bytes to check, try 8 byte vector.  NB: No need
> > > -          for wcscmp nor wcsncmp since wide char is 4 bytes.   */
> > > -       cmpl    $(PAGE_SIZE - 8), %eax
> > > -       jg      L(cross_page_8bytes)
> > > -       vmovq   (%rdi, %rdx), %xmm1
> > > -       vmovq   (%rsi, %rdx), %xmm0
> > > -       VPCMPEQ %xmm0, %xmm1, %xmm0
> > > -       VPMINU  %xmm1, %xmm0, %xmm0
> > > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > > -       vpmovmskb %xmm0, %ecx
> > > -       /* Only last 8 bits are valid.  */
> > > -       andl    $0xff, %ecx
> > > -       testl   %ecx, %ecx
> > > -       jne     L(last_vector)
> > > +       addl    $8, %OFFSET_REG
> > > +       subq    %OFFSET_REG64, %rdx
> > > +       jbe     L(ret_zero_page_cross_slow_case0)
> > > +       subq    $-(VEC_SIZE * 4), %rdx
> > >
> > > -       addl    $8, %edx
> > > -       addl    $8, %eax
> > > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > +# else
> > > +       leaq    (8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > +       leaq    (8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > +# endif
> > > +       jmp     L(prepare_loop_aligned)
> > > +
> > > +
> > > +       .p2align 4,, 10
> > > +L(less_8_till_page):
> > > +# ifdef USE_AS_WCSCMP
> > > +       /* If using wchar then this is the only check before we reach
> > > +          the page boundary.  */
> > > +       movl    (%rdi), %eax
> > > +       movl    (%rsi), %ecx
> > > +       cmpl    %ecx, %eax
> > > +       jnz     L(ret_less_8_wcs)
> > >  #  ifdef USE_AS_STRNCMP
> > > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > > -          (%r11).  */
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > +       addq    %rdi, %rdx
> > > +       /* We already checked for len <= 1 so cannot hit that case here.
> > > +        */
> > >  #  endif
> > > +       testl   %eax, %eax
> > > +       jnz     L(prepare_loop_no_len)
> > > +       ret
> > >
> > > -L(cross_page_8bytes):
> > > -       /* Less than 8 bytes to check, try 4 byte vector.  */
> > > -       cmpl    $(PAGE_SIZE - 4), %eax
> > > -       jg      L(cross_page_4bytes)
> > > -       vmovd   (%rdi, %rdx), %xmm1
> > > -       vmovd   (%rsi, %rdx), %xmm0
> > > -       VPCMPEQ %xmm0, %xmm1, %xmm0
> > > -       VPMINU  %xmm1, %xmm0, %xmm0
> > > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > > -       vpmovmskb %xmm0, %ecx
> > > -       /* Only last 4 bits are valid.  */
> > > -       andl    $0xf, %ecx
> > > -       testl   %ecx, %ecx
> > > -       jne     L(last_vector)
> > > +       .p2align 4,, 8
> > > +L(ret_less_8_wcs):
> > > +       setl    %OFFSET_REG8
> > > +       negl    %OFFSET_REG
> > > +       movl    %OFFSET_REG, %eax
> > > +       xorl    %r8d, %eax
> > > +       ret
> > > +
> > > +# else
> > > +
> > > +       /* Find largest load size we can use.  */
> > > +       cmpl    $28, %eax
> > > +       ja      L(less_4_till_page)
> > > +
> > > +       vmovd   (%rdi), %xmm0
> > > +       vmovd   (%rsi), %xmm1
> > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       subl    $0xf, %ecx
> > > +       jnz     L(check_ret_vec_page_cross)
> > >
> > > -       addl    $4, %edx
> > >  #  ifdef USE_AS_STRNCMP
> > > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > > -          (%r11).  */
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > +       cmpq    $4, %rdx
> > > +       jbe     L(ret_zero_page_cross_slow_case1)
> > >  #  endif
> > > +       movl    $28, %OFFSET_REG
> > > +       /* Explicit check for 16 byte alignment.  */
> > > +       subl    %eax, %OFFSET_REG
> > >
> > > -L(cross_page_4bytes):
> > > -# endif
> > > -       /* Less than 4 bytes to check, try one byte/dword at a time.  */
> > > -# ifdef USE_AS_STRNCMP
> > > -       cmpq    %r11, %rdx
> > > -       jae     L(zero)
> > > -# endif
> > > -# ifdef USE_AS_WCSCMP
> > > -       movl    (%rdi, %rdx), %eax
> > > -       movl    (%rsi, %rdx), %ecx
> > > -# else
> > > -       movzbl  (%rdi, %rdx), %eax
> > > -       movzbl  (%rsi, %rdx), %ecx
> > > -# endif
> > > -       testl   %eax, %eax
> > > -       jne     L(cross_page_loop)
> > > +
> > > +
> > > +       vmovd   (%rdi, %OFFSET_REG64), %xmm0
> > > +       vmovd   (%rsi, %OFFSET_REG64), %xmm1
> > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > +       vpmovmskb %ymm1, %ecx
> > > +       subl    $0xf, %ecx
> > > +       jnz     L(check_ret_vec_page_cross)
> > > +
> > > +#  ifdef USE_AS_STRNCMP
> > > +       addl    $4, %OFFSET_REG
> > > +       subq    %OFFSET_REG64, %rdx
> > > +       jbe     L(ret_zero_page_cross_slow_case1)
> > > +       subq    $-(VEC_SIZE * 4), %rdx
> > > +
> > > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > +#  else
> > > +       leaq    (4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > +       leaq    (4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > +#  endif
> > > +       jmp     L(prepare_loop_aligned)
> > > +
> > > +#  ifdef USE_AS_STRNCMP
> > > +       .p2align 4,, 2
> > > +L(ret_zero_page_cross_slow_case1):
> > > +       xorl    %eax, %eax
> > > +       ret
> > > +#  endif
> > > +
> > > +       .p2align 4,, 10
> > > +L(less_4_till_page):
> > > +       subq    %rdi, %rsi
> > > +       /* Extremely slow byte comparison loop.  */
> > > +L(less_4_loop):
> > > +       movzbl  (%rdi), %eax
> > > +       movzbl  (%rsi, %rdi), %ecx
> > >         subl    %ecx, %eax
> > > -       VZEROUPPER_RETURN
> > > -END (STRCMP)
> > > +       jnz     L(ret_less_4_loop)
> > > +       testl   %ecx, %ecx
> > > +       jz      L(ret_zero_4_loop)
> > > +#  ifdef USE_AS_STRNCMP
> > > +       decq    %rdx
> > > +       jz      L(ret_zero_4_loop)
> > > +#  endif
> > > +       incq    %rdi
> > > +       /* end condition is reach page boundary (rdi is aligned).  */
> > > +       testl   $31, %edi
> > > +       jnz     L(less_4_loop)
> > > +       leaq    -(VEC_SIZE * 4)(%rdi, %rsi), %rsi
> > > +       addq    $-(VEC_SIZE * 4), %rdi
> > > +#  ifdef USE_AS_STRNCMP
> > > +       subq    $-(VEC_SIZE * 4), %rdx
> > > +#  endif
> > > +       jmp     L(prepare_loop_aligned)
> > > +
> > > +L(ret_zero_4_loop):
> > > +       xorl    %eax, %eax
> > > +       ret
> > > +L(ret_less_4_loop):
> > > +       xorl    %r8d, %eax
> > > +       subl    %r8d, %eax
> > > +       ret
> > > +# endif
> > > +END(STRCMP)
> > >  #endif
> > > --
> > > 2.25.1
> > >
> >
> > LGTM.
>
> Should I wait until 2.36 release to push the optimized versions?

Yes, please.

> There are alot of edge cases with these functions and last time we
> tried to improve them in:
>
> commit c46e9afb2df5fc9e39ff4d13777e4b4c26e04e55
> Author: H.J. Lu <hjl.tools@gmail.com>
> Date:   Fri Oct 29 12:40:20 2021 -0700
>
>     x86-64: Improve EVEX strcmp with masked load
>
>
> We missed a case:
> https://bugzilla.redhat.com/show_bug.cgi?id=2026399#c19

Is it a correctness bug?

> >
> > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> >
> > Thanks.
> >
> > --
> > H.J.

Thanks.

--
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c
  2022-01-10  0:38     ` H.J. Lu
@ 2022-01-10  2:51       ` Noah Goldstein
  0 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  2:51 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 6:39 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Jan 9, 2022 at 4:30 PM Noah Goldstein via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > Add additional test cases for small / medium sizes.
> >
> > Add tests in test-strncmp.c where `n` is near ULONG_MAX or LONG_MIN to
> > test for overflow bugs in length handling.
>
> How long do new tests run?

~5sec for strcmp
~15sec for strncmp
>
> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > ---
> >  string/test-strcmp.c  |  70 ++++++++++--
> >  string/test-strncmp.c | 248 +++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 298 insertions(+), 20 deletions(-)
> >
> > diff --git a/string/test-strcmp.c b/string/test-strcmp.c
> > index 97d7bf5043..eacbdc8857 100644
> > --- a/string/test-strcmp.c
> > +++ b/string/test-strcmp.c
> > @@ -16,6 +16,9 @@
> >     License along with the GNU C Library; if not, see
> >     <https://www.gnu.org/licenses/>.  */
> >
> > +#define TEST_LEN (4096 * 3)
> > +#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
> > +
> >  #define TEST_MAIN
> >  #ifdef WIDE
> >  # define TEST_NAME "wcscmp"
> > @@ -129,7 +132,7 @@ do_one_test (impl_t *impl,
> >
> >  static void
> >  do_test (size_t align1, size_t align2, size_t len, int max_char,
> > -        int exp_result)
> > +         int exp_result)
> >  {
> >    size_t i;
> >
> > @@ -138,19 +141,22 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
> >    if (len == 0)
> >      return;
> >
> > -  align1 &= 63;
> > +  align1 &= ~(CHARBYTES - 1);
> > +  align2 &= ~(CHARBYTES - 1);
> > +
> > +  align1 &= getpagesize () - 1;
> >    if (align1 + (len + 1) * CHARBYTES >= page_size)
> >      return;
> >
> > -  align2 &= 63;
> > +  align2 &= getpagesize () - 1;
> >    if (align2 + (len + 1) * CHARBYTES >= page_size)
> >      return;
> >
> >    /* Put them close to the end of page.  */
> >    i = align1 + CHARBYTES * (len + 2);
> > -  s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1);
> > +  s1 = (CHAR *)(buf1 + ((page_size - i) / 16 * 16) + align1);
> >    i = align2 + CHARBYTES * (len + 2);
> > -  s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16)  + align2);
> > +  s2 = (CHAR *)(buf2 + ((page_size - i) / 16 * 16) + align2);
> >
> >    for (i = 0; i < len; i++)
> >      s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
> > @@ -161,9 +167,10 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
> >    s2[len - 1] -= exp_result;
> >
> >    FOR_EACH_IMPL (impl, 0)
> > -    do_one_test (impl, s1, s2, exp_result);
> > +  do_one_test (impl, s1, s2, exp_result);
> >  }
> >
> > +
> >  static void
> >  do_random_tests (void)
> >  {
> > @@ -385,7 +392,7 @@ check3 (void)
> >  int
> >  test_main (void)
> >  {
> > -  size_t i;
> > +  size_t i, j;
> >
> >    test_init ();
> >    check();
> > @@ -426,6 +433,55 @@ test_main (void)
> >        do_test (2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1);
> >      }
> >
> > +  for (j = 0; j < 160; ++j)
> > +    {
> > +      for (i = 0; i < TEST_LEN;)
> > +        {
> > +          do_test (getpagesize () - j - 1, 0, i, 127, 0);
> > +          do_test (getpagesize () - j - 1, 0, i, 127, 1);
> > +          do_test (getpagesize () - j - 1, 0, i, 127, -1);
> > +
> > +          do_test (getpagesize () - j - 1, j, i, 127, 0);
> > +          do_test (getpagesize () - j - 1, j, i, 127, 1);
> > +          do_test (getpagesize () - j - 1, j, i, 127, -1);
> > +
> > +          do_test (0, getpagesize () - j - 1, i, 127, 0);
> > +          do_test (0, getpagesize () - j - 1, i, 127, 1);
> > +          do_test (0, getpagesize () - j - 1, i, 127, -1);
> > +
> > +          do_test (j, getpagesize () - j - 1, i, 127, 0);
> > +          do_test (j, getpagesize () - j - 1, i, 127, 1);
> > +          do_test (j, getpagesize () - j - 1, i, 127, -1);
> > +
> > +          if (i < 32)
> > +            {
> > +              i += 1;
> > +            }
> > +          else if (i < 161)
> > +            {
> > +              i += 7;
> > +            }
> > +          else if (i + 161 < TEST_LEN)
> > +            {
> > +              i += 31;
> > +              i *= 17;
> > +              i /= 16;
> > +              if (i + 161 > TEST_LEN)
> > +                {
> > +                  i = TEST_LEN - 160;
> > +                }
> > +            }
> > +          else if (i + 32 < TEST_LEN)
> > +            {
> > +              i += 7;
> > +            }
> > +          else
> > +            {
> > +              i += 1;
> > +            }
> > +        }
> > +    }
> > +
> >    do_random_tests ();
> >    return ret;
> >  }
> > diff --git a/string/test-strncmp.c b/string/test-strncmp.c
> > index 61a283a0af..4fa6106eb4 100644
> > --- a/string/test-strncmp.c
> > +++ b/string/test-strncmp.c
> > @@ -16,6 +16,9 @@
> >     License along with the GNU C Library; if not, see
> >     <https://www.gnu.org/licenses/>.  */
> >
> > +#define TEST_LEN (4096 * 3)
> > +#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
> > +
> >  #define TEST_MAIN
> >  #ifdef WIDE
> >  # define TEST_NAME "wcsncmp"
> > @@ -166,10 +169,10 @@ do_test_limit (size_t align1, size_t align2, size_t len, size_t n, int max_char,
> >  }
> >
> >  static void
> > -do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
> > -        int exp_result)
> > +do_test_n (size_t align1, size_t align2, size_t len, size_t n, int n_in_bounds,
> > +           int max_char, int exp_result)
> >  {
> > -  size_t i;
> > +  size_t i, buf_bound;
> >    CHAR *s1, *s2;
> >
> >    align1 &= ~(CHARBYTES - 1);
> > @@ -178,22 +181,28 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
> >    if (n == 0)
> >      return;
> >
> > -  align1 &= 63;
> > -  if (align1 + (n + 1) * CHARBYTES >= page_size)
> > +  buf_bound = n_in_bounds ? n : len;
> > +
> > +  align1 &= getpagesize () - 1;
> > +  if (align1 + (buf_bound + 1) * CHARBYTES >= page_size)
> >      return;
> >
> > -  align2 &= 63;
> > -  if (align2 + (n + 1) * CHARBYTES >= page_size)
> > +  align2 &= getpagesize () - 1;
> > +  if (align2 + (buf_bound + 1) * CHARBYTES >= page_size)
> >      return;
> >
> > -  s1 = (CHAR *) (buf1 + align1);
> > -  s2 = (CHAR *) (buf2 + align2);
> > +  s1 = (CHAR *)(buf1 + align1);
> > +  s2 = (CHAR *)(buf2 + align2);
> >
> > -  for (i = 0; i < n; i++)
> > +  if (n_in_bounds)
> > +    {
> > +      s1[n] = 24 + exp_result;
> > +      s2[n] = 23;
> > +    }
> > +
> > +  for (i = 0; i < buf_bound; i++)
> >      s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
> >
> > -  s1[n] = 24 + exp_result;
> > -  s2[n] = 23;
> >    s1[len] = 0;
> >    s2[len] = 0;
> >    if (exp_result < 0)
> > @@ -207,6 +216,13 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
> >      do_one_test (impl, s1, s2, n, exp_result);
> >  }
> >
> > +static void
> > +do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
> > +         int exp_result)
> > +{
> > +  do_test_n (align1, align2, len, n, 1, max_char, exp_result);
> > +}
> > +
> >  static void
> >  do_page_test (size_t offset1, size_t offset2, CHAR *s2)
> >  {
> > @@ -400,10 +416,123 @@ check3 (void)
> >         }
> >  }
> >
> > +static void
> > +check_overflow (void)
> > +{
> > +  size_t i, j, of_mask, of_idx;
> > +  const size_t of_masks[]
> > +      = { ULONG_MAX, LONG_MIN, ULONG_MAX - (ULONG_MAX >> 2),
> > +          ((size_t)LONG_MAX) >> 1 };
> > +
> > +  for (of_idx = 0; of_idx < sizeof (of_masks) / sizeof (of_masks[0]); ++of_idx)
> > +    {
> > +      of_mask = of_masks[of_idx];
> > +      for (j = 0; j < 160; ++j)
> > +        {
> > +          for (i = 1; i <= 161; i += (32 / sizeof (CHAR)))
> > +            {
> > +              do_test_n (j, 0, i, of_mask, 0, 127, 0);
> > +              do_test_n (j, 0, i, of_mask, 0, 127, 1);
> > +              do_test_n (j, 0, i, of_mask, 0, 127, -1);
> > +
> > +              do_test_n (j, 0, i, of_mask - j / 2, 0, 127, 0);
> > +              do_test_n (j, 0, i, of_mask - j * 2, 0, 127, 1);
> > +              do_test_n (j, 0, i, of_mask - j, 0, 127, -1);
> > +
> > +              do_test_n (j / 2, j, i, of_mask, 0, 127, 0);
> > +              do_test_n (j / 2, j, i, of_mask, 0, 127, 1);
> > +              do_test_n (j / 2, j, i, of_mask, 0, 127, -1);
> > +
> > +              do_test_n (j / 2, j, i, of_mask - j, 0, 127, 0);
> > +              do_test_n (j / 2, j, i, of_mask - j / 2, 0, 127, 1);
> > +              do_test_n (j / 2, j, i, of_mask - j * 2, 0, 127, -1);
> > +
> > +              do_test_n (0, j, i, of_mask - j * 2, 0, 127, 0);
> > +              do_test_n (0, j, i, of_mask - j, 0, 127, 1);
> > +              do_test_n (0, j, i, of_mask - j / 2, 0, 127, -1);
> > +
> > +              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 0);
> > +              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 1);
> > +              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, -1);
> > +
> > +              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j / 2, 0, 127,
> > +                         0);
> > +              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j * 2, 0, 127,
> > +                         1);
> > +              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j, 0, 127,
> > +                         -1);
> > +
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> > +                         of_mask, 0, 127, 0);
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> > +                         of_mask, 0, 127, 1);
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> > +                         of_mask, 0, 127, -1);
> > +
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> > +                         of_mask - j, 0, 127, 0);
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> > +                         of_mask - j / 2, 0, 127, 1);
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
> > +                         of_mask - j * 2, 0, 127, -1);
> > +            }
> > +
> > +          for (i = 1; i < TEST_LEN; i += i)
> > +            {
> > +              do_test_n (j, 0, i - 1, of_mask, 0, 127, 0);
> > +              do_test_n (j, 0, i - 1, of_mask, 0, 127, 1);
> > +              do_test_n (j, 0, i - 1, of_mask, 0, 127, -1);
> > +
> > +              do_test_n (j, 0, i - 1, of_mask - j / 2, 0, 127, 0);
> > +              do_test_n (j, 0, i - 1, of_mask - j * 2, 0, 127, 1);
> > +              do_test_n (j, 0, i - 1, of_mask - j, 0, 127, -1);
> > +
> > +              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 0);
> > +              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 1);
> > +              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, -1);
> > +
> > +              do_test_n (j / 2, j, i - 1, of_mask - j, 0, 127, 0);
> > +              do_test_n (j / 2, j, i - 1, of_mask - j / 2, 0, 127, 1);
> > +              do_test_n (j / 2, j, i - 1, of_mask - j * 2, 0, 127, -1);
> > +
> > +              do_test_n (0, j, i - 1, of_mask - j * 2, 0, 127, 0);
> > +              do_test_n (0, j, i - 1, of_mask - j, 0, 127, 1);
> > +              do_test_n (0, j, i - 1, of_mask - j / 2, 0, 127, -1);
> > +
> > +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 0);
> > +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 1);
> > +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127,
> > +                         -1);
> > +
> > +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j / 2, 0,
> > +                         127, 0);
> > +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j * 2, 0,
> > +                         127, 1);
> > +              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j, 0, 127,
> > +                         -1);
> > +
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> > +                         i - 1, of_mask, 0, 127, 0);
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> > +                         i - 1, of_mask, 0, 127, 1);
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> > +                         i - 1, of_mask, 0, 127, -1);
> > +
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> > +                         i - 1, of_mask - j, 0, 127, 0);
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> > +                         i - 1, of_mask - j / 2, 0, 127, 1);
> > +              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
> > +                         i - 1, of_mask - j * 2, 0, 127, -1);
> > +            }
> > +        }
> > +    }
> > +}
> > +
> >  int
> >  test_main (void)
> >  {
> > -  size_t i;
> > +  size_t i, j;
> >
> >    test_init ();
> >
> > @@ -470,6 +599,99 @@ test_main (void)
> >        do_test_limit (0, 0, 15 - i, 16 - i, 255, -1);
> >      }
> >
> > +  for (j = 0; j < 160; ++j)
> > +    {
> > +      for (i = 0; i < TEST_LEN;)
> > +        {
> > +          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 0);
> > +          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 1);
> > +          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, -1);
> > +
> > +          do_test_n (getpagesize () - j - 1, 0, i, i, 0, 127, 0);
> > +          do_test_n (getpagesize () - j - 1, 0, i, i - 1, 0, 127, 0);
> > +
> > +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 0);
> > +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 1);
> > +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, -1);
> > +
> > +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 0);
> > +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 1);
> > +          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, -1);
> > +
> > +          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 0);
> > +          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 1);
> > +          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, -1);
> > +
> > +          do_test_n (getpagesize () - j - 1, j, i, i, 0, 127, 0);
> > +          do_test_n (getpagesize () - j - 1, j, i, i - 1, 0, 127, 0);
> > +
> > +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 0);
> > +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 1);
> > +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, -1);
> > +
> > +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 0);
> > +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 1);
> > +          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, -1);
> > +
> > +          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
> > +          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
> > +          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
> > +
> > +          do_test_n (0, getpagesize () - j - 1, i, i, 0, 127, 0);
> > +          do_test_n (0, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
> > +
> > +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
> > +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
> > +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
> > +
> > +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0);
> > +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1);
> > +          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1);
> > +
> > +          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
> > +          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
> > +          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
> > +
> > +          do_test_n (j, getpagesize () - j - 1, i, i, 0, 127, 0);
> > +          do_test_n (j, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
> > +
> > +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
> > +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
> > +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
> > +
> > +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0);
> > +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1);
> > +          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1);
> > +          if (i < 32)
> > +            {
> > +              i += 1;
> > +            }
> > +          else if (i < 161)
> > +            {
> > +              i += 7;
> > +            }
> > +          else if (i + 161 < TEST_LEN)
> > +            {
> > +              i += 31;
> > +              i *= 17;
> > +              i /= 16;
> > +              if (i + 161 > TEST_LEN)
> > +                {
> > +                  i = TEST_LEN - 160;
> > +                }
> > +            }
> > +          else if (i + 32 < TEST_LEN)
> > +            {
> > +              i += 7;
> > +            }
> > +          else
> > +            {
> > +              i += 1;
> > +            }
> > +        }
> > +    }
> > +
> > +  check_overflow ();
> >    do_random_tests ();
> >    return ret;
> >  }
> > --
> > 2.25.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 5/7] x86: Optimize strcmp-avx2.S
  2022-01-10  1:58         ` H.J. Lu
@ 2022-01-10  2:54           ` Noah Goldstein
  0 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10  2:54 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library

On Sun, Jan 9, 2022 at 7:59 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Jan 9, 2022 at 5:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Sun, Jan 9, 2022 at 6:41 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Sun, Jan 9, 2022 at 4:31 PM Noah Goldstein via Libc-alpha
> > > <libc-alpha@sourceware.org> wrote:
> > > >
> > > > Optimization are primarily to the loop logic and how the page cross
> > > > logic interacts with the loop.
> > > >
> > > > The page cross logic is at times more expensive for short strings near
> > > > the end of a page but not crossing the page. This is done to retest
> > > > the page cross conditions with a non-faulty check and to improve the
> > > > logic for entering the loop afterwards. This is only particular cases,
> > > > however, and is general made up for by more than 10x improvements on
> > > > the transition from the page cross -> loop case.
> > > >
> > > > The non-page cross cases are improved most for smaller sizes [0, 128]
> > > > and go about even for (128, 4096]. The loop page cross logic is
> > > > improved so some more significant speedup is seen there as well.
> > > >
> > > > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> > > >
> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > ---
> > > >  sysdeps/x86_64/multiarch/strcmp-avx2.S | 1590 ++++++++++++++----------
> > > >  1 file changed, 939 insertions(+), 651 deletions(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > index 9c73b5899d..28d6a0025a 100644
> > > > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > @@ -26,35 +26,57 @@
> > > >
> > > >  # define PAGE_SIZE     4096
> > > >
> > > > -/* VEC_SIZE = Number of bytes in a ymm register */
> > > > +       /* VEC_SIZE = Number of bytes in a ymm register.  */
> > > >  # define VEC_SIZE      32
> > > >
> > > > -/* Shift for dividing by (VEC_SIZE * 4).  */
> > > > -# define DIVIDE_BY_VEC_4_SHIFT 7
> > > > -# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> > > > -#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
> > > > -# endif
> > > > +# define VMOVU vmovdqu
> > > > +# define VMOVA vmovdqa
> > > >
> > > >  # ifdef USE_AS_WCSCMP
> > > > -/* Compare packed dwords.  */
> > > > +       /* Compare packed dwords.  */
> > > >  #  define VPCMPEQ      vpcmpeqd
> > > > -/* Compare packed dwords and store minimum.  */
> > > > +       /* Compare packed dwords and store minimum.  */
> > > >  #  define VPMINU       vpminud
> > > > -/* 1 dword char == 4 bytes.  */
> > > > +       /* 1 dword char == 4 bytes.  */
> > > >  #  define SIZE_OF_CHAR 4
> > > >  # else
> > > > -/* Compare packed bytes.  */
> > > > +       /* Compare packed bytes.  */
> > > >  #  define VPCMPEQ      vpcmpeqb
> > > > -/* Compare packed bytes and store minimum.  */
> > > > +       /* Compare packed bytes and store minimum.  */
> > > >  #  define VPMINU       vpminub
> > > > -/* 1 byte char == 1 byte.  */
> > > > +       /* 1 byte char == 1 byte.  */
> > > >  #  define SIZE_OF_CHAR 1
> > > >  # endif
> > > >
> > > > +# ifdef USE_AS_STRNCMP
> > > > +#  define LOOP_REG     r9d
> > > > +#  define LOOP_REG64   r9
> > > > +
> > > > +#  define OFFSET_REG8  r9b
> > > > +#  define OFFSET_REG   r9d
> > > > +#  define OFFSET_REG64 r9
> > > > +# else
> > > > +#  define LOOP_REG     edx
> > > > +#  define LOOP_REG64   rdx
> > > > +
> > > > +#  define OFFSET_REG8  dl
> > > > +#  define OFFSET_REG   edx
> > > > +#  define OFFSET_REG64 rdx
> > > > +# endif
> > > > +
> > > >  # ifndef VZEROUPPER
> > > >  #  define VZEROUPPER   vzeroupper
> > > >  # endif
> > > >
> > > > +# if defined USE_AS_STRNCMP
> > > > +#  define VEC_OFFSET   0
> > > > +# else
> > > > +#  define VEC_OFFSET   (-VEC_SIZE)
> > > > +# endif
> > > > +
> > > > +# define xmmZERO       xmm15
> > > > +# define ymmZERO       ymm15
> > > > +
> > > >  # ifndef SECTION
> > > >  #  define SECTION(p)   p##.avx
> > > >  # endif
> > > > @@ -79,783 +101,1049 @@
> > > >     the maximum offset is reached before a difference is found, zero is
> > > >     returned.  */
> > > >
> > > > -       .section SECTION(.text),"ax",@progbits
> > > > -ENTRY (STRCMP)
> > > > +       .section SECTION(.text), "ax", @progbits
> > > > +ENTRY(STRCMP)
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* Check for simple cases (0 or 1) in offset.  */
> > > > +#  ifdef __ILP32__
> > > > +       /* Clear the upper 32 bits.  */
> > > > +       movl    %edx, %rdx
> > > > +#  endif
> > > >         cmp     $1, %RDX_LP
> > > > -       je      L(char0)
> > > > -       jb      L(zero)
> > > > +       /* Signed comparison intentional. We use this branch to also
> > > > +          test cases where length >= 2^63. These very large sizes can be
> > > > +          handled with strcmp as there is no way for that length to
> > > > +          actually bound the buffer.  */
> > > > +       jle     L(one_or_less)
> > > >  #  ifdef USE_AS_WCSCMP
> > > > -#  ifndef __ILP32__
> > > >         movq    %rdx, %rcx
> > > > -       /* Check if length could overflow when multiplied by
> > > > -          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > > > -          overflow cases as well as redirect cases where its impossible to
> > > > -          length to bound a valid memory region. In these cases just use
> > > > -          'wcscmp'.  */
> > > > +
> > > > +       /* Multiplying length by sizeof(wchar_t) can result in overflow.
> > > > +          Check if that is possible. All cases where overflow are possible
> > > > +          are cases where length is large enough that it can never be a
> > > > +          bound on valid memory so just use wcscmp.  */
> > > >         shrq    $56, %rcx
> > > >         jnz     __wcscmp_avx2
> > > > +
> > > > +       leaq    (, %rdx, 4), %rdx
> > > >  #  endif
> > > > -       /* Convert units: from wide to byte char.  */
> > > > -       shl     $2, %RDX_LP
> > > > -#  endif
> > > > -       /* Register %r11 tracks the maximum offset.  */
> > > > -       mov     %RDX_LP, %R11_LP
> > > >  # endif
> > > > +       vpxor   %xmmZERO, %xmmZERO, %xmmZERO
> > > >         movl    %edi, %eax
> > > > -       xorl    %edx, %edx
> > > > -       /* Make %xmm7 (%ymm7) all zeros in this function.  */
> > > > -       vpxor   %xmm7, %xmm7, %xmm7
> > > >         orl     %esi, %eax
> > > > -       andl    $(PAGE_SIZE - 1), %eax
> > > > -       cmpl    $(PAGE_SIZE - (VEC_SIZE * 4)), %eax
> > > > -       jg      L(cross_page)
> > > > -       /* Start comparing 4 vectors.  */
> > > > -       vmovdqu (%rdi), %ymm1
> > > > -       VPCMPEQ (%rsi), %ymm1, %ymm0
> > > > -       VPMINU  %ymm1, %ymm0, %ymm0
> > > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > > -       vpmovmskb %ymm0, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       je      L(next_3_vectors)
> > > > -       tzcntl  %ecx, %edx
> > > > +       sall    $20, %eax
> > > > +       /* Check if s1 or s2 may cross a page  in next 4x VEC loads.  */
> > > > +       cmpl    $((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
> > > > +       ja      L(page_cross)
> > > > +
> > > > +L(no_page_cross):
> > > > +       /* Safe to compare 4x vectors.  */
> > > > +       VMOVU   (%rdi), %ymm0
> > > > +       /* 1s where s1 and s2 equal.  */
> > > > +       VPCMPEQ (%rsi), %ymm0, %ymm1
> > > > +       /* 1s at null CHAR.  */
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       /* 1s where s1 and s2 equal AND not null CHAR.  */
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +
> > > > +       /* All 1s -> keep going, any 0s -> return.  */
> > > > +       vpmovmskb %ymm1, %ecx
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* Return 0 if the mismatched index (%rdx) is after the maximum
> > > > -          offset (%r11).   */
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > +       cmpq    $VEC_SIZE, %rdx
> > > > +       jbe     L(vec_0_test_len)
> > > >  # endif
> > > > +
> > > > +       /* All 1s represents all equals. incl will overflow to zero in
> > > > +          all equals case. Otherwise 1s will carry until position of first
> > > > +          mismatch.  */
> > > > +       incl    %ecx
> > > > +       jz      L(more_3x_vec)
> > > > +
> > > > +       .p2align 4,, 4
> > > > +L(return_vec_0):
> > > > +       tzcntl  %ecx, %ecx
> > > >  # ifdef USE_AS_WCSCMP
> > > > +       movl    (%rdi, %rcx), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rdi, %rdx), %ecx
> > > > -       cmpl    (%rsi, %rdx), %ecx
> > > > -       je      L(return)
> > > > -L(wcscmp_return):
> > > > +       cmpl    (%rsi, %rcx), %edx
> > > > +       je      L(ret0)
> > > >         setl    %al
> > > >         negl    %eax
> > > >         orl     $1, %eax
> > > > -L(return):
> > > >  # else
> > > > -       movzbl  (%rdi, %rdx), %eax
> > > > -       movzbl  (%rsi, %rdx), %edx
> > > > -       subl    %edx, %eax
> > > > +       movzbl  (%rdi, %rcx), %eax
> > > > +       movzbl  (%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > >  # endif
> > > > +L(ret0):
> > > >  L(return_vzeroupper):
> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN
> > > >
> > > > -       .p2align 4
> > > > -L(return_vec_size):
> > > > -       tzcntl  %ecx, %edx
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
> > > > -          the maximum offset (%r11).  */
> > > > -       addq    $VEC_SIZE, %rdx
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > +       .p2align 4,, 8
> > > > +L(vec_0_test_len):
> > > > +       notl    %ecx
> > > > +       bzhil   %edx, %ecx, %eax
> > > > +       jnz     L(return_vec_0)
> > > > +       /* Align if will cross fetch block.  */
> > > > +       .p2align 4,, 2
> > > > +L(ret_zero):
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rdi, %rdx), %ecx
> > > > -       cmpl    (%rsi, %rdx), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rdi, %rdx), %eax
> > > > -       movzbl  (%rsi, %rdx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > -# else
> > > > +       VZEROUPPER_RETURN
> > > > +
> > > > +       .p2align 4,, 5
> > > > +L(one_or_less):
> > > > +       jb      L(ret_zero)
> > > >  #  ifdef USE_AS_WCSCMP
> > > > +       /* 'nbe' covers the case where length is negative (large
> > > > +          unsigned).  */
> > > > +       jnbe    __wcscmp_avx2
> > > > +       movl    (%rdi), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    VEC_SIZE(%rdi, %rdx), %ecx
> > > > -       cmpl    VEC_SIZE(%rsi, %rdx), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > +       cmpl    (%rsi), %edx
> > > > +       je      L(ret1)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       orl     $1, %eax
> > > >  #  else
> > > > -       movzbl  VEC_SIZE(%rdi, %rdx), %eax
> > > > -       movzbl  VEC_SIZE(%rsi, %rdx), %edx
> > > > -       subl    %edx, %eax
> > > > +       /* 'nbe' covers the case where length is negative (large
> > > > +          unsigned).  */
> > > > +
> > > > +       jnbe    __strcmp_avx2
> > > > +       movzbl  (%rdi), %eax
> > > > +       movzbl  (%rsi), %ecx
> > > > +       subl    %ecx, %eax
> > > >  #  endif
> > > > +L(ret1):
> > > > +       ret
> > > >  # endif
> > > > -       VZEROUPPER_RETURN
> > > >
> > > > -       .p2align 4
> > > > -L(return_2_vec_size):
> > > > -       tzcntl  %ecx, %edx
> > > > +       .p2align 4,, 10
> > > > +L(return_vec_1):
> > > > +       tzcntl  %ecx, %ecx
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
> > > > -          after the maximum offset (%r11).  */
> > > > -       addq    $(VEC_SIZE * 2), %rdx
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > +       /* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of
> > > > +          overflow.  */
> > > > +       addq    $-VEC_SIZE, %rdx
> > > > +       cmpq    %rcx, %rdx
> > > > +       jbe     L(ret_zero)
> > > > +# endif
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       movl    VEC_SIZE(%rdi, %rcx), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rdi, %rdx), %ecx
> > > > -       cmpl    (%rsi, %rdx), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rdi, %rdx), %eax
> > > > -       movzbl  (%rsi, %rdx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> > > > +       je      L(ret2)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       orl     $1, %eax
> > > >  # else
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       xorl    %eax, %eax
> > > > -       movl    (VEC_SIZE * 2)(%rdi, %rdx), %ecx
> > > > -       cmpl    (VEC_SIZE * 2)(%rsi, %rdx), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (VEC_SIZE * 2)(%rdi, %rdx), %eax
> > > > -       movzbl  (VEC_SIZE * 2)(%rsi, %rdx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> > > > +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > >  # endif
> > > > +L(ret2):
> > > >         VZEROUPPER_RETURN
> > > >
> > > > -       .p2align 4
> > > > -L(return_3_vec_size):
> > > > -       tzcntl  %ecx, %edx
> > > > +       .p2align 4,, 10
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
> > > > -          after the maximum offset (%r11).  */
> > > > -       addq    $(VEC_SIZE * 3), %rdx
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > +L(return_vec_3):
> > > > +       salq    $32, %rcx
> > > > +# endif
> > > > +
> > > > +L(return_vec_2):
> > > > +# ifndef USE_AS_STRNCMP
> > > > +       tzcntl  %ecx, %ecx
> > > > +# else
> > > > +       tzcntq  %rcx, %rcx
> > > > +       cmpq    %rcx, %rdx
> > > > +       jbe     L(ret_zero)
> > > > +# endif
> > > > +
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rdi, %rdx), %ecx
> > > > -       cmpl    (%rsi, %rdx), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rdi, %rdx), %eax
> > > > -       movzbl  (%rsi, %rdx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> > > > +       je      L(ret3)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       orl     $1, %eax
> > > >  # else
> > > > +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> > > > +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > > +# endif
> > > > +L(ret3):
> > > > +       VZEROUPPER_RETURN
> > > > +
> > > > +# ifndef USE_AS_STRNCMP
> > > > +       .p2align 4,, 10
> > > > +L(return_vec_3):
> > > > +       tzcntl  %ecx, %ecx
> > > >  #  ifdef USE_AS_WCSCMP
> > > > +       movl    (VEC_SIZE * 3)(%rdi, %rcx), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    (VEC_SIZE * 3)(%rdi, %rdx), %ecx
> > > > -       cmpl    (VEC_SIZE * 3)(%rsi, %rdx), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > +       cmpl    (VEC_SIZE * 3)(%rsi, %rcx), %edx
> > > > +       je      L(ret4)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       orl     $1, %eax
> > > >  #  else
> > > > -       movzbl  (VEC_SIZE * 3)(%rdi, %rdx), %eax
> > > > -       movzbl  (VEC_SIZE * 3)(%rsi, %rdx), %edx
> > > > -       subl    %edx, %eax
> > > > +       movzbl  (VEC_SIZE * 3)(%rdi, %rcx), %eax
> > > > +       movzbl  (VEC_SIZE * 3)(%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > >  #  endif
> > > > -# endif
> > > > +L(ret4):
> > > >         VZEROUPPER_RETURN
> > > > +# endif
> > > > +
> > > > +       .p2align 4,, 10
> > > > +L(more_3x_vec):
> > > > +       /* Safe to compare 4x vectors.  */
> > > > +       VMOVU   VEC_SIZE(%rdi), %ymm0
> > > > +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incl    %ecx
> > > > +       jnz     L(return_vec_1)
> > > > +
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       subq    $(VEC_SIZE * 2), %rdx
> > > > +       jbe     L(ret_zero)
> > > > +# endif
> > > > +
> > > > +       VMOVU   (VEC_SIZE * 2)(%rdi), %ymm0
> > > > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incl    %ecx
> > > > +       jnz     L(return_vec_2)
> > > > +
> > > > +       VMOVU   (VEC_SIZE * 3)(%rdi), %ymm0
> > > > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incl    %ecx
> > > > +       jnz     L(return_vec_3)
> > > >
> > > > -       .p2align 4
> > > > -L(next_3_vectors):
> > > > -       vmovdqu VEC_SIZE(%rdi), %ymm6
> > > > -       VPCMPEQ VEC_SIZE(%rsi), %ymm6, %ymm3
> > > > -       VPMINU  %ymm6, %ymm3, %ymm3
> > > > -       VPCMPEQ %ymm7, %ymm3, %ymm3
> > > > -       vpmovmskb %ymm3, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       jne     L(return_vec_size)
> > > > -       vmovdqu (VEC_SIZE * 2)(%rdi), %ymm5
> > > > -       vmovdqu (VEC_SIZE * 3)(%rdi), %ymm4
> > > > -       vmovdqu (VEC_SIZE * 3)(%rsi), %ymm0
> > > > -       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm5, %ymm2
> > > > -       VPMINU  %ymm5, %ymm2, %ymm2
> > > > -       VPCMPEQ %ymm4, %ymm0, %ymm0
> > > > -       VPCMPEQ %ymm7, %ymm2, %ymm2
> > > > -       vpmovmskb %ymm2, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       jne     L(return_2_vec_size)
> > > > -       VPMINU  %ymm4, %ymm0, %ymm0
> > > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > > -       vpmovmskb %ymm0, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       jne     L(return_3_vec_size)
> > > > -L(main_loop_header):
> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rdx
> > > > -       movl    $PAGE_SIZE, %ecx
> > > > -       /* Align load via RAX.  */
> > > > -       andq    $-(VEC_SIZE * 4), %rdx
> > > > -       subq    %rdi, %rdx
> > > > -       leaq    (%rdi, %rdx), %rax
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* Starting from this point, the maximum offset, or simply the
> > > > -          'offset', DECREASES by the same amount when base pointers are
> > > > -          moved forward.  Return 0 when:
> > > > -            1) On match: offset <= the matched vector index.
> > > > -            2) On mistmach, offset is before the mistmatched index.
> > > > +       cmpq    $(VEC_SIZE * 2), %rdx
> > > > +       jbe     L(ret_zero)
> > > > +# endif
> > > > +
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       /* any non-zero positive value that doesn't inference with 0x1.
> > > >          */
> > > > -       subq    %rdx, %r11
> > > > -       jbe     L(zero)
> > > > -# endif
> > > > -       addq    %rsi, %rdx
> > > > -       movq    %rdx, %rsi
> > > > -       andl    $(PAGE_SIZE - 1), %esi
> > > > -       /* Number of bytes before page crossing.  */
> > > > -       subq    %rsi, %rcx
> > > > -       /* Number of VEC_SIZE * 4 blocks before page crossing.  */
> > > > -       shrq    $DIVIDE_BY_VEC_4_SHIFT, %rcx
> > > > -       /* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
> > > > -       movl    %ecx, %esi
> > > > -       jmp     L(loop_start)
> > > > +       movl    $2, %r8d
> > > >
> > > > +# else
> > > > +       xorl    %r8d, %r8d
> > > > +# endif
> > > > +
> > > > +       /* The prepare labels are various entry points from the page
> > > > +          cross logic.  */
> > > > +L(prepare_loop):
> > > > +
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       /* Store N + (VEC_SIZE * 4) and place check at the begining of
> > > > +          the loop.  */
> > > > +       leaq    (VEC_SIZE * 2)(%rdi, %rdx), %rdx
> > > > +# endif
> > > > +L(prepare_loop_no_len):
> > > > +
> > > > +       /* Align s1 and adjust s2 accordingly.  */
> > > > +       subq    %rdi, %rsi
> > > > +       andq    $-(VEC_SIZE * 4), %rdi
> > > > +       addq    %rdi, %rsi
> > > > +
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       subq    %rdi, %rdx
> > > > +# endif
> > > > +
> > > > +L(prepare_loop_aligned):
> > > > +       /* eax stores distance from rsi to next page cross. These cases
> > > > +          need to be handled specially as the 4x loop could potentially
> > > > +          read memory past the length of s1 or s2 and across a page
> > > > +          boundary.  */
> > > > +       movl    $-(VEC_SIZE * 4), %eax
> > > > +       subl    %esi, %eax
> > > > +       andl    $(PAGE_SIZE - 1), %eax
> > > > +
> > > > +       /* Loop 4x comparisons at a time.  */
> > > >         .p2align 4
> > > >  L(loop):
> > > > +
> > > > +       /* End condition for strncmp.  */
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
> > > > -          the maximum offset (%r11) by the same amount.  */
> > > > -       subq    $(VEC_SIZE * 4), %r11
> > > > -       jbe     L(zero)
> > > > -# endif
> > > > -       addq    $(VEC_SIZE * 4), %rax
> > > > -       addq    $(VEC_SIZE * 4), %rdx
> > > > -L(loop_start):
> > > > -       testl   %esi, %esi
> > > > -       leal    -1(%esi), %esi
> > > > -       je      L(loop_cross_page)
> > > > -L(back_to_loop):
> > > > -       /* Main loop, comparing 4 vectors are a time.  */
> > > > -       vmovdqa (%rax), %ymm0
> > > > -       vmovdqa VEC_SIZE(%rax), %ymm3
> > > > -       VPCMPEQ (%rdx), %ymm0, %ymm4
> > > > -       VPCMPEQ VEC_SIZE(%rdx), %ymm3, %ymm1
> > > > -       VPMINU  %ymm0, %ymm4, %ymm4
> > > > -       VPMINU  %ymm3, %ymm1, %ymm1
> > > > -       vmovdqa (VEC_SIZE * 2)(%rax), %ymm2
> > > > -       VPMINU  %ymm1, %ymm4, %ymm0
> > > > -       vmovdqa (VEC_SIZE * 3)(%rax), %ymm3
> > > > -       VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm2, %ymm5
> > > > -       VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm3, %ymm6
> > > > -       VPMINU  %ymm2, %ymm5, %ymm5
> > > > -       VPMINU  %ymm3, %ymm6, %ymm6
> > > > -       VPMINU  %ymm5, %ymm0, %ymm0
> > > > -       VPMINU  %ymm6, %ymm0, %ymm0
> > > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > > -
> > > > -       /* Test each mask (32 bits) individually because for VEC_SIZE
> > > > -          == 32 is not possible to OR the four masks and keep all bits
> > > > -          in a 64-bit integer register, differing from SSE2 strcmp
> > > > -          where ORing is possible.  */
> > > > -       vpmovmskb %ymm0, %ecx
> > > > +       subq    $(VEC_SIZE * 4), %rdx
> > > > +       jbe     L(ret_zero)
> > > > +# endif
> > > > +
> > > > +       subq    $-(VEC_SIZE * 4), %rdi
> > > > +       subq    $-(VEC_SIZE * 4), %rsi
> > > > +
> > > > +       /* Check if rsi loads will cross a page boundary.  */
> > > > +       addl    $-(VEC_SIZE * 4), %eax
> > > > +       jnb     L(page_cross_during_loop)
> > > > +
> > > > +       /* Loop entry after handling page cross during loop.  */
> > > > +L(loop_skip_page_cross_check):
> > > > +       VMOVA   (VEC_SIZE * 0)(%rdi), %ymm0
> > > > +       VMOVA   (VEC_SIZE * 1)(%rdi), %ymm2
> > > > +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> > > > +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> > > > +
> > > > +       /* ymm1 all 1s where s1 and s2 equal. All 0s otherwise.  */
> > > > +       VPCMPEQ (VEC_SIZE * 0)(%rsi), %ymm0, %ymm1
> > > > +
> > > > +       VPCMPEQ (VEC_SIZE * 1)(%rsi), %ymm2, %ymm3
> > > > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> > > > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> > > > +
> > > > +
> > > > +       /* If any mismatches or null CHAR then 0 CHAR, otherwise non-
> > > > +          zero.  */
> > > > +       vpand   %ymm0, %ymm1, %ymm1
> > > > +
> > > > +
> > > > +       vpand   %ymm2, %ymm3, %ymm3
> > > > +       vpand   %ymm4, %ymm5, %ymm5
> > > > +       vpand   %ymm6, %ymm7, %ymm7
> > > > +
> > > > +       VPMINU  %ymm1, %ymm3, %ymm3
> > > > +       VPMINU  %ymm5, %ymm7, %ymm7
> > > > +
> > > > +       /* Reduce all 0 CHARs for the 4x VEC into ymm7.  */
> > > > +       VPMINU  %ymm3, %ymm7, %ymm7
> > > > +
> > > > +       /* If any 0 CHAR then done.  */
> > > > +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> > > > +       vpmovmskb %ymm7, %LOOP_REG
> > > > +       testl   %LOOP_REG, %LOOP_REG
> > > > +       jz      L(loop)
> > > > +
> > > > +       /* Find which VEC has the mismatch of end of string.  */
> > > > +       VPCMPEQ %ymm1, %ymmZERO, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > >         testl   %ecx, %ecx
> > > > -       je      L(loop)
> > > > -       VPCMPEQ %ymm7, %ymm4, %ymm0
> > > > -       vpmovmskb %ymm0, %edi
> > > > -       testl   %edi, %edi
> > > > -       je      L(test_vec)
> > > > -       tzcntl  %edi, %ecx
> > > > +       jnz     L(return_vec_0_end)
> > > > +
> > > > +
> > > > +       VPCMPEQ %ymm3, %ymmZERO, %ymm3
> > > > +       vpmovmskb %ymm3, %ecx
> > > > +       testl   %ecx, %ecx
> > > > +       jnz     L(return_vec_1_end)
> > > > +
> > > > +L(return_vec_2_3_end):
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       cmpq    %rcx, %r11
> > > > -       jbe     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > +       subq    $(VEC_SIZE * 2), %rdx
> > > > +       jbe     L(ret_zero_end)
> > > > +# endif
> > > > +
> > > > +       VPCMPEQ %ymm5, %ymmZERO, %ymm5
> > > > +       vpmovmskb %ymm5, %ecx
> > > > +       testl   %ecx, %ecx
> > > > +       jnz     L(return_vec_2_end)
> > > > +
> > > > +       /* LOOP_REG contains matches for null/mismatch from the loop. If
> > > > +          VEC 0,1,and 2 all have no null and no mismatches then mismatch
> > > > +          must entirely be from VEC 3 which is fully represented by
> > > > +          LOOP_REG.  */
> > > > +       tzcntl  %LOOP_REG, %LOOP_REG
> > > > +
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       subl    $-(VEC_SIZE), %LOOP_REG
> > > > +       cmpq    %LOOP_REG64, %rdx
> > > > +       jbe     L(ret_zero_end)
> > > > +# endif
> > > > +
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       movl    (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rsi, %rcx), %edi
> > > > -       cmpl    (%rdx, %rcx), %edi
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rax, %rcx), %eax
> > > > -       movzbl  (%rdx, %rcx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       cmpl    (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> > > > +       je      L(ret5)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       xorl    %r8d, %eax
> > > >  # else
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > -       xorl    %eax, %eax
> > > > -       movl    (%rsi, %rcx), %edi
> > > > -       cmpl    (%rdx, %rcx), %edi
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rax, %rcx), %eax
> > > > -       movzbl  (%rdx, %rcx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax
> > > > +       movzbl  (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
> > > > +       subl    %ecx, %eax
> > > > +       xorl    %r8d, %eax
> > > > +       subl    %r8d, %eax
> > > >  # endif
> > > > +L(ret5):
> > > >         VZEROUPPER_RETURN
> > > >
> > > > -       .p2align 4
> > > > -L(test_vec):
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* The first vector matched.  Return 0 if the maximum offset
> > > > -          (%r11) <= VEC_SIZE.  */
> > > > -       cmpq    $VEC_SIZE, %r11
> > > > -       jbe     L(zero)
> > > > +       .p2align 4,, 2
> > > > +L(ret_zero_end):
> > > > +       xorl    %eax, %eax
> > > > +       VZEROUPPER_RETURN
> > > >  # endif
> > > > -       VPCMPEQ %ymm7, %ymm1, %ymm1
> > > > -       vpmovmskb %ymm1, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       je      L(test_2_vec)
> > > > -       tzcntl  %ecx, %edi
> > > > +
> > > > +
> > > > +       /* The L(return_vec_N_end) differ from L(return_vec_N) in that
> > > > +          they use the value of `r8` to negate the return value. This is
> > > > +          because the page cross logic can swap `rdi` and `rsi`.  */
> > > > +       .p2align 4,, 10
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       addq    $VEC_SIZE, %rdi
> > > > -       cmpq    %rdi, %r11
> > > > -       jbe     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > +L(return_vec_1_end):
> > > > +       salq    $32, %rcx
> > > > +# endif
> > > > +L(return_vec_0_end):
> > > > +# ifndef USE_AS_STRNCMP
> > > > +       tzcntl  %ecx, %ecx
> > > > +# else
> > > > +       tzcntq  %rcx, %rcx
> > > > +       cmpq    %rcx, %rdx
> > > > +       jbe     L(ret_zero_end)
> > > > +# endif
> > > > +
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       movl    (%rdi, %rcx), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rsi, %rdi), %ecx
> > > > -       cmpl    (%rdx, %rdi), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rax, %rdi), %eax
> > > > -       movzbl  (%rdx, %rdi), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       cmpl    (%rsi, %rcx), %edx
> > > > +       je      L(ret6)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       xorl    %r8d, %eax
> > > >  # else
> > > > +       movzbl  (%rdi, %rcx), %eax
> > > > +       movzbl  (%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > > +       xorl    %r8d, %eax
> > > > +       subl    %r8d, %eax
> > > > +# endif
> > > > +L(ret6):
> > > > +       VZEROUPPER_RETURN
> > > > +
> > > > +# ifndef USE_AS_STRNCMP
> > > > +       .p2align 4,, 10
> > > > +L(return_vec_1_end):
> > > > +       tzcntl  %ecx, %ecx
> > > >  #  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > +       movl    VEC_SIZE(%rdi, %rcx), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    VEC_SIZE(%rsi, %rdi), %ecx
> > > > -       cmpl    VEC_SIZE(%rdx, %rdi), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > +       cmpl    VEC_SIZE(%rsi, %rcx), %edx
> > > > +       je      L(ret7)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       xorl    %r8d, %eax
> > > >  #  else
> > > > -       movzbl  VEC_SIZE(%rax, %rdi), %eax
> > > > -       movzbl  VEC_SIZE(%rdx, %rdi), %edx
> > > > -       subl    %edx, %eax
> > > > +       movzbl  VEC_SIZE(%rdi, %rcx), %eax
> > > > +       movzbl  VEC_SIZE(%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > > +       xorl    %r8d, %eax
> > > > +       subl    %r8d, %eax
> > > >  #  endif
> > > > -# endif
> > > > +L(ret7):
> > > >         VZEROUPPER_RETURN
> > > > +# endif
> > > >
> > > > -       .p2align 4
> > > > -L(test_2_vec):
> > > > +       .p2align 4,, 10
> > > > +L(return_vec_2_end):
> > > > +       tzcntl  %ecx, %ecx
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* The first 2 vectors matched.  Return 0 if the maximum offset
> > > > -          (%r11) <= 2 * VEC_SIZE.  */
> > > > -       cmpq    $(VEC_SIZE * 2), %r11
> > > > -       jbe     L(zero)
> > > > +       cmpq    %rcx, %rdx
> > > > +       jbe     L(ret_zero_page_cross)
> > > >  # endif
> > > > -       VPCMPEQ %ymm7, %ymm5, %ymm5
> > > > -       vpmovmskb %ymm5, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       je      L(test_3_vec)
> > > > -       tzcntl  %ecx, %edi
> > > > -# ifdef USE_AS_STRNCMP
> > > > -       addq    $(VEC_SIZE * 2), %rdi
> > > > -       cmpq    %rdi, %r11
> > > > -       jbe     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       movl    (VEC_SIZE * 2)(%rdi, %rcx), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rsi, %rdi), %ecx
> > > > -       cmpl    (%rdx, %rdi), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rax, %rdi), %eax
> > > > -       movzbl  (%rdx, %rdi), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       cmpl    (VEC_SIZE * 2)(%rsi, %rcx), %edx
> > > > +       je      L(ret11)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       xorl    %r8d, %eax
> > > >  # else
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > -       xorl    %eax, %eax
> > > > -       movl    (VEC_SIZE * 2)(%rsi, %rdi), %ecx
> > > > -       cmpl    (VEC_SIZE * 2)(%rdx, %rdi), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (VEC_SIZE * 2)(%rax, %rdi), %eax
> > > > -       movzbl  (VEC_SIZE * 2)(%rdx, %rdi), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       movzbl  (VEC_SIZE * 2)(%rdi, %rcx), %eax
> > > > +       movzbl  (VEC_SIZE * 2)(%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > > +       xorl    %r8d, %eax
> > > > +       subl    %r8d, %eax
> > > >  # endif
> > > > +L(ret11):
> > > >         VZEROUPPER_RETURN
> > > >
> > > > -       .p2align 4
> > > > -L(test_3_vec):
> > > > +
> > > > +       /* Page cross in rsi in next 4x VEC.  */
> > > > +
> > > > +       /* TODO: Improve logic here.  */
> > > > +       .p2align 4,, 10
> > > > +L(page_cross_during_loop):
> > > > +       /* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
> > > > +
> > > > +       /* Optimistically rsi and rdi and both aligned inwhich case we
> > > > +          don't need any logic here.  */
> > > > +       cmpl    $-(VEC_SIZE * 4), %eax
> > > > +       /* Don't adjust eax before jumping back to loop and we will
> > > > +          never hit page cross case again.  */
> > > > +       je      L(loop_skip_page_cross_check)
> > > > +
> > > > +       /* Check if we can safely load a VEC.  */
> > > > +       cmpl    $-(VEC_SIZE * 3), %eax
> > > > +       jle     L(less_1x_vec_till_page_cross)
> > > > +
> > > > +       VMOVA   (%rdi), %ymm0
> > > > +       VPCMPEQ (%rsi), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incl    %ecx
> > > > +       jnz     L(return_vec_0_end)
> > > > +
> > > > +       /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
> > > > +       cmpl    $-(VEC_SIZE * 2), %eax
> > > > +       jg      L(more_2x_vec_till_page_cross)
> > > > +
> > > > +       .p2align 4,, 4
> > > > +L(less_1x_vec_till_page_cross):
> > > > +       subl    $-(VEC_SIZE * 4), %eax
> > > > +       /* Guranteed safe to read from rdi - VEC_SIZE here. The only
> > > > +          concerning case is first iteration if incoming s1 was near start
> > > > +          of a page and s2 near end. If s1 was near the start of the page
> > > > +          we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
> > > > +          to read back -VEC_SIZE. If rdi is truly at the start of a page
> > > > +          here, it means the previous page (rdi - VEC_SIZE) has already
> > > > +          been loaded earlier so must be valid.  */
> > > > +       VMOVU   -VEC_SIZE(%rdi, %rax), %ymm0
> > > > +       VPCMPEQ -VEC_SIZE(%rsi, %rax), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +
> > > > +       /* Mask of potentially valid bits. The lower bits can be out of
> > > > +          range comparisons (but safe regarding page crosses).  */
> > > > +       movl    $-1, %r10d
> > > > +       shlxl   %esi, %r10d, %r10d
> > > > +       notl    %ecx
> > > > +
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* The first 3 vectors matched.  Return 0 if the maximum offset
> > > > -          (%r11) <= 3 * VEC_SIZE.  */
> > > > -       cmpq    $(VEC_SIZE * 3), %r11
> > > > -       jbe     L(zero)
> > > > -# endif
> > > > -       VPCMPEQ %ymm7, %ymm6, %ymm6
> > > > -       vpmovmskb %ymm6, %esi
> > > > -       tzcntl  %esi, %ecx
> > > > +       cmpq    %rax, %rdx
> > > > +       jbe     L(return_page_cross_end_check)
> > > > +# endif
> > > > +       movl    %eax, %OFFSET_REG
> > > > +       addl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> > > > +
> > > > +       andl    %r10d, %ecx
> > > > +       jz      L(loop_skip_page_cross_check)
> > > > +
> > > > +       .p2align 4,, 3
> > > > +L(return_page_cross_end):
> > > > +       tzcntl  %ecx, %ecx
> > > > +
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       addq    $(VEC_SIZE * 3), %rcx
> > > > -       cmpq    %rcx, %r11
> > > > -       jbe     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > -       xorl    %eax, %eax
> > > > -       movl    (%rsi, %rcx), %esi
> > > > -       cmpl    (%rdx, %rcx), %esi
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rax, %rcx), %eax
> > > > -       movzbl  (%rdx, %rcx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       leal    -VEC_SIZE(%OFFSET_REG64, %rcx), %ecx
> > > > +L(return_page_cross_cmp_mem):
> > > >  # else
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > +       addl    %OFFSET_REG, %ecx
> > > > +# endif
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       movl    VEC_OFFSET(%rdi, %rcx), %edx
> > > >         xorl    %eax, %eax
> > > > -       movl    (VEC_SIZE * 3)(%rsi, %rcx), %esi
> > > > -       cmpl    (VEC_SIZE * 3)(%rdx, %rcx), %esi
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (VEC_SIZE * 3)(%rax, %rcx), %eax
> > > > -       movzbl  (VEC_SIZE * 3)(%rdx, %rcx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> > > > +       je      L(ret8)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       xorl    %r8d, %eax
> > > > +# else
> > > > +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> > > > +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > > +       xorl    %r8d, %eax
> > > > +       subl    %r8d, %eax
> > > >  # endif
> > > > +L(ret8):
> > > >         VZEROUPPER_RETURN
> > > >
> > > > -       .p2align 4
> > > > -L(loop_cross_page):
> > > > -       xorl    %r10d, %r10d
> > > > -       movq    %rdx, %rcx
> > > > -       /* Align load via RDX.  We load the extra ECX bytes which should
> > > > -          be ignored.  */
> > > > -       andl    $((VEC_SIZE * 4) - 1), %ecx
> > > > -       /* R10 is -RCX.  */
> > > > -       subq    %rcx, %r10
> > > > -
> > > > -       /* This works only if VEC_SIZE * 2 == 64. */
> > > > -# if (VEC_SIZE * 2) != 64
> > > > -#  error (VEC_SIZE * 2) != 64
> > > > -# endif
> > > > -
> > > > -       /* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
> > > > -       cmpl    $(VEC_SIZE * 2), %ecx
> > > > -       jge     L(loop_cross_page_2_vec)
> > > > -
> > > > -       vmovdqu (%rax, %r10), %ymm2
> > > > -       vmovdqu VEC_SIZE(%rax, %r10), %ymm3
> > > > -       VPCMPEQ (%rdx, %r10), %ymm2, %ymm0
> > > > -       VPCMPEQ VEC_SIZE(%rdx, %r10), %ymm3, %ymm1
> > > > -       VPMINU  %ymm2, %ymm0, %ymm0
> > > > -       VPMINU  %ymm3, %ymm1, %ymm1
> > > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > > -       VPCMPEQ %ymm7, %ymm1, %ymm1
> > > > -
> > > > -       vpmovmskb %ymm0, %edi
> > > > -       vpmovmskb %ymm1, %esi
> > > > -
> > > > -       salq    $32, %rsi
> > > > -       xorq    %rsi, %rdi
> > > > -
> > > > -       /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
> > > > -       shrq    %cl, %rdi
> > > > -
> > > > -       testq   %rdi, %rdi
> > > > -       je      L(loop_cross_page_2_vec)
> > > > -       tzcntq  %rdi, %rcx
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       cmpq    %rcx, %r11
> > > > -       jbe     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > +       .p2align 4,, 10
> > > > +L(return_page_cross_end_check):
> > > > +       tzcntl  %ecx, %ecx
> > > > +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> > > > +       cmpl    %ecx, %edx
> > > > +       ja      L(return_page_cross_cmp_mem)
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rsi, %rcx), %edi
> > > > -       cmpl    (%rdx, %rcx), %edi
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rax, %rcx), %eax
> > > > -       movzbl  (%rdx, %rcx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > -# else
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > -       xorl    %eax, %eax
> > > > -       movl    (%rsi, %rcx), %edi
> > > > -       cmpl    (%rdx, %rcx), %edi
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rax, %rcx), %eax
> > > > -       movzbl  (%rdx, %rcx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > -# endif
> > > >         VZEROUPPER_RETURN
> > > > +# endif
> > > >
> > > > -       .p2align 4
> > > > -L(loop_cross_page_2_vec):
> > > > -       /* The first VEC_SIZE * 2 bytes match or are ignored.  */
> > > > -       vmovdqu (VEC_SIZE * 2)(%rax, %r10), %ymm2
> > > > -       vmovdqu (VEC_SIZE * 3)(%rax, %r10), %ymm3
> > > > -       VPCMPEQ (VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5
> > > > -       VPMINU  %ymm2, %ymm5, %ymm5
> > > > -       VPCMPEQ (VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6
> > > > -       VPCMPEQ %ymm7, %ymm5, %ymm5
> > > > -       VPMINU  %ymm3, %ymm6, %ymm6
> > > > -       VPCMPEQ %ymm7, %ymm6, %ymm6
> > > > -
> > > > -       vpmovmskb %ymm5, %edi
> > > > -       vpmovmskb %ymm6, %esi
> > > > -
> > > > -       salq    $32, %rsi
> > > > -       xorq    %rsi, %rdi
> > > >
> > > > -       xorl    %r8d, %r8d
> > > > -       /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
> > > > -       subl    $(VEC_SIZE * 2), %ecx
> > > > -       jle     1f
> > > > -       /* Skip ECX bytes.  */
> > > > -       shrq    %cl, %rdi
> > > > -       /* R8 has number of bytes skipped.  */
> > > > -       movl    %ecx, %r8d
> > > > -1:
> > > > -       /* Before jumping back to the loop, set ESI to the number of
> > > > -          VEC_SIZE * 4 blocks before page crossing.  */
> > > > -       movl    $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
> > > > -
> > > > -       testq   %rdi, %rdi
> > > > +       .p2align 4,, 10
> > > > +L(more_2x_vec_till_page_cross):
> > > > +       /* If more 2x vec till cross we will complete a full loop
> > > > +          iteration here.  */
> > > > +
> > > > +       VMOVU   VEC_SIZE(%rdi), %ymm0
> > > > +       VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incl    %ecx
> > > > +       jnz     L(return_vec_1_end)
> > > > +
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* At this point, if %rdi value is 0, it already tested
> > > > -          VEC_SIZE*4+%r10 byte starting from %rax. This label
> > > > -          checks whether strncmp maximum offset reached or not.  */
> > > > -       je      L(string_nbyte_offset_check)
> > > > -# else
> > > > -       je      L(back_to_loop)
> > > > +       cmpq    $(VEC_SIZE * 2), %rdx
> > > > +       jbe     L(ret_zero_in_loop_page_cross)
> > > >  # endif
> > > > -       tzcntq  %rdi, %rcx
> > > > -       addq    %r10, %rcx
> > > > -       /* Adjust for number of bytes skipped.  */
> > > > -       addq    %r8, %rcx
> > > > +
> > > > +       subl    $-(VEC_SIZE * 4), %eax
> > > > +
> > > > +       /* Safe to include comparisons from lower bytes.  */
> > > > +       VMOVU   -(VEC_SIZE * 2)(%rdi, %rax), %ymm0
> > > > +       VPCMPEQ -(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incl    %ecx
> > > > +       jnz     L(return_vec_page_cross_0)
> > > > +
> > > > +       VMOVU   -(VEC_SIZE * 1)(%rdi, %rax), %ymm0
> > > > +       VPCMPEQ -(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incl    %ecx
> > > > +       jnz     L(return_vec_page_cross_1)
> > > > +
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       addq    $(VEC_SIZE * 2), %rcx
> > > > -       subq    %rcx, %r11
> > > > -       jbe     L(zero)
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > +       /* Must check length here as length might proclude reading next
> > > > +          page.  */
> > > > +       cmpq    %rax, %rdx
> > > > +       jbe     L(ret_zero_in_loop_page_cross)
> > > > +# endif
> > > > +
> > > > +       /* Finish the loop.  */
> > > > +       VMOVA   (VEC_SIZE * 2)(%rdi), %ymm4
> > > > +       VMOVA   (VEC_SIZE * 3)(%rdi), %ymm6
> > > > +
> > > > +       VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
> > > > +       VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
> > > > +       vpand   %ymm4, %ymm5, %ymm5
> > > > +       vpand   %ymm6, %ymm7, %ymm7
> > > > +       VPMINU  %ymm5, %ymm7, %ymm7
> > > > +       VPCMPEQ %ymm7, %ymmZERO, %ymm7
> > > > +       vpmovmskb %ymm7, %LOOP_REG
> > > > +       testl   %LOOP_REG, %LOOP_REG
> > > > +       jnz     L(return_vec_2_3_end)
> > > > +
> > > > +       /* Best for code size to include ucond-jmp here. Would be faster
> > > > +          if this case is hot to duplicate the L(return_vec_2_3_end) code
> > > > +          as fall-through and have jump back to loop on mismatch
> > > > +          comparison.  */
> > > > +       subq    $-(VEC_SIZE * 4), %rdi
> > > > +       subq    $-(VEC_SIZE * 4), %rsi
> > > > +       addl    $(PAGE_SIZE - VEC_SIZE * 8), %eax
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       subq    $(VEC_SIZE * 4), %rdx
> > > > +       ja      L(loop_skip_page_cross_check)
> > > > +L(ret_zero_in_loop_page_cross):
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rsi, %rcx), %edi
> > > > -       cmpl    (%rdx, %rcx), %edi
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rax, %rcx), %eax
> > > > -       movzbl  (%rdx, %rcx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       VZEROUPPER_RETURN
> > > >  # else
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       movq    %rax, %rsi
> > > > -       xorl    %eax, %eax
> > > > -       movl    (VEC_SIZE * 2)(%rsi, %rcx), %edi
> > > > -       cmpl    (VEC_SIZE * 2)(%rdx, %rcx), %edi
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (VEC_SIZE * 2)(%rax, %rcx), %eax
> > > > -       movzbl  (VEC_SIZE * 2)(%rdx, %rcx), %edx
> > > > -       subl    %edx, %eax
> > > > -#  endif
> > > > +       jmp     L(loop_skip_page_cross_check)
> > > >  # endif
> > > > -       VZEROUPPER_RETURN
> > > >
> > > > +
> > > > +       .p2align 4,, 10
> > > > +L(return_vec_page_cross_0):
> > > > +       addl    $-VEC_SIZE, %eax
> > > > +L(return_vec_page_cross_1):
> > > > +       tzcntl  %ecx, %ecx
> > > >  # ifdef USE_AS_STRNCMP
> > > > -L(string_nbyte_offset_check):
> > > > -       leaq    (VEC_SIZE * 4)(%r10), %r10
> > > > -       cmpq    %r10, %r11
> > > > -       jbe     L(zero)
> > > > -       jmp     L(back_to_loop)
> > > > +       leal    -VEC_SIZE(%rax, %rcx), %ecx
> > > > +       cmpq    %rcx, %rdx
> > > > +       jbe     L(ret_zero_in_loop_page_cross)
> > > > +# else
> > > > +       addl    %eax, %ecx
> > > >  # endif
> > > >
> > > > -       .p2align 4
> > > > -L(cross_page_loop):
> > > > -       /* Check one byte/dword at a time.  */
> > > >  # ifdef USE_AS_WCSCMP
> > > > -       cmpl    %ecx, %eax
> > > > +       movl    VEC_OFFSET(%rdi, %rcx), %edx
> > > > +       xorl    %eax, %eax
> > > > +       cmpl    VEC_OFFSET(%rsi, %rcx), %edx
> > > > +       je      L(ret9)
> > > > +       setl    %al
> > > > +       negl    %eax
> > > > +       xorl    %r8d, %eax
> > > >  # else
> > > > +       movzbl  VEC_OFFSET(%rdi, %rcx), %eax
> > > > +       movzbl  VEC_OFFSET(%rsi, %rcx), %ecx
> > > >         subl    %ecx, %eax
> > > > +       xorl    %r8d, %eax
> > > > +       subl    %r8d, %eax
> > > >  # endif
> > > > -       jne     L(different)
> > > > -       addl    $SIZE_OF_CHAR, %edx
> > > > -       cmpl    $(VEC_SIZE * 4), %edx
> > > > -       je      L(main_loop_header)
> > > > -# ifdef USE_AS_STRNCMP
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > +L(ret9):
> > > > +       VZEROUPPER_RETURN
> > > > +
> > > > +
> > > > +       .p2align 4,, 10
> > > > +L(page_cross):
> > > > +# ifndef USE_AS_STRNCMP
> > > > +       /* If both are VEC aligned we don't need any special logic here.
> > > > +          Only valid for strcmp where stop condition is guranteed to be
> > > > +          reachable by just reading memory.  */
> > > > +       testl   $((VEC_SIZE - 1) << 20), %eax
> > > > +       jz      L(no_page_cross)
> > > >  # endif
> > > > +
> > > > +       movl    %edi, %eax
> > > > +       movl    %esi, %ecx
> > > > +       andl    $(PAGE_SIZE - 1), %eax
> > > > +       andl    $(PAGE_SIZE - 1), %ecx
> > > > +
> > > > +       xorl    %OFFSET_REG, %OFFSET_REG
> > > > +
> > > > +       /* Check which is closer to page cross, s1 or s2.  */
> > > > +       cmpl    %eax, %ecx
> > > > +       jg      L(page_cross_s2)
> > > > +
> > > > +       /* The previous page cross check has false positives. Check for
> > > > +          true positive as page cross logic is very expensive.  */
> > > > +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %eax
> > > > +       jbe     L(no_page_cross)
> > > > +
> > > > +       /* Set r8 to not interfere with normal return value (rdi and rsi
> > > > +          did not swap).  */
> > > >  # ifdef USE_AS_WCSCMP
> > > > -       movl    (%rdi, %rdx), %eax
> > > > -       movl    (%rsi, %rdx), %ecx
> > > > +       /* any non-zero positive value that doesn't inference with 0x1.
> > > > +        */
> > > > +       movl    $2, %r8d
> > > >  # else
> > > > -       movzbl  (%rdi, %rdx), %eax
> > > > -       movzbl  (%rsi, %rdx), %ecx
> > > > +       xorl    %r8d, %r8d
> > > >  # endif
> > > > -       /* Check null char.  */
> > > > -       testl   %eax, %eax
> > > > -       jne     L(cross_page_loop)
> > > > -       /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
> > > > -          comparisons.  */
> > > > -       subl    %ecx, %eax
> > > > -# ifndef USE_AS_WCSCMP
> > > > -L(different):
> > > > +
> > > > +       /* Check if less than 1x VEC till page cross.  */
> > > > +       subl    $(VEC_SIZE * 3), %eax
> > > > +       jg      L(less_1x_vec_till_page)
> > > > +
> > > > +       /* If more than 1x VEC till page cross, loop throuh safely
> > > > +          loadable memory until within 1x VEC of page cross.  */
> > > > +
> > > > +       .p2align 4,, 10
> > > > +L(page_cross_loop):
> > > > +
> > > > +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> > > > +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incl    %ecx
> > > > +
> > > > +       jnz     L(check_ret_vec_page_cross)
> > > > +       addl    $VEC_SIZE, %OFFSET_REG
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       cmpq    %OFFSET_REG64, %rdx
> > > > +       jbe     L(ret_zero_page_cross)
> > > >  # endif
> > > > -       VZEROUPPER_RETURN
> > > > +       addl    $VEC_SIZE, %eax
> > > > +       jl      L(page_cross_loop)
> > > > +
> > > > +       subl    %eax, %OFFSET_REG
> > > > +       /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
> > > > +          to not cross page so is safe to load. Since we have already
> > > > +          loaded at least 1 VEC from rsi it is also guranteed to be safe.
> > > > +        */
> > > > +
> > > > +       VMOVU   (%rdi, %OFFSET_REG64), %ymm0
> > > > +       VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1
> > > > +       VPCMPEQ %ymm0, %ymmZERO, %ymm2
> > > > +       vpandn  %ymm1, %ymm2, %ymm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       leal    VEC_SIZE(%OFFSET_REG64), %eax
> > > > +       cmpq    %rax, %rdx
> > > > +       jbe     L(check_ret_vec_page_cross2)
> > > > +       addq    %rdi, %rdx
> > > > +# endif
> > > > +       incl    %ecx
> > > > +       jz      L(prepare_loop_no_len)
> > > >
> > > > +       .p2align 4,, 4
> > > > +L(ret_vec_page_cross):
> > > > +# ifndef USE_AS_STRNCMP
> > > > +L(check_ret_vec_page_cross):
> > > > +# endif
> > > > +       tzcntl  %ecx, %ecx
> > > > +       addl    %OFFSET_REG, %ecx
> > > > +L(ret_vec_page_cross_cont):
> > > >  # ifdef USE_AS_WCSCMP
> > > > -       .p2align 4
> > > > -L(different):
> > > > -       /* Use movl to avoid modifying EFLAGS.  */
> > > > -       movl    $0, %eax
> > > > +       movl    (%rdi, %rcx), %edx
> > > > +       xorl    %eax, %eax
> > > > +       cmpl    (%rsi, %rcx), %edx
> > > > +       je      L(ret12)
> > > >         setl    %al
> > > >         negl    %eax
> > > > -       orl     $1, %eax
> > > > -       VZEROUPPER_RETURN
> > > > +       xorl    %r8d, %eax
> > > > +# else
> > > > +       movzbl  (%rdi, %rcx), %eax
> > > > +       movzbl  (%rsi, %rcx), %ecx
> > > > +       subl    %ecx, %eax
> > > > +       xorl    %r8d, %eax
> > > > +       subl    %r8d, %eax
> > > >  # endif
> > > > +L(ret12):
> > > > +       VZEROUPPER_RETURN
> > > >
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       .p2align 4
> > > > -L(zero):
> > > > +       .p2align 4,, 10
> > > > +L(check_ret_vec_page_cross2):
> > > > +       incl    %ecx
> > > > +L(check_ret_vec_page_cross):
> > > > +       tzcntl  %ecx, %ecx
> > > > +       addl    %OFFSET_REG, %ecx
> > > > +       cmpq    %rcx, %rdx
> > > > +       ja      L(ret_vec_page_cross_cont)
> > > > +       .p2align 4,, 2
> > > > +L(ret_zero_page_cross):
> > > >         xorl    %eax, %eax
> > > >         VZEROUPPER_RETURN
> > > > +# endif
> > > >
> > > > -       .p2align 4
> > > > -L(char0):
> > > > -#  ifdef USE_AS_WCSCMP
> > > > -       xorl    %eax, %eax
> > > > -       movl    (%rdi), %ecx
> > > > -       cmpl    (%rsi), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -#  else
> > > > -       movzbl  (%rsi), %ecx
> > > > -       movzbl  (%rdi), %eax
> > > > -       subl    %ecx, %eax
> > > > -#  endif
> > > > -       VZEROUPPER_RETURN
> > > > +       .p2align 4,, 4
> > > > +L(page_cross_s2):
> > > > +       /* Ensure this is a true page cross.  */
> > > > +       subl    $(PAGE_SIZE - VEC_SIZE * 4), %ecx
> > > > +       jbe     L(no_page_cross)
> > > > +
> > > > +
> > > > +       movl    %ecx, %eax
> > > > +       movq    %rdi, %rcx
> > > > +       movq    %rsi, %rdi
> > > > +       movq    %rcx, %rsi
> > > > +
> > > > +       /* set r8 to negate return value as rdi and rsi swapped.  */
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       movl    $-4, %r8d
> > > > +# else
> > > > +       movl    $-1, %r8d
> > > >  # endif
> > > > +       xorl    %OFFSET_REG, %OFFSET_REG
> > > >
> > > > -       .p2align 4
> > > > -L(last_vector):
> > > > -       addq    %rdx, %rdi
> > > > -       addq    %rdx, %rsi
> > > > +       /* Check if more than 1x VEC till page cross.  */
> > > > +       subl    $(VEC_SIZE * 3), %eax
> > > > +       jle     L(page_cross_loop)
> > > > +
> > > > +       .p2align 4,, 6
> > > > +L(less_1x_vec_till_page):
> > > > +       /* Find largest load size we can use.  */
> > > > +       cmpl    $16, %eax
> > > > +       ja      L(less_16_till_page)
> > > > +
> > > > +       VMOVU   (%rdi), %xmm0
> > > > +       VPCMPEQ (%rsi), %xmm0, %xmm1
> > > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incw    %cx
> > > > +       jnz     L(check_ret_vec_page_cross)
> > > > +       movl    $16, %OFFSET_REG
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       subq    %rdx, %r11
> > > > +       cmpq    %OFFSET_REG64, %rdx
> > > > +       jbe     L(ret_zero_page_cross_slow_case0)
> > > > +       subl    %eax, %OFFSET_REG
> > > > +# else
> > > > +       /* Explicit check for 16 byte alignment.  */
> > > > +       subl    %eax, %OFFSET_REG
> > > > +       jz      L(prepare_loop)
> > > >  # endif
> > > > -       tzcntl  %ecx, %edx
> > > > +
> > > > +       VMOVU   (%rdi, %OFFSET_REG64), %xmm0
> > > > +       VPCMPEQ (%rsi, %OFFSET_REG64), %xmm0, %xmm1
> > > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incw    %cx
> > > > +       jnz     L(check_ret_vec_page_cross)
> > > > +
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > +       addl    $16, %OFFSET_REG
> > > > +       subq    %OFFSET_REG64, %rdx
> > > > +       jbe     L(ret_zero_page_cross_slow_case0)
> > > > +       subq    $-(VEC_SIZE * 4), %rdx
> > > > +
> > > > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > > +# else
> > > > +       leaq    (16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > > +       leaq    (16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > >  # endif
> > > > -# ifdef USE_AS_WCSCMP
> > > > +       jmp     L(prepare_loop_aligned)
> > > > +
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       .p2align 4,, 2
> > > > +L(ret_zero_page_cross_slow_case0):
> > > >         xorl    %eax, %eax
> > > > -       movl    (%rdi, %rdx), %ecx
> > > > -       cmpl    (%rsi, %rdx), %ecx
> > > > -       jne     L(wcscmp_return)
> > > > -# else
> > > > -       movzbl  (%rdi, %rdx), %eax
> > > > -       movzbl  (%rsi, %rdx), %edx
> > > > -       subl    %edx, %eax
> > > > +       ret
> > > >  # endif
> > > > -       VZEROUPPER_RETURN
> > > >
> > > > -       /* Comparing on page boundary region requires special treatment:
> > > > -          It must done one vector at the time, starting with the wider
> > > > -          ymm vector if possible, if not, with xmm. If fetching 16 bytes
> > > > -          (xmm) still passes the boundary, byte comparison must be done.
> > > > -        */
> > > > -       .p2align 4
> > > > -L(cross_page):
> > > > -       /* Try one ymm vector at a time.  */
> > > > -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> > > > -       jg      L(cross_page_1_vector)
> > > > -L(loop_1_vector):
> > > > -       vmovdqu (%rdi, %rdx), %ymm1
> > > > -       VPCMPEQ (%rsi, %rdx), %ymm1, %ymm0
> > > > -       VPMINU  %ymm1, %ymm0, %ymm0
> > > > -       VPCMPEQ %ymm7, %ymm0, %ymm0
> > > > -       vpmovmskb %ymm0, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       jne     L(last_vector)
> > > >
> > > > -       addl    $VEC_SIZE, %edx
> > > > +       .p2align 4,, 10
> > > > +L(less_16_till_page):
> > > > +       /* Find largest load size we can use.  */
> > > > +       cmpl    $24, %eax
> > > > +       ja      L(less_8_till_page)
> > > >
> > > > -       addl    $VEC_SIZE, %eax
> > > > -# ifdef USE_AS_STRNCMP
> > > > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > > > -          (%r11).  */
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > -# endif
> > > > -       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> > > > -       jle     L(loop_1_vector)
> > > > -L(cross_page_1_vector):
> > > > -       /* Less than 32 bytes to check, try one xmm vector.  */
> > > > -       cmpl    $(PAGE_SIZE - 16), %eax
> > > > -       jg      L(cross_page_1_xmm)
> > > > -       vmovdqu (%rdi, %rdx), %xmm1
> > > > -       VPCMPEQ (%rsi, %rdx), %xmm1, %xmm0
> > > > -       VPMINU  %xmm1, %xmm0, %xmm0
> > > > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > > > -       vpmovmskb %xmm0, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       jne     L(last_vector)
> > > > +       vmovq   (%rdi), %xmm0
> > > > +       vmovq   (%rsi), %xmm1
> > > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incb    %cl
> > > > +       jnz     L(check_ret_vec_page_cross)
> > > >
> > > > -       addl    $16, %edx
> > > > -# ifndef USE_AS_WCSCMP
> > > > -       addl    $16, %eax
> > > > +
> > > > +# ifdef USE_AS_STRNCMP
> > > > +       cmpq    $8, %rdx
> > > > +       jbe     L(ret_zero_page_cross_slow_case0)
> > > >  # endif
> > > > +       movl    $24, %OFFSET_REG
> > > > +       /* Explicit check for 16 byte alignment.  */
> > > > +       subl    %eax, %OFFSET_REG
> > > > +
> > > > +
> > > > +
> > > > +       vmovq   (%rdi, %OFFSET_REG64), %xmm0
> > > > +       vmovq   (%rsi, %OFFSET_REG64), %xmm1
> > > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       incb    %cl
> > > > +       jnz     L(check_ret_vec_page_cross)
> > > > +
> > > >  # ifdef USE_AS_STRNCMP
> > > > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > > > -          (%r11).  */
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > -# endif
> > > > -
> > > > -L(cross_page_1_xmm):
> > > > -# ifndef USE_AS_WCSCMP
> > > > -       /* Less than 16 bytes to check, try 8 byte vector.  NB: No need
> > > > -          for wcscmp nor wcsncmp since wide char is 4 bytes.   */
> > > > -       cmpl    $(PAGE_SIZE - 8), %eax
> > > > -       jg      L(cross_page_8bytes)
> > > > -       vmovq   (%rdi, %rdx), %xmm1
> > > > -       vmovq   (%rsi, %rdx), %xmm0
> > > > -       VPCMPEQ %xmm0, %xmm1, %xmm0
> > > > -       VPMINU  %xmm1, %xmm0, %xmm0
> > > > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > > > -       vpmovmskb %xmm0, %ecx
> > > > -       /* Only last 8 bits are valid.  */
> > > > -       andl    $0xff, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       jne     L(last_vector)
> > > > +       addl    $8, %OFFSET_REG
> > > > +       subq    %OFFSET_REG64, %rdx
> > > > +       jbe     L(ret_zero_page_cross_slow_case0)
> > > > +       subq    $-(VEC_SIZE * 4), %rdx
> > > >
> > > > -       addl    $8, %edx
> > > > -       addl    $8, %eax
> > > > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > > +# else
> > > > +       leaq    (8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > > +       leaq    (8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > > +# endif
> > > > +       jmp     L(prepare_loop_aligned)
> > > > +
> > > > +
> > > > +       .p2align 4,, 10
> > > > +L(less_8_till_page):
> > > > +# ifdef USE_AS_WCSCMP
> > > > +       /* If using wchar then this is the only check before we reach
> > > > +          the page boundary.  */
> > > > +       movl    (%rdi), %eax
> > > > +       movl    (%rsi), %ecx
> > > > +       cmpl    %ecx, %eax
> > > > +       jnz     L(ret_less_8_wcs)
> > > >  #  ifdef USE_AS_STRNCMP
> > > > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > > > -          (%r11).  */
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > +       addq    %rdi, %rdx
> > > > +       /* We already checked for len <= 1 so cannot hit that case here.
> > > > +        */
> > > >  #  endif
> > > > +       testl   %eax, %eax
> > > > +       jnz     L(prepare_loop_no_len)
> > > > +       ret
> > > >
> > > > -L(cross_page_8bytes):
> > > > -       /* Less than 8 bytes to check, try 4 byte vector.  */
> > > > -       cmpl    $(PAGE_SIZE - 4), %eax
> > > > -       jg      L(cross_page_4bytes)
> > > > -       vmovd   (%rdi, %rdx), %xmm1
> > > > -       vmovd   (%rsi, %rdx), %xmm0
> > > > -       VPCMPEQ %xmm0, %xmm1, %xmm0
> > > > -       VPMINU  %xmm1, %xmm0, %xmm0
> > > > -       VPCMPEQ %xmm7, %xmm0, %xmm0
> > > > -       vpmovmskb %xmm0, %ecx
> > > > -       /* Only last 4 bits are valid.  */
> > > > -       andl    $0xf, %ecx
> > > > -       testl   %ecx, %ecx
> > > > -       jne     L(last_vector)
> > > > +       .p2align 4,, 8
> > > > +L(ret_less_8_wcs):
> > > > +       setl    %OFFSET_REG8
> > > > +       negl    %OFFSET_REG
> > > > +       movl    %OFFSET_REG, %eax
> > > > +       xorl    %r8d, %eax
> > > > +       ret
> > > > +
> > > > +# else
> > > > +
> > > > +       /* Find largest load size we can use.  */
> > > > +       cmpl    $28, %eax
> > > > +       ja      L(less_4_till_page)
> > > > +
> > > > +       vmovd   (%rdi), %xmm0
> > > > +       vmovd   (%rsi), %xmm1
> > > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       subl    $0xf, %ecx
> > > > +       jnz     L(check_ret_vec_page_cross)
> > > >
> > > > -       addl    $4, %edx
> > > >  #  ifdef USE_AS_STRNCMP
> > > > -       /* Return 0 if the current offset (%rdx) >= the maximum offset
> > > > -          (%r11).  */
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > +       cmpq    $4, %rdx
> > > > +       jbe     L(ret_zero_page_cross_slow_case1)
> > > >  #  endif
> > > > +       movl    $28, %OFFSET_REG
> > > > +       /* Explicit check for 16 byte alignment.  */
> > > > +       subl    %eax, %OFFSET_REG
> > > >
> > > > -L(cross_page_4bytes):
> > > > -# endif
> > > > -       /* Less than 4 bytes to check, try one byte/dword at a time.  */
> > > > -# ifdef USE_AS_STRNCMP
> > > > -       cmpq    %r11, %rdx
> > > > -       jae     L(zero)
> > > > -# endif
> > > > -# ifdef USE_AS_WCSCMP
> > > > -       movl    (%rdi, %rdx), %eax
> > > > -       movl    (%rsi, %rdx), %ecx
> > > > -# else
> > > > -       movzbl  (%rdi, %rdx), %eax
> > > > -       movzbl  (%rsi, %rdx), %ecx
> > > > -# endif
> > > > -       testl   %eax, %eax
> > > > -       jne     L(cross_page_loop)
> > > > +
> > > > +
> > > > +       vmovd   (%rdi, %OFFSET_REG64), %xmm0
> > > > +       vmovd   (%rsi, %OFFSET_REG64), %xmm1
> > > > +       VPCMPEQ %xmm0, %xmmZERO, %xmm2
> > > > +       VPCMPEQ %xmm1, %xmm0, %xmm1
> > > > +       vpandn  %xmm1, %xmm2, %xmm1
> > > > +       vpmovmskb %ymm1, %ecx
> > > > +       subl    $0xf, %ecx
> > > > +       jnz     L(check_ret_vec_page_cross)
> > > > +
> > > > +#  ifdef USE_AS_STRNCMP
> > > > +       addl    $4, %OFFSET_REG
> > > > +       subq    %OFFSET_REG64, %rdx
> > > > +       jbe     L(ret_zero_page_cross_slow_case1)
> > > > +       subq    $-(VEC_SIZE * 4), %rdx
> > > > +
> > > > +       leaq    -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > > +       leaq    -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > > +#  else
> > > > +       leaq    (4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
> > > > +       leaq    (4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
> > > > +#  endif
> > > > +       jmp     L(prepare_loop_aligned)
> > > > +
> > > > +#  ifdef USE_AS_STRNCMP
> > > > +       .p2align 4,, 2
> > > > +L(ret_zero_page_cross_slow_case1):
> > > > +       xorl    %eax, %eax
> > > > +       ret
> > > > +#  endif
> > > > +
> > > > +       .p2align 4,, 10
> > > > +L(less_4_till_page):
> > > > +       subq    %rdi, %rsi
> > > > +       /* Extremely slow byte comparison loop.  */
> > > > +L(less_4_loop):
> > > > +       movzbl  (%rdi), %eax
> > > > +       movzbl  (%rsi, %rdi), %ecx
> > > >         subl    %ecx, %eax
> > > > -       VZEROUPPER_RETURN
> > > > -END (STRCMP)
> > > > +       jnz     L(ret_less_4_loop)
> > > > +       testl   %ecx, %ecx
> > > > +       jz      L(ret_zero_4_loop)
> > > > +#  ifdef USE_AS_STRNCMP
> > > > +       decq    %rdx
> > > > +       jz      L(ret_zero_4_loop)
> > > > +#  endif
> > > > +       incq    %rdi
> > > > +       /* end condition is reach page boundary (rdi is aligned).  */
> > > > +       testl   $31, %edi
> > > > +       jnz     L(less_4_loop)
> > > > +       leaq    -(VEC_SIZE * 4)(%rdi, %rsi), %rsi
> > > > +       addq    $-(VEC_SIZE * 4), %rdi
> > > > +#  ifdef USE_AS_STRNCMP
> > > > +       subq    $-(VEC_SIZE * 4), %rdx
> > > > +#  endif
> > > > +       jmp     L(prepare_loop_aligned)
> > > > +
> > > > +L(ret_zero_4_loop):
> > > > +       xorl    %eax, %eax
> > > > +       ret
> > > > +L(ret_less_4_loop):
> > > > +       xorl    %r8d, %eax
> > > > +       subl    %r8d, %eax
> > > > +       ret
> > > > +# endif
> > > > +END(STRCMP)
> > > >  #endif
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > LGTM.
> >
> > Should I wait until 2.36 release to push the optimized versions?
>
> Yes, please.
>
> > There are alot of edge cases with these functions and last time we
> > tried to improve them in:
> >
> > commit c46e9afb2df5fc9e39ff4d13777e4b4c26e04e55
> > Author: H.J. Lu <hjl.tools@gmail.com>
> > Date:   Fri Oct 29 12:40:20 2021 -0700
> >
> >     x86-64: Improve EVEX strcmp with masked load
> >
> >
> > We missed a case:
> > https://bugzilla.redhat.com/show_bug.cgi?id=2026399#c19
>
> Is it a correctness bug?

Yes see: https://sourceware.org/bugzilla/show_bug.cgi?id=28646

Although it was fixed.

>
> > >
> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> > >
> > > Thanks.
> > >
> > > --
> > > H.J.
>
> Thanks.
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
  2022-01-09 12:29 [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
                   ` (5 preceding siblings ...)
  2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
@ 2022-01-10 21:35 ` Noah Goldstein
  2022-01-10 21:35   ` [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
                     ` (6 more replies)
  6 siblings, 7 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10 21:35 UTC (permalink / raw)
  To: libc-alpha

Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index a45f9d2749..9c73b5899d 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -87,6 +87,16 @@ ENTRY (STRCMP)
 	je	L(char0)
 	jb	L(zero)
 #  ifdef USE_AS_WCSCMP
+#  ifndef __ILP32__
+	movq	%rdx, %rcx
+	/* Check if length could overflow when multiplied by
+	   sizeof(wchar_t). Checking top 8 bits will cover all potential
+	   overflow cases as well as redirect cases where its impossible to
+	   length to bound a valid memory region. In these cases just use
+	   'wcscmp'.  */
+	shrq	$56, %rcx
+	jnz	__wcscmp_avx2
+#  endif
 	/* Convert units: from wide to byte char.  */
 	shl	$2, %RDX_LP
 #  endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
@ 2022-01-10 21:35   ` Noah Goldstein
  2022-01-11  2:15     ` H.J. Lu
  2022-01-10 21:35   ` [PATCH v3 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10 21:35 UTC (permalink / raw)
  To: libc-alpha

Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_evex. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-evex.S | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
index 1d971f3889..0cd939d5af 100644
--- a/sysdeps/x86_64/multiarch/strcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
@@ -104,6 +104,16 @@ ENTRY (STRCMP)
 	je	L(char0)
 	jb	L(zero)
 #  ifdef USE_AS_WCSCMP
+#  ifndef __ILP32__
+	movq	%rdx, %rcx
+	/* Check if length could overflow when multiplied by
+	   sizeof(wchar_t). Checking top 8 bits will cover all potential
+	   overflow cases as well as redirect cases where its impossible to
+	   length to bound a valid memory region. In these cases just use
+	   'wcscmp'.  */
+	shrq	$56, %rcx
+	jnz	__wcscmp_evex
+#  endif
 	/* Convert units: from wide to byte char.  */
 	shl	$2, %RDX_LP
 #  endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp].
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
  2022-01-10 21:35   ` [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
@ 2022-01-10 21:35   ` Noah Goldstein
  2022-01-10 21:35   ` [PATCH v3 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10 21:35 UTC (permalink / raw)
  To: libc-alpha

These implementations just add to test duration. Since we have
simple_* implementations we already have a safe reference
implementation.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 string/test-strcmp.c  | 35 -----------------------------------
 string/test-strncmp.c | 34 ----------------------------------
 2 files changed, 69 deletions(-)

diff --git a/string/test-strcmp.c b/string/test-strcmp.c
index 3c75076fb8..97d7bf5043 100644
--- a/string/test-strcmp.c
+++ b/string/test-strcmp.c
@@ -34,7 +34,6 @@
 # define STRLEN wcslen
 # define MEMCPY wmemcpy
 # define SIMPLE_STRCMP simple_wcscmp
-# define STUPID_STRCMP stupid_wcscmp
 # define CHAR wchar_t
 # define UCHAR wchar_t
 # define CHARBYTES 4
@@ -64,25 +63,6 @@ simple_wcscmp (const wchar_t *s1, const wchar_t *s2)
   return c1 < c2 ? -1 : 1;
 }
 
-int
-stupid_wcscmp (const wchar_t *s1, const wchar_t *s2)
-{
-  size_t ns1 = wcslen (s1) + 1;
-  size_t ns2 = wcslen (s2) + 1;
-  size_t n = ns1 < ns2 ? ns1 : ns2;
-  int ret = 0;
-
-  wchar_t c1, c2;
-
-  while (n--) {
-    c1 = *s1++;
-    c2 = *s2++;
-    if ((ret = c1 < c2 ? -1 : c1 == c2 ? 0 : 1) != 0)
-      break;
-  }
-  return ret;
-}
-
 #else
 # include <limits.h>
 
@@ -92,7 +72,6 @@ stupid_wcscmp (const wchar_t *s1, const wchar_t *s2)
 # define STRLEN strlen
 # define MEMCPY memcpy
 # define SIMPLE_STRCMP simple_strcmp
-# define STUPID_STRCMP stupid_strcmp
 # define CHAR char
 # define UCHAR unsigned char
 # define CHARBYTES 1
@@ -113,24 +92,10 @@ simple_strcmp (const char *s1, const char *s2)
   return ret;
 }
 
-int
-stupid_strcmp (const char *s1, const char *s2)
-{
-  size_t ns1 = strlen (s1) + 1;
-  size_t ns2 = strlen (s2) + 1;
-  size_t n = ns1 < ns2 ? ns1 : ns2;
-  int ret = 0;
-
-  while (n--)
-    if ((ret = *(unsigned char *) s1++ - *(unsigned char *) s2++) != 0)
-      break;
-  return ret;
-}
 #endif
 
 typedef int (*proto_t) (const CHAR *, const CHAR *);
 
-IMPL (STUPID_STRCMP, 1)
 IMPL (SIMPLE_STRCMP, 1)
 IMPL (STRCMP, 1)
 
diff --git a/string/test-strncmp.c b/string/test-strncmp.c
index e7d5edea39..61a283a0af 100644
--- a/string/test-strncmp.c
+++ b/string/test-strncmp.c
@@ -33,7 +33,6 @@
 # define STRDUP wcsdup
 # define MEMCPY wmemcpy
 # define SIMPLE_STRNCMP simple_wcsncmp
-# define STUPID_STRNCMP stupid_wcsncmp
 # define CHAR wchar_t
 # define UCHAR wchar_t
 # define CHARBYTES 4
@@ -57,25 +56,6 @@ simple_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
   return 0;
 }
 
-int
-stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
-{
-  wchar_t c1, c2;
-  size_t ns1 = wcsnlen (s1, n) + 1, ns2 = wcsnlen (s2, n) + 1;
-
-  n = ns1 < n ? ns1 : n;
-  n = ns2 < n ? ns2 : n;
-
-  while (n--)
-    {
-      c1 = *s1++;
-      c2 = *s2++;
-      if (c1 != c2)
-	return c1 > c2 ? 1 : -1;
-    }
-  return 0;
-}
-
 #else
 # define L(str) str
 # define STRNCMP strncmp
@@ -83,7 +63,6 @@ stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n)
 # define STRDUP strdup
 # define MEMCPY memcpy
 # define SIMPLE_STRNCMP simple_strncmp
-# define STUPID_STRNCMP stupid_strncmp
 # define CHAR char
 # define UCHAR unsigned char
 # define CHARBYTES 1
@@ -101,23 +80,10 @@ simple_strncmp (const char *s1, const char *s2, size_t n)
   return ret;
 }
 
-int
-stupid_strncmp (const char *s1, const char *s2, size_t n)
-{
-  size_t ns1 = strnlen (s1, n) + 1, ns2 = strnlen (s2, n) + 1;
-  int ret = 0;
-
-  n = ns1 < n ? ns1 : n;
-  n = ns2 < n ? ns2 : n;
-  while (n-- && (ret = *(unsigned char *) s1++ - * (unsigned char *) s2++) == 0);
-  return ret;
-}
-
 #endif
 
 typedef int (*proto_t) (const CHAR *, const CHAR *, size_t);
 
-IMPL (STUPID_STRNCMP, 0)
 IMPL (SIMPLE_STRNCMP, 0)
 IMPL (STRNCMP, 1)
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
  2022-01-10 21:35   ` [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
  2022-01-10 21:35   ` [PATCH v3 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
@ 2022-01-10 21:35   ` Noah Goldstein
  2022-01-10 21:35   ` [PATCH v3 5/7] x86: Optimize strcmp-avx2.S Noah Goldstein
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10 21:35 UTC (permalink / raw)
  To: libc-alpha

Add additional test cases for small / medium sizes.

Add tests in test-strncmp.c where `n` is near ULONG_MAX or LONG_MIN to
test for overflow bugs in length handling.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 string/test-strcmp.c  |  70 ++++++++++--
 string/test-strncmp.c | 257 +++++++++++++++++++++++++++++++++++++++---
 2 files changed, 306 insertions(+), 21 deletions(-)

diff --git a/string/test-strcmp.c b/string/test-strcmp.c
index 97d7bf5043..eacbdc8857 100644
--- a/string/test-strcmp.c
+++ b/string/test-strcmp.c
@@ -16,6 +16,9 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#define TEST_LEN (4096 * 3)
+#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
+
 #define TEST_MAIN
 #ifdef WIDE
 # define TEST_NAME "wcscmp"
@@ -129,7 +132,7 @@ do_one_test (impl_t *impl,
 
 static void
 do_test (size_t align1, size_t align2, size_t len, int max_char,
-	 int exp_result)
+         int exp_result)
 {
   size_t i;
 
@@ -138,19 +141,22 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
   if (len == 0)
     return;
 
-  align1 &= 63;
+  align1 &= ~(CHARBYTES - 1);
+  align2 &= ~(CHARBYTES - 1);
+
+  align1 &= getpagesize () - 1;
   if (align1 + (len + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
+  align2 &= getpagesize () - 1;
   if (align2 + (len + 1) * CHARBYTES >= page_size)
     return;
 
   /* Put them close to the end of page.  */
   i = align1 + CHARBYTES * (len + 2);
-  s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1);
+  s1 = (CHAR *)(buf1 + ((page_size - i) / 16 * 16) + align1);
   i = align2 + CHARBYTES * (len + 2);
-  s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16)  + align2);
+  s2 = (CHAR *)(buf2 + ((page_size - i) / 16 * 16) + align2);
 
   for (i = 0; i < len; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
@@ -161,9 +167,10 @@ do_test (size_t align1, size_t align2, size_t len, int max_char,
   s2[len - 1] -= exp_result;
 
   FOR_EACH_IMPL (impl, 0)
-    do_one_test (impl, s1, s2, exp_result);
+  do_one_test (impl, s1, s2, exp_result);
 }
 
+
 static void
 do_random_tests (void)
 {
@@ -385,7 +392,7 @@ check3 (void)
 int
 test_main (void)
 {
-  size_t i;
+  size_t i, j;
 
   test_init ();
   check();
@@ -426,6 +433,55 @@ test_main (void)
       do_test (2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1);
     }
 
+  for (j = 0; j < 160; ++j)
+    {
+      for (i = 0; i < TEST_LEN;)
+        {
+          do_test (getpagesize () - j - 1, 0, i, 127, 0);
+          do_test (getpagesize () - j - 1, 0, i, 127, 1);
+          do_test (getpagesize () - j - 1, 0, i, 127, -1);
+
+          do_test (getpagesize () - j - 1, j, i, 127, 0);
+          do_test (getpagesize () - j - 1, j, i, 127, 1);
+          do_test (getpagesize () - j - 1, j, i, 127, -1);
+
+          do_test (0, getpagesize () - j - 1, i, 127, 0);
+          do_test (0, getpagesize () - j - 1, i, 127, 1);
+          do_test (0, getpagesize () - j - 1, i, 127, -1);
+
+          do_test (j, getpagesize () - j - 1, i, 127, 0);
+          do_test (j, getpagesize () - j - 1, i, 127, 1);
+          do_test (j, getpagesize () - j - 1, i, 127, -1);
+
+          if (i < 32)
+            {
+              i += 1;
+            }
+          else if (i < 161)
+            {
+              i += 7;
+            }
+          else if (i + 161 < TEST_LEN)
+            {
+              i += 31;
+              i *= 17;
+              i /= 16;
+              if (i + 161 > TEST_LEN)
+                {
+                  i = TEST_LEN - 160;
+                }
+            }
+          else if (i + 32 < TEST_LEN)
+            {
+              i += 7;
+            }
+          else
+            {
+              i += 1;
+            }
+        }
+    }
+
   do_random_tests ();
   return ret;
 }
diff --git a/string/test-strncmp.c b/string/test-strncmp.c
index 61a283a0af..1a3cee1792 100644
--- a/string/test-strncmp.c
+++ b/string/test-strncmp.c
@@ -16,6 +16,9 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#define TEST_LEN (4096 * 3)
+#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ())
+
 #define TEST_MAIN
 #ifdef WIDE
 # define TEST_NAME "wcsncmp"
@@ -166,11 +169,11 @@ do_test_limit (size_t align1, size_t align2, size_t len, size_t n, int max_char,
 }
 
 static void
-do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
-	 int exp_result)
+do_test_n (size_t align1, size_t align2, size_t len, size_t n, int n_in_bounds,
+           int max_char, int exp_result)
 {
-  size_t i;
-  CHAR *s1, *s2;
+  size_t i, buf_bound;
+  CHAR *s1, *s2, *s1_end, *s2_end;
 
   align1 &= ~(CHARBYTES - 1);
   align2 &= ~(CHARBYTES - 1);
@@ -178,22 +181,28 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
   if (n == 0)
     return;
 
-  align1 &= 63;
-  if (align1 + (n + 1) * CHARBYTES >= page_size)
+  buf_bound = n_in_bounds ? n : len;
+
+  align1 &= getpagesize () - 1;
+  if (align1 + (buf_bound + 2) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
-  if (align2 + (n + 1) * CHARBYTES >= page_size)
+  align2 &= getpagesize () - 1;
+  if (align2 + (buf_bound + 2) * CHARBYTES >= page_size)
     return;
 
-  s1 = (CHAR *) (buf1 + align1);
-  s2 = (CHAR *) (buf2 + align2);
+  s1 = (CHAR *)(buf1 + align1);
+  s2 = (CHAR *)(buf2 + align2);
 
-  for (i = 0; i < n; i++)
+  if (n_in_bounds)
+    {
+      s1[n] = 24 + exp_result;
+      s2[n] = 23;
+    }
+
+  for (i = 0; i < buf_bound; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
 
-  s1[n] = 24 + exp_result;
-  s2[n] = 23;
   s1[len] = 0;
   s2[len] = 0;
   if (exp_result < 0)
@@ -203,10 +212,24 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
   if (len >= n)
     s2[n - 1] -= exp_result;
 
+  /* Ensure that both s1 and s2 are valid null terminated strings. This is
+   * required by the standard. */
+  s1_end = (CHAR *)(buf1 + MIN_PAGE_SIZE - CHARBYTES);
+  s2_end = (CHAR *)(buf2 + MIN_PAGE_SIZE - CHARBYTES);
+  *s1_end = 0;
+  *s2_end = 0;
+
   FOR_EACH_IMPL (impl, 0)
     do_one_test (impl, s1, s2, n, exp_result);
 }
 
+static void
+do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char,
+         int exp_result)
+{
+  do_test_n (align1, align2, len, n, 1, max_char, exp_result);
+}
+
 static void
 do_page_test (size_t offset1, size_t offset2, CHAR *s2)
 {
@@ -400,10 +423,123 @@ check3 (void)
 	}
 }
 
+static void
+check_overflow (void)
+{
+  size_t i, j, of_mask, of_idx;
+  const size_t of_masks[]
+      = { ULONG_MAX, LONG_MIN, ULONG_MAX - (ULONG_MAX >> 2),
+          ((size_t)LONG_MAX) >> 1 };
+
+  for (of_idx = 0; of_idx < sizeof (of_masks) / sizeof (of_masks[0]); ++of_idx)
+    {
+      of_mask = of_masks[of_idx];
+      for (j = 0; j < 160; ++j)
+        {
+          for (i = 1; i <= 161; i += (32 / sizeof (CHAR)))
+            {
+              do_test_n (j, 0, i, of_mask, 0, 127, 0);
+              do_test_n (j, 0, i, of_mask, 0, 127, 1);
+              do_test_n (j, 0, i, of_mask, 0, 127, -1);
+
+              do_test_n (j, 0, i, of_mask - j / 2, 0, 127, 0);
+              do_test_n (j, 0, i, of_mask - j * 2, 0, 127, 1);
+              do_test_n (j, 0, i, of_mask - j, 0, 127, -1);
+
+              do_test_n (j / 2, j, i, of_mask, 0, 127, 0);
+              do_test_n (j / 2, j, i, of_mask, 0, 127, 1);
+              do_test_n (j / 2, j, i, of_mask, 0, 127, -1);
+
+              do_test_n (j / 2, j, i, of_mask - j, 0, 127, 0);
+              do_test_n (j / 2, j, i, of_mask - j / 2, 0, 127, 1);
+              do_test_n (j / 2, j, i, of_mask - j * 2, 0, 127, -1);
+
+              do_test_n (0, j, i, of_mask - j * 2, 0, 127, 0);
+              do_test_n (0, j, i, of_mask - j, 0, 127, 1);
+              do_test_n (0, j, i, of_mask - j / 2, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j / 2, 0, 127,
+                         0);
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j * 2, 0, 127,
+                         1);
+              do_test_n (getpagesize () - j - 1, 0, i, of_mask - j, 0, 127,
+                         -1);
+
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask - j, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask - j / 2, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i,
+                         of_mask - j * 2, 0, 127, -1);
+            }
+
+          for (i = 1; i < TEST_LEN; i += i)
+            {
+              do_test_n (j, 0, i - 1, of_mask, 0, 127, 0);
+              do_test_n (j, 0, i - 1, of_mask, 0, 127, 1);
+              do_test_n (j, 0, i - 1, of_mask, 0, 127, -1);
+
+              do_test_n (j, 0, i - 1, of_mask - j / 2, 0, 127, 0);
+              do_test_n (j, 0, i - 1, of_mask - j * 2, 0, 127, 1);
+              do_test_n (j, 0, i - 1, of_mask - j, 0, 127, -1);
+
+              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 0);
+              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 1);
+              do_test_n (j / 2, j, i - 1, of_mask, 0, 127, -1);
+
+              do_test_n (j / 2, j, i - 1, of_mask - j, 0, 127, 0);
+              do_test_n (j / 2, j, i - 1, of_mask - j / 2, 0, 127, 1);
+              do_test_n (j / 2, j, i - 1, of_mask - j * 2, 0, 127, -1);
+
+              do_test_n (0, j, i - 1, of_mask - j * 2, 0, 127, 0);
+              do_test_n (0, j, i - 1, of_mask - j, 0, 127, 1);
+              do_test_n (0, j, i - 1, of_mask - j / 2, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127,
+                         -1);
+
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j / 2, 0,
+                         127, 0);
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j * 2, 0,
+                         127, 1);
+              do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j, 0, 127,
+                         -1);
+
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask, 0, 127, -1);
+
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask - j, 0, 127, 0);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask - j / 2, 0, 127, 1);
+              do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1,
+                         i - 1, of_mask - j * 2, 0, 127, -1);
+            }
+        }
+    }
+}
+
 int
 test_main (void)
 {
-  size_t i;
+  size_t i, j;
 
   test_init ();
 
@@ -470,6 +606,99 @@ test_main (void)
       do_test_limit (0, 0, 15 - i, 16 - i, 255, -1);
     }
 
+  for (j = 0; j < 160; ++j)
+    {
+      for (i = 0; i < TEST_LEN;)
+        {
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, 0, i, i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, i - 1, 0, 127, 0);
+
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, i - 1, 0, 127, 0);
+
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 0);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 1);
+          do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, i, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
+
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1);
+          do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, i, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, i - 1, 0, 127, 0);
+
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1);
+
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1);
+          do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1);
+          if (i < 32)
+            {
+              i += 1;
+            }
+          else if (i < 161)
+            {
+              i += 7;
+            }
+          else if (i + 161 < TEST_LEN)
+            {
+              i += 31;
+              i *= 17;
+              i /= 16;
+              if (i + 161 > TEST_LEN)
+                {
+                  i = TEST_LEN - 160;
+                }
+            }
+          else if (i + 32 < TEST_LEN)
+            {
+              i += 7;
+            }
+          else
+            {
+              i += 1;
+            }
+        }
+    }
+
+  check_overflow ();
   do_random_tests ();
   return ret;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
                     ` (2 preceding siblings ...)
  2022-01-10 21:35   ` [PATCH v3 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
@ 2022-01-10 21:35   ` Noah Goldstein
  2022-02-14 14:10     ` Andreas Schwab
  2022-01-10 21:35   ` [PATCH v3 6/7] x86: Optimize strcmp-evex.S Noah Goldstein
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10 21:35 UTC (permalink / raw)
  To: libc-alpha

Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-avx2.S | 1590 ++++++++++++++----------
 1 file changed, 939 insertions(+), 651 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index 9c73b5899d..28d6a0025a 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -26,35 +26,57 @@
 
 # define PAGE_SIZE	4096
 
-/* VEC_SIZE = Number of bytes in a ymm register */
+	/* VEC_SIZE = Number of bytes in a ymm register.  */
 # define VEC_SIZE	32
 
-/* Shift for dividing by (VEC_SIZE * 4).  */
-# define DIVIDE_BY_VEC_4_SHIFT	7
-# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-# endif
+# define VMOVU	vmovdqu
+# define VMOVA	vmovdqa
 
 # ifdef USE_AS_WCSCMP
-/* Compare packed dwords.  */
+	/* Compare packed dwords.  */
 #  define VPCMPEQ	vpcmpeqd
-/* Compare packed dwords and store minimum.  */
+	/* Compare packed dwords and store minimum.  */
 #  define VPMINU	vpminud
-/* 1 dword char == 4 bytes.  */
+	/* 1 dword char == 4 bytes.  */
 #  define SIZE_OF_CHAR	4
 # else
-/* Compare packed bytes.  */
+	/* Compare packed bytes.  */
 #  define VPCMPEQ	vpcmpeqb
-/* Compare packed bytes and store minimum.  */
+	/* Compare packed bytes and store minimum.  */
 #  define VPMINU	vpminub
-/* 1 byte char == 1 byte.  */
+	/* 1 byte char == 1 byte.  */
 #  define SIZE_OF_CHAR	1
 # endif
 
+# ifdef USE_AS_STRNCMP
+#  define LOOP_REG	r9d
+#  define LOOP_REG64	r9
+
+#  define OFFSET_REG8	r9b
+#  define OFFSET_REG	r9d
+#  define OFFSET_REG64	r9
+# else
+#  define LOOP_REG	edx
+#  define LOOP_REG64	rdx
+
+#  define OFFSET_REG8	dl
+#  define OFFSET_REG	edx
+#  define OFFSET_REG64	rdx
+# endif
+
 # ifndef VZEROUPPER
 #  define VZEROUPPER	vzeroupper
 # endif
 
+# if defined USE_AS_STRNCMP
+#  define VEC_OFFSET	0
+# else
+#  define VEC_OFFSET	(-VEC_SIZE)
+# endif
+
+# define xmmZERO	xmm15
+# define ymmZERO	ymm15
+
 # ifndef SECTION
 #  define SECTION(p)	p##.avx
 # endif
@@ -79,783 +101,1049 @@
    the maximum offset is reached before a difference is found, zero is
    returned.  */
 
-	.section SECTION(.text),"ax",@progbits
-ENTRY (STRCMP)
+	.section SECTION(.text), "ax", @progbits
+ENTRY(STRCMP)
 # ifdef USE_AS_STRNCMP
-	/* Check for simple cases (0 or 1) in offset.  */
+#  ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %rdx
+#  endif
 	cmp	$1, %RDX_LP
-	je	L(char0)
-	jb	L(zero)
+	/* Signed comparison intentional. We use this branch to also
+	   test cases where length >= 2^63. These very large sizes can be
+	   handled with strcmp as there is no way for that length to
+	   actually bound the buffer.  */
+	jle	L(one_or_less)
 #  ifdef USE_AS_WCSCMP
-#  ifndef __ILP32__
 	movq	%rdx, %rcx
-	/* Check if length could overflow when multiplied by
-	   sizeof(wchar_t). Checking top 8 bits will cover all potential
-	   overflow cases as well as redirect cases where its impossible to
-	   length to bound a valid memory region. In these cases just use
-	   'wcscmp'.  */
+
+	/* Multiplying length by sizeof(wchar_t) can result in overflow.
+	   Check if that is possible. All cases where overflow are possible
+	   are cases where length is large enough that it can never be a
+	   bound on valid memory so just use wcscmp.  */
 	shrq	$56, %rcx
 	jnz	__wcscmp_avx2
+
+	leaq	(, %rdx, 4), %rdx
 #  endif
-	/* Convert units: from wide to byte char.  */
-	shl	$2, %RDX_LP
-#  endif
-	/* Register %r11 tracks the maximum offset.  */
-	mov	%RDX_LP, %R11_LP
 # endif
+	vpxor	%xmmZERO, %xmmZERO, %xmmZERO
 	movl	%edi, %eax
-	xorl	%edx, %edx
-	/* Make %xmm7 (%ymm7) all zeros in this function.  */
-	vpxor	%xmm7, %xmm7, %xmm7
 	orl	%esi, %eax
-	andl	$(PAGE_SIZE - 1), %eax
-	cmpl	$(PAGE_SIZE - (VEC_SIZE * 4)), %eax
-	jg	L(cross_page)
-	/* Start comparing 4 vectors.  */
-	vmovdqu	(%rdi), %ymm1
-	VPCMPEQ	(%rsi), %ymm1, %ymm0
-	VPMINU	%ymm1, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	je	L(next_3_vectors)
-	tzcntl	%ecx, %edx
+	sall	$20, %eax
+	/* Check if s1 or s2 may cross a page  in next 4x VEC loads.  */
+	cmpl	$((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
+	ja	L(page_cross)
+
+L(no_page_cross):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	(%rdi), %ymm0
+	/* 1s where s1 and s2 equal.  */
+	VPCMPEQ	(%rsi), %ymm0, %ymm1
+	/* 1s at null CHAR.  */
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	/* 1s where s1 and s2 equal AND not null CHAR.  */
+	vpandn	%ymm1, %ymm2, %ymm1
+
+	/* All 1s -> keep going, any 0s -> return.  */
+	vpmovmskb %ymm1, %ecx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx) is after the maximum
-	   offset (%r11).   */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$VEC_SIZE, %rdx
+	jbe	L(vec_0_test_len)
 # endif
+
+	/* All 1s represents all equals. incl will overflow to zero in
+	   all equals case. Otherwise 1s will carry until position of first
+	   mismatch.  */
+	incl	%ecx
+	jz	L(more_3x_vec)
+
+	.p2align 4,, 4
+L(return_vec_0):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	je	L(return)
-L(wcscmp_return):
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret0)
 	setl	%al
 	negl	%eax
 	orl	$1, %eax
-L(return):
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret0):
 L(return_vzeroupper):
 	ZERO_UPPER_VEC_REGISTERS_RETURN
 
-	.p2align 4
-L(return_vec_size):
-	tzcntl	%ecx, %edx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
-	   the maximum offset (%r11).  */
-	addq	$VEC_SIZE, %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	.p2align 4,, 8
+L(vec_0_test_len):
+	notl	%ecx
+	bzhil	%edx, %ecx, %eax
+	jnz	L(return_vec_0)
+	/* Align if will cross fetch block.  */
+	.p2align 4,, 2
+L(ret_zero):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
-# else
+	VZEROUPPER_RETURN
+
+	.p2align 4,, 5
+L(one_or_less):
+	jb	L(ret_zero)
 #  ifdef USE_AS_WCSCMP
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__wcscmp_avx2
+	movl	(%rdi), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rdi, %rdx), %ecx
-	cmpl	VEC_SIZE(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(%rsi), %edx
+	je	L(ret1)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	VEC_SIZE(%rdi, %rdx), %eax
-	movzbl	VEC_SIZE(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+
+	jnbe	__strcmp_avx2
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi), %ecx
+	subl	%ecx, %eax
 #  endif
+L(ret1):
+	ret
 # endif
-	VZEROUPPER_RETURN
 
-	.p2align 4
-L(return_2_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
+L(return_vec_1):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 2), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	/* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of
+	   overflow.  */
+	addq	$-VEC_SIZE, %rdx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_SIZE(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_SIZE(%rsi, %rcx), %edx
+	je	L(ret2)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 2)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 2)(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret2):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(return_3_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 3), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+L(return_vec_3):
+	salq	$32, %rcx
+# endif
+
+L(return_vec_2):
+# ifndef USE_AS_STRNCMP
+	tzcntl	%ecx, %ecx
+# else
+	tzcntq	%rcx, %rcx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx), %edx
+	je	L(ret3)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+# endif
+L(ret3):
+	VZEROUPPER_RETURN
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_3):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 3)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 3)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(VEC_SIZE * 3)(%rsi, %rcx), %edx
+	je	L(ret4)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 3)(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(VEC_SIZE * 3)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 3)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret4):
 	VZEROUPPER_RETURN
+# endif
+
+	.p2align 4,, 10
+L(more_3x_vec):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	VEC_SIZE(%rdi), %ymm0
+	VPCMPEQ	VEC_SIZE(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_1)
+
+# ifdef USE_AS_STRNCMP
+	subq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+	VMOVU	(VEC_SIZE * 2)(%rdi), %ymm0
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_2)
+
+	VMOVU	(VEC_SIZE * 3)(%rdi), %ymm0
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_3)
 
-	.p2align 4
-L(next_3_vectors):
-	vmovdqu	VEC_SIZE(%rdi), %ymm6
-	VPCMPEQ	VEC_SIZE(%rsi), %ymm6, %ymm3
-	VPMINU	%ymm6, %ymm3, %ymm3
-	VPCMPEQ	%ymm7, %ymm3, %ymm3
-	vpmovmskb %ymm3, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_vec_size)
-	vmovdqu	(VEC_SIZE * 2)(%rdi), %ymm5
-	vmovdqu	(VEC_SIZE * 3)(%rdi), %ymm4
-	vmovdqu	(VEC_SIZE * 3)(%rsi), %ymm0
-	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm5, %ymm2
-	VPMINU	%ymm5, %ymm2, %ymm2
-	VPCMPEQ	%ymm4, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm2, %ymm2
-	vpmovmskb %ymm2, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_2_vec_size)
-	VPMINU	%ymm4, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(return_3_vec_size)
-L(main_loop_header):
-	leaq	(VEC_SIZE * 4)(%rdi), %rdx
-	movl	$PAGE_SIZE, %ecx
-	/* Align load via RAX.  */
-	andq	$-(VEC_SIZE * 4), %rdx
-	subq	%rdi, %rdx
-	leaq	(%rdi, %rdx), %rax
 # ifdef USE_AS_STRNCMP
-	/* Starting from this point, the maximum offset, or simply the
-	   'offset', DECREASES by the same amount when base pointers are
-	   moved forward.  Return 0 when:
-	     1) On match: offset <= the matched vector index.
-	     2) On mistmach, offset is before the mistmatched index.
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	/* any non-zero positive value that doesn't inference with 0x1.
 	 */
-	subq	%rdx, %r11
-	jbe	L(zero)
-# endif
-	addq	%rsi, %rdx
-	movq	%rdx, %rsi
-	andl	$(PAGE_SIZE - 1), %esi
-	/* Number of bytes before page crossing.  */
-	subq	%rsi, %rcx
-	/* Number of VEC_SIZE * 4 blocks before page crossing.  */
-	shrq	$DIVIDE_BY_VEC_4_SHIFT, %rcx
-	/* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
-	movl	%ecx, %esi
-	jmp	L(loop_start)
+	movl	$2, %r8d
 
+# else
+	xorl	%r8d, %r8d
+# endif
+
+	/* The prepare labels are various entry points from the page
+	   cross logic.  */
+L(prepare_loop):
+
+# ifdef USE_AS_STRNCMP
+	/* Store N + (VEC_SIZE * 4) and place check at the begining of
+	   the loop.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rdx), %rdx
+# endif
+L(prepare_loop_no_len):
+
+	/* Align s1 and adjust s2 accordingly.  */
+	subq	%rdi, %rsi
+	andq	$-(VEC_SIZE * 4), %rdi
+	addq	%rdi, %rsi
+
+# ifdef USE_AS_STRNCMP
+	subq	%rdi, %rdx
+# endif
+
+L(prepare_loop_aligned):
+	/* eax stores distance from rsi to next page cross. These cases
+	   need to be handled specially as the 4x loop could potentially
+	   read memory past the length of s1 or s2 and across a page
+	   boundary.  */
+	movl	$-(VEC_SIZE * 4), %eax
+	subl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+
+	/* Loop 4x comparisons at a time.  */
 	.p2align 4
 L(loop):
+
+	/* End condition for strncmp.  */
 # ifdef USE_AS_STRNCMP
-	/* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
-	   the maximum offset (%r11) by the same amount.  */
-	subq	$(VEC_SIZE * 4), %r11
-	jbe	L(zero)
-# endif
-	addq	$(VEC_SIZE * 4), %rax
-	addq	$(VEC_SIZE * 4), %rdx
-L(loop_start):
-	testl	%esi, %esi
-	leal	-1(%esi), %esi
-	je	L(loop_cross_page)
-L(back_to_loop):
-	/* Main loop, comparing 4 vectors are a time.  */
-	vmovdqa	(%rax), %ymm0
-	vmovdqa	VEC_SIZE(%rax), %ymm3
-	VPCMPEQ	(%rdx), %ymm0, %ymm4
-	VPCMPEQ	VEC_SIZE(%rdx), %ymm3, %ymm1
-	VPMINU	%ymm0, %ymm4, %ymm4
-	VPMINU	%ymm3, %ymm1, %ymm1
-	vmovdqa	(VEC_SIZE * 2)(%rax), %ymm2
-	VPMINU	%ymm1, %ymm4, %ymm0
-	vmovdqa	(VEC_SIZE * 3)(%rax), %ymm3
-	VPCMPEQ	(VEC_SIZE * 2)(%rdx), %ymm2, %ymm5
-	VPCMPEQ	(VEC_SIZE * 3)(%rdx), %ymm3, %ymm6
-	VPMINU	%ymm2, %ymm5, %ymm5
-	VPMINU	%ymm3, %ymm6, %ymm6
-	VPMINU	%ymm5, %ymm0, %ymm0
-	VPMINU	%ymm6, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-
-	/* Test each mask (32 bits) individually because for VEC_SIZE
-	   == 32 is not possible to OR the four masks and keep all bits
-	   in a 64-bit integer register, differing from SSE2 strcmp
-	   where ORing is possible.  */
-	vpmovmskb %ymm0, %ecx
+	subq	$(VEC_SIZE * 4), %rdx
+	jbe	L(ret_zero)
+# endif
+
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+
+	/* Check if rsi loads will cross a page boundary.  */
+	addl	$-(VEC_SIZE * 4), %eax
+	jnb	L(page_cross_during_loop)
+
+	/* Loop entry after handling page cross during loop.  */
+L(loop_skip_page_cross_check):
+	VMOVA	(VEC_SIZE * 0)(%rdi), %ymm0
+	VMOVA	(VEC_SIZE * 1)(%rdi), %ymm2
+	VMOVA	(VEC_SIZE * 2)(%rdi), %ymm4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %ymm6
+
+	/* ymm1 all 1s where s1 and s2 equal. All 0s otherwise.  */
+	VPCMPEQ	(VEC_SIZE * 0)(%rsi), %ymm0, %ymm1
+
+	VPCMPEQ	(VEC_SIZE * 1)(%rsi), %ymm2, %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
+
+
+	/* If any mismatches or null CHAR then 0 CHAR, otherwise non-
+	   zero.  */
+	vpand	%ymm0, %ymm1, %ymm1
+
+
+	vpand	%ymm2, %ymm3, %ymm3
+	vpand	%ymm4, %ymm5, %ymm5
+	vpand	%ymm6, %ymm7, %ymm7
+
+	VPMINU	%ymm1, %ymm3, %ymm3
+	VPMINU	%ymm5, %ymm7, %ymm7
+
+	/* Reduce all 0 CHARs for the 4x VEC into ymm7.  */
+	VPMINU	%ymm3, %ymm7, %ymm7
+
+	/* If any 0 CHAR then done.  */
+	VPCMPEQ	%ymm7, %ymmZERO, %ymm7
+	vpmovmskb %ymm7, %LOOP_REG
+	testl	%LOOP_REG, %LOOP_REG
+	jz	L(loop)
+
+	/* Find which VEC has the mismatch of end of string.  */
+	VPCMPEQ	%ymm1, %ymmZERO, %ymm1
+	vpmovmskb %ymm1, %ecx
 	testl	%ecx, %ecx
-	je	L(loop)
-	VPCMPEQ	%ymm7, %ymm4, %ymm0
-	vpmovmskb %ymm0, %edi
-	testl	%edi, %edi
-	je	L(test_vec)
-	tzcntl	%edi, %ecx
+	jnz	L(return_vec_0_end)
+
+
+	VPCMPEQ	%ymm3, %ymmZERO, %ymm3
+	vpmovmskb %ymm3, %ecx
+	testl	%ecx, %ecx
+	jnz	L(return_vec_1_end)
+
+L(return_vec_2_3_end):
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	subq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+	VPCMPEQ	%ymm5, %ymmZERO, %ymm5
+	vpmovmskb %ymm5, %ecx
+	testl	%ecx, %ecx
+	jnz	L(return_vec_2_end)
+
+	/* LOOP_REG contains matches for null/mismatch from the loop. If
+	   VEC 0,1,and 2 all have no null and no mismatches then mismatch
+	   must entirely be from VEC 3 which is fully represented by
+	   LOOP_REG.  */
+	tzcntl	%LOOP_REG, %LOOP_REG
+
+# ifdef USE_AS_STRNCMP
+	subl	$-(VEC_SIZE), %LOOP_REG
+	cmpq	%LOOP_REG64, %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
+	je	L(ret5)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	(VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax
+	movzbl	(VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret5):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(test_vec):
 # ifdef USE_AS_STRNCMP
-	/* The first vector matched.  Return 0 if the maximum offset
-	   (%r11) <= VEC_SIZE.  */
-	cmpq	$VEC_SIZE, %r11
-	jbe	L(zero)
+	.p2align 4,, 2
+L(ret_zero_end):
+	xorl	%eax, %eax
+	VZEROUPPER_RETURN
 # endif
-	VPCMPEQ	%ymm7, %ymm1, %ymm1
-	vpmovmskb %ymm1, %ecx
-	testl	%ecx, %ecx
-	je	L(test_2_vec)
-	tzcntl	%ecx, %edi
+
+
+	/* The L(return_vec_N_end) differ from L(return_vec_N) in that
+	   they use the value of `r8` to negate the return value. This is
+	   because the page cross logic can swap `rdi` and `rsi`.  */
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	addq	$VEC_SIZE, %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+L(return_vec_1_end):
+	salq	$32, %rcx
+# endif
+L(return_vec_0_end):
+# ifndef USE_AS_STRNCMP
+	tzcntl	%ecx, %ecx
+# else
+	tzcntq	%rcx, %rcx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_end)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret6)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
+# endif
+L(ret6):
+	VZEROUPPER_RETURN
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_1_end):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_SIZE(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rsi, %rdi), %ecx
-	cmpl	VEC_SIZE(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
+	cmpl	VEC_SIZE(%rsi, %rcx), %edx
+	je	L(ret7)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 #  else
-	movzbl	VEC_SIZE(%rax, %rdi), %eax
-	movzbl	VEC_SIZE(%rdx, %rdi), %edx
-	subl	%edx, %eax
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 #  endif
-# endif
+L(ret7):
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(test_2_vec):
+	.p2align 4,, 10
+L(return_vec_2_end):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-	/* The first 2 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 2 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 2), %r11
-	jbe	L(zero)
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
-	VPCMPEQ	%ymm7, %ymm5, %ymm5
-	vpmovmskb %ymm5, %ecx
-	testl	%ecx, %ecx
-	je	L(test_3_vec)
-	tzcntl	%ecx, %edi
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx), %edx
+	je	L(ret11)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rdi), %ecx
-	cmpl	(VEC_SIZE * 2)(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rdi), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret11):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(test_3_vec):
+
+	/* Page cross in rsi in next 4x VEC.  */
+
+	/* TODO: Improve logic here.  */
+	.p2align 4,, 10
+L(page_cross_during_loop):
+	/* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
+
+	/* Optimistically rsi and rdi and both aligned inwhich case we
+	   don't need any logic here.  */
+	cmpl	$-(VEC_SIZE * 4), %eax
+	/* Don't adjust eax before jumping back to loop and we will
+	   never hit page cross case again.  */
+	je	L(loop_skip_page_cross_check)
+
+	/* Check if we can safely load a VEC.  */
+	cmpl	$-(VEC_SIZE * 3), %eax
+	jle	L(less_1x_vec_till_page_cross)
+
+	VMOVA	(%rdi), %ymm0
+	VPCMPEQ	(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_0_end)
+
+	/* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
+	cmpl	$-(VEC_SIZE * 2), %eax
+	jg	L(more_2x_vec_till_page_cross)
+
+	.p2align 4,, 4
+L(less_1x_vec_till_page_cross):
+	subl	$-(VEC_SIZE * 4), %eax
+	/* Guranteed safe to read from rdi - VEC_SIZE here. The only
+	   concerning case is first iteration if incoming s1 was near start
+	   of a page and s2 near end. If s1 was near the start of the page
+	   we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
+	   to read back -VEC_SIZE. If rdi is truly at the start of a page
+	   here, it means the previous page (rdi - VEC_SIZE) has already
+	   been loaded earlier so must be valid.  */
+	VMOVU	-VEC_SIZE(%rdi, %rax), %ymm0
+	VPCMPEQ	-VEC_SIZE(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+
+	/* Mask of potentially valid bits. The lower bits can be out of
+	   range comparisons (but safe regarding page crosses).  */
+	movl	$-1, %r10d
+	shlxl	%esi, %r10d, %r10d
+	notl	%ecx
+
 # ifdef USE_AS_STRNCMP
-	/* The first 3 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 3 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 3), %r11
-	jbe	L(zero)
-# endif
-	VPCMPEQ	%ymm7, %ymm6, %ymm6
-	vpmovmskb %ymm6, %esi
-	tzcntl	%esi, %ecx
+	cmpq	%rax, %rdx
+	jbe	L(return_page_cross_end_check)
+# endif
+	movl	%eax, %OFFSET_REG
+	addl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+
+	andl	%r10d, %ecx
+	jz	L(loop_skip_page_cross_check)
+
+	.p2align 4,, 3
+L(return_page_cross_end):
+	tzcntl	%ecx, %ecx
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 3), %rcx
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %esi
-	cmpl	(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	leal	-VEC_SIZE(%OFFSET_REG64, %rcx), %ecx
+L(return_page_cross_cmp_mem):
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	addl	%OFFSET_REG, %ecx
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rsi, %rcx), %esi
-	cmpl	(VEC_SIZE * 3)(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 3)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 3)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret8)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret8):
 	VZEROUPPER_RETURN
 
-	.p2align 4
-L(loop_cross_page):
-	xorl	%r10d, %r10d
-	movq	%rdx, %rcx
-	/* Align load via RDX.  We load the extra ECX bytes which should
-	   be ignored.  */
-	andl	$((VEC_SIZE * 4) - 1), %ecx
-	/* R10 is -RCX.  */
-	subq	%rcx, %r10
-
-	/* This works only if VEC_SIZE * 2 == 64. */
-# if (VEC_SIZE * 2) != 64
-#  error (VEC_SIZE * 2) != 64
-# endif
-
-	/* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
-	cmpl	$(VEC_SIZE * 2), %ecx
-	jge	L(loop_cross_page_2_vec)
-
-	vmovdqu	(%rax, %r10), %ymm2
-	vmovdqu	VEC_SIZE(%rax, %r10), %ymm3
-	VPCMPEQ	(%rdx, %r10), %ymm2, %ymm0
-	VPCMPEQ	VEC_SIZE(%rdx, %r10), %ymm3, %ymm1
-	VPMINU	%ymm2, %ymm0, %ymm0
-	VPMINU	%ymm3, %ymm1, %ymm1
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm1, %ymm1
-
-	vpmovmskb %ymm0, %edi
-	vpmovmskb %ymm1, %esi
-
-	salq	$32, %rsi
-	xorq	%rsi, %rdi
-
-	/* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
-	shrq	%cl, %rdi
-
-	testq	%rdi, %rdi
-	je	L(loop_cross_page_2_vec)
-	tzcntq	%rdi, %rcx
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	.p2align 4,, 10
+L(return_page_cross_end_check):
+	tzcntl	%ecx, %ecx
+	leal	-VEC_SIZE(%rax, %rcx), %ecx
+	cmpl	%ecx, %edx
+	ja	L(return_page_cross_cmp_mem)
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# endif
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(loop_cross_page_2_vec):
-	/* The first VEC_SIZE * 2 bytes match or are ignored.  */
-	vmovdqu	(VEC_SIZE * 2)(%rax, %r10), %ymm2
-	vmovdqu	(VEC_SIZE * 3)(%rax, %r10), %ymm3
-	VPCMPEQ	(VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5
-	VPMINU	%ymm2, %ymm5, %ymm5
-	VPCMPEQ	(VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6
-	VPCMPEQ	%ymm7, %ymm5, %ymm5
-	VPMINU	%ymm3, %ymm6, %ymm6
-	VPCMPEQ	%ymm7, %ymm6, %ymm6
-
-	vpmovmskb %ymm5, %edi
-	vpmovmskb %ymm6, %esi
-
-	salq	$32, %rsi
-	xorq	%rsi, %rdi
 
-	xorl	%r8d, %r8d
-	/* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
-	subl	$(VEC_SIZE * 2), %ecx
-	jle	1f
-	/* Skip ECX bytes.  */
-	shrq	%cl, %rdi
-	/* R8 has number of bytes skipped.  */
-	movl	%ecx, %r8d
-1:
-	/* Before jumping back to the loop, set ESI to the number of
-	   VEC_SIZE * 4 blocks before page crossing.  */
-	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
-
-	testq	%rdi, %rdi
+	.p2align 4,, 10
+L(more_2x_vec_till_page_cross):
+	/* If more 2x vec till cross we will complete a full loop
+	   iteration here.  */
+
+	VMOVU	VEC_SIZE(%rdi), %ymm0
+	VPCMPEQ	VEC_SIZE(%rsi), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_1_end)
+
 # ifdef USE_AS_STRNCMP
-	/* At this point, if %rdi value is 0, it already tested
-	   VEC_SIZE*4+%r10 byte starting from %rax. This label
-	   checks whether strncmp maximum offset reached or not.  */
-	je	L(string_nbyte_offset_check)
-# else
-	je	L(back_to_loop)
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
-	tzcntq	%rdi, %rcx
-	addq	%r10, %rcx
-	/* Adjust for number of bytes skipped.  */
-	addq	%r8, %rcx
+
+	subl	$-(VEC_SIZE * 4), %eax
+
+	/* Safe to include comparisons from lower bytes.  */
+	VMOVU	-(VEC_SIZE * 2)(%rdi, %rax), %ymm0
+	VPCMPEQ	-(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_page_cross_0)
+
+	VMOVU	-(VEC_SIZE * 1)(%rdi, %rax), %ymm0
+	VPCMPEQ	-(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+	jnz	L(return_vec_page_cross_1)
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rcx
-	subq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	/* Must check length here as length might proclude reading next
+	   page.  */
+	cmpq	%rax, %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
+# endif
+
+	/* Finish the loop.  */
+	VMOVA	(VEC_SIZE * 2)(%rdi), %ymm4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %ymm6
+
+	VPCMPEQ	(VEC_SIZE * 2)(%rsi), %ymm4, %ymm5
+	VPCMPEQ	(VEC_SIZE * 3)(%rsi), %ymm6, %ymm7
+	vpand	%ymm4, %ymm5, %ymm5
+	vpand	%ymm6, %ymm7, %ymm7
+	VPMINU	%ymm5, %ymm7, %ymm7
+	VPCMPEQ	%ymm7, %ymmZERO, %ymm7
+	vpmovmskb %ymm7, %LOOP_REG
+	testl	%LOOP_REG, %LOOP_REG
+	jnz	L(return_vec_2_3_end)
+
+	/* Best for code size to include ucond-jmp here. Would be faster
+	   if this case is hot to duplicate the L(return_vec_2_3_end) code
+	   as fall-through and have jump back to loop on mismatch
+	   comparison.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+	addl	$(PAGE_SIZE - VEC_SIZE * 8), %eax
+# ifdef USE_AS_STRNCMP
+	subq	$(VEC_SIZE * 4), %rdx
+	ja	L(loop_skip_page_cross_check)
+L(ret_zero_in_loop_page_cross):
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	VZEROUPPER_RETURN
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rcx), %edi
-	cmpl	(VEC_SIZE * 2)(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	jmp	L(loop_skip_page_cross_check)
 # endif
-	VZEROUPPER_RETURN
 
+
+	.p2align 4,, 10
+L(return_vec_page_cross_0):
+	addl	$-VEC_SIZE, %eax
+L(return_vec_page_cross_1):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_STRNCMP
-L(string_nbyte_offset_check):
-	leaq	(VEC_SIZE * 4)(%r10), %r10
-	cmpq	%r10, %r11
-	jbe	L(zero)
-	jmp	L(back_to_loop)
+	leal	-VEC_SIZE(%rax, %rcx), %ecx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
+# else
+	addl	%eax, %ecx
 # endif
 
-	.p2align 4
-L(cross_page_loop):
-	/* Check one byte/dword at a time.  */
 # ifdef USE_AS_WCSCMP
-	cmpl	%ecx, %eax
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
+	xorl	%eax, %eax
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret9)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
 	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
-	jne	L(different)
-	addl	$SIZE_OF_CHAR, %edx
-	cmpl	$(VEC_SIZE * 4), %edx
-	je	L(main_loop_header)
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+L(ret9):
+	VZEROUPPER_RETURN
+
+
+	.p2align 4,, 10
+L(page_cross):
+# ifndef USE_AS_STRNCMP
+	/* If both are VEC aligned we don't need any special logic here.
+	   Only valid for strcmp where stop condition is guranteed to be
+	   reachable by just reading memory.  */
+	testl	$((VEC_SIZE - 1) << 20), %eax
+	jz	L(no_page_cross)
 # endif
+
+	movl	%edi, %eax
+	movl	%esi, %ecx
+	andl	$(PAGE_SIZE - 1), %eax
+	andl	$(PAGE_SIZE - 1), %ecx
+
+	xorl	%OFFSET_REG, %OFFSET_REG
+
+	/* Check which is closer to page cross, s1 or s2.  */
+	cmpl	%eax, %ecx
+	jg	L(page_cross_s2)
+
+	/* The previous page cross check has false positives. Check for
+	   true positive as page cross logic is very expensive.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+	jbe	L(no_page_cross)
+
+	/* Set r8 to not interfere with normal return value (rdi and rsi
+	   did not swap).  */
 # ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
+	xorl	%r8d, %r8d
 # endif
-	/* Check null char.  */
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
-	/* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
-	   comparisons.  */
-	subl	%ecx, %eax
-# ifndef USE_AS_WCSCMP
-L(different):
+
+	/* Check if less than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jg	L(less_1x_vec_till_page)
+
+	/* If more than 1x VEC till page cross, loop throuh safely
+	   loadable memory until within 1x VEC of page cross.  */
+
+	.p2align 4,, 10
+L(page_cross_loop):
+
+	VMOVU	(%rdi, %OFFSET_REG64), %ymm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+	incl	%ecx
+
+	jnz	L(check_ret_vec_page_cross)
+	addl	$VEC_SIZE, %OFFSET_REG
+# ifdef USE_AS_STRNCMP
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
-	VZEROUPPER_RETURN
+	addl	$VEC_SIZE, %eax
+	jl	L(page_cross_loop)
+
+	subl	%eax, %OFFSET_REG
+	/* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
+	   to not cross page so is safe to load. Since we have already
+	   loaded at least 1 VEC from rsi it is also guranteed to be safe.
+	 */
+
+	VMOVU	(%rdi, %OFFSET_REG64), %ymm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %ymm0, %ymm1
+	VPCMPEQ	%ymm0, %ymmZERO, %ymm2
+	vpandn	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %ecx
+
+# ifdef USE_AS_STRNCMP
+	leal	VEC_SIZE(%OFFSET_REG64), %eax
+	cmpq	%rax, %rdx
+	jbe	L(check_ret_vec_page_cross2)
+	addq	%rdi, %rdx
+# endif
+	incl	%ecx
+	jz	L(prepare_loop_no_len)
 
+	.p2align 4,, 4
+L(ret_vec_page_cross):
+# ifndef USE_AS_STRNCMP
+L(check_ret_vec_page_cross):
+# endif
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+L(ret_vec_page_cross_cont):
 # ifdef USE_AS_WCSCMP
-	.p2align 4
-L(different):
-	/* Use movl to avoid modifying EFLAGS.  */
-	movl	$0, %eax
+	movl	(%rdi, %rcx), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx), %edx
+	je	L(ret12)
 	setl	%al
 	negl	%eax
-	orl	$1, %eax
-	VZEROUPPER_RETURN
+	xorl	%r8d, %eax
+# else
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret12):
+	VZEROUPPER_RETURN
 
 # ifdef USE_AS_STRNCMP
-	.p2align 4
-L(zero):
+	.p2align 4,, 10
+L(check_ret_vec_page_cross2):
+	incl	%ecx
+L(check_ret_vec_page_cross):
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+	cmpq	%rcx, %rdx
+	ja	L(ret_vec_page_cross_cont)
+	.p2align 4,, 2
+L(ret_zero_page_cross):
 	xorl	%eax, %eax
 	VZEROUPPER_RETURN
+# endif
 
-	.p2align 4
-L(char0):
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi), %ecx
-	cmpl	(%rsi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rsi), %ecx
-	movzbl	(%rdi), %eax
-	subl	%ecx, %eax
-#  endif
-	VZEROUPPER_RETURN
+	.p2align 4,, 4
+L(page_cross_s2):
+	/* Ensure this is a true page cross.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %ecx
+	jbe	L(no_page_cross)
+
+
+	movl	%ecx, %eax
+	movq	%rdi, %rcx
+	movq	%rsi, %rdi
+	movq	%rcx, %rsi
+
+	/* set r8 to negate return value as rdi and rsi swapped.  */
+# ifdef USE_AS_WCSCMP
+	movl	$-4, %r8d
+# else
+	movl	$-1, %r8d
 # endif
+	xorl	%OFFSET_REG, %OFFSET_REG
 
-	.p2align 4
-L(last_vector):
-	addq	%rdx, %rdi
-	addq	%rdx, %rsi
+	/* Check if more than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jle	L(page_cross_loop)
+
+	.p2align 4,, 6
+L(less_1x_vec_till_page):
+	/* Find largest load size we can use.  */
+	cmpl	$16, %eax
+	ja	L(less_16_till_page)
+
+	VMOVU	(%rdi), %xmm0
+	VPCMPEQ	(%rsi), %xmm0, %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incw	%cx
+	jnz	L(check_ret_vec_page_cross)
+	movl	$16, %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	subq	%rdx, %r11
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subl	%eax, %OFFSET_REG
+# else
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+	jz	L(prepare_loop)
 # endif
-	tzcntl	%ecx, %edx
+
+	VMOVU	(%rdi, %OFFSET_REG64), %xmm0
+	VPCMPEQ	(%rsi, %OFFSET_REG64), %xmm0, %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incw	%cx
+	jnz	L(check_ret_vec_page_cross)
+
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addl	$16, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(VEC_SIZE * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# else
+	leaq	(16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
 # endif
-# ifdef USE_AS_WCSCMP
+	jmp	L(prepare_loop_aligned)
+
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case0):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	ret
 # endif
-	VZEROUPPER_RETURN
 
-	/* Comparing on page boundary region requires special treatment:
-	   It must done one vector at the time, starting with the wider
-	   ymm vector if possible, if not, with xmm. If fetching 16 bytes
-	   (xmm) still passes the boundary, byte comparison must be done.
-	 */
-	.p2align 4
-L(cross_page):
-	/* Try one ymm vector at a time.  */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jg	L(cross_page_1_vector)
-L(loop_1_vector):
-	vmovdqu	(%rdi, %rdx), %ymm1
-	VPCMPEQ	(%rsi, %rdx), %ymm1, %ymm0
-	VPMINU	%ymm1, %ymm0, %ymm0
-	VPCMPEQ	%ymm7, %ymm0, %ymm0
-	vpmovmskb %ymm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
 
-	addl	$VEC_SIZE, %edx
+	.p2align 4,, 10
+L(less_16_till_page):
+	/* Find largest load size we can use.  */
+	cmpl	$24, %eax
+	ja	L(less_8_till_page)
 
-	addl	$VEC_SIZE, %eax
-# ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jle	L(loop_1_vector)
-L(cross_page_1_vector):
-	/* Less than 32 bytes to check, try one xmm vector.  */
-	cmpl	$(PAGE_SIZE - 16), %eax
-	jg	L(cross_page_1_xmm)
-	vmovdqu	(%rdi, %rdx), %xmm1
-	VPCMPEQ	(%rsi, %rdx), %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	vmovq	(%rdi), %xmm0
+	vmovq	(%rsi), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incb	%cl
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$16, %edx
-# ifndef USE_AS_WCSCMP
-	addl	$16, %eax
+
+# ifdef USE_AS_STRNCMP
+	cmpq	$8, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
 # endif
+	movl	$24, %OFFSET_REG
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+
+
+
+	vmovq	(%rdi, %OFFSET_REG64), %xmm0
+	vmovq	(%rsi, %OFFSET_REG64), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	incb	%cl
+	jnz	L(check_ret_vec_page_cross)
+
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-
-L(cross_page_1_xmm):
-# ifndef USE_AS_WCSCMP
-	/* Less than 16 bytes to check, try 8 byte vector.  NB: No need
-	   for wcscmp nor wcsncmp since wide char is 4 bytes.   */
-	cmpl	$(PAGE_SIZE - 8), %eax
-	jg	L(cross_page_8bytes)
-	vmovq	(%rdi, %rdx), %xmm1
-	vmovq	(%rsi, %rdx), %xmm0
-	VPCMPEQ	%xmm0, %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	addl	$8, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(VEC_SIZE * 4), %rdx
 
-	addl	$8, %edx
-	addl	$8, %eax
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# else
+	leaq	(8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+# endif
+	jmp	L(prepare_loop_aligned)
+
+
+	.p2align 4,, 10
+L(less_8_till_page):
+# ifdef USE_AS_WCSCMP
+	/* If using wchar then this is the only check before we reach
+	   the page boundary.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	cmpl	%ecx, %eax
+	jnz	L(ret_less_8_wcs)
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addq	%rdi, %rdx
+	/* We already checked for len <= 1 so cannot hit that case here.
+	 */
 #  endif
+	testl	%eax, %eax
+	jnz	L(prepare_loop_no_len)
+	ret
 
-L(cross_page_8bytes):
-	/* Less than 8 bytes to check, try 4 byte vector.  */
-	cmpl	$(PAGE_SIZE - 4), %eax
-	jg	L(cross_page_4bytes)
-	vmovd	(%rdi, %rdx), %xmm1
-	vmovd	(%rsi, %rdx), %xmm0
-	VPCMPEQ	%xmm0, %xmm1, %xmm0
-	VPMINU	%xmm1, %xmm0, %xmm0
-	VPCMPEQ	%xmm7, %xmm0, %xmm0
-	vpmovmskb %xmm0, %ecx
-	/* Only last 4 bits are valid.  */
-	andl	$0xf, %ecx
-	testl	%ecx, %ecx
-	jne	L(last_vector)
+	.p2align 4,, 8
+L(ret_less_8_wcs):
+	setl	%OFFSET_REG8
+	negl	%OFFSET_REG
+	movl	%OFFSET_REG, %eax
+	xorl	%r8d, %eax
+	ret
+
+# else
+
+	/* Find largest load size we can use.  */
+	cmpl	$28, %eax
+	ja	L(less_4_till_page)
+
+	vmovd	(%rdi), %xmm0
+	vmovd	(%rsi), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$4, %edx
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$4, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
 #  endif
+	movl	$28, %OFFSET_REG
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
 
-L(cross_page_4bytes):
-# endif
-	/* Less than 4 bytes to check, try one byte/dword at a time.  */
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-# ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
-# endif
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
+
+
+	vmovd	(%rdi, %OFFSET_REG64), %xmm0
+	vmovd	(%rsi, %OFFSET_REG64), %xmm1
+	VPCMPEQ	%xmm0, %xmmZERO, %xmm2
+	VPCMPEQ	%xmm1, %xmm0, %xmm1
+	vpandn	%xmm1, %xmm2, %xmm1
+	vpmovmskb %ymm1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
+
+#  ifdef USE_AS_STRNCMP
+	addl	$4, %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
+	subq	$-(VEC_SIZE * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+#  else
+	leaq	(4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi
+	leaq	(4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+#  ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case1):
+	xorl	%eax, %eax
+	ret
+#  endif
+
+	.p2align 4,, 10
+L(less_4_till_page):
+	subq	%rdi, %rsi
+	/* Extremely slow byte comparison loop.  */
+L(less_4_loop):
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi, %rdi), %ecx
 	subl	%ecx, %eax
-	VZEROUPPER_RETURN
-END (STRCMP)
+	jnz	L(ret_less_4_loop)
+	testl	%ecx, %ecx
+	jz	L(ret_zero_4_loop)
+#  ifdef USE_AS_STRNCMP
+	decq	%rdx
+	jz	L(ret_zero_4_loop)
+#  endif
+	incq	%rdi
+	/* end condition is reach page boundary (rdi is aligned).  */
+	testl	$31, %edi
+	jnz	L(less_4_loop)
+	leaq	-(VEC_SIZE * 4)(%rdi, %rsi), %rsi
+	addq	$-(VEC_SIZE * 4), %rdi
+#  ifdef USE_AS_STRNCMP
+	subq	$-(VEC_SIZE * 4), %rdx
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+L(ret_zero_4_loop):
+	xorl	%eax, %eax
+	ret
+L(ret_less_4_loop):
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
+	ret
+# endif
+END(STRCMP)
 #endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 6/7] x86: Optimize strcmp-evex.S
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
                     ` (3 preceding siblings ...)
  2022-01-10 21:35   ` [PATCH v3 5/7] x86: Optimize strcmp-avx2.S Noah Goldstein
@ 2022-01-10 21:35   ` Noah Goldstein
  2022-01-10 21:35   ` [PATCH v3 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
  2022-01-11  2:15   ` [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] H.J. Lu
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10 21:35 UTC (permalink / raw)
  To: libc-alpha

Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases as well are nearly universally improved.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-evex.S | 1712 +++++++++++++-----------
 1 file changed, 919 insertions(+), 793 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
index 0cd939d5af..e5070f3d53 100644
--- a/sysdeps/x86_64/multiarch/strcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
@@ -26,54 +26,69 @@
 
 # define PAGE_SIZE	4096
 
-/* VEC_SIZE = Number of bytes in a ymm register */
+	/* VEC_SIZE = Number of bytes in a ymm register.  */
 # define VEC_SIZE	32
+# define CHAR_PER_VEC	(VEC_SIZE	/	SIZE_OF_CHAR)
 
-/* Shift for dividing by (VEC_SIZE * 4).  */
-# define DIVIDE_BY_VEC_4_SHIFT	7
-# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
-# endif
-
-# define VMOVU		vmovdqu64
-# define VMOVA		vmovdqa64
+# define VMOVU	vmovdqu64
+# define VMOVA	vmovdqa64
 
 # ifdef USE_AS_WCSCMP
-/* Compare packed dwords.  */
-#  define VPCMP		vpcmpd
+#  define TESTEQ	subl	$0xff,
+	/* Compare packed dwords.  */
+#  define VPCMP	vpcmpd
 #  define VPMINU	vpminud
 #  define VPTESTM	vptestmd
-#  define SHIFT_REG32	r8d
-#  define SHIFT_REG64	r8
-/* 1 dword char == 4 bytes.  */
+	/* 1 dword char == 4 bytes.  */
 #  define SIZE_OF_CHAR	4
 # else
-/* Compare packed bytes.  */
-#  define VPCMP		vpcmpb
+#  define TESTEQ	incl
+	/* Compare packed bytes.  */
+#  define VPCMP	vpcmpb
 #  define VPMINU	vpminub
 #  define VPTESTM	vptestmb
-#  define SHIFT_REG32	ecx
-#  define SHIFT_REG64	rcx
-/* 1 byte char == 1 byte.  */
+	/* 1 byte char == 1 byte.  */
 #  define SIZE_OF_CHAR	1
 # endif
 
+# ifdef USE_AS_STRNCMP
+#  define LOOP_REG	r9d
+#  define LOOP_REG64	r9
+
+#  define OFFSET_REG8	r9b
+#  define OFFSET_REG	r9d
+#  define OFFSET_REG64	r9
+# else
+#  define LOOP_REG	edx
+#  define LOOP_REG64	rdx
+
+#  define OFFSET_REG8	dl
+#  define OFFSET_REG	edx
+#  define OFFSET_REG64	rdx
+# endif
+
+# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
+#  define VEC_OFFSET	0
+# else
+#  define VEC_OFFSET	(-VEC_SIZE)
+# endif
+
 # define XMMZERO	xmm16
-# define XMM0		xmm17
-# define XMM1		xmm18
+# define XMM0	xmm17
+# define XMM1	xmm18
 
 # define YMMZERO	ymm16
-# define YMM0		ymm17
-# define YMM1		ymm18
-# define YMM2		ymm19
-# define YMM3		ymm20
-# define YMM4		ymm21
-# define YMM5		ymm22
-# define YMM6		ymm23
-# define YMM7		ymm24
-# define YMM8		ymm25
-# define YMM9		ymm26
-# define YMM10		ymm27
+# define YMM0	ymm17
+# define YMM1	ymm18
+# define YMM2	ymm19
+# define YMM3	ymm20
+# define YMM4	ymm21
+# define YMM5	ymm22
+# define YMM6	ymm23
+# define YMM7	ymm24
+# define YMM8	ymm25
+# define YMM9	ymm26
+# define YMM10	ymm27
 
 /* Warning!
            wcscmp/wcsncmp have to use SIGNED comparison for elements.
@@ -96,985 +111,1096 @@
    the maximum offset is reached before a difference is found, zero is
    returned.  */
 
-	.section .text.evex,"ax",@progbits
-ENTRY (STRCMP)
+	.section .text.evex, "ax", @progbits
+ENTRY(STRCMP)
 # ifdef USE_AS_STRNCMP
-	/* Check for simple cases (0 or 1) in offset.  */
-	cmp	$1, %RDX_LP
-	je	L(char0)
-	jb	L(zero)
-#  ifdef USE_AS_WCSCMP
-#  ifndef __ILP32__
-	movq	%rdx, %rcx
-	/* Check if length could overflow when multiplied by
-	   sizeof(wchar_t). Checking top 8 bits will cover all potential
-	   overflow cases as well as redirect cases where its impossible to
-	   length to bound a valid memory region. In these cases just use
-	   'wcscmp'.  */
-	shrq	$56, %rcx
-	jnz	__wcscmp_evex
-#  endif
-	/* Convert units: from wide to byte char.  */
-	shl	$2, %RDX_LP
+#  ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %rdx
 #  endif
-	/* Register %r11 tracks the maximum offset.  */
-	mov	%RDX_LP, %R11_LP
+	cmp	$1, %RDX_LP
+	/* Signed comparison intentional. We use this branch to also
+	   test cases where length >= 2^63. These very large sizes can be
+	   handled with strcmp as there is no way for that length to
+	   actually bound the buffer.  */
+	jle	L(one_or_less)
 # endif
 	movl	%edi, %eax
-	xorl	%edx, %edx
-	/* Make %XMMZERO (%YMMZERO) all zeros in this function.  */
-	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
 	orl	%esi, %eax
-	andl	$(PAGE_SIZE - 1), %eax
-	cmpl	$(PAGE_SIZE - (VEC_SIZE * 4)), %eax
-	jg	L(cross_page)
-	/* Start comparing 4 vectors.  */
+	/* Shift out the bits irrelivant to page boundary ([63:12]).  */
+	sall	$20, %eax
+	/* Check if s1 or s2 may cross a page in next 4x VEC loads.  */
+	cmpl	$((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax
+	ja	L(page_cross)
+
+L(no_page_cross):
+	/* Safe to compare 4x vectors.  */
 	VMOVU	(%rdi), %YMM0
-
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-
 	/* Each bit cleared in K1 represents a mismatch or a null CHAR
 	   in YMM0 and 32 bytes at (%rsi).  */
 	VPCMP	$0, (%rsi), %YMM0, %k1{%k2}
-
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(next_3_vectors)
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx) is after the maximum
-	   offset (%r11).   */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$CHAR_PER_VEC, %rdx
+	jbe	L(vec_0_test_len)
 # endif
+
+	/* TESTEQ is `incl` for strcmp/strncmp and `subl $0xff` for
+	   wcscmp/wcsncmp.  */
+
+	/* All 1s represents all equals. TESTEQ will overflow to zero in
+	   all equals case. Otherwise 1s will carry until position of first
+	   mismatch.  */
+	TESTEQ	%ecx
+	jz	L(more_3x_vec)
+
+	.p2align 4,, 4
+L(return_vec_0):
+	tzcntl	%ecx, %ecx
 # ifdef USE_AS_WCSCMP
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	je	L(return)
-L(wcscmp_return):
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret0)
 	setl	%al
 	negl	%eax
 	orl	$1, %eax
-L(return):
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret0):
 	ret
 
-L(return_vec_size):
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
-	   the maximum offset (%r11).  */
-	addq	$VEC_SIZE, %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	.p2align 4,, 4
+L(vec_0_test_len):
+	notl	%ecx
+	bzhil	%edx, %ecx, %eax
+	jnz	L(return_vec_0)
+	/* Align if will cross fetch block.  */
+	.p2align 4,, 2
+L(ret_zero):
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
-# else
+	ret
+
+	.p2align 4,, 5
+L(one_or_less):
+	jb	L(ret_zero)
 #  ifdef USE_AS_WCSCMP
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__wcscmp_evex
+	movl	(%rdi), %edx
 	xorl	%eax, %eax
-	movl	VEC_SIZE(%rdi, %rdx), %ecx
-	cmpl	VEC_SIZE(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(%rsi), %edx
+	je	L(ret1)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	VEC_SIZE(%rdi, %rdx), %eax
-	movzbl	VEC_SIZE(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	/* 'nbe' covers the case where length is negative (large
+	   unsigned).  */
+	jnbe	__strcmp_evex
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret1):
 	ret
+# endif
 
-L(return_2_vec_size):
-	tzcntl	%ecx, %edx
+	.p2align 4,, 10
+L(return_vec_1):
+	tzcntl	%ecx, %ecx
+# ifdef USE_AS_STRNCMP
+	/* rdx must be > CHAR_PER_VEC so its safe to subtract without
+	   worrying about underflow.  */
+	addq	$-CHAR_PER_VEC, %rdx
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
+	movl	VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret2)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
+# else
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 # endif
+L(ret2):
+	ret
+
+	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 2), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+L(return_vec_3):
+#  if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %ecx
 #  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	salq	$CHAR_PER_VEC, %rcx
 #  endif
+# endif
+L(return_vec_2):
+# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
+	tzcntl	%ecx, %ecx
 # else
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 2)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 2)(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	tzcntq	%rcx, %rcx
 # endif
-	ret
 
-L(return_3_vec_size):
-	tzcntl	%ecx, %edx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
-	   after the maximum offset (%r11).  */
-	addq	$(VEC_SIZE * 3), %rdx
-	cmpq	%r11, %rdx
-	jae	L(zero)
-#  ifdef USE_AS_WCSCMP
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero)
+# endif
+
+# ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret3)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 # else
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+# endif
+L(ret3):
+	ret
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_3):
+	tzcntl	%ecx, %ecx
 #  ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 3)(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rdi, %rdx), %ecx
-	cmpl	(VEC_SIZE * 3)(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	cmpl	(VEC_SIZE * 3)(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret4)
+	setl	%al
+	negl	%eax
+	orl	$1, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rdi, %rdx), %eax
-	movzbl	(VEC_SIZE * 3)(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	movzbl	(VEC_SIZE * 3)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 3)(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
 #  endif
-# endif
+L(ret4):
 	ret
+# endif
 
-	.p2align 4
-L(next_3_vectors):
-	VMOVU	VEC_SIZE(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
+	/* 32 byte align here ensures the main loop is ideally aligned
+	   for DSB.  */
+	.p2align 5
+L(more_3x_vec):
+	/* Safe to compare 4x vectors.  */
+	VMOVU	(VEC_SIZE)(%rdi), %YMM0
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at VEC_SIZE(%rsi).  */
-	VPCMP	$0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
+	VPCMP	$0, (VEC_SIZE)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_1)
+
+# ifdef USE_AS_STRNCMP
+	subq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero)
 # endif
-	jne	L(return_vec_size)
 
 	VMOVU	(VEC_SIZE * 2)(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
 	VPCMP	$0, (VEC_SIZE * 2)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	jne	L(return_2_vec_size)
+	TESTEQ	%ecx
+	jnz	L(return_vec_2)
 
 	VMOVU	(VEC_SIZE * 3)(%rdi), %YMM0
-	/* Each bit set in K2 represents a non-null CHAR in YMM0.  */
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi).  */
 	VPCMP	$0, (VEC_SIZE * 3)(%rsi), %YMM0, %k1{%k2}
 	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_3)
+
+# ifdef USE_AS_STRNCMP
+	cmpq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero)
+# endif
+
+
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
+
 # else
-	incl	%ecx
+	xorl	%r8d, %r8d
 # endif
-	jne	L(return_3_vec_size)
-L(main_loop_header):
-	leaq	(VEC_SIZE * 4)(%rdi), %rdx
-	movl	$PAGE_SIZE, %ecx
-	/* Align load via RAX.  */
-	andq	$-(VEC_SIZE * 4), %rdx
-	subq	%rdi, %rdx
-	leaq	(%rdi, %rdx), %rax
+
+	/* The prepare labels are various entry points from the page
+	   cross logic.  */
+L(prepare_loop):
+
 # ifdef USE_AS_STRNCMP
-	/* Starting from this point, the maximum offset, or simply the
-	   'offset', DECREASES by the same amount when base pointers are
-	   moved forward.  Return 0 when:
-	     1) On match: offset <= the matched vector index.
-	     2) On mistmach, offset is before the mistmatched index.
-	 */
-	subq	%rdx, %r11
-	jbe	L(zero)
+#  ifdef USE_AS_WCSCMP
+L(prepare_loop_no_len):
+	movl	%edi, %ecx
+	andl	$(VEC_SIZE * 4 - 1), %ecx
+	shrl	$2, %ecx
+	leaq	(CHAR_PER_VEC * 2)(%rdx, %rcx), %rdx
+#  else
+	/* Store N + (VEC_SIZE * 4) and place check at the begining of
+	   the loop.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rdx), %rdx
+L(prepare_loop_no_len):
+#  endif
+# else
+L(prepare_loop_no_len):
 # endif
-	addq	%rsi, %rdx
-	movq	%rdx, %rsi
-	andl	$(PAGE_SIZE - 1), %esi
-	/* Number of bytes before page crossing.  */
-	subq	%rsi, %rcx
-	/* Number of VEC_SIZE * 4 blocks before page crossing.  */
-	shrq	$DIVIDE_BY_VEC_4_SHIFT, %rcx
-	/* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
-	movl	%ecx, %esi
-	jmp	L(loop_start)
 
+	/* Align s1 and adjust s2 accordingly.  */
+	subq	%rdi, %rsi
+	andq	$-(VEC_SIZE * 4), %rdi
+L(prepare_loop_readj):
+	addq	%rdi, %rsi
+# if (defined USE_AS_STRNCMP) && !(defined USE_AS_WCSCMP)
+	subq	%rdi, %rdx
+# endif
+
+L(prepare_loop_aligned):
+	/* eax stores distance from rsi to next page cross. These cases
+	   need to be handled specially as the 4x loop could potentially
+	   read memory past the length of s1 or s2 and across a page
+	   boundary.  */
+	movl	$-(VEC_SIZE * 4), %eax
+	subl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+
+	vpxorq	%YMMZERO, %YMMZERO, %YMMZERO
+
+	/* Loop 4x comparisons at a time.  */
 	.p2align 4
 L(loop):
+
+	/* End condition for strncmp.  */
 # ifdef USE_AS_STRNCMP
-	/* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
-	   the maximum offset (%r11) by the same amount.  */
-	subq	$(VEC_SIZE * 4), %r11
-	jbe	L(zero)
+	subq	$(CHAR_PER_VEC * 4), %rdx
+	jbe	L(ret_zero)
 # endif
-	addq	$(VEC_SIZE * 4), %rax
-	addq	$(VEC_SIZE * 4), %rdx
-L(loop_start):
-	testl	%esi, %esi
-	leal	-1(%esi), %esi
-	je	L(loop_cross_page)
-L(back_to_loop):
-	/* Main loop, comparing 4 vectors are a time.  */
-	VMOVA	(%rax), %YMM0
-	VMOVA	VEC_SIZE(%rax), %YMM2
-	VMOVA	(VEC_SIZE * 2)(%rax), %YMM4
-	VMOVA	(VEC_SIZE * 3)(%rax), %YMM6
+
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+
+	/* Check if rsi loads will cross a page boundary.  */
+	addl	$-(VEC_SIZE * 4), %eax
+	jnb	L(page_cross_during_loop)
+
+	/* Loop entry after handling page cross during loop.  */
+L(loop_skip_page_cross_check):
+	VMOVA	(VEC_SIZE * 0)(%rdi), %YMM0
+	VMOVA	(VEC_SIZE * 1)(%rdi), %YMM2
+	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM6
 
 	VPMINU	%YMM0, %YMM2, %YMM8
 	VPMINU	%YMM4, %YMM6, %YMM9
 
-	/* A zero CHAR in YMM8 means that there is a null CHAR.  */
-	VPMINU	%YMM8, %YMM9, %YMM8
+	/* A zero CHAR in YMM9 means that there is a null CHAR.  */
+	VPMINU	%YMM8, %YMM9, %YMM9
 
 	/* Each bit set in K1 represents a non-null CHAR in YMM8.  */
-	VPTESTM	%YMM8, %YMM8, %k1
+	VPTESTM	%YMM9, %YMM9, %k1
 
-	/* (YMM ^ YMM): A non-zero CHAR represents a mismatch.  */
-	vpxorq	(%rdx), %YMM0, %YMM1
-	vpxorq	VEC_SIZE(%rdx), %YMM2, %YMM3
-	vpxorq	(VEC_SIZE * 2)(%rdx), %YMM4, %YMM5
-	vpxorq	(VEC_SIZE * 3)(%rdx), %YMM6, %YMM7
+	vpxorq	(VEC_SIZE * 0)(%rsi), %YMM0, %YMM1
+	vpxorq	(VEC_SIZE * 1)(%rsi), %YMM2, %YMM3
+	vpxorq	(VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
+	/* Ternary logic to xor (VEC_SIZE * 3)(%rsi) with YMM6 while
+	   oring with YMM1. Result is stored in YMM6.  */
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM1, %YMM6
 
-	vporq	%YMM1, %YMM3, %YMM9
-	vporq	%YMM5, %YMM7, %YMM10
+	/* Or together YMM3, YMM5, and YMM6.  */
+	vpternlogd $0xfe, %YMM3, %YMM5, %YMM6
 
-	/* A non-zero CHAR in YMM9 represents a mismatch.  */
-	vporq	%YMM9, %YMM10, %YMM9
 
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR.  */
-	VPCMP	$0, %YMMZERO, %YMM9, %k0{%k1}
-	kmovd   %k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	 L(loop)
+	/* A non-zero CHAR in YMM6 represents a mismatch.  */
+	VPCMP	$0, %YMMZERO, %YMM6, %k0{%k1}
+	kmovd	%k0, %LOOP_REG
 
-	/* Each bit set in K1 represents a non-null CHAR in YMM0.  */
+	TESTEQ	%LOOP_REG
+	jz	L(loop)
+
+
+	/* Find which VEC has the mismatch of end of string.  */
 	VPTESTM	%YMM0, %YMM0, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM0 and (%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM1, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(test_vec)
-	tzcntl	%ecx, %ecx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-# ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# endif
-	ret
+	TESTEQ	%ecx
+	jnz	L(return_vec_0_end)
 
-	.p2align 4
-L(test_vec):
-# ifdef USE_AS_STRNCMP
-	/* The first vector matched.  Return 0 if the maximum offset
-	   (%r11) <= VEC_SIZE.  */
-	cmpq	$VEC_SIZE, %r11
-	jbe	L(zero)
-# endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM2.  */
 	VPTESTM	%YMM2, %YMM2, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM2 and VEC_SIZE(%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM3, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
-# else
-	incl	%ecx
-# endif
-	je	L(test_2_vec)
-	tzcntl	%ecx, %edi
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edi
-# endif
-# ifdef USE_AS_STRNCMP
-	addq	$VEC_SIZE, %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	VEC_SIZE(%rsi, %rdi), %ecx
-	cmpl	VEC_SIZE(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	VEC_SIZE(%rax, %rdi), %eax
-	movzbl	VEC_SIZE(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
-# endif
-	ret
+	TESTEQ	%ecx
+	jnz	L(return_vec_1_end)
 
-	.p2align 4
-L(test_2_vec):
+
+	/* Handle VEC 2 and 3 without branches.  */
+L(return_vec_2_3_end):
 # ifdef USE_AS_STRNCMP
-	/* The first 2 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 2 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 2), %r11
-	jbe	L(zero)
+	subq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero_end)
 # endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM4.  */
+
 	VPTESTM	%YMM4, %YMM4, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM4 and (VEC_SIZE * 2)(%rdx).  */
 	VPCMP	$0, %YMMZERO, %YMM5, %k0{%k1}
 	kmovd	%k0, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	TESTEQ	%ecx
+# if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %LOOP_REG
+	orl	%ecx, %LOOP_REG
 # else
-	incl	%ecx
+	salq	$CHAR_PER_VEC, %LOOP_REG64
+	orq	%rcx, %LOOP_REG64
+# endif
+L(return_vec_3_end):
+	/* LOOP_REG contains matches for null/mismatch from the loop. If
+	   VEC 0,1,and 2 all have no null and no mismatches then mismatch
+	   must entirely be from VEC 3 which is fully represented by
+	   LOOP_REG.  */
+# if CHAR_PER_VEC <= 16
+	tzcntl	%LOOP_REG, %LOOP_REG
+# else
+	tzcntq	%LOOP_REG64, %LOOP_REG64
+# endif
+# ifdef USE_AS_STRNCMP
+	cmpq	%LOOP_REG64, %rdx
+	jbe	L(ret_zero_end)
 # endif
-	je	L(test_3_vec)
-	tzcntl	%ecx, %edi
+
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edi
+	movl	(VEC_SIZE * 2)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
+	xorl	%eax, %eax
+	cmpl	(VEC_SIZE * 2)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
+	je	L(ret5)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	(VEC_SIZE * 2)(%rdi, %LOOP_REG64), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %LOOP_REG64), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret5):
+	ret
+
 # ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rdi
-	cmpq	%rdi, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	.p2align 4,, 2
+L(ret_zero_end):
 	xorl	%eax, %eax
-	movl	(%rsi, %rdi), %ecx
-	cmpl	(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
+	ret
+# endif
+
+
+	/* The L(return_vec_N_end) differ from L(return_vec_N) in that
+	   they use the value of `r8` to negate the return value. This is
+	   because the page cross logic can swap `rdi` and `rsi`.  */
+	.p2align 4,, 10
+# ifdef USE_AS_STRNCMP
+L(return_vec_1_end):
+#  if CHAR_PER_VEC <= 16
+	sall	$CHAR_PER_VEC, %ecx
 #  else
-	movzbl	(%rax, %rdi), %eax
-	movzbl	(%rdx, %rdi), %edx
-	subl	%edx, %eax
+	salq	$CHAR_PER_VEC, %rcx
 #  endif
+# endif
+L(return_vec_0_end):
+# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
+	tzcntl	%ecx, %ecx
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rdi), %ecx
-	cmpl	(VEC_SIZE * 2)(%rdx, %rdi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rdi), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rdi), %edx
-	subl	%edx, %eax
-#  endif
+	tzcntq	%rcx, %rcx
 # endif
-	ret
 
-	.p2align 4
-L(test_3_vec):
 # ifdef USE_AS_STRNCMP
-	/* The first 3 vectors matched.  Return 0 if the maximum offset
-	   (%r11) <= 3 * VEC_SIZE.  */
-	cmpq	$(VEC_SIZE * 3), %r11
-	jbe	L(zero)
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_end)
 # endif
-	/* Each bit set in K1 represents a non-null CHAR in YMM6.  */
-	VPTESTM	%YMM6, %YMM6, %k1
-	/* Each bit cleared in K0 represents a mismatch or a null CHAR
-	   in YMM6 and (VEC_SIZE * 3)(%rdx).  */
-	VPCMP	$0, %YMMZERO, %YMM7, %k0{%k1}
-	kmovd	%k0, %ecx
+
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret6)
+	setl	%al
+	negl	%eax
+	/* This is the non-zero case for `eax` so just xorl with `r8d`
+	   flip is `rdi` and `rsi` where swapped.  */
+	xorl	%r8d, %eax
 # else
-	incl	%ecx
+	movzbl	(%rdi, %rcx), %eax
+	movzbl	(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	/* Flip `eax` if `rdi` and `rsi` where swapped in page cross
+	   logic. Subtract `r8d` after xor for zero case.  */
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret6):
+	ret
+
+# ifndef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_vec_1_end):
 	tzcntl	%ecx, %ecx
-# ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 3), %rcx
-	cmpq	%rcx, %r11
-	jbe	L(zero)
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %esi
-	cmpl	(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
-# else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 3)(%rsi, %rcx), %esi
-	cmpl	(VEC_SIZE * 3)(%rdx, %rcx), %esi
-	jne	L(wcscmp_return)
+	cmpl	VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret7)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 #  else
-	movzbl	(VEC_SIZE * 3)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 3)(%rdx, %rcx), %edx
-	subl	%edx, %eax
+	movzbl	VEC_SIZE(%rdi, %rcx), %eax
+	movzbl	VEC_SIZE(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 #  endif
-# endif
+L(ret7):
 	ret
-
-	.p2align 4
-L(loop_cross_page):
-	xorl	%r10d, %r10d
-	movq	%rdx, %rcx
-	/* Align load via RDX.  We load the extra ECX bytes which should
-	   be ignored.  */
-	andl	$((VEC_SIZE * 4) - 1), %ecx
-	/* R10 is -RCX.  */
-	subq	%rcx, %r10
-
-	/* This works only if VEC_SIZE * 2 == 64. */
-# if (VEC_SIZE * 2) != 64
-#  error (VEC_SIZE * 2) != 64
 # endif
 
-	/* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
-	cmpl	$(VEC_SIZE * 2), %ecx
-	jge	L(loop_cross_page_2_vec)
 
-	VMOVU	(%rax, %r10), %YMM2
-	VMOVU	VEC_SIZE(%rax, %r10), %YMM3
+	/* Page cross in rsi in next 4x VEC.  */
 
-	/* Each bit set in K2 represents a non-null CHAR in YMM2.  */
-	VPTESTM	%YMM2, %YMM2, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM2 and 32 bytes at (%rdx, %r10).  */
-	VPCMP	$0, (%rdx, %r10), %YMM2, %k1{%k2}
-	kmovd	%k1, %r9d
-	/* Don't use subl since it is the lower 16/32 bits of RDI
-	   below.  */
-	notl	%r9d
-# ifdef USE_AS_WCSCMP
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %r9d
-# endif
+	/* TODO: Improve logic here.  */
+	.p2align 4,, 10
+L(page_cross_during_loop):
+	/* eax contains [distance_from_page - (VEC_SIZE * 4)].  */
 
-	/* Each bit set in K4 represents a non-null CHAR in YMM3.  */
-	VPTESTM	%YMM3, %YMM3, %k4
-	/* Each bit cleared in K3 represents a mismatch or a null CHAR
-	   in YMM3 and 32 bytes at VEC_SIZE(%rdx, %r10).  */
-	VPCMP	$0, VEC_SIZE(%rdx, %r10), %YMM3, %k3{%k4}
-	kmovd	%k3, %edi
-    /* Must use notl %edi here as lower bits are for CHAR
-	   comparisons potentially out of range thus can be 0 without
-	   indicating mismatch.  */
-	notl	%edi
-# ifdef USE_AS_WCSCMP
-	/* Don't use subl since it is the upper 8 bits of EDI below.  */
-	andl	$0xff, %edi
-# endif
+	/* Optimistically rsi and rdi and both aligned in which case we
+	   don't need any logic here.  */
+	cmpl	$-(VEC_SIZE * 4), %eax
+	/* Don't adjust eax before jumping back to loop and we will
+	   never hit page cross case again.  */
+	je	L(loop_skip_page_cross_check)
 
-# ifdef USE_AS_WCSCMP
-	/* NB: Each bit in EDI/R9D represents 4-byte element.  */
-	sall	$8, %edi
-	/* NB: Divide shift count by 4 since each bit in K1 represent 4
-	   bytes.  */
-	movl	%ecx, %SHIFT_REG32
-	sarl	$2, %SHIFT_REG32
-
-	/* Each bit in EDI represents a null CHAR or a mismatch.  */
-	orl	%r9d, %edi
-# else
-	salq	$32, %rdi
+	/* Check if we can safely load a VEC.  */
+	cmpl	$-(VEC_SIZE * 3), %eax
+	jle	L(less_1x_vec_till_page_cross)
 
-	/* Each bit in RDI represents a null CHAR or a mismatch.  */
-	orq	%r9, %rdi
-# endif
+	VMOVA	(%rdi), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_0_end)
+
+	/* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
+	cmpl	$-(VEC_SIZE * 2), %eax
+	jg	L(more_2x_vec_till_page_cross)
+
+	.p2align 4,, 4
+L(less_1x_vec_till_page_cross):
+	subl	$-(VEC_SIZE * 4), %eax
+	/* Guranteed safe to read from rdi - VEC_SIZE here. The only
+	   concerning case is first iteration if incoming s1 was near start
+	   of a page and s2 near end. If s1 was near the start of the page
+	   we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe
+	   to read back -VEC_SIZE. If rdi is truly at the start of a page
+	   here, it means the previous page (rdi - VEC_SIZE) has already
+	   been loaded earlier so must be valid.  */
+	VMOVU	-VEC_SIZE(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -VEC_SIZE(%rsi, %rax), %YMM0, %k1{%k2}
+
+	/* Mask of potentially valid bits. The lower bits can be out of
+	   range comparisons (but safe regarding page crosses).  */
 
-	/* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
-	shrxq	%SHIFT_REG64, %rdi, %rdi
-	testq	%rdi, %rdi
-	je	L(loop_cross_page_2_vec)
-	tzcntq	%rdi, %rcx
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
+	movl	$-1, %r10d
+	movl	%esi, %ecx
+	andl	$(VEC_SIZE - 1), %ecx
+	shrl	$2, %ecx
+	shlxl	%ecx, %r10d, %ecx
+	movzbl	%cl, %r10d
+# else
+	movl	$-1, %ecx
+	shlxl	%esi, %ecx, %r10d
 # endif
+
+	kmovd	%k1, %ecx
+	notl	%ecx
+
+
 # ifdef USE_AS_STRNCMP
-	cmpq	%rcx, %r11
-	jbe	L(zero)
 #  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
+	movl	%eax, %r11d
+	shrl	$2, %r11d
+	cmpq	%r11, %rdx
 #  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
+	cmpq	%rax, %rdx
 #  endif
+	jbe	L(return_page_cross_end_check)
+# endif
+	movl	%eax, %OFFSET_REG
+
+	/* Readjust eax before potentially returning to the loop.  */
+	addl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+
+	andl	%r10d, %ecx
+	jz	L(loop_skip_page_cross_check)
+
+	.p2align 4,, 3
+L(return_page_cross_end):
+	tzcntl	%ecx, %ecx
+
+# if (defined USE_AS_STRNCMP) || (defined USE_AS_WCSCMP)
+	leal	-VEC_SIZE(%OFFSET_REG64, %rcx, SIZE_OF_CHAR), %ecx
+L(return_page_cross_cmp_mem):
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	addl	%OFFSET_REG, %ecx
+# endif
+# ifdef USE_AS_WCSCMP
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret8)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
+# else
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret8):
 	ret
 
-	.p2align 4
-L(loop_cross_page_2_vec):
-	/* The first VEC_SIZE * 2 bytes match or are ignored.  */
-	VMOVU	(VEC_SIZE * 2)(%rax, %r10), %YMM0
-	VMOVU	(VEC_SIZE * 3)(%rax, %r10), %YMM1
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 10
+L(return_page_cross_end_check):
+	tzcntl	%ecx, %ecx
+	leal	-VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
+#  ifdef USE_AS_WCSCMP
+	sall	$2, %edx
+#  endif
+	cmpl	%ecx, %edx
+	ja	L(return_page_cross_cmp_mem)
+	xorl	%eax, %eax
+	ret
+# endif
+
 
+	.p2align 4,, 10
+L(more_2x_vec_till_page_cross):
+	/* If more 2x vec till cross we will complete a full loop
+	   iteration here.  */
+
+	VMOVA	VEC_SIZE(%rdi), %YMM0
 	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rdx, %r10).  */
-	VPCMP	$0, (VEC_SIZE * 2)(%rdx, %r10), %YMM0, %k1{%k2}
-	kmovd	%k1, %r9d
-	/* Don't use subl since it is the lower 16/32 bits of RDI
-	   below.  */
-	notl	%r9d
-# ifdef USE_AS_WCSCMP
-	/* Only last 8 bits are valid.  */
-	andl	$0xff, %r9d
-# endif
+	VPCMP	$0, VEC_SIZE(%rsi), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_1_end)
 
-	VPTESTM	%YMM1, %YMM1, %k4
-	/* Each bit cleared in K3 represents a mismatch or a null CHAR
-	   in YMM1 and 32 bytes at (VEC_SIZE * 3)(%rdx, %r10).  */
-	VPCMP	$0, (VEC_SIZE * 3)(%rdx, %r10), %YMM1, %k3{%k4}
-	kmovd	%k3, %edi
-	/* Must use notl %edi here as lower bits are for CHAR
-	   comparisons potentially out of range thus can be 0 without
-	   indicating mismatch.  */
-	notl	%edi
-# ifdef USE_AS_WCSCMP
-	/* Don't use subl since it is the upper 8 bits of EDI below.  */
-	andl	$0xff, %edi
+# ifdef USE_AS_STRNCMP
+	cmpq	$(CHAR_PER_VEC * 2), %rdx
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
 
-# ifdef USE_AS_WCSCMP
-	/* NB: Each bit in EDI/R9D represents 4-byte element.  */
-	sall	$8, %edi
+	subl	$-(VEC_SIZE * 4), %eax
 
-	/* Each bit in EDI represents a null CHAR or a mismatch.  */
-	orl	%r9d, %edi
-# else
-	salq	$32, %rdi
+	/* Safe to include comparisons from lower bytes.  */
+	VMOVU	-(VEC_SIZE * 2)(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -(VEC_SIZE * 2)(%rsi, %rax), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_page_cross_0)
+
+	VMOVU	-(VEC_SIZE * 1)(%rdi, %rax), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, -(VEC_SIZE * 1)(%rsi, %rax), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(return_vec_page_cross_1)
 
-	/* Each bit in RDI represents a null CHAR or a mismatch.  */
-	orq	%r9, %rdi
+# ifdef USE_AS_STRNCMP
+	/* Must check length here as length might proclude reading next
+	   page.  */
+#  ifdef USE_AS_WCSCMP
+	movl	%eax, %r11d
+	shrl	$2, %r11d
+	cmpq	%r11, %rdx
+#  else
+	cmpq	%rax, %rdx
+#  endif
+	jbe	L(ret_zero_in_loop_page_cross)
 # endif
 
-	xorl	%r8d, %r8d
-	/* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
-	subl	$(VEC_SIZE * 2), %ecx
-	jle	1f
-	/* R8 has number of bytes skipped.  */
-	movl	%ecx, %r8d
-# ifdef USE_AS_WCSCMP
-	/* NB: Divide shift count by 4 since each bit in RDI represent 4
-	   bytes.  */
-	sarl	$2, %ecx
-	/* Skip ECX bytes.  */
-	shrl	%cl, %edi
+	/* Finish the loop.  */
+	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM4
+	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM6
+	VPMINU	%YMM4, %YMM6, %YMM9
+	VPTESTM	%YMM9, %YMM9, %k1
+
+	vpxorq	(VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
+	/* YMM6 = YMM5 | ((VEC_SIZE * 3)(%rsi) ^ YMM6).  */
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM5, %YMM6
+
+	VPCMP	$0, %YMMZERO, %YMM6, %k0{%k1}
+	kmovd	%k0, %LOOP_REG
+	TESTEQ	%LOOP_REG
+	jnz	L(return_vec_2_3_end)
+
+	/* Best for code size to include ucond-jmp here. Would be faster
+	   if this case is hot to duplicate the L(return_vec_2_3_end) code
+	   as fall-through and have jump back to loop on mismatch
+	   comparison.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	subq	$-(VEC_SIZE * 4), %rsi
+	addl	$(PAGE_SIZE - VEC_SIZE * 8), %eax
+# ifdef USE_AS_STRNCMP
+	subq	$(CHAR_PER_VEC * 4), %rdx
+	ja	L(loop_skip_page_cross_check)
+L(ret_zero_in_loop_page_cross):
+	xorl	%eax, %eax
+	ret
 # else
-	/* Skip ECX bytes.  */
-	shrq	%cl, %rdi
+	jmp	L(loop_skip_page_cross_check)
 # endif
-1:
-	/* Before jumping back to the loop, set ESI to the number of
-	   VEC_SIZE * 4 blocks before page crossing.  */
-	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
 
-	testq	%rdi, %rdi
-# ifdef USE_AS_STRNCMP
-	/* At this point, if %rdi value is 0, it already tested
-	   VEC_SIZE*4+%r10 byte starting from %rax. This label
-	   checks whether strncmp maximum offset reached or not.  */
-	je	L(string_nbyte_offset_check)
+
+	.p2align 4,, 10
+L(return_vec_page_cross_0):
+	addl	$-VEC_SIZE, %eax
+L(return_vec_page_cross_1):
+	tzcntl	%ecx, %ecx
+# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
+	leal	-VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
+#  ifdef USE_AS_STRNCMP
+#   ifdef USE_AS_WCSCMP
+	/* Must divide ecx instead of multiply rdx due to overflow.  */
+	movl	%ecx, %eax
+	shrl	$2, %eax
+	cmpq	%rax, %rdx
+#   else
+	cmpq	%rcx, %rdx
+#   endif
+	jbe	L(ret_zero_in_loop_page_cross)
+#  endif
 # else
-	je	L(back_to_loop)
+	addl	%eax, %ecx
 # endif
-	tzcntq	%rdi, %rcx
+
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %ecx
-# endif
-	addq	%r10, %rcx
-	/* Adjust for number of bytes skipped.  */
-	addq	%r8, %rcx
-# ifdef USE_AS_STRNCMP
-	addq	$(VEC_SIZE * 2), %rcx
-	subq	%rcx, %r11
-	jbe	L(zero)
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
+	movl	VEC_OFFSET(%rdi, %rcx), %edx
 	xorl	%eax, %eax
-	movl	(%rsi, %rcx), %edi
-	cmpl	(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rax, %rcx), %eax
-	movzbl	(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	cmpl	VEC_OFFSET(%rsi, %rcx), %edx
+	je	L(ret9)
+	setl	%al
+	negl	%eax
+	xorl	%r8d, %eax
 # else
-#  ifdef USE_AS_WCSCMP
-	movq	%rax, %rsi
-	xorl	%eax, %eax
-	movl	(VEC_SIZE * 2)(%rsi, %rcx), %edi
-	cmpl	(VEC_SIZE * 2)(%rdx, %rcx), %edi
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(VEC_SIZE * 2)(%rax, %rcx), %eax
-	movzbl	(VEC_SIZE * 2)(%rdx, %rcx), %edx
-	subl	%edx, %eax
-#  endif
+	movzbl	VEC_OFFSET(%rdi, %rcx), %eax
+	movzbl	VEC_OFFSET(%rsi, %rcx), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret9):
 	ret
 
-# ifdef USE_AS_STRNCMP
-L(string_nbyte_offset_check):
-	leaq	(VEC_SIZE * 4)(%r10), %r10
-	cmpq	%r10, %r11
-	jbe	L(zero)
-	jmp	L(back_to_loop)
+
+	.p2align 4,, 10
+L(page_cross):
+# ifndef USE_AS_STRNCMP
+	/* If both are VEC aligned we don't need any special logic here.
+	   Only valid for strcmp where stop condition is guranteed to be
+	   reachable by just reading memory.  */
+	testl	$((VEC_SIZE - 1) << 20), %eax
+	jz	L(no_page_cross)
 # endif
 
-	.p2align 4
-L(cross_page_loop):
-	/* Check one byte/dword at a time.  */
+	movl	%edi, %eax
+	movl	%esi, %ecx
+	andl	$(PAGE_SIZE - 1), %eax
+	andl	$(PAGE_SIZE - 1), %ecx
+
+	xorl	%OFFSET_REG, %OFFSET_REG
+
+	/* Check which is closer to page cross, s1 or s2.  */
+	cmpl	%eax, %ecx
+	jg	L(page_cross_s2)
+
+	/* The previous page cross check has false positives. Check for
+	   true positive as page cross logic is very expensive.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
+	jbe	L(no_page_cross)
+
+
+	/* Set r8 to not interfere with normal return value (rdi and rsi
+	   did not swap).  */
 # ifdef USE_AS_WCSCMP
-	cmpl	%ecx, %eax
+	/* any non-zero positive value that doesn't inference with 0x1.
+	 */
+	movl	$2, %r8d
 # else
-	subl	%ecx, %eax
+	xorl	%r8d, %r8d
 # endif
-	jne	L(different)
-	addl	$SIZE_OF_CHAR, %edx
-	cmpl	$(VEC_SIZE * 4), %edx
-	je	L(main_loop_header)
+
+	/* Check if less than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jg	L(less_1x_vec_till_page)
+
+
+	/* If more than 1x VEC till page cross, loop throuh safely
+	   loadable memory until within 1x VEC of page cross.  */
+	.p2align 4,, 8
+L(page_cross_loop):
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2}
+	kmovd	%k1, %ecx
+	TESTEQ	%ecx
+	jnz	L(check_ret_vec_page_cross)
+	addl	$CHAR_PER_VEC, %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross)
 # endif
+	addl	$VEC_SIZE, %eax
+	jl	L(page_cross_loop)
+
 # ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
+	shrl	$2, %eax
 # endif
-	/* Check null CHAR.  */
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
-	/* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
-	   comparisons.  */
-	subl	%ecx, %eax
-# ifndef USE_AS_WCSCMP
-L(different):
+
+
+	subl	%eax, %OFFSET_REG
+	/* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
+	   to not cross page so is safe to load. Since we have already
+	   loaded at least 1 VEC from rsi it is also guranteed to be safe.
+	 */
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
+	VPTESTM	%YMM0, %YMM0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2}
+
+	kmovd	%k1, %ecx
+# ifdef USE_AS_STRNCMP
+	leal	CHAR_PER_VEC(%OFFSET_REG64), %eax
+	cmpq	%rax, %rdx
+	jbe	L(check_ret_vec_page_cross2)
+#  ifdef USE_AS_WCSCMP
+	addq	$-(CHAR_PER_VEC * 2), %rdx
+#  else
+	addq	%rdi, %rdx
+#  endif
 # endif
-	ret
+	TESTEQ	%ecx
+	jz	L(prepare_loop_no_len)
 
+	.p2align 4,, 4
+L(ret_vec_page_cross):
+# ifndef USE_AS_STRNCMP
+L(check_ret_vec_page_cross):
+# endif
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+L(ret_vec_page_cross_cont):
 # ifdef USE_AS_WCSCMP
-	.p2align 4
-L(different):
-	/* Use movl to avoid modifying EFLAGS.  */
-	movl	$0, %eax
+	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret12)
 	setl	%al
 	negl	%eax
-	orl	$1, %eax
-	ret
+	xorl	%r8d, %eax
+# else
+	movzbl	(%rdi, %rcx, SIZE_OF_CHAR), %eax
+	movzbl	(%rsi, %rcx, SIZE_OF_CHAR), %ecx
+	subl	%ecx, %eax
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 # endif
+L(ret12):
+	ret
+
 
 # ifdef USE_AS_STRNCMP
-	.p2align 4
-L(zero):
+	.p2align 4,, 10
+L(check_ret_vec_page_cross2):
+	TESTEQ	%ecx
+L(check_ret_vec_page_cross):
+	tzcntl	%ecx, %ecx
+	addl	%OFFSET_REG, %ecx
+	cmpq	%rcx, %rdx
+	ja	L(ret_vec_page_cross_cont)
+	.p2align 4,, 2
+L(ret_zero_page_cross):
 	xorl	%eax, %eax
 	ret
+# endif
 
-	.p2align 4
-L(char0):
-#  ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi), %ecx
-	cmpl	(%rsi), %ecx
-	jne	L(wcscmp_return)
-#  else
-	movzbl	(%rsi), %ecx
-	movzbl	(%rdi), %eax
-	subl	%ecx, %eax
-#  endif
-	ret
+	.p2align 4,, 4
+L(page_cross_s2):
+	/* Ensure this is a true page cross.  */
+	subl	$(PAGE_SIZE - VEC_SIZE * 4), %ecx
+	jbe	L(no_page_cross)
+
+
+	movl	%ecx, %eax
+	movq	%rdi, %rcx
+	movq	%rsi, %rdi
+	movq	%rcx, %rsi
+
+	/* set r8 to negate return value as rdi and rsi swapped.  */
+# ifdef USE_AS_WCSCMP
+	movl	$-4, %r8d
+# else
+	movl	$-1, %r8d
 # endif
+	xorl	%OFFSET_REG, %OFFSET_REG
 
-	.p2align 4
-L(last_vector):
-	addq	%rdx, %rdi
-	addq	%rdx, %rsi
-# ifdef USE_AS_STRNCMP
-	subq	%rdx, %r11
+	/* Check if more than 1x VEC till page cross.  */
+	subl	$(VEC_SIZE * 3), %eax
+	jle	L(page_cross_loop)
+
+	.p2align 4,, 6
+L(less_1x_vec_till_page):
+# ifdef USE_AS_WCSCMP
+	shrl	$2, %eax
 # endif
-	tzcntl	%ecx, %edx
+	/* Find largest load size we can use.  */
+	cmpl	$(16 / SIZE_OF_CHAR), %eax
+	ja	L(less_16_till_page)
+
+	/* Use 16 byte comparison.  */
+	vmovdqu	(%rdi), %xmm0
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, (%rsi), %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %edx
+	subl	$0xf, %ecx
+# else
+	incw	%cx
 # endif
+	jnz	L(check_ret_vec_page_cross)
+	movl	$(16 / SIZE_OF_CHAR), %OFFSET_REG
 # ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subl	%eax, %OFFSET_REG
+# else
+	/* Explicit check for 16 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+	jz	L(prepare_loop)
 # endif
+	vmovdqu	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	xorl	%eax, %eax
-	movl	(%rdi, %rdx), %ecx
-	cmpl	(%rsi, %rdx), %ecx
-	jne	L(wcscmp_return)
+	subl	$0xf, %ecx
 # else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %edx
-	subl	%edx, %eax
+	incw	%cx
 # endif
+	jnz	L(check_ret_vec_page_cross)
+# ifdef USE_AS_STRNCMP
+	addl	$(16 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# else
+	leaq	(16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# endif
+	jmp	L(prepare_loop_aligned)
+
+# ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case0):
+	xorl	%eax, %eax
 	ret
+# endif
 
-	/* Comparing on page boundary region requires special treatment:
-	   It must done one vector at the time, starting with the wider
-	   ymm vector if possible, if not, with xmm. If fetching 16 bytes
-	   (xmm) still passes the boundary, byte comparison must be done.
-	 */
-	.p2align 4
-L(cross_page):
-	/* Try one ymm vector at a time.  */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jg	L(cross_page_1_vector)
-L(loop_1_vector):
-	VMOVU	(%rdi, %rdx), %YMM0
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in YMM0 and 32 bytes at (%rsi, %rdx).  */
-	VPCMP	$0, (%rsi, %rdx), %YMM0, %k1{%k2}
+	.p2align 4,, 10
+L(less_16_till_page):
+	cmpl	$(24 / SIZE_OF_CHAR), %eax
+	ja	L(less_8_till_page)
+
+	/* Use 8 byte comparison.  */
+	vmovq	(%rdi), %xmm0
+	vmovq	(%rsi), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	subl	$0xff, %ecx
+	subl	$0x3, %ecx
 # else
-	incl	%ecx
+	incb	%cl
 # endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$VEC_SIZE, %edx
 
-	addl	$VEC_SIZE, %eax
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$(8 / SIZE_OF_CHAR), %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
 # endif
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	jle	L(loop_1_vector)
-L(cross_page_1_vector):
-	/* Less than 32 bytes to check, try one xmm vector.  */
-	cmpl	$(PAGE_SIZE - 16), %eax
-	jg	L(cross_page_1_xmm)
-	VMOVU	(%rdi, %rdx), %XMM0
+	movl	$(24 / SIZE_OF_CHAR), %OFFSET_REG
+	subl	%eax, %OFFSET_REG
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and 16 bytes at (%rsi, %rdx).  */
-	VPCMP	$0, (%rsi, %rdx), %XMM0, %k1{%k2}
+	vmovq	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	vmovq	(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
 # ifdef USE_AS_WCSCMP
-	subl	$0xf, %ecx
+	subl	$0x3, %ecx
 # else
-	subl	$0xffff, %ecx
+	incb	%cl
 # endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
+
 
-	addl	$16, %edx
-# ifndef USE_AS_WCSCMP
-	addl	$16, %eax
-# endif
 # ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	addl	$(8 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case0)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+# else
+	leaq	(8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
 # endif
+	jmp	L(prepare_loop_aligned)
 
-L(cross_page_1_xmm):
-# ifndef USE_AS_WCSCMP
-	/* Less than 16 bytes to check, try 8 byte vector.  NB: No need
-	   for wcscmp nor wcsncmp since wide char is 4 bytes.   */
-	cmpl	$(PAGE_SIZE - 8), %eax
-	jg	L(cross_page_8bytes)
-	vmovq	(%rdi, %rdx), %XMM0
-	vmovq	(%rsi, %rdx), %XMM1
 
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and XMM1.  */
-	VPCMP	$0, %XMM1, %XMM0, %k1{%k2}
-	kmovb	%k1, %ecx
+
+
+	.p2align 4,, 10
+L(less_8_till_page):
 # ifdef USE_AS_WCSCMP
-	subl	$0x3, %ecx
+	/* If using wchar then this is the only check before we reach
+	   the page boundary.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	cmpl	%ecx, %eax
+	jnz	L(ret_less_8_wcs)
+#  ifdef USE_AS_STRNCMP
+	addq	$-(CHAR_PER_VEC * 2), %rdx
+	/* We already checked for len <= 1 so cannot hit that case here.
+	 */
+#  endif
+	testl	%eax, %eax
+	jnz	L(prepare_loop)
+	ret
+
+	.p2align 4,, 8
+L(ret_less_8_wcs):
+	setl	%OFFSET_REG8
+	negl	%OFFSET_REG
+	movl	%OFFSET_REG, %eax
+	xorl	%r8d, %eax
+	ret
+
 # else
-	subl	$0xff, %ecx
-# endif
-	jne	L(last_vector)
+	cmpl	$28, %eax
+	ja	L(less_4_till_page)
+
+	vmovd	(%rdi), %xmm0
+	vmovd	(%rsi), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
+	kmovd	%k1, %ecx
+	subl	$0xf, %ecx
+	jnz	L(check_ret_vec_page_cross)
 
-	addl	$8, %edx
-	addl	$8, %eax
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	cmpq	$4, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
 #  endif
+	movl	$(28 / SIZE_OF_CHAR), %OFFSET_REG
+	subl	%eax, %OFFSET_REG
 
-L(cross_page_8bytes):
-	/* Less than 8 bytes to check, try 4 byte vector.  */
-	cmpl	$(PAGE_SIZE - 4), %eax
-	jg	L(cross_page_4bytes)
-	vmovd	(%rdi, %rdx), %XMM0
-	vmovd	(%rsi, %rdx), %XMM1
-
-	VPTESTM	%YMM0, %YMM0, %k2
-	/* Each bit cleared in K1 represents a mismatch or a null CHAR
-	   in XMM0 and XMM1.  */
-	VPCMP	$0, %XMM1, %XMM0, %k1{%k2}
+	vmovd	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
+	vmovd	(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1
+	VPTESTM	%xmm0, %xmm0, %k2
+	VPCMP	$0, %xmm1, %xmm0, %k1{%k2}
 	kmovd	%k1, %ecx
-# ifdef USE_AS_WCSCMP
-	subl	$0x1, %ecx
-# else
 	subl	$0xf, %ecx
-# endif
-	jne	L(last_vector)
+	jnz	L(check_ret_vec_page_cross)
+#  ifdef USE_AS_STRNCMP
+	addl	$(4 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case1)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+#  else
+	leaq	(4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+#  endif
+	jmp	L(prepare_loop_aligned)
+
 
-	addl	$4, %edx
 #  ifdef USE_AS_STRNCMP
-	/* Return 0 if the current offset (%rdx) >= the maximum offset
-	   (%r11).  */
-	cmpq	%r11, %rdx
-	jae	L(zero)
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case1):
+	xorl	%eax, %eax
+	ret
 #  endif
 
-L(cross_page_4bytes):
-# endif
-	/* Less than 4 bytes to check, try one byte/dword at a time.  */
-# ifdef USE_AS_STRNCMP
-	cmpq	%r11, %rdx
-	jae	L(zero)
-# endif
-# ifdef USE_AS_WCSCMP
-	movl	(%rdi, %rdx), %eax
-	movl	(%rsi, %rdx), %ecx
-# else
-	movzbl	(%rdi, %rdx), %eax
-	movzbl	(%rsi, %rdx), %ecx
-# endif
-	testl	%eax, %eax
-	jne	L(cross_page_loop)
+	.p2align 4,, 10
+L(less_4_till_page):
+	subq	%rdi, %rsi
+	/* Extremely slow byte comparison loop.  */
+L(less_4_loop):
+	movzbl	(%rdi), %eax
+	movzbl	(%rsi, %rdi), %ecx
 	subl	%ecx, %eax
+	jnz	L(ret_less_4_loop)
+	testl	%ecx, %ecx
+	jz	L(ret_zero_4_loop)
+#  ifdef USE_AS_STRNCMP
+	decq	%rdx
+	jz	L(ret_zero_4_loop)
+#  endif
+	incq	%rdi
+	/* end condition is reach page boundary (rdi is aligned).  */
+	testl	$31, %edi
+	jnz	L(less_4_loop)
+	leaq	-(VEC_SIZE * 4)(%rdi, %rsi), %rsi
+	addq	$-(VEC_SIZE * 4), %rdi
+#  ifdef USE_AS_STRNCMP
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+L(ret_zero_4_loop):
+	xorl	%eax, %eax
+	ret
+L(ret_less_4_loop):
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
 	ret
-END (STRCMP)
+# endif
+END(STRCMP)
 #endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
                     ` (4 preceding siblings ...)
  2022-01-10 21:35   ` [PATCH v3 6/7] x86: Optimize strcmp-evex.S Noah Goldstein
@ 2022-01-10 21:35   ` Noah Goldstein
  2022-01-11  2:15   ` [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] H.J. Lu
  6 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-10 21:35 UTC (permalink / raw)
  To: libc-alpha

Add more small and medium sized tests for strcmp and strncmp.

As well for strcmp add option for more direct control of
alignment. Previously alignment was being pushed to the end of the
page. While this is the most difficult case to implement, it is far
from the common case and so shouldn't be the only benchmark.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 benchtests/bench-strcmp.c  | 142 ++++++++++++++++++++++++++-----------
 benchtests/bench-strncmp.c | 110 ++++++++++++++++++++--------
 2 files changed, 183 insertions(+), 69 deletions(-)

diff --git a/benchtests/bench-strcmp.c b/benchtests/bench-strcmp.c
index 387e76fcfb..3a60edfb15 100644
--- a/benchtests/bench-strcmp.c
+++ b/benchtests/bench-strcmp.c
@@ -99,8 +99,8 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl,
 }
 
 static void
-do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
-	 max_char, int exp_result)
+do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len,
+         int max_char, int exp_result, int at_end)
 {
   size_t i;
 
@@ -109,19 +109,28 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
   if (len == 0)
     return;
 
-  align1 &= 63;
+  align1 &= ~(CHARBYTES - 1);
+  align2 &= ~(CHARBYTES - 1);
+
+  align1 &= (getpagesize () - 1);
   if (align1 + (len + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 63;
+  align2 &= (getpagesize () - 1);
   if (align2 + (len + 1) * CHARBYTES >= page_size)
     return;
 
   /* Put them close to the end of page.  */
-  i = align1 + CHARBYTES * (len + 2);
-  s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1);
-  i = align2 + CHARBYTES * (len + 2);
-  s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16)  + align2);
+  if (at_end)
+    {
+      i = align1 + CHARBYTES * (len + 2);
+      align1 = ((page_size - i) / 16 * 16) + align1;
+      i = align2 + CHARBYTES * (len + 2);
+      align2 = ((page_size - i) / 16 * 16) + align2;
+    }
+
+  s1 = (CHAR *)(buf1 + align1);
+  s2 = (CHAR *)(buf2 + align2);
 
   for (i = 0; i < len; i++)
     s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
@@ -132,9 +141,9 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int
   s2[len - 1] -= exp_result;
 
   json_element_object_begin (json_ctx);
-  json_attr_uint (json_ctx, "length", (double) len);
-  json_attr_uint (json_ctx, "align1", (double) align1);
-  json_attr_uint (json_ctx, "align2", (double) align2);
+  json_attr_uint (json_ctx, "length", (double)len);
+  json_attr_uint (json_ctx, "align1", (double)align1);
+  json_attr_uint (json_ctx, "align2", (double)align2);
   json_array_begin (json_ctx, "timings");
 
   FOR_EACH_IMPL (impl, 0)
@@ -202,7 +211,8 @@ int
 test_main (void)
 {
   json_ctx_t json_ctx;
-  size_t i;
+  size_t i, j, k;
+  size_t pg_sz = getpagesize ();
 
   test_init ();
 
@@ -221,36 +231,88 @@ test_main (void)
   json_array_end (&json_ctx);
 
   json_array_begin (&json_ctx, "results");
-
-  for (i = 1; i < 32; ++i)
-    {
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 0);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, -1);
-    }
-
-  for (i = 1; i < 10 + CHARBYTESLOG; ++i)
+  for (k = 0; k < 2; ++k)
     {
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 0);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 0);
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 1);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 1);
-      do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, -1);
-      do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, -1);
-      do_test (&json_ctx, 0, CHARBYTES * i, 2 << i, MIDCHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, CHARBYTES * (i + 1), 2 << i, LARGECHAR, 1);
+      for (i = 1; i < 32; ++i)
+        {
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, -1, k);
+        }
+
+      for (i = 1; i <= 8192;)
+        {
+          /* No page crosses.  */
+          do_test (&json_ctx, 0, 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, i * CHARBYTES, 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, i * CHARBYTES, i, MIDCHAR, 0, k);
+
+          /* False page crosses.  */
+          do_test (&json_ctx, pg_sz / 2, pg_sz / 2 - CHARBYTES, i, MIDCHAR, 0,
+                   k);
+          do_test (&json_ctx, pg_sz / 2 - CHARBYTES, pg_sz / 2, i, MIDCHAR, 0,
+                   k);
+
+          do_test (&json_ctx, pg_sz - (i * CHARBYTES), 0, i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, pg_sz - (i * CHARBYTES), i, MIDCHAR, 0, k);
+
+          /* Real page cross.  */
+          for (j = 16; j < 128; j += 16)
+            {
+              do_test (&json_ctx, pg_sz - j, 0, i, MIDCHAR, 0, k);
+              do_test (&json_ctx, 0, pg_sz - j, i, MIDCHAR, 0, k);
+
+              do_test (&json_ctx, pg_sz - j, pg_sz - j / 2, i, MIDCHAR, 0, k);
+              do_test (&json_ctx, pg_sz - j / 2, pg_sz - j, i, MIDCHAR, 0, k);
+            }
+
+          if (i < 32)
+            {
+              ++i;
+            }
+          else if (i < 160)
+            {
+              i += 8;
+            }
+          else if (i < 512)
+            {
+              i += 32;
+            }
+          else
+            {
+              i *= 2;
+            }
+        }
+
+      for (i = 1; i < 10 + CHARBYTESLOG; ++i)
+        {
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 0, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 0, k);
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, -1, k);
+          do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, -1, k);
+          do_test (&json_ctx, 0, CHARBYTES * i, 2 << i, MIDCHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, CHARBYTES * (i + 1), 2 << i,
+                   LARGECHAR, 1, k);
+        }
+
+      for (i = 1; i < 8; ++i)
+        {
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, 0, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, 0, k);
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, 1, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, 1, k);
+          do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i,
+                   MIDCHAR, -1, k);
+          do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i,
+                   LARGECHAR, -1, k);
+        }
     }
-
-  for (i = 1; i < 8; ++i)
-    {
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, 0);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, 0);
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, 1);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, 1);
-      do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, -1);
-      do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1);
-    }
-
   do_test_page_boundary (&json_ctx);
 
   json_array_end (&json_ctx);
diff --git a/benchtests/bench-strncmp.c b/benchtests/bench-strncmp.c
index b7a01fde64..6673a53521 100644
--- a/benchtests/bench-strncmp.c
+++ b/benchtests/bench-strncmp.c
@@ -150,43 +150,43 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, size_t
   if (n == 0)
     return;
 
-  align1 &= 63;
+  align1 &= getpagesize () - 1;
   if (align1 + (n + 1) * CHARBYTES >= page_size)
     return;
 
-  align2 &= 7;
+  align2 &= getpagesize () - 1;
   if (align2 + (n + 1) * CHARBYTES >= page_size)
     return;
 
   json_element_object_begin (json_ctx);
-  json_attr_uint (json_ctx, "strlen", (double) len);
-  json_attr_uint (json_ctx, "len", (double) n);
-  json_attr_uint (json_ctx, "align1", (double) align1);
-  json_attr_uint (json_ctx, "align2", (double) align2);
+  json_attr_uint (json_ctx, "strlen", (double)len);
+  json_attr_uint (json_ctx, "len", (double)n);
+  json_attr_uint (json_ctx, "align1", (double)align1);
+  json_attr_uint (json_ctx, "align2", (double)align2);
   json_array_begin (json_ctx, "timings");
 
   FOR_EACH_IMPL (impl, 0)
-    {
-      alloc_bufs ();
-      s1 = (CHAR *) (buf1 + align1);
-      s2 = (CHAR *) (buf2 + align2);
-
-      for (i = 0; i < n; i++)
-	s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
-
-      s1[n] = 24 + exp_result;
-      s2[n] = 23;
-      s1[len] = 0;
-      s2[len] = 0;
-      if (exp_result < 0)
-	s2[len] = 32;
-      else if (exp_result > 0)
-	s1[len] = 64;
-      if (len >= n)
-	s2[n - 1] -= exp_result;
+  {
+    alloc_bufs ();
+    s1 = (CHAR *)(buf1 + align1);
+    s2 = (CHAR *)(buf2 + align2);
+
+    for (i = 0; i < n; i++)
+      s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char;
+
+    s1[n] = 24 + exp_result;
+    s2[n] = 23;
+    s1[len] = 0;
+    s2[len] = 0;
+    if (exp_result < 0)
+      s2[len] = 32;
+    else if (exp_result > 0)
+      s1[len] = 64;
+    if (len >= n)
+      s2[n - 1] -= exp_result;
 
-      do_one_test (json_ctx, impl, s1, s2, n, exp_result);
-    }
+    do_one_test (json_ctx, impl, s1, s2, n, exp_result);
+  }
 
   json_array_end (json_ctx);
   json_element_object_end (json_ctx);
@@ -319,7 +319,8 @@ int
 test_main (void)
 {
   json_ctx_t json_ctx;
-  size_t i;
+  size_t i, j, len;
+  size_t pg_sz = getpagesize ();
 
   test_init ();
 
@@ -334,12 +335,12 @@ test_main (void)
 
   json_array_begin (&json_ctx, "ifuncs");
   FOR_EACH_IMPL (impl, 0)
-    json_element_string (&json_ctx, impl->name);
+  json_element_string (&json_ctx, impl->name);
   json_array_end (&json_ctx);
 
   json_array_begin (&json_ctx, "results");
 
-  for (i =0; i < 16; ++i)
+  for (i = 0; i < 16; ++i)
     {
       do_test (&json_ctx, 0, 0, 8, i, 127, 0);
       do_test (&json_ctx, 0, 0, 8, i, 127, -1);
@@ -361,6 +362,57 @@ test_main (void)
       do_test (&json_ctx, i, 3 * i, 8, i, 255, -1);
     }
 
+  for (len = 0; len <= 128; len += 64)
+    {
+      for (i = 1; i <= 8192;)
+        {
+          /* No page crosses.  */
+          do_test (&json_ctx, 0, 0, i, i + len, 127, 0);
+          do_test (&json_ctx, i * CHARBYTES, 0, i, i + len, 127, 0);
+          do_test (&json_ctx, 0, i * CHARBYTES, i, i + len, 127, 0);
+
+          /* False page crosses.  */
+          do_test (&json_ctx, pg_sz / 2, pg_sz / 2 - CHARBYTES, i, i + len,
+                   127, 0);
+          do_test (&json_ctx, pg_sz / 2 - CHARBYTES, pg_sz / 2, i, i + len,
+                   127, 0);
+
+          do_test (&json_ctx, pg_sz - (i * CHARBYTES), 0, i, i + len, 127,
+                   0);
+          do_test (&json_ctx, 0, pg_sz - (i * CHARBYTES), i, i + len, 127,
+                   0);
+
+          /* Real page cross.  */
+          for (j = 16; j < 128; j += 16)
+            {
+              do_test (&json_ctx, pg_sz - j, 0, i, i + len, 127, 0);
+              do_test (&json_ctx, 0, pg_sz - j, i, i + len, 127, 0);
+
+              do_test (&json_ctx, pg_sz - j, pg_sz - j / 2, i, i + len,
+                       127, 0);
+              do_test (&json_ctx, pg_sz - j / 2, pg_sz - j, i, i + len,
+                       127, 0);
+            }
+
+          if (i < 32)
+            {
+              ++i;
+            }
+          else if (i < 160)
+            {
+              i += 8;
+            }
+          else if (i < 256)
+            {
+              i += 32;
+            }
+          else
+            {
+              i *= 2;
+            }
+        }
+    }
+
   for (i = 1; i < 8; ++i)
     {
       do_test (&json_ctx, 0, 0, 8 << i, 16 << i, 127, 0);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
  2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
                     ` (5 preceding siblings ...)
  2022-01-10 21:35   ` [PATCH v3 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
@ 2022-01-11  2:15   ` H.J. Lu
  2022-01-26 22:05     ` H.J. Lu
  6 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-11  2:15 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Mon, Jan 10, 2022 at 1:36 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> __wcscmp_avx2. For x86_64 this covers the entire address range so any
> length larger could not possibly be used to bound `s1` or `s2`.
>
> test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> index a45f9d2749..9c73b5899d 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> @@ -87,6 +87,16 @@ ENTRY (STRCMP)
>         je      L(char0)
>         jb      L(zero)
>  #  ifdef USE_AS_WCSCMP
> +#  ifndef __ILP32__
> +       movq    %rdx, %rcx
> +       /* Check if length could overflow when multiplied by
> +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> +          overflow cases as well as redirect cases where its impossible to
> +          length to bound a valid memory region. In these cases just use
> +          'wcscmp'.  */
> +       shrq    $56, %rcx
> +       jnz     __wcscmp_avx2
> +#  endif
>         /* Convert units: from wide to byte char.  */
>         shl     $2, %RDX_LP
>  #  endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
  2022-01-10 21:35   ` [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
@ 2022-01-11  2:15     ` H.J. Lu
  2022-01-26 22:04       ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-11  2:15 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Mon, Jan 10, 2022 at 1:36 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> __wcscmp_evex. For x86_64 this covers the entire address range so any
> length larger could not possibly be used to bound `s1` or `s2`.
>
> test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  sysdeps/x86_64/multiarch/strcmp-evex.S | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
> index 1d971f3889..0cd939d5af 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-evex.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
> @@ -104,6 +104,16 @@ ENTRY (STRCMP)
>         je      L(char0)
>         jb      L(zero)
>  #  ifdef USE_AS_WCSCMP
> +#  ifndef __ILP32__
> +       movq    %rdx, %rcx
> +       /* Check if length could overflow when multiplied by
> +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> +          overflow cases as well as redirect cases where its impossible to
> +          length to bound a valid memory region. In these cases just use
> +          'wcscmp'.  */
> +       shrq    $56, %rcx
> +       jnz     __wcscmp_evex
> +#  endif
>         /* Convert units: from wide to byte char.  */
>         shl     $2, %RDX_LP
>  #  endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
  2022-01-11  2:15     ` H.J. Lu
@ 2022-01-26 22:04       ` H.J. Lu
  2022-04-29 22:05         ` Sunil Pandey
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-26 22:04 UTC (permalink / raw)
  To: Noah Goldstein, GNU C Library; +Cc: Libc-stable Mailing List

On Mon, Jan 10, 2022 at 6:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Jan 10, 2022 at 1:36 PM Noah Goldstein via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> > __wcscmp_evex. For x86_64 this covers the entire address range so any
> > length larger could not possibly be used to bound `s1` or `s2`.
> >
> > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > ---
> >  sysdeps/x86_64/multiarch/strcmp-evex.S | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
> > index 1d971f3889..0cd939d5af 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp-evex.S
> > +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
> > @@ -104,6 +104,16 @@ ENTRY (STRCMP)
> >         je      L(char0)
> >         jb      L(zero)
> >  #  ifdef USE_AS_WCSCMP
> > +#  ifndef __ILP32__
> > +       movq    %rdx, %rcx
> > +       /* Check if length could overflow when multiplied by
> > +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > +          overflow cases as well as redirect cases where its impossible to
> > +          length to bound a valid memory region. In these cases just use
> > +          'wcscmp'.  */
> > +       shrq    $56, %rcx
> > +       jnz     __wcscmp_evex
> > +#  endif
> >         /* Convert units: from wide to byte char.  */
> >         shl     $2, %RDX_LP
> >  #  endif
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I am backporting this to 2.34 branch.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
  2022-01-11  2:15   ` [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] H.J. Lu
@ 2022-01-26 22:05     ` H.J. Lu
  2022-01-27  4:29       ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-26 22:05 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Libc-stable Mailing List

On Mon, Jan 10, 2022 at 6:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Jan 10, 2022 at 1:36 PM Noah Goldstein via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> > __wcscmp_avx2. For x86_64 this covers the entire address range so any
> > length larger could not possibly be used to bound `s1` or `s2`.
> >
> > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > ---
> >  sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > index a45f9d2749..9c73b5899d 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > @@ -87,6 +87,16 @@ ENTRY (STRCMP)
> >         je      L(char0)
> >         jb      L(zero)
> >  #  ifdef USE_AS_WCSCMP
> > +#  ifndef __ILP32__
> > +       movq    %rdx, %rcx
> > +       /* Check if length could overflow when multiplied by
> > +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > +          overflow cases as well as redirect cases where its impossible to
> > +          length to bound a valid memory region. In these cases just use
> > +          'wcscmp'.  */
> > +       shrq    $56, %rcx
> > +       jnz     __wcscmp_avx2
> > +#  endif
> >         /* Convert units: from wide to byte char.  */
> >         shl     $2, %RDX_LP
> >  #  endif
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I am backporting this to 2.34 branch.


-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
  2022-01-26 22:05     ` H.J. Lu
@ 2022-01-27  4:29       ` H.J. Lu
  2022-01-27  5:10         ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-27  4:29 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Libc-stable Mailing List

On Wed, Jan 26, 2022 at 2:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Jan 10, 2022 at 6:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Jan 10, 2022 at 1:36 PM Noah Goldstein via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> > >
> > > Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> > > __wcscmp_avx2. For x86_64 this covers the entire address range so any
> > > length larger could not possibly be used to bound `s1` or `s2`.
> > >
> > > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> > >
> > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > ---
> > >  sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > index a45f9d2749..9c73b5899d 100644
> > > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > @@ -87,6 +87,16 @@ ENTRY (STRCMP)
> > >         je      L(char0)
> > >         jb      L(zero)
> > >  #  ifdef USE_AS_WCSCMP
> > > +#  ifndef __ILP32__
> > > +       movq    %rdx, %rcx
> > > +       /* Check if length could overflow when multiplied by
> > > +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > > +          overflow cases as well as redirect cases where its impossible to
> > > +          length to bound a valid memory region. In these cases just use
> > > +          'wcscmp'.  */
> > > +       shrq    $56, %rcx
> > > +       jnz     __wcscmp_avx2
> > > +#  endif
> > >         /* Convert units: from wide to byte char.  */
> > >         shl     $2, %RDX_LP
> > >  #  endif
> > > --
> > > 2.25.1
> > >
> >
> > LGTM.
> >
> > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> >
> > Thanks.
> >
> > --
> > H.J.
>
> I am backporting this to 2.34 branch.
>

I am backporting this to 2.33 branch.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
  2022-01-27  4:29       ` H.J. Lu
@ 2022-01-27  5:10         ` H.J. Lu
  2022-01-27  5:52           ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-01-27  5:10 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Libc-stable Mailing List

On Wed, Jan 26, 2022 at 8:29 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jan 26, 2022 at 2:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Jan 10, 2022 at 6:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Mon, Jan 10, 2022 at 1:36 PM Noah Goldstein via Libc-alpha
> > > <libc-alpha@sourceware.org> wrote:
> > > >
> > > > Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> > > > __wcscmp_avx2. For x86_64 this covers the entire address range so any
> > > > length larger could not possibly be used to bound `s1` or `s2`.
> > > >
> > > > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> > > >
> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > ---
> > > >  sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++
> > > >  1 file changed, 10 insertions(+)
> > > >
> > > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > index a45f9d2749..9c73b5899d 100644
> > > > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > @@ -87,6 +87,16 @@ ENTRY (STRCMP)
> > > >         je      L(char0)
> > > >         jb      L(zero)
> > > >  #  ifdef USE_AS_WCSCMP
> > > > +#  ifndef __ILP32__
> > > > +       movq    %rdx, %rcx
> > > > +       /* Check if length could overflow when multiplied by
> > > > +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > > > +          overflow cases as well as redirect cases where its impossible to
> > > > +          length to bound a valid memory region. In these cases just use
> > > > +          'wcscmp'.  */
> > > > +       shrq    $56, %rcx
> > > > +       jnz     __wcscmp_avx2
> > > > +#  endif
> > > >         /* Convert units: from wide to byte char.  */
> > > >         shl     $2, %RDX_LP
> > > >  #  endif
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > LGTM.
> > >
> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> > >
> > > Thanks.
> > >
> > > --
> > > H.J.
> >
> > I am backporting this to 2.34 branch.
> >
>
> I am backporting this to 2.33 branch.
>

I am backporting this to all affected release branches.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
  2022-01-27  5:10         ` H.J. Lu
@ 2022-01-27  5:52           ` Noah Goldstein
  0 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-01-27  5:52 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Libc-stable Mailing List

On Wed, Jan 26, 2022 at 11:11 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jan 26, 2022 at 8:29 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, Jan 26, 2022 at 2:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Mon, Jan 10, 2022 at 6:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 10, 2022 at 1:36 PM Noah Goldstein via Libc-alpha
> > > > <libc-alpha@sourceware.org> wrote:
> > > > >
> > > > > Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> > > > > __wcscmp_avx2. For x86_64 this covers the entire address range so any
> > > > > length larger could not possibly be used to bound `s1` or `s2`.
> > > > >
> > > > > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> > > > >
> > > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > ---
> > > > >  sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++
> > > > >  1 file changed, 10 insertions(+)
> > > > >
> > > > > diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > > index a45f9d2749..9c73b5899d 100644
> > > > > --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > > +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> > > > > @@ -87,6 +87,16 @@ ENTRY (STRCMP)
> > > > >         je      L(char0)
> > > > >         jb      L(zero)
> > > > >  #  ifdef USE_AS_WCSCMP
> > > > > +#  ifndef __ILP32__
> > > > > +       movq    %rdx, %rcx
> > > > > +       /* Check if length could overflow when multiplied by
> > > > > +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > > > > +          overflow cases as well as redirect cases where its impossible to
> > > > > +          length to bound a valid memory region. In these cases just use
> > > > > +          'wcscmp'.  */
> > > > > +       shrq    $56, %rcx
> > > > > +       jnz     __wcscmp_avx2
> > > > > +#  endif
> > > > >         /* Convert units: from wide to byte char.  */
> > > > >         shl     $2, %RDX_LP
> > > > >  #  endif
> > > > > --
> > > > > 2.25.1
> > > > >
> > > >
> > > > LGTM.
> > > >
> > > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> > > >
> > > > Thanks.
> > > >
> > > > --
> > > > H.J.
> > >
> > > I am backporting this to 2.34 branch.
> > >
> >
> > I am backporting this to 2.33 branch.
> >
>
> I am backporting this to all affected release branches.

Should we also backport the stuff for [BZ #27974]?
It was essentially the same bug.

The two commits that fixed the issues where:

commit a775a7a3eb1e85b54af0b4ee5ff4dcf66772a1fb
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Jun 23 01:56:29 2021 -0400

    x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]

and

commit 645a158978f9520e74074e8c14047503be4db0f0
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Jun 9 16:25:32 2021 -0400

    x86: Fix overflow bug with wmemchr-sse2 and wmemchr-avx2 [BZ #27974]


The only thing is the avx2 fixes are based onsome other changes to the file.

>
> --
> H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-01-10 21:35   ` [PATCH v3 5/7] x86: Optimize strcmp-avx2.S Noah Goldstein
@ 2022-02-14 14:10     ` Andreas Schwab
  2022-02-14 18:23       ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: Andreas Schwab @ 2022-02-14 14:10 UTC (permalink / raw)
  To: Noah Goldstein via Libc-alpha

I'm seeing erroneous behaviour with this.  There are random cases of
misbehaviour on build workers with AVX2, for example:

https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc/glibc:cross-riscv64/f/x86_64

riscv64-suse-linux-gcc: error: unrecognized command-line option '-frounding-math'
make[2]: *** [../o-iterator.mk:9: /home/abuild/rpmbuild/BUILD/glibc-2.35.9000.58.g7912236f4a/cc-base/time/tzset.o] Error 1
make[2]: *** Waiting for unfinished jobs....

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 14:10     ` Andreas Schwab
@ 2022-02-14 18:23       ` H.J. Lu
  2022-02-14 19:16         ` Andreas Schwab
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-02-14 18:23 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Noah Goldstein via Libc-alpha

On Mon, Feb 14, 2022 at 6:10 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> I'm seeing erroneous behaviour with this.  There are random cases of
> misbehaviour on build workers with AVX2, for example:
>
> https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc/glibc:cross-riscv64/f/x86_64
>
> riscv64-suse-linux-gcc: error: unrecognized command-line option '-frounding-math'
> make[2]: *** [../o-iterator.mk:9: /home/abuild/rpmbuild/BUILD/glibc-2.35.9000.58.g7912236f4a/cc-base/time/tzset.o] Error 1
> make[2]: *** Waiting for unfinished jobs....

How reproducible is it?

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 18:23       ` H.J. Lu
@ 2022-02-14 19:16         ` Andreas Schwab
  2022-02-14 19:30           ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: Andreas Schwab @ 2022-02-14 19:16 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Noah Goldstein via Libc-alpha

On Feb 14 2022, H.J. Lu wrote:

> On Mon, Feb 14, 2022 at 6:10 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>>
>> I'm seeing erroneous behaviour with this.  There are random cases of
>> misbehaviour on build workers with AVX2, for example:
>>
>> https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc/glibc:cross-riscv64/f/x86_64
>>
>> riscv64-suse-linux-gcc: error: unrecognized command-line option '-frounding-math'
>> make[2]: *** [../o-iterator.mk:9: /home/abuild/rpmbuild/BUILD/glibc-2.35.9000.58.g7912236f4a/cc-base/time/tzset.o] Error 1
>> make[2]: *** Waiting for unfinished jobs....
>
> How reproducible is it?

100%.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 19:16         ` Andreas Schwab
@ 2022-02-14 19:30           ` H.J. Lu
  2022-02-14 19:35             ` Andreas Schwab
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2022-02-14 19:30 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Noah Goldstein via Libc-alpha

On Mon, Feb 14, 2022 at 11:16 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Feb 14 2022, H.J. Lu wrote:
>
> > On Mon, Feb 14, 2022 at 6:10 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >>
> >> I'm seeing erroneous behaviour with this.  There are random cases of
> >> misbehaviour on build workers with AVX2, for example:
> >>
> >> https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc/glibc:cross-riscv64/f/x86_64
> >>
> >> riscv64-suse-linux-gcc: error: unrecognized command-line option '-frounding-math'
> >> make[2]: *** [../o-iterator.mk:9: /home/abuild/rpmbuild/BUILD/glibc-2.35.9000.58.g7912236f4a/cc-base/time/tzset.o] Error 1
> >> make[2]: *** Waiting for unfinished jobs....
> >
> > How reproducible is it?
>
> 100%.
>

Can I reproduce it with scripts/build-many-glibcs.py on
any machine which uses strcmp-avx2.S?


-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 19:30           ` H.J. Lu
@ 2022-02-14 19:35             ` Andreas Schwab
  2022-02-14 20:59               ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: Andreas Schwab @ 2022-02-14 19:35 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Noah Goldstein via Libc-alpha

On Feb 14 2022, H.J. Lu wrote:

> On Mon, Feb 14, 2022 at 11:16 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>>
>> On Feb 14 2022, H.J. Lu wrote:
>>
>> > On Mon, Feb 14, 2022 at 6:10 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>> >>
>> >> I'm seeing erroneous behaviour with this.  There are random cases of
>> >> misbehaviour on build workers with AVX2, for example:
>> >>
>> >> https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc/glibc:cross-riscv64/f/x86_64
>> >>
>> >> riscv64-suse-linux-gcc: error: unrecognized command-line option '-frounding-math'
>> >> make[2]: *** [../o-iterator.mk:9: /home/abuild/rpmbuild/BUILD/glibc-2.35.9000.58.g7912236f4a/cc-base/time/tzset.o] Error 1
>> >> make[2]: *** Waiting for unfinished jobs....
>> >
>> > How reproducible is it?
>>
>> 100%.
>>
>
> Can I reproduce it with scripts/build-many-glibcs.py on
> any machine which uses strcmp-avx2.S?

Maybe.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 19:35             ` Andreas Schwab
@ 2022-02-14 20:59               ` H.J. Lu
  2022-02-14 21:10                 ` H.J. Lu
  2022-02-14 23:42                 ` Noah Goldstein
  0 siblings, 2 replies; 59+ messages in thread
From: H.J. Lu @ 2022-02-14 20:59 UTC (permalink / raw)
  To: Andreas Schwab, Sunil K Pandey
  Cc: Noah Goldstein via Libc-alpha, Noah Goldstein

On Mon, Feb 14, 2022 at 11:35 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Feb 14 2022, H.J. Lu wrote:
>
> > On Mon, Feb 14, 2022 at 11:16 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >>
> >> On Feb 14 2022, H.J. Lu wrote:
> >>
> >> > On Mon, Feb 14, 2022 at 6:10 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >> >>
> >> >> I'm seeing erroneous behaviour with this.  There are random cases of
> >> >> misbehaviour on build workers with AVX2, for example:
> >> >>
> >> >> https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc/glibc:cross-riscv64/f/x86_64
> >> >>
> >> >> riscv64-suse-linux-gcc: error: unrecognized command-line option '-frounding-math'
> >> >> make[2]: *** [../o-iterator.mk:9: /home/abuild/rpmbuild/BUILD/glibc-2.35.9000.58.g7912236f4a/cc-base/time/tzset.o] Error 1
> >> >> make[2]: *** Waiting for unfinished jobs....
> >> >
> >> > How reproducible is it?
> >>
> >> 100%.
> >>
> >
> > Can I reproduce it with scripts/build-many-glibcs.py on
> > any machine which uses strcmp-avx2.S?
>
> Maybe.
>

I can't reproduce it.  It sounds very similar to

https://sourceware.org/bugzilla/show_bug.cgi?id=28646

The failure can only be triggered by a specific setup.
Noah, can you figure out what went wrong?

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 20:59               ` H.J. Lu
@ 2022-02-14 21:10                 ` H.J. Lu
  2022-02-15 11:11                   ` Andreas Schwab
  2022-02-15 12:55                   ` Andreas Schwab
  2022-02-14 23:42                 ` Noah Goldstein
  1 sibling, 2 replies; 59+ messages in thread
From: H.J. Lu @ 2022-02-14 21:10 UTC (permalink / raw)
  To: Andreas Schwab, Sunil K Pandey
  Cc: Noah Goldstein via Libc-alpha, Noah Goldstein

On Mon, Feb 14, 2022 at 12:59 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Feb 14, 2022 at 11:35 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >
> > On Feb 14 2022, H.J. Lu wrote:
> >
> > > On Mon, Feb 14, 2022 at 11:16 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> > >>
> > >> On Feb 14 2022, H.J. Lu wrote:
> > >>
> > >> > On Mon, Feb 14, 2022 at 6:10 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> > >> >>
> > >> >> I'm seeing erroneous behaviour with this.  There are random cases of
> > >> >> misbehaviour on build workers with AVX2, for example:
> > >> >>
> > >> >> https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc/glibc:cross-riscv64/f/x86_64
> > >> >>
> > >> >> riscv64-suse-linux-gcc: error: unrecognized command-line option '-frounding-math'
> > >> >> make[2]: *** [../o-iterator.mk:9: /home/abuild/rpmbuild/BUILD/glibc-2.35.9000.58.g7912236f4a/cc-base/time/tzset.o] Error 1
> > >> >> make[2]: *** Waiting for unfinished jobs....
> > >> >
> > >> > How reproducible is it?
> > >>
> > >> 100%.
> > >>
> > >
> > > Can I reproduce it with scripts/build-many-glibcs.py on
> > > any machine which uses strcmp-avx2.S?
> >
> > Maybe.
> >
>
> I can't reproduce it.  It sounds very similar to
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=28646
>
> The failure can only be triggered by a specific setup.

Andreas, I need your help to create a testcase.  You
can build a special glibc and use it to build riscv64 glibc.
In the special glibc, you compare AVX2 strcmp result
against SSE2 strcmp.  If they don't match, do

asm ("hlt")

with a core dump.  Then use gdb to get 2 pointers
with their contents and addresses.  I can extract
a testcase from this info.

> Noah, can you figure out what went wrong?
>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 20:59               ` H.J. Lu
  2022-02-14 21:10                 ` H.J. Lu
@ 2022-02-14 23:42                 ` Noah Goldstein
  2022-02-15 10:43                   ` Andreas Schwab
  2022-02-15 11:22                   ` Andreas Schwab
  1 sibling, 2 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-02-14 23:42 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Andreas Schwab, Sunil K Pandey, Noah Goldstein via Libc-alpha

On Mon, Feb 14, 2022 at 3:00 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Feb 14, 2022 at 11:35 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >
> > On Feb 14 2022, H.J. Lu wrote:
> >
> > > On Mon, Feb 14, 2022 at 11:16 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> > >>
> > >> On Feb 14 2022, H.J. Lu wrote:
> > >>
> > >> > On Mon, Feb 14, 2022 at 6:10 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> > >> >>
> > >> >> I'm seeing erroneous behaviour with this.  There are random cases of
> > >> >> misbehaviour on build workers with AVX2, for example:
> > >> >>
> > >> >> https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc/glibc:cross-riscv64/f/x86_64
> > >> >>
> > >> >> riscv64-suse-linux-gcc: error: unrecognized command-line option '-frounding-math'
> > >> >> make[2]: *** [../o-iterator.mk:9: /home/abuild/rpmbuild/BUILD/glibc-2.35.9000.58.g7912236f4a/cc-base/time/tzset.o] Error 1
> > >> >> make[2]: *** Waiting for unfinished jobs....
> > >> >
> > >> > How reproducible is it?
> > >>
> > >> 100%.
> > >>
> > >
> > > Can I reproduce it with scripts/build-many-glibcs.py on
> > > any machine which uses strcmp-avx2.S?
> >
> > Maybe.
> >
>
> I can't reproduce it.  It sounds very similar to
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=28646

Do you change the ifunc to prefer avx2?

>
> The failure can only be triggered by a specific setup.
> Noah, can you figure out what went wrong?

Looking into it. Andreas where can I see the build command you used?

>
> --
> H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 23:42                 ` Noah Goldstein
@ 2022-02-15 10:43                   ` Andreas Schwab
  2022-02-15 11:22                   ` Andreas Schwab
  1 sibling, 0 replies; 59+ messages in thread
From: Andreas Schwab @ 2022-02-15 10:43 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: H.J. Lu, Sunil K Pandey, Noah Goldstein via Libc-alpha

On Feb 14 2022, Noah Goldstein wrote:

> Looking into it. Andreas where can I see the build command you used?

You can find all logs here:
https://build.opensuse.org/project/monitor/home:Andreas_Schwab:glibc?defaults=0&succeeded=1&failed=1&arch_x86_64=1&repo_f=1

(but it didn't fail today as the job was picked up by a worker without
AVX2).

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 21:10                 ` H.J. Lu
@ 2022-02-15 11:11                   ` Andreas Schwab
  2022-02-15 12:55                   ` Andreas Schwab
  1 sibling, 0 replies; 59+ messages in thread
From: Andreas Schwab @ 2022-02-15 11:11 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Sunil K Pandey, Noah Goldstein via Libc-alpha, Noah Goldstein

On Feb 14 2022, H.J. Lu wrote:

> Andreas, I need your help to create a testcase.  You
> can build a special glibc and use it to build riscv64 glibc.
> In the special glibc, you compare AVX2 strcmp result
> against SSE2 strcmp.  If they don't match, do
>
> asm ("hlt")

I tried
https://build.opensuse.org/package/view_file/home:Andreas_Schwab:glibc:test/glibc/strcmp-avx2_w.patch
but that never triggers.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 23:42                 ` Noah Goldstein
  2022-02-15 10:43                   ` Andreas Schwab
@ 2022-02-15 11:22                   ` Andreas Schwab
  2022-02-15 11:28                     ` Noah Goldstein
  1 sibling, 1 reply; 59+ messages in thread
From: Andreas Schwab @ 2022-02-15 11:22 UTC (permalink / raw)
  To: Noah Goldstein via Libc-alpha

On Feb 14 2022, Noah Goldstein via Libc-alpha wrote:

> Looking into it. Andreas where can I see the build command you used?

You can find a failing log here:

https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc:test/glibc:cross-riscv64/f/x86_64

The error appears to depend on the exact memory layout.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-15 11:22                   ` Andreas Schwab
@ 2022-02-15 11:28                     ` Noah Goldstein
  2022-02-15 12:24                       ` Andreas Schwab
  0 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-02-15 11:28 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Noah Goldstein via Libc-alpha, H.J. Lu

On Tue, Feb 15, 2022 at 5:22 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Feb 14 2022, Noah Goldstein via Libc-alpha wrote:
>
> > Looking into it. Andreas where can I see the build command you used?
>
> You can find a failing log here:
>
> https://build.opensuse.org/package/live_build_log/home:Andreas_Schwab:glibc:test/glibc:cross-riscv64/f/x86_64
>
> The error appears to depend on the exact memory layout.

Did the build still succeed? It may be strncmp/wcscmp/wcsncmp.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-15 11:28                     ` Noah Goldstein
@ 2022-02-15 12:24                       ` Andreas Schwab
  0 siblings, 0 replies; 59+ messages in thread
From: Andreas Schwab @ 2022-02-15 12:24 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Noah Goldstein via Libc-alpha, H.J. Lu

On Feb 15 2022, Noah Goldstein wrote:

> It may be strncmp

That's it.  With the strncmp wrapper it triggers even more:

https://build.opensuse.org/project/monitor/home:Andreas_Schwab:glibc:test?arch_x86_64=1&defaults=0&failed=1&repo_f=1

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-14 21:10                 ` H.J. Lu
  2022-02-15 11:11                   ` Andreas Schwab
@ 2022-02-15 12:55                   ` Andreas Schwab
  2022-02-15 12:58                     ` Noah Goldstein
  1 sibling, 1 reply; 59+ messages in thread
From: Andreas Schwab @ 2022-02-15 12:55 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Sunil K Pandey, Noah Goldstein via Libc-alpha, Noah Goldstein

#0  0x00007f4fd4a61df3 in __strncmp_avx2_w (a=0x55aaa1906ffa "tst-tlsmod%", 
    b=0x55aaa1940fed "tst-tls-manydynamic73mod", c=10)
    at ../sysdeps/x86_64/multiarch/strncmp-avx2_w.c:11
11        if (n1 != n2) asm("hlt");

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-15 12:55                   ` Andreas Schwab
@ 2022-02-15 12:58                     ` Noah Goldstein
  2022-02-15 13:09                       ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-02-15 12:58 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: H.J. Lu, Sunil K Pandey, Noah Goldstein via Libc-alpha

On Tue, Feb 15, 2022 at 6:55 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> #0  0x00007f4fd4a61df3 in __strncmp_avx2_w (a=0x55aaa1906ffa "tst-tlsmod%",
>     b=0x55aaa1940fed "tst-tls-manydynamic73mod", c=10)
>     at ../sysdeps/x86_64/multiarch/strncmp-avx2_w.c:11
> 11        if (n1 != n2) asm("hlt");

I'll check that input thanks!

One thing, your check has some false positives as all that matters is
n1 / n2 have
same zero/non-zero status or same sign.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-15 12:58                     ` Noah Goldstein
@ 2022-02-15 13:09                       ` Noah Goldstein
  2022-02-15 13:32                         ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-02-15 13:09 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: H.J. Lu, Sunil K Pandey, Noah Goldstein via Libc-alpha

On Tue, Feb 15, 2022 at 6:58 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Feb 15, 2022 at 6:55 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >
> > #0  0x00007f4fd4a61df3 in __strncmp_avx2_w (a=0x55aaa1906ffa "tst-tlsmod%",
> >     b=0x55aaa1940fed "tst-tls-manydynamic73mod", c=10)
> >     at ../sysdeps/x86_64/multiarch/strncmp-avx2_w.c:11
> > 11        if (n1 != n2) asm("hlt");
>
> I'll check that input thanks!
>
> One thing, your check has some false positives as all that matters is
> n1 / n2 have
> same zero/non-zero status or same sign.

Confirmed. Sorry for the bug, will ping back when fix is up.
> >
> > --
> > Andreas Schwab, schwab@linux-m68k.org
> > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> > "And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-15 13:09                       ` Noah Goldstein
@ 2022-02-15 13:32                         ` Noah Goldstein
  2022-02-15 13:37                           ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-02-15 13:32 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: H.J. Lu, Sunil K Pandey, Noah Goldstein via Libc-alpha

On Tue, Feb 15, 2022 at 7:09 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Feb 15, 2022 at 6:58 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 6:55 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> > >
> > > #0  0x00007f4fd4a61df3 in __strncmp_avx2_w (a=0x55aaa1906ffa "tst-tlsmod%",
> > >     b=0x55aaa1940fed "tst-tls-manydynamic73mod", c=10)
> > >     at ../sysdeps/x86_64/multiarch/strncmp-avx2_w.c:11
> > > 11        if (n1 != n2) asm("hlt");
> >
> > I'll check that input thanks!
> >
> > One thing, your check has some false positives as all that matters is
> > n1 / n2 have
> > same zero/non-zero status or same sign.
>
> Confirmed. Sorry for the bug, will ping back when fix is up.

Found a bug (hopefully the bug) in strncmp. Did you see this at all
in strcmp-avx2 or was it just the commit you were referencing?
> > >
> > > --
> > > Andreas Schwab, schwab@linux-m68k.org
> > > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> > > "And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-15 13:32                         ` Noah Goldstein
@ 2022-02-15 13:37                           ` Noah Goldstein
  2022-02-15 16:33                             ` Noah Goldstein
  0 siblings, 1 reply; 59+ messages in thread
From: Noah Goldstein @ 2022-02-15 13:37 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: H.J. Lu, Sunil K Pandey, Noah Goldstein via Libc-alpha

On Tue, Feb 15, 2022 at 7:32 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Feb 15, 2022 at 7:09 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 6:58 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 6:55 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> > > >
> > > > #0  0x00007f4fd4a61df3 in __strncmp_avx2_w (a=0x55aaa1906ffa "tst-tlsmod%",
> > > >     b=0x55aaa1940fed "tst-tls-manydynamic73mod", c=10)
> > > >     at ../sysdeps/x86_64/multiarch/strncmp-avx2_w.c:11
> > > > 11        if (n1 != n2) asm("hlt");
> > >
> > > I'll check that input thanks!
> > >
> > > One thing, your check has some false positives as all that matters is
> > > n1 / n2 have
> > > same zero/non-zero status or same sign.
> >
> > Confirmed. Sorry for the bug, will ping back when fix is up.
>
> Found a bug (hopefully the bug) in strncmp. Did you see this at all
> in strcmp-avx2 or was it just the commit you were referencing?

Made bugzilla for the one I found at least:
https://sourceware.org/bugzilla/show_bug.cgi?id=28895
> > > >
> > > > --
> > > > Andreas Schwab, schwab@linux-m68k.org
> > > > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> > > > "And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S
  2022-02-15 13:37                           ` Noah Goldstein
@ 2022-02-15 16:33                             ` Noah Goldstein
  0 siblings, 0 replies; 59+ messages in thread
From: Noah Goldstein @ 2022-02-15 16:33 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: H.J. Lu, Sunil K Pandey, Noah Goldstein via Libc-alpha

On Tue, Feb 15, 2022 at 7:37 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Feb 15, 2022 at 7:32 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 7:09 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 6:58 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 15, 2022 at 6:55 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> > > > >
> > > > > #0  0x00007f4fd4a61df3 in __strncmp_avx2_w (a=0x55aaa1906ffa "tst-tlsmod%",
> > > > >     b=0x55aaa1940fed "tst-tls-manydynamic73mod", c=10)
> > > > >     at ../sysdeps/x86_64/multiarch/strncmp-avx2_w.c:11
> > > > > 11        if (n1 != n2) asm("hlt");
> > > >
> > > > I'll check that input thanks!
> > > >
> > > > One thing, your check has some false positives as all that matters is
> > > > n1 / n2 have
> > > > same zero/non-zero status or same sign.
> > >
> > > Confirmed. Sorry for the bug, will ping back when fix is up.
> >
> > Found a bug (hopefully the bug) in strncmp. Did you see this at all
> > in strcmp-avx2 or was it just the commit you were referencing?
>
> Made bugzilla for the one I found at least:
> https://sourceware.org/bugzilla/show_bug.cgi?id=28895

Hopefully fix:
https://patchwork.sourceware.org/project/glibc/patch/20220215162829.282223-1-goldstein.w.n@gmail.com/

> > > > >
> > > > > --
> > > > > Andreas Schwab, schwab@linux-m68k.org
> > > > > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> > > > > "And now for something completely different."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
  2022-01-26 22:04       ` H.J. Lu
@ 2022-04-29 22:05         ` Sunil Pandey
  0 siblings, 0 replies; 59+ messages in thread
From: Sunil Pandey @ 2022-04-29 22:05 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library, Libc-stable Mailing List

On Wed, Jan 26, 2022 at 2:06 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Mon, Jan 10, 2022 at 6:15 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Jan 10, 2022 at 1:36 PM Noah Goldstein via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> > >
> > > Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
> > > __wcscmp_evex. For x86_64 this covers the entire address range so any
> > > length larger could not possibly be used to bound `s1` or `s2`.
> > >
> > > test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
> > >
> > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > ---
> > >  sysdeps/x86_64/multiarch/strcmp-evex.S | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
> > > index 1d971f3889..0cd939d5af 100644
> > > --- a/sysdeps/x86_64/multiarch/strcmp-evex.S
> > > +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
> > > @@ -104,6 +104,16 @@ ENTRY (STRCMP)
> > >         je      L(char0)
> > >         jb      L(zero)
> > >  #  ifdef USE_AS_WCSCMP
> > > +#  ifndef __ILP32__
> > > +       movq    %rdx, %rcx
> > > +       /* Check if length could overflow when multiplied by
> > > +          sizeof(wchar_t). Checking top 8 bits will cover all potential
> > > +          overflow cases as well as redirect cases where its impossible to
> > > +          length to bound a valid memory region. In these cases just use
> > > +          'wcscmp'.  */
> > > +       shrq    $56, %rcx
> > > +       jnz     __wcscmp_evex
> > > +#  endif
> > >         /* Convert units: from wide to byte char.  */
> > >         shl     $2, %RDX_LP
> > >  #  endif
> > > --
> > > 2.25.1
> > >
> >
> > LGTM.
> >
> > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> >
> > Thanks.
> >
> > --
> > H.J.
>
> I am backporting this to 2.34 branch.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2022-04-29 22:06 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-09 12:29 [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
2022-01-09 12:29 ` [PATCH v1 2/5] x86: Optimize strcmp-evex.S " Noah Goldstein
2022-01-09 12:29 ` [PATCH v1 3/5] string: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
2022-01-09 12:29 ` [PATCH v1 4/5] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
2022-01-09 12:29 ` [PATCH v1 5/5] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
2022-01-09 12:35 ` [PATCH v1 1/5] x86: Optimize strcmp-avx2.S and fix for [BZ# 28755] Noah Goldstein
2022-01-09 14:07   ` H.J. Lu
2022-01-10  0:29     ` Noah Goldstein
2022-01-10  0:27 ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S " Noah Goldstein
2022-01-10  0:27   ` [PATCH v2 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
2022-01-10  0:35     ` H.J. Lu
2022-01-10  0:27   ` [PATCH v2 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
2022-01-10  0:37     ` H.J. Lu
2022-01-10  0:27   ` [PATCH v2 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
2022-01-10  0:38     ` H.J. Lu
2022-01-10  2:51       ` Noah Goldstein
2022-01-10  0:27   ` [PATCH v2 5/7] x86: Optimize strcmp-avx2.S Noah Goldstein
2022-01-10  0:41     ` H.J. Lu
2022-01-10  1:06       ` Noah Goldstein
2022-01-10  1:58         ` H.J. Lu
2022-01-10  2:54           ` Noah Goldstein
2022-01-10  0:27   ` [PATCH v2 6/7] x86: Optimize strcmp-evex.S Noah Goldstein
2022-01-10  0:41     ` H.J. Lu
2022-01-10  0:27   ` [PATCH v2 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
2022-01-10  0:34   ` [PATCH v2 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] H.J. Lu
2022-01-10 21:35 ` [PATCH v3 " Noah Goldstein
2022-01-10 21:35   ` [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S " Noah Goldstein
2022-01-11  2:15     ` H.J. Lu
2022-01-26 22:04       ` H.J. Lu
2022-04-29 22:05         ` Sunil Pandey
2022-01-10 21:35   ` [PATCH v3 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp] Noah Goldstein
2022-01-10 21:35   ` [PATCH v3 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c Noah Goldstein
2022-01-10 21:35   ` [PATCH v3 5/7] x86: Optimize strcmp-avx2.S Noah Goldstein
2022-02-14 14:10     ` Andreas Schwab
2022-02-14 18:23       ` H.J. Lu
2022-02-14 19:16         ` Andreas Schwab
2022-02-14 19:30           ` H.J. Lu
2022-02-14 19:35             ` Andreas Schwab
2022-02-14 20:59               ` H.J. Lu
2022-02-14 21:10                 ` H.J. Lu
2022-02-15 11:11                   ` Andreas Schwab
2022-02-15 12:55                   ` Andreas Schwab
2022-02-15 12:58                     ` Noah Goldstein
2022-02-15 13:09                       ` Noah Goldstein
2022-02-15 13:32                         ` Noah Goldstein
2022-02-15 13:37                           ` Noah Goldstein
2022-02-15 16:33                             ` Noah Goldstein
2022-02-14 23:42                 ` Noah Goldstein
2022-02-15 10:43                   ` Andreas Schwab
2022-02-15 11:22                   ` Andreas Schwab
2022-02-15 11:28                     ` Noah Goldstein
2022-02-15 12:24                       ` Andreas Schwab
2022-01-10 21:35   ` [PATCH v3 6/7] x86: Optimize strcmp-evex.S Noah Goldstein
2022-01-10 21:35   ` [PATCH v3 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks Noah Goldstein
2022-01-11  2:15   ` [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] H.J. Lu
2022-01-26 22:05     ` H.J. Lu
2022-01-27  4:29       ` H.J. Lu
2022-01-27  5:10         ` H.J. Lu
2022-01-27  5:52           ` Noah Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).