public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Subject: [PATCH v1 3/3] x86: Optimize memchr-evex.S
Date: Mon,  3 May 2021 04:44:38 -0400	[thread overview]
Message-ID: <20210503084435.160548-3-goldstein.w.n@gmail.com> (raw)
In-Reply-To: <20210503084435.160548-1-goldstein.w.n@gmail.com>

No bug. This commit optimizes memchr-evex.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
saving some ALU in the alignment process, and most importantly
increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and
test-wmemchr are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
Tests where run on the following CPUs:

Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

All times are the geometric mean of N=20. The unit of time is
seconds.

"Cur" refers to the current implementation
"New" refers to this patches implementation

Note: The numbers for size = [1, 32] are highly dependent on function
alignment. That being said the new implementation which uses cmovcc
instead of a branch (mostly for the reason of high variance with
different alignments) for the [1, 32] case is far more consistent and
performs about as well (and should only be a bigger improvement in
cases where the sizes / position are not 100% predictable).

For memchr-evex the numbers are a near universal improvement. The case
where the current implement as better is for size = 0 and for size =
[1, 32] with pos < size the two implementations are about the
same. For size = [1, 32] with pos > size, for medium range sizes, and
large size, however, the new implementation is faster.

Results For Tigerlake memchr-evex
size  , algn  , Pos   , Cur T , New T , Win   , Dif   
2048  , 0     , , 32    5.58  , 5.22  , New   , 0.36  
256   , 1     , , 64    5.22  , 4.93  , New   , 0.29  
2048  , 0     , , 64    5.22  , 4.89  , New   , 0.33  
256   , 2     , , 64    5.14  , 4.81  , New   , 0.33  
2048  , 0     , , 128   6.3   , 5.67  , New   , 0.63  
256   , 3     , , 64    5.22  , 4.9   , New   , 0.32  
2048  , 0     , , 256   11.07 , 10.92 , New   , 0.15  
256   , 4     , , 64    5.16  , 4.86  , New   , 0.3   
2048  , 0     , , 512   15.66 , 14.81 , New   , 0.85  
256   , 5     , , 64    5.15  , 4.84  , New   , 0.31  
2048  , 0     , , 1024  25.7  , 23.02 , New   , 2.68  
256   , 6     , , 64    5.12  , 4.89  , New   , 0.23  
2048  , 0     , , 2048  42.34 , 37.71 , New   , 4.63  
256   , 7     , , 64    5.03  , 4.62  , New   , 0.41  
192   , 1     , , 32    4.96  , 4.28  , New   , 0.68  
256   , 1     , , 32    4.95  , 4.28  , New   , 0.67  
512   , 1     , , 32    4.94  , 4.29  , New   , 0.65  
192   , 2     , , 64    5.1   , 4.8   , New   , 0.3   
512   , 2     , , 64    5.12  , 4.72  , New   , 0.4   
192   , 3     , , 96    5.54  , 5.12  , New   , 0.42  
256   , 3     , , 96    5.52  , 5.15  , New   , 0.37  
512   , 3     , , 96    5.51  , 5.16  , New   , 0.35  
192   , 4     , , 128   6.1   , 5.53  , New   , 0.57  
256   , 4     , , 128   6.09  , 5.49  , New   , 0.6   
512   , 4     , , 128   6.08  , 5.48  , New   , 0.6   
192   , 5     , , 160   7.42  , 6.71  , New   , 0.71  
256   , 5     , , 160   6.86  , 6.71  , New   , 0.15  
512   , 5     , , 160   9.28  , 8.68  , New   , 0.6   
192   , 6     , , 192   7.94  , 7.47  , New   , 0.47  
256   , 6     , , 192   7.62  , 7.17  , New   , 0.45  
512   , 6     , , 192   9.2   , 9.16  , New   , 0.04  
192   , 7     , , 224   8.02  , 7.43  , New   , 0.59  
256   , 7     , , 224   8.34  , 7.85  , New   , 0.49  
512   , 7     , , 224   9.89  , 9.16  , New   , 0.73  
2     , 0     , , 1     3.0   , 3.0   , Eq    , 0.0
2     , 1     , , 1     3.0   , 3.0   , Eq    , 0.0
0     , 0     , , 1     3.01  , 3.6   , Cur   , 0.59  
0     , 1     , , 1     3.01  , 3.6   , Cur   , 0.59  
3     , 0     , , 2     3.0   , 3.0   , Eq    , 0.0
3     , 2     , , 2     3.0   , 3.0   , Eq    , 0.0
1     , 0     , , 2     3.6   , 3.0   , New   , 0.6   
1     , 2     , , 2     3.6   , 3.0   , New   , 0.6   
4     , 0     , , 3     3.01  , 3.01  , Eq    , 0.0
4     , 3     , , 3     3.01  , 3.01  , Eq    , 0.0
2     , 0     , , 3     3.62  , 3.02  , New   , 0.6   
2     , 3     , , 3     3.62  , 3.03  , New   , 0.59  
5     , 0     , , 4     3.02  , 3.03  , Cur   , 0.01  
5     , 4     , , 4     3.02  , 3.02  , Eq    , 0.0
3     , 0     , , 4     3.63  , 3.02  , New   , 0.61  
3     , 4     , , 4     3.63  , 3.04  , New   , 0.59  
6     , 0     , , 5     3.05  , 3.04  , New   , 0.01  
6     , 5     , , 5     3.02  , 3.02  , Eq    , 0.0
4     , 0     , , 5     3.63  , 3.02  , New   , 0.61  
4     , 5     , , 5     3.64  , 3.03  , New   , 0.61  
7     , 0     , , 6     3.03  , 3.03  , Eq    , 0.0
7     , 6     , , 6     3.02  , 3.02  , Eq    , 0.0
5     , 0     , , 6     3.64  , 3.01  , New   , 0.63  
5     , 6     , , 6     3.64  , 3.03  , New   , 0.61  
8     , 0     , , 7     3.03  , 3.04  , Cur   , 0.01  
8     , 7     , , 7     3.04  , 3.04  , Eq    , 0.0
6     , 0     , , 7     3.67  , 3.04  , New   , 0.63  
6     , 7     , , 7     3.65  , 3.05  , New   , 0.6   
9     , 0     , , 8     3.05  , 3.05  , Eq    , 0.0
7     , 0     , , 8     3.67  , 3.05  , New   , 0.62  
10    , 0     , , 9     3.06  , 3.06  , Eq    , 0.0
10    , 1     , , 9     3.06  , 3.06  , Eq    , 0.0
8     , 0     , , 9     3.67  , 3.06  , New   , 0.61  
8     , 1     , , 9     3.67  , 3.06  , New   , 0.61  
11    , 0     , , 10    3.06  , 3.06  , Eq    , 0.0
11    , 2     , , 10    3.07  , 3.06  , New   , 0.01  
9     , 0     , , 10    3.67  , 3.05  , New   , 0.62  
9     , 2     , , 10    3.67  , 3.06  , New   , 0.61  
12    , 0     , , 11    3.06  , 3.06  , Eq    , 0.0
12    , 3     , , 11    3.06  , 3.06  , Eq    , 0.0
10    , 0     , , 11    3.67  , 3.06  , New   , 0.61  
10    , 3     , , 11    3.67  , 3.06  , New   , 0.61  
13    , 0     , , 12    3.06  , 3.07  , Cur   , 0.01  
13    , 4     , , 12    3.06  , 3.07  , Cur   , 0.01  
11    , 0     , , 12    3.67  , 3.11  , New   , 0.56  
11    , 4     , , 12    3.68  , 3.12  , New   , 0.56  
14    , 0     , , 13    3.07  , 3.1   , Cur   , 0.03  
14    , 5     , , 13    3.06  , 3.07  , Cur   , 0.01  
12    , 0     , , 13    3.67  , 3.07  , New   , 0.6   
12    , 5     , , 13    3.67  , 3.08  , New   , 0.59  
15    , 0     , , 14    3.06  , 3.06  , Eq    , 0.0
15    , 6     , , 14    3.07  , 3.06  , New   , 0.01  
13    , 0     , , 14    3.67  , 3.06  , New   , 0.61  
13    , 6     , , 14    3.68  , 3.06  , New   , 0.62  
16    , 0     , , 15    3.06  , 3.06  , Eq    , 0.0
16    , 7     , , 15    3.06  , 3.05  , New   , 0.01  
14    , 0     , , 15    3.68  , 3.06  , New   , 0.62  
14    , 7     , , 15    3.67  , 3.06  , New   , 0.61  
17    , 0     , , 16    3.07  , 3.06  , New   , 0.01  
15    , 0     , , 16    3.68  , 3.06  , New   , 0.62  
18    , 0     , , 17    3.06  , 3.06  , Eq    , 0.0
18    , 1     , , 17    3.06  , 3.06  , Eq    , 0.0
16    , 0     , , 17    3.67  , 3.06  , New   , 0.61  
16    , 1     , , 17    3.67  , 3.05  , New   , 0.62  
19    , 0     , , 18    3.07  , 3.06  , New   , 0.01  
19    , 2     , , 18    3.06  , 3.06  , Eq    , 0.0
17    , 0     , , 18    3.68  , 3.08  , New   , 0.6   
17    , 2     , , 18    3.68  , 3.06  , New   , 0.62  
20    , 0     , , 19    3.06  , 3.06  , Eq    , 0.0
20    , 3     , , 19    3.06  , 3.06  , Eq    , 0.0
18    , 0     , , 19    3.68  , 3.06  , New   , 0.62  
18    , 3     , , 19    3.68  , 3.06  , New   , 0.62  
21    , 0     , , 20    3.06  , 3.06  , Eq    , 0.0
21    , 4     , , 20    3.06  , 3.06  , Eq    , 0.0
19    , 0     , , 20    3.67  , 3.06  , New   , 0.61  
19    , 4     , , 20    3.67  , 3.06  , New   , 0.61  
22    , 0     , , 21    3.06  , 3.06  , Eq    , 0.0
22    , 5     , , 21    3.06  , 3.06  , Eq    , 0.0
20    , 0     , , 21    3.67  , 3.05  , New   , 0.62  
20    , 5     , , 21    3.68  , 3.06  , New   , 0.62  
23    , 0     , , 22    3.07  , 3.06  , New   , 0.01  
23    , 6     , , 22    3.06  , 3.06  , Eq    , 0.0
21    , 0     , , 22    3.68  , 3.07  , New   , 0.61  
21    , 6     , , 22    3.67  , 3.06  , New   , 0.61  
24    , 0     , , 23    3.19  , 3.06  , New   , 0.13  
24    , 7     , , 23    3.08  , 3.06  , New   , 0.02  
22    , 0     , , 23    3.69  , 3.06  , New   , 0.63  
22    , 7     , , 23    3.68  , 3.06  , New   , 0.62  
25    , 0     , , 24    3.07  , 3.06  , New   , 0.01  
23    , 0     , , 24    3.68  , 3.06  , New   , 0.62  
26    , 0     , , 25    3.06  , 3.05  , New   , 0.01  
26    , 1     , , 25    3.07  , 3.06  , New   , 0.01  
24    , 0     , , 25    3.67  , 3.05  , New   , 0.62  
24    , 1     , , 25    3.68  , 3.06  , New   , 0.62  
27    , 0     , , 26    3.12  , 3.06  , New   , 0.06  
27    , 2     , , 26    3.08  , 3.06  , New   , 0.02  
25    , 0     , , 26    3.69  , 3.06  , New   , 0.63  
25    , 2     , , 26    3.67  , 3.06  , New   , 0.61  
28    , 0     , , 27    3.06  , 3.06  , Eq    , 0.0
28    , 3     , , 27    3.06  , 3.06  , Eq    , 0.0
26    , 0     , , 27    3.67  , 3.06  , New   , 0.61  
26    , 3     , , 27    3.67  , 3.06  , New   , 0.61  
29    , 0     , , 28    3.06  , 3.06  , Eq    , 0.0
29    , 4     , , 28    3.06  , 3.06  , Eq    , 0.0
27    , 0     , , 28    3.68  , 3.05  , New   , 0.63  
27    , 4     , , 28    3.67  , 3.06  , New   , 0.61  
30    , 0     , , 29    3.06  , 3.06  , Eq    , 0.0
30    , 5     , , 29    3.06  , 3.06  , Eq    , 0.0
28    , 0     , , 29    3.67  , 3.06  , New   , 0.61  
28    , 5     , , 29    3.68  , 3.06  , New   , 0.62  
31    , 0     , , 30    3.06  , 3.06  , Eq    , 0.0
31    , 6     , , 30    3.06  , 3.06  , Eq    , 0.0
29    , 0     , , 30    3.68  , 3.06  , New   , 0.62  
29    , 6     , , 30    3.7   , 3.06  , New   , 0.64  
32    , 0     , , 31    3.17  , 3.06  , New   , 0.11  
32    , 7     , , 31    3.12  , 3.06  , New   , 0.06  
30    , 0     , , 31    3.68  , 3.06  , New   , 0.62  
30    , 7     , , 31    3.68  , 3.06  , New   , 0.62

Results For Icelake memchr-evex
size  , algn  , Pos   , Cur T , New T , Win   , Dif   
2048  , 0     , , 32    4.94  , 4.26  , New   , 0.68  
256   , 1     , , 64    4.5   , 4.13  , New   , 0.37  
2048  , 0     , , 64    4.19  , 3.9   , New   , 0.29  
256   , 2     , , 64    4.19  , 3.87  , New   , 0.32  
2048  , 0     , , 128   4.96  , 4.53  , New   , 0.43  
256   , 3     , , 64    4.07  , 3.86  , New   , 0.21  
2048  , 0     , , 256   8.77  , 8.61  , New   , 0.16  
256   , 4     , , 64    4.08  , 3.87  , New   , 0.21  
2048  , 0     , , 512   12.22 , 11.67 , New   , 0.55  
256   , 5     , , 64    4.12  , 3.83  , New   , 0.29  
2048  , 0     , , 1024  20.06 , 18.09 , New   , 1.97  
256   , 6     , , 64    4.2   , 3.95  , New   , 0.25  
2048  , 0     , , 2048  33.83 , 30.62 , New   , 3.21  
256   , 7     , , 64    4.3   , 4.04  , New   , 0.26  
192   , 1     , , 32    4.2   , 3.71  , New   , 0.49  
256   , 1     , , 32    4.24  , 3.76  , New   , 0.48  
512   , 1     , , 32    4.29  , 3.74  , New   , 0.55  
192   , 2     , , 64    4.42  , 4.0   , New   , 0.42  
512   , 2     , , 64    4.17  , 3.83  , New   , 0.34  
192   , 3     , , 96    4.44  , 4.26  , New   , 0.18  
256   , 3     , , 96    4.45  , 4.14  , New   , 0.31  
512   , 3     , , 96    4.42  , 4.15  , New   , 0.27  
192   , 4     , , 128   4.93  , 4.45  , New   , 0.48  
256   , 4     , , 128   4.93  , 4.47  , New   , 0.46  
512   , 4     , , 128   4.95  , 4.47  , New   , 0.48  
192   , 5     , , 160   5.95  , 5.44  , New   , 0.51  
256   , 5     , , 160   5.59  , 5.47  , New   , 0.12  
512   , 5     , , 160   7.59  , 7.34  , New   , 0.25  
192   , 6     , , 192   6.53  , 6.08  , New   , 0.45  
256   , 6     , , 192   6.2   , 5.88  , New   , 0.32  
512   , 6     , , 192   7.53  , 7.62  , Cur   , 0.09  
192   , 7     , , 224   6.62  , 6.12  , New   , 0.5   
256   , 7     , , 224   6.79  , 6.51  , New   , 0.28  
512   , 7     , , 224   8.12  , 7.61  , New   , 0.51  
2     , 0     , , 1     2.5   , 2.54  , Cur   , 0.04  
2     , 1     , , 1     2.56  , 2.55  , New   , 0.01  
0     , 0     , , 1     2.57  , 3.12  , Cur   , 0.55  
0     , 1     , , 1     2.59  , 3.14  , Cur   , 0.55  
3     , 0     , , 2     2.62  , 2.63  , Cur   , 0.01  
3     , 2     , , 2     2.66  , 2.67  , Cur   , 0.01  
1     , 0     , , 2     3.24  , 2.72  , New   , 0.52  
1     , 2     , , 2     3.28  , 2.75  , New   , 0.53  
4     , 0     , , 3     2.78  , 2.8   , Cur   , 0.02  
4     , 3     , , 3     2.8   , 2.82  , Cur   , 0.02  
2     , 0     , , 3     3.38  , 2.86  , New   , 0.52  
2     , 3     , , 3     3.41  , 2.89  , New   , 0.52  
5     , 0     , , 4     2.88  , 2.91  , Cur   , 0.03  
5     , 4     , , 4     2.88  , 2.92  , Cur   , 0.04  
3     , 0     , , 4     3.48  , 2.93  , New   , 0.55  
3     , 4     , , 4     3.47  , 2.93  , New   , 0.54  
6     , 0     , , 5     2.95  , 2.94  , New   , 0.01  
6     , 5     , , 5     2.91  , 2.92  , Cur   , 0.01  
4     , 0     , , 5     3.47  , 2.9   , New   , 0.57  
4     , 5     , , 5     3.43  , 2.91  , New   , 0.52  
7     , 0     , , 6     2.87  , 2.9   , Cur   , 0.03  
7     , 6     , , 6     2.87  , 2.89  , Cur   , 0.02  
5     , 0     , , 6     3.44  , 2.88  , New   , 0.56  
5     , 6     , , 6     3.41  , 2.87  , New   , 0.54  
8     , 0     , , 7     2.86  , 2.87  , Cur   , 0.01  
8     , 7     , , 7     2.86  , 2.87  , Cur   , 0.01  
6     , 0     , , 7     3.43  , 2.87  , New   , 0.56  
6     , 7     , , 7     3.44  , 2.87  , New   , 0.57  
9     , 0     , , 8     2.86  , 2.88  , Cur   , 0.02  
7     , 0     , , 8     3.41  , 2.89  , New   , 0.52  
10    , 0     , , 9     2.83  , 2.87  , Cur   , 0.04  
10    , 1     , , 9     2.82  , 2.87  , Cur   , 0.05  
8     , 0     , , 9     3.4   , 2.89  , New   , 0.51  
8     , 1     , , 9     3.41  , 2.87  , New   , 0.54  
11    , 0     , , 10    2.83  , 2.88  , Cur   , 0.05  
11    , 2     , , 10    2.84  , 2.88  , Cur   , 0.04  
9     , 0     , , 10    3.41  , 2.87  , New   , 0.54  
9     , 2     , , 10    3.41  , 2.88  , New   , 0.53  
12    , 0     , , 11    2.83  , 2.89  , Cur   , 0.06  
12    , 3     , , 11    2.85  , 2.87  , Cur   , 0.02  
10    , 0     , , 11    3.41  , 2.87  , New   , 0.54  
10    , 3     , , 11    3.42  , 2.88  , New   , 0.54  
13    , 0     , , 12    2.86  , 2.87  , Cur   , 0.01  
13    , 4     , , 12    2.84  , 2.88  , Cur   , 0.04  
11    , 0     , , 12    3.43  , 2.87  , New   , 0.56  
11    , 4     , , 12    3.49  , 2.87  , New   , 0.62  
14    , 0     , , 13    2.85  , 2.86  , Cur   , 0.01  
14    , 5     , , 13    2.85  , 2.86  , Cur   , 0.01  
12    , 0     , , 13    3.41  , 2.86  , New   , 0.55  
12    , 5     , , 13    3.44  , 2.85  , New   , 0.59  
15    , 0     , , 14    2.83  , 2.87  , Cur   , 0.04  
15    , 6     , , 14    2.82  , 2.86  , Cur   , 0.04  
13    , 0     , , 14    3.41  , 2.86  , New   , 0.55  
13    , 6     , , 14    3.4   , 2.86  , New   , 0.54  
16    , 0     , , 15    2.84  , 2.86  , Cur   , 0.02  
16    , 7     , , 15    2.83  , 2.85  , Cur   , 0.02  
14    , 0     , , 15    3.41  , 2.85  , New   , 0.56  
14    , 7     , , 15    3.39  , 2.87  , New   , 0.52  
17    , 0     , , 16    2.83  , 2.87  , Cur   , 0.04  
15    , 0     , , 16    3.4   , 2.85  , New   , 0.55  
18    , 0     , , 17    2.83  , 2.86  , Cur   , 0.03  
18    , 1     , , 17    2.85  , 2.84  , New   , 0.01  
16    , 0     , , 17    3.41  , 2.85  , New   , 0.56  
16    , 1     , , 17    3.4   , 2.86  , New   , 0.54  
19    , 0     , , 18    2.8   , 2.84  , Cur   , 0.04  
19    , 2     , , 18    2.82  , 2.83  , Cur   , 0.01  
17    , 0     , , 18    3.39  , 2.86  , New   , 0.53  
17    , 2     , , 18    3.39  , 2.84  , New   , 0.55  
20    , 0     , , 19    2.85  , 2.87  , Cur   , 0.02  
20    , 3     , , 19    2.88  , 2.87  , New   , 0.01  
18    , 0     , , 19    3.38  , 2.85  , New   , 0.53  
18    , 3     , , 19    3.4   , 2.85  , New   , 0.55  
21    , 0     , , 20    2.83  , 2.85  , Cur   , 0.02  
21    , 4     , , 20    2.88  , 2.85  , New   , 0.03  
19    , 0     , , 20    3.39  , 2.84  , New   , 0.55  
19    , 4     , , 20    3.39  , 2.96  , New   , 0.43  
22    , 0     , , 21    2.84  , 2.9   , Cur   , 0.06  
22    , 5     , , 21    2.81  , 2.84  , Cur   , 0.03  
20    , 0     , , 21    3.41  , 2.81  , New   , 0.6   
20    , 5     , , 21    3.38  , 2.83  , New   , 0.55  
23    , 0     , , 22    2.8   , 2.82  , Cur   , 0.02  
23    , 6     , , 22    2.81  , 2.83  , Cur   , 0.02  
21    , 0     , , 22    3.35  , 2.81  , New   , 0.54  
21    , 6     , , 22    3.34  , 2.81  , New   , 0.53  
24    , 0     , , 23    2.77  , 2.84  , Cur   , 0.07  
24    , 7     , , 23    2.78  , 2.8   , Cur   , 0.02  
22    , 0     , , 23    3.34  , 2.79  , New   , 0.55  
22    , 7     , , 23    3.32  , 2.79  , New   , 0.53  
25    , 0     , , 24    2.77  , 2.8   , Cur   , 0.03  
23    , 0     , , 24    3.29  , 2.79  , New   , 0.5   
26    , 0     , , 25    2.73  , 2.78  , Cur   , 0.05  
26    , 1     , , 25    2.75  , 2.79  , Cur   , 0.04  
24    , 0     , , 25    3.27  , 2.79  , New   , 0.48  
24    , 1     , , 25    3.27  , 2.77  , New   , 0.5   
27    , 0     , , 26    2.72  , 2.78  , Cur   , 0.06  
27    , 2     , , 26    2.75  , 2.76  , Cur   , 0.01  
25    , 0     , , 26    3.29  , 2.73  , New   , 0.56  
25    , 2     , , 26    3.3   , 2.76  , New   , 0.54  
28    , 0     , , 27    2.75  , 2.79  , Cur   , 0.04  
28    , 3     , , 27    2.77  , 2.77  , Eq    , 0.0
26    , 0     , , 27    3.28  , 2.78  , New   , 0.5   
26    , 3     , , 27    3.29  , 2.78  , New   , 0.51  
29    , 0     , , 28    2.74  , 2.76  , Cur   , 0.02  
29    , 4     , , 28    2.74  , 2.77  , Cur   , 0.03  
27    , 0     , , 28    3.3   , 2.76  , New   , 0.54  
27    , 4     , , 28    3.3   , 2.74  , New   , 0.56  
30    , 0     , , 29    2.72  , 2.76  , Cur   , 0.04  
30    , 5     , , 29    2.74  , 2.75  , Cur   , 0.01  
28    , 0     , , 29    3.25  , 2.73  , New   , 0.52  
28    , 5     , , 29    3.3   , 2.73  , New   , 0.57  
31    , 0     , , 30    2.73  , 2.77  , Cur   , 0.04  
31    , 6     , , 30    2.74  , 2.76  , Cur   , 0.02  
29    , 0     , , 30    3.25  , 2.73  , New   , 0.52  
29    , 6     , , 30    3.26  , 2.74  , New   , 0.52  
32    , 0     , , 31    2.73  , 2.74  , Cur   , 0.01  
32    , 7     , , 31    2.73  , 2.75  , Cur   , 0.02  
30    , 0     , , 31    3.24  , 2.72  , New   , 0.52  
30    , 7     , , 31    3.24  , 2.72  , New   , 0.52

For memchr-avx2 the improvements are more modest though again near
universal. The improvement is most significant for medium sizes and
small sizes with pos > size. For small sizes with pos < size and large
sizes the two implementations perform roughly the same for large
sizes.

Results For Tigerlake memchr-avx2
size  , algn  , Pos   , Cur T , New T , Win   , Dif   
2048  , 0     , , 32    6.15  , 6.27  , Cur   , 0.12  
256   , 1     , , 64    6.21  , 6.03  , New   , 0.18  
2048  , 0     , , 64    6.07  , 5.95  , New   , 0.12  
256   , 2     , , 64    6.01  , 5.8   , New   , 0.21  
2048  , 0     , , 128   7.05  , 6.55  , New   , 0.5   
256   , 3     , , 64    6.14  , 5.83  , New   , 0.31  
2048  , 0     , , 256   11.78 , 11.78 , Eq    , 0.0
256   , 4     , , 64    6.1   , 5.85  , New   , 0.25  
2048  , 0     , , 512   16.32 , 15.96 , New   , 0.36  
256   , 5     , , 64    6.1   , 5.77  , New   , 0.33  
2048  , 0     , , 1024  25.38 , 25.18 , New   , 0.2   
256   , 6     , , 64    6.08  , 5.88  , New   , 0.2   
2048  , 0     , , 2048  38.56 , 38.32 , New   , 0.24  
256   , 7     , , 64    5.93  , 5.68  , New   , 0.25  
192   , 1     , , 32    5.49  , 5.3   , New   , 0.19  
256   , 1     , , 32    5.5   , 5.28  , New   , 0.22  
512   , 1     , , 32    5.48  , 5.32  , New   , 0.16  
192   , 2     , , 64    6.1   , 5.73  , New   , 0.37  
512   , 2     , , 64    5.88  , 5.72  , New   , 0.16  
192   , 3     , , 96    6.31  , 5.93  , New   , 0.38  
256   , 3     , , 96    6.32  , 5.93  , New   , 0.39  
512   , 3     , , 96    6.2   , 5.94  , New   , 0.26  
192   , 4     , , 128   6.65  , 6.4   , New   , 0.25  
256   , 4     , , 128   6.6   , 6.37  , New   , 0.23  
512   , 4     , , 128   6.74  , 6.33  , New   , 0.41  
192   , 5     , , 160   7.78  , 7.4   , New   , 0.38  
256   , 5     , , 160   7.18  , 7.4   , Cur   , 0.22  
512   , 5     , , 160   9.81  , 9.44  , New   , 0.37  
192   , 6     , , 192   9.12  , 7.77  , New   , 1.35  
256   , 6     , , 192   7.97  , 7.66  , New   , 0.31  
512   , 6     , , 192   10.14 , 9.95  , New   , 0.19  
192   , 7     , , 224   8.96  , 7.78  , New   , 1.18  
256   , 7     , , 224   8.52  , 8.23  , New   , 0.29  
512   , 7     , , 224   10.33 , 9.98  , New   , 0.35  
2     , 0     , , 1     3.61  , 3.6   , New   , 0.01  
2     , 1     , , 1     3.6   , 3.6   , Eq    , 0.0
0     , 0     , , 1     3.02  , 3.0   , New   , 0.02  
0     , 1     , , 1     3.0   , 3.0   , Eq    , 0.0
3     , 0     , , 2     3.6   , 3.6   , Eq    , 0.0
3     , 2     , , 2     3.61  , 3.6   , New   , 0.01  
1     , 0     , , 2     4.82  , 3.6   , New   , 1.22  
1     , 2     , , 2     4.81  , 3.6   , New   , 1.21  
4     , 0     , , 3     3.61  , 3.61  , Eq    , 0.0
4     , 3     , , 3     3.62  , 3.61  , New   , 0.01  
2     , 0     , , 3     4.82  , 3.62  , New   , 1.2   
2     , 3     , , 3     4.83  , 3.63  , New   , 1.2   
5     , 0     , , 4     3.63  , 3.64  , Cur   , 0.01  
5     , 4     , , 4     3.63  , 3.62  , New   , 0.01  
3     , 0     , , 4     4.84  , 3.62  , New   , 1.22  
3     , 4     , , 4     4.84  , 3.64  , New   , 1.2   
6     , 0     , , 5     3.66  , 3.64  , New   , 0.02  
6     , 5     , , 5     3.65  , 3.62  , New   , 0.03  
4     , 0     , , 5     4.83  , 3.63  , New   , 1.2   
4     , 5     , , 5     4.85  , 3.64  , New   , 1.21  
7     , 0     , , 6     3.76  , 3.79  , Cur   , 0.03  
7     , 6     , , 6     3.76  , 3.72  , New   , 0.04  
5     , 0     , , 6     4.84  , 3.62  , New   , 1.22  
5     , 6     , , 6     4.85  , 3.64  , New   , 1.21  
8     , 0     , , 7     3.64  , 3.65  , Cur   , 0.01  
8     , 7     , , 7     3.65  , 3.65  , Eq    , 0.0
6     , 0     , , 7     4.88  , 3.64  , New   , 1.24  
6     , 7     , , 7     4.87  , 3.65  , New   , 1.22  
9     , 0     , , 8     3.66  , 3.66  , Eq    , 0.0
7     , 0     , , 8     4.89  , 3.66  , New   , 1.23  
10    , 0     , , 9     3.67  , 3.67  , Eq    , 0.0
10    , 1     , , 9     3.67  , 3.67  , Eq    , 0.0
8     , 0     , , 9     4.9   , 3.67  , New   , 1.23  
8     , 1     , , 9     4.9   , 3.67  , New   , 1.23  
11    , 0     , , 10    3.68  , 3.67  , New   , 0.01  
11    , 2     , , 10    3.69  , 3.67  , New   , 0.02  
9     , 0     , , 10    4.9   , 3.67  , New   , 1.23  
9     , 2     , , 10    4.9   , 3.67  , New   , 1.23  
12    , 0     , , 11    3.71  , 3.68  , New   , 0.03  
12    , 3     , , 11    3.71  , 3.67  , New   , 0.04  
10    , 0     , , 11    4.9   , 3.67  , New   , 1.23  
10    , 3     , , 11    4.9   , 3.67  , New   , 1.23  
13    , 0     , , 12    4.24  , 4.23  , New   , 0.01  
13    , 4     , , 12    4.23  , 4.23  , Eq    , 0.0
11    , 0     , , 12    4.9   , 3.7   , New   , 1.2   
11    , 4     , , 12    4.9   , 3.73  , New   , 1.17  
14    , 0     , , 13    3.99  , 4.01  , Cur   , 0.02  
14    , 5     , , 13    3.98  , 3.98  , Eq    , 0.0
12    , 0     , , 13    4.9   , 3.69  , New   , 1.21  
12    , 5     , , 13    4.9   , 3.69  , New   , 1.21  
15    , 0     , , 14    3.99  , 3.97  , New   , 0.02  
15    , 6     , , 14    4.0   , 4.0   , Eq    , 0.0
13    , 0     , , 14    4.9   , 3.67  , New   , 1.23  
13    , 6     , , 14    4.9   , 3.67  , New   , 1.23  
16    , 0     , , 15    3.99  , 4.02  , Cur   , 0.03  
16    , 7     , , 15    4.01  , 3.96  , New   , 0.05  
14    , 0     , , 15    4.93  , 3.67  , New   , 1.26  
14    , 7     , , 15    4.92  , 3.67  , New   , 1.25  
17    , 0     , , 16    4.04  , 3.99  , New   , 0.05  
15    , 0     , , 16    5.42  , 4.22  , New   , 1.2   
18    , 0     , , 17    4.01  , 3.97  , New   , 0.04  
18    , 1     , , 17    3.99  , 3.98  , New   , 0.01  
16    , 0     , , 17    5.22  , 3.98  , New   , 1.24  
16    , 1     , , 17    5.19  , 3.98  , New   , 1.21  
19    , 0     , , 18    4.0   , 3.99  , New   , 0.01  
19    , 2     , , 18    4.03  , 3.97  , New   , 0.06  
17    , 0     , , 18    5.18  , 3.99  , New   , 1.19  
17    , 2     , , 18    5.18  , 3.98  , New   , 1.2   
20    , 0     , , 19    4.02  , 3.98  , New   , 0.04  
20    , 3     , , 19    4.0   , 3.98  , New   , 0.02  
18    , 0     , , 19    5.19  , 3.97  , New   , 1.22  
18    , 3     , , 19    5.21  , 3.98  , New   , 1.23  
21    , 0     , , 20    3.98  , 4.0   , Cur   , 0.02  
21    , 4     , , 20    4.0   , 4.0   , Eq    , 0.0
19    , 0     , , 20    5.19  , 3.99  , New   , 1.2   
19    , 4     , , 20    5.17  , 3.99  , New   , 1.18  
22    , 0     , , 21    4.03  , 3.98  , New   , 0.05  
22    , 5     , , 21    4.01  , 3.95  , New   , 0.06  
20    , 0     , , 21    5.19  , 4.0   , New   , 1.19  
20    , 5     , , 21    5.21  , 3.99  , New   , 1.22  
23    , 0     , , 22    4.06  , 3.97  , New   , 0.09  
23    , 6     , , 22    4.02  , 3.98  , New   , 0.04  
21    , 0     , , 22    5.2   , 4.02  , New   , 1.18  
21    , 6     , , 22    5.22  , 4.0   , New   , 1.22  
24    , 0     , , 23    4.15  , 3.98  , New   , 0.17  
24    , 7     , , 23    4.0   , 4.01  , Cur   , 0.01  
22    , 0     , , 23    5.28  , 4.0   , New   , 1.28  
22    , 7     , , 23    5.22  , 3.99  , New   , 1.23  
25    , 0     , , 24    4.1   , 4.04  , New   , 0.06  
23    , 0     , , 24    5.23  , 4.04  , New   , 1.19  
26    , 0     , , 25    4.1   , 4.06  , New   , 0.04  
26    , 1     , , 25    4.07  , 3.99  , New   , 0.08  
24    , 0     , , 25    5.26  , 4.02  , New   , 1.24  
24    , 1     , , 25    5.21  , 4.0   , New   , 1.21  
27    , 0     , , 26    4.17  , 4.03  , New   , 0.14  
27    , 2     , , 26    4.09  , 4.03  , New   , 0.06  
25    , 0     , , 26    5.29  , 4.1   , New   , 1.19  
25    , 2     , , 26    5.25  , 4.0   , New   , 1.25  
28    , 0     , , 27    4.06  , 4.1   , Cur   , 0.04  
28    , 3     , , 27    4.09  , 4.04  , New   , 0.05  
26    , 0     , , 27    5.26  , 4.04  , New   , 1.22  
26    , 3     , , 27    5.28  , 4.01  , New   , 1.27  
29    , 0     , , 28    4.07  , 4.02  , New   , 0.05  
29    , 4     , , 28    4.07  , 4.05  , New   , 0.02  
27    , 0     , , 28    5.25  , 4.02  , New   , 1.23  
27    , 4     , , 28    5.25  , 4.03  , New   , 1.22  
30    , 0     , , 29    4.14  , 4.06  , New   , 0.08  
30    , 5     , , 29    4.08  , 4.04  , New   , 0.04  
28    , 0     , , 29    5.26  , 4.07  , New   , 1.19  
28    , 5     , , 29    5.28  , 4.04  , New   , 1.24  
31    , 0     , , 30    4.09  , 4.08  , New   , 0.01  
31    , 6     , , 30    4.1   , 4.08  , New   , 0.02  
29    , 0     , , 30    5.28  , 4.05  , New   , 1.23  
29    , 6     , , 30    5.24  , 4.07  , New   , 1.17  
32    , 0     , , 31    4.1   , 4.13  , Cur   , 0.03  
32    , 7     , , 31    4.16  , 4.09  , New   , 0.07  
30    , 0     , , 31    5.31  , 4.09  , New   , 1.22  
30    , 7     , , 31    5.28  , 4.08  , New   , 1.2

Results For Icelake memchr-avx2
size  , algn  , Pos   , Cur T , New T , Win   , Dif   
2048  , 0     , , 32    5.74  , 5.08  , New   , 0.66  
256   , 1     , , 64    5.16  , 4.93  , New   , 0.23  
2048  , 0     , , 64    4.86  , 4.69  , New   , 0.17  
256   , 2     , , 64    4.78  , 4.7   , New   , 0.08  
2048  , 0     , , 128   5.64  , 5.0   , New   , 0.64  
256   , 3     , , 64    4.64  , 4.59  , New   , 0.05  
2048  , 0     , , 256   9.07  , 9.17  , Cur   , 0.1   
256   , 4     , , 64    4.7   , 4.6   , New   , 0.1   
2048  , 0     , , 512   12.56 , 12.33 , New   , 0.23  
256   , 5     , , 64    4.72  , 4.61  , New   , 0.11  
2048  , 0     , , 1024  19.36 , 19.49 , Cur   , 0.13  
256   , 6     , , 64    4.82  , 4.69  , New   , 0.13  
2048  , 0     , , 2048  29.99 , 30.53 , Cur   , 0.54  
256   , 7     , , 64    4.9   , 4.85  , New   , 0.05  
192   , 1     , , 32    4.89  , 4.45  , New   , 0.44  
256   , 1     , , 32    4.93  , 4.44  , New   , 0.49  
512   , 1     , , 32    4.97  , 4.45  , New   , 0.52  
192   , 2     , , 64    5.04  , 4.65  , New   , 0.39  
512   , 2     , , 64    4.75  , 4.66  , New   , 0.09  
192   , 3     , , 96    5.14  , 4.66  , New   , 0.48  
256   , 3     , , 96    5.12  , 4.66  , New   , 0.46  
512   , 3     , , 96    5.13  , 4.62  , New   , 0.51  
192   , 4     , , 128   5.65  , 4.95  , New   , 0.7   
256   , 4     , , 128   5.63  , 4.95  , New   , 0.68  
512   , 4     , , 128   5.68  , 4.96  , New   , 0.72  
192   , 5     , , 160   6.1   , 5.84  , New   , 0.26  
256   , 5     , , 160   5.58  , 5.84  , Cur   , 0.26  
512   , 5     , , 160   7.95  , 7.74  , New   , 0.21  
192   , 6     , , 192   7.07  , 6.23  , New   , 0.84  
256   , 6     , , 192   6.34  , 6.09  , New   , 0.25  
512   , 6     , , 192   8.17  , 8.13  , New   , 0.04  
192   , 7     , , 224   7.06  , 6.23  , New   , 0.83  
256   , 7     , , 224   6.76  , 6.65  , New   , 0.11  
512   , 7     , , 224   8.29  , 8.08  , New   , 0.21  
2     , 0     , , 1     3.0   , 3.04  , Cur   , 0.04  
2     , 1     , , 1     3.06  , 3.07  , Cur   , 0.01  
0     , 0     , , 1     2.57  , 2.59  , Cur   , 0.02  
0     , 1     , , 1     2.6   , 2.61  , Cur   , 0.01  
3     , 0     , , 2     3.15  , 3.17  , Cur   , 0.02  
3     , 2     , , 2     3.19  , 3.21  , Cur   , 0.02  
1     , 0     , , 2     4.32  , 3.25  , New   , 1.07  
1     , 2     , , 2     4.36  , 3.31  , New   , 1.05  
4     , 0     , , 3     3.5   , 3.52  , Cur   , 0.02  
4     , 3     , , 3     3.52  , 3.54  , Cur   , 0.02  
2     , 0     , , 3     4.51  , 3.43  , New   , 1.08  
2     , 3     , , 3     4.56  , 3.47  , New   , 1.09  
5     , 0     , , 4     3.61  , 3.65  , Cur   , 0.04  
5     , 4     , , 4     3.63  , 3.67  , Cur   , 0.04  
3     , 0     , , 4     4.64  , 3.51  , New   , 1.13  
3     , 4     , , 4     4.7   , 3.51  , New   , 1.19  
6     , 0     , , 5     3.66  , 3.68  , Cur   , 0.02  
6     , 5     , , 5     3.69  , 3.65  , New   , 0.04  
4     , 0     , , 5     4.7   , 3.49  , New   , 1.21  
4     , 5     , , 5     4.58  , 3.48  , New   , 1.1   
7     , 0     , , 6     3.6   , 3.65  , Cur   , 0.05  
7     , 6     , , 6     3.59  , 3.64  , Cur   , 0.05  
5     , 0     , , 6     4.74  , 3.65  , New   , 1.09  
5     , 6     , , 6     4.73  , 3.64  , New   , 1.09  
8     , 0     , , 7     3.6   , 3.61  , Cur   , 0.01  
8     , 7     , , 7     3.6   , 3.61  , Cur   , 0.01  
6     , 0     , , 7     4.73  , 3.6   , New   , 1.13  
6     , 7     , , 7     4.73  , 3.62  , New   , 1.11  
9     , 0     , , 8     3.59  , 3.62  , Cur   , 0.03  
7     , 0     , , 8     4.72  , 3.64  , New   , 1.08  
10    , 0     , , 9     3.57  , 3.62  , Cur   , 0.05  
10    , 1     , , 9     3.56  , 3.61  , Cur   , 0.05  
8     , 0     , , 9     4.69  , 3.63  , New   , 1.06  
8     , 1     , , 9     4.71  , 3.61  , New   , 1.1   
11    , 0     , , 10    3.58  , 3.62  , Cur   , 0.04  
11    , 2     , , 10    3.59  , 3.63  , Cur   , 0.04  
9     , 0     , , 10    4.72  , 3.61  , New   , 1.11  
9     , 2     , , 10    4.7   , 3.61  , New   , 1.09  
12    , 0     , , 11    3.58  , 3.63  , Cur   , 0.05  
12    , 3     , , 11    3.58  , 3.62  , Cur   , 0.04  
10    , 0     , , 11    4.7   , 3.6   , New   , 1.1   
10    , 3     , , 11    4.73  , 3.64  , New   , 1.09  
13    , 0     , , 12    3.6   , 3.6   , Eq    , 0.0
13    , 4     , , 12    3.57  , 3.62  , Cur   , 0.05  
11    , 0     , , 12    4.73  , 3.62  , New   , 1.11  
11    , 4     , , 12    4.79  , 3.61  , New   , 1.18  
14    , 0     , , 13    3.61  , 3.62  , Cur   , 0.01  
14    , 5     , , 13    3.59  , 3.59  , Eq    , 0.0
12    , 0     , , 13    4.7   , 3.61  , New   , 1.09  
12    , 5     , , 13    4.75  , 3.58  , New   , 1.17  
15    , 0     , , 14    3.58  , 3.62  , Cur   , 0.04  
15    , 6     , , 14    3.59  , 3.62  , Cur   , 0.03  
13    , 0     , , 14    4.68  , 3.6   , New   , 1.08  
13    , 6     , , 14    4.68  , 3.63  , New   , 1.05  
16    , 0     , , 15    3.57  , 3.6   , Cur   , 0.03  
16    , 7     , , 15    3.55  , 3.59  , Cur   , 0.04  
14    , 0     , , 15    4.69  , 3.61  , New   , 1.08  
14    , 7     , , 15    4.69  , 3.61  , New   , 1.08  
17    , 0     , , 16    3.56  , 3.61  , Cur   , 0.05  
15    , 0     , , 16    4.71  , 3.58  , New   , 1.13  
18    , 0     , , 17    3.57  , 3.65  , Cur   , 0.08  
18    , 1     , , 17    3.58  , 3.59  , Cur   , 0.01  
16    , 0     , , 17    4.7   , 3.58  , New   , 1.12  
16    , 1     , , 17    4.68  , 3.59  , New   , 1.09  
19    , 0     , , 18    3.51  , 3.58  , Cur   , 0.07  
19    , 2     , , 18    3.55  , 3.58  , Cur   , 0.03  
17    , 0     , , 18    4.69  , 3.61  , New   , 1.08  
17    , 2     , , 18    4.68  , 3.61  , New   , 1.07  
20    , 0     , , 19    3.57  , 3.6   , Cur   , 0.03  
20    , 3     , , 19    3.59  , 3.59  , Eq    , 0.0
18    , 0     , , 19    4.68  , 3.59  , New   , 1.09  
18    , 3     , , 19    4.67  , 3.57  , New   , 1.1   
21    , 0     , , 20    3.61  , 3.58  , New   , 0.03  
21    , 4     , , 20    3.62  , 3.6   , New   , 0.02  
19    , 0     , , 20    4.74  , 3.57  , New   , 1.17  
19    , 4     , , 20    4.69  , 3.7   , New   , 0.99  
22    , 0     , , 21    3.57  , 3.64  , Cur   , 0.07  
22    , 5     , , 21    3.55  , 3.6   , Cur   , 0.05  
20    , 0     , , 21    4.72  , 3.55  , New   , 1.17  
20    , 5     , , 21    4.66  , 3.55  , New   , 1.11  
23    , 0     , , 22    3.56  , 3.56  , Eq    , 0.0
23    , 6     , , 22    3.54  , 3.56  , Cur   , 0.02  
21    , 0     , , 22    4.65  , 3.53  , New   , 1.12  
21    , 6     , , 22    4.62  , 3.56  , New   , 1.06  
24    , 0     , , 23    3.5   , 3.54  , Cur   , 0.04  
24    , 7     , , 23    3.52  , 3.53  , Cur   , 0.01  
22    , 0     , , 23    4.61  , 3.51  , New   , 1.1   
22    , 7     , , 23    4.6   , 3.51  , New   , 1.09  
25    , 0     , , 24    3.5   , 3.53  , Cur   , 0.03  
23    , 0     , , 24    4.54  , 3.5   , New   , 1.04  
26    , 0     , , 25    3.47  , 3.49  , Cur   , 0.02  
26    , 1     , , 25    3.46  , 3.51  , Cur   , 0.05  
24    , 0     , , 25    4.53  , 3.51  , New   , 1.02  
24    , 1     , , 25    4.51  , 3.51  , New   , 1.0   
27    , 0     , , 26    3.44  , 3.51  , Cur   , 0.07  
27    , 2     , , 26    3.51  , 3.52  , Cur   , 0.01  
25    , 0     , , 26    4.56  , 3.46  , New   , 1.1   
25    , 2     , , 26    4.55  , 3.47  , New   , 1.08  
28    , 0     , , 27    3.47  , 3.5   , Cur   , 0.03  
28    , 3     , , 27    3.48  , 3.47  , New   , 0.01  
26    , 0     , , 27    4.52  , 3.44  , New   , 1.08  
26    , 3     , , 27    4.55  , 3.46  , New   , 1.09  
29    , 0     , , 28    3.45  , 3.49  , Cur   , 0.04  
29    , 4     , , 28    3.5   , 3.5   , Eq    , 0.0
27    , 0     , , 28    4.56  , 3.49  , New   , 1.07  
27    , 4     , , 28    4.5   , 3.49  , New   , 1.01  
30    , 0     , , 29    3.44  , 3.48  , Cur   , 0.04  
30    , 5     , , 29    3.46  , 3.47  , Cur   , 0.01  
28    , 0     , , 29    4.49  , 3.43  , New   , 1.06  
28    , 5     , , 29    4.57  , 3.45  , New   , 1.12  
31    , 0     , , 30    3.48  , 3.48  , Eq    , 0.0
31    , 6     , , 30    3.46  , 3.49  , Cur   , 0.03  
29    , 0     , , 30    4.49  , 3.44  , New   , 1.05  
29    , 6     , , 30    4.53  , 3.44  , New   , 1.09  
32    , 0     , , 31    3.44  , 3.45  , Cur   , 0.01  
32    , 7     , , 31    3.46  , 3.51  , Cur   , 0.05  
30    , 0     , , 31    4.48  , 3.42  , New   , 1.06  
30    , 7     , , 31    4.48  , 3.44  , New   , 1.04


Results For Skylake memchr-avx2
size  , algn  , Pos   , Cur T , New T , Win   , Dif   
2048  , 0     , , 32    6.61  , 5.4   , New   , 1.21  
256   , 1     , , 64    6.52  , 5.68  , New   , 0.84  
2048  , 0     , , 64    6.03  , 5.47  , New   , 0.56  
256   , 2     , , 64    6.07  , 5.42  , New   , 0.65  
2048  , 0     , , 128   7.01  , 5.83  , New   , 1.18  
256   , 3     , , 64    6.24  , 5.68  , New   , 0.56  
2048  , 0     , , 256   11.03 , 9.86  , New   , 1.17  
256   , 4     , , 64    6.17  , 5.49  , New   , 0.68  
2048  , 0     , , 512   14.11 , 13.41 , New   , 0.7   
256   , 5     , , 64    6.03  , 5.45  , New   , 0.58  
2048  , 0     , , 1024  19.82 , 19.92 , Cur   , 0.1   
256   , 6     , , 64    6.14  , 5.7   , New   , 0.44  
2048  , 0     , , 2048  30.9  , 30.59 , New   , 0.31  
256   , 7     , , 64    6.05  , 5.64  , New   , 0.41  
192   , 1     , , 32    5.6   , 4.89  , New   , 0.71  
256   , 1     , , 32    5.59  , 5.07  , New   , 0.52  
512   , 1     , , 32    5.58  , 4.93  , New   , 0.65  
192   , 2     , , 64    6.14  , 5.46  , New   , 0.68  
512   , 2     , , 64    5.95  , 5.38  , New   , 0.57  
192   , 3     , , 96    6.6   , 5.74  , New   , 0.86  
256   , 3     , , 96    6.48  , 5.37  , New   , 1.11  
512   , 3     , , 96    6.56  , 5.44  , New   , 1.12  
192   , 4     , , 128   7.04  , 6.02  , New   , 1.02  
256   , 4     , , 128   6.96  , 5.89  , New   , 1.07  
512   , 4     , , 128   6.97  , 5.99  , New   , 0.98  
192   , 5     , , 160   8.49  , 7.07  , New   , 1.42  
256   , 5     , , 160   8.1   , 6.96  , New   , 1.14  
512   , 5     , , 160   10.48 , 9.14  , New   , 1.34  
192   , 6     , , 192   8.46  , 8.52  , Cur   , 0.06  
256   , 6     , , 192   8.53  , 7.58  , New   , 0.95  
512   , 6     , , 192   10.88 , 9.06  , New   , 1.82  
192   , 7     , , 224   8.59  , 8.35  , New   , 0.24  
256   , 7     , , 224   8.86  , 7.91  , New   , 0.95  
512   , 7     , , 224   10.89 , 8.98  , New   , 1.91  
2     , 0     , , 1     4.28  , 3.62  , New   , 0.66  
2     , 1     , , 1     4.32  , 3.75  , New   , 0.57  
0     , 0     , , 1     3.76  , 3.24  , New   , 0.52  
0     , 1     , , 1     3.7   , 3.19  , New   , 0.51  
3     , 0     , , 2     4.16  , 3.67  , New   , 0.49  
3     , 2     , , 2     4.21  , 3.68  , New   , 0.53  
1     , 0     , , 2     4.25  , 3.74  , New   , 0.51  
1     , 2     , , 2     4.4   , 3.82  , New   , 0.58  
4     , 0     , , 3     4.43  , 3.88  , New   , 0.55  
4     , 3     , , 3     4.34  , 3.8   , New   , 0.54  
2     , 0     , , 3     4.33  , 3.79  , New   , 0.54  
2     , 3     , , 3     4.37  , 3.84  , New   , 0.53  
5     , 0     , , 4     4.45  , 3.87  , New   , 0.58  
5     , 4     , , 4     4.41  , 3.84  , New   , 0.57  
3     , 0     , , 4     4.34  , 3.83  , New   , 0.51  
3     , 4     , , 4     4.35  , 3.82  , New   , 0.53  
6     , 0     , , 5     4.41  , 3.88  , New   , 0.53  
6     , 5     , , 5     4.41  , 3.88  , New   , 0.53  
4     , 0     , , 5     4.35  , 3.84  , New   , 0.51  
4     , 5     , , 5     4.37  , 3.85  , New   , 0.52  
7     , 0     , , 6     4.4   , 3.84  , New   , 0.56  
7     , 6     , , 6     4.39  , 3.83  , New   , 0.56  
5     , 0     , , 6     4.37  , 3.85  , New   , 0.52  
5     , 6     , , 6     4.4   , 3.86  , New   , 0.54  
8     , 0     , , 7     4.39  , 3.88  , New   , 0.51  
8     , 7     , , 7     4.4   , 3.83  , New   , 0.57  
6     , 0     , , 7     4.39  , 3.85  , New   , 0.54  
6     , 7     , , 7     4.38  , 3.87  , New   , 0.51  
9     , 0     , , 8     4.47  , 3.96  , New   , 0.51  
7     , 0     , , 8     4.37  , 3.85  , New   , 0.52  
10    , 0     , , 9     4.61  , 4.08  , New   , 0.53  
10    , 1     , , 9     4.61  , 4.09  , New   , 0.52  
8     , 0     , , 9     4.37  , 3.85  , New   , 0.52  
8     , 1     , , 9     4.37  , 3.85  , New   , 0.52  
11    , 0     , , 10    4.68  , 4.06  , New   , 0.62  
11    , 2     , , 10    4.56  , 4.1   , New   , 0.46  
9     , 0     , , 10    4.36  , 3.83  , New   , 0.53  
9     , 2     , , 10    4.37  , 3.83  , New   , 0.54  
12    , 0     , , 11    4.62  , 4.05  , New   , 0.57  
12    , 3     , , 11    4.63  , 4.06  , New   , 0.57  
10    , 0     , , 11    4.38  , 3.86  , New   , 0.52  
10    , 3     , , 11    4.41  , 3.86  , New   , 0.55  
13    , 0     , , 12    4.57  , 4.08  , New   , 0.49  
13    , 4     , , 12    4.59  , 4.12  , New   , 0.47  
11    , 0     , , 12    4.45  , 4.0   , New   , 0.45  
11    , 4     , , 12    4.51  , 4.04  , New   , 0.47  
14    , 0     , , 13    4.64  , 4.16  , New   , 0.48  
14    , 5     , , 13    4.67  , 4.1   , New   , 0.57  
12    , 0     , , 13    4.58  , 4.08  , New   , 0.5   
12    , 5     , , 13    4.6   , 4.1   , New   , 0.5   
15    , 0     , , 14    4.61  , 4.05  , New   , 0.56  
15    , 6     , , 14    4.59  , 4.06  , New   , 0.53  
13    , 0     , , 14    4.57  , 4.06  , New   , 0.51  
13    , 6     , , 14    4.57  , 4.05  , New   , 0.52  
16    , 0     , , 15    4.62  , 4.05  , New   , 0.57  
16    , 7     , , 15    4.63  , 4.06  , New   , 0.57  
14    , 0     , , 15    4.61  , 4.06  , New   , 0.55  
14    , 7     , , 15    4.59  , 4.05  , New   , 0.54  
17    , 0     , , 16    4.58  , 4.08  , New   , 0.5   
15    , 0     , , 16    4.64  , 4.06  , New   , 0.58  
18    , 0     , , 17    4.56  , 4.17  , New   , 0.39  
18    , 1     , , 17    4.59  , 4.09  , New   , 0.5   
16    , 0     , , 17    4.59  , 4.07  , New   , 0.52  
16    , 1     , , 17    4.58  , 4.04  , New   , 0.54  
19    , 0     , , 18    4.61  , 4.05  , New   , 0.56  
19    , 2     , , 18    4.6   , 4.08  , New   , 0.52  
17    , 0     , , 18    4.64  , 4.11  , New   , 0.53  
17    , 2     , , 18    4.56  , 4.13  , New   , 0.43  
20    , 0     , , 19    4.77  , 4.3   , New   , 0.47  
20    , 3     , , 19    4.6   , 4.14  , New   , 0.46  
18    , 0     , , 19    4.72  , 4.02  , New   , 0.7   
18    , 3     , , 19    4.53  , 4.01  , New   , 0.52  
21    , 0     , , 20    4.66  , 4.26  , New   , 0.4   
21    , 4     , , 20    4.74  , 4.07  , New   , 0.67  
19    , 0     , , 20    4.62  , 4.12  , New   , 0.5   
19    , 4     , , 20    4.57  , 4.04  , New   , 0.53  
22    , 0     , , 21    4.61  , 4.13  , New   , 0.48  
22    , 5     , , 21    4.64  , 4.08  , New   , 0.56  
20    , 0     , , 21    4.49  , 4.01  , New   , 0.48  
20    , 5     , , 21    4.58  , 4.06  , New   , 0.52  
23    , 0     , , 22    4.62  , 4.13  , New   , 0.49  
23    , 6     , , 22    4.72  , 4.27  , New   , 0.45  
21    , 0     , , 22    4.65  , 3.97  , New   , 0.68  
21    , 6     , , 22    4.5   , 4.02  , New   , 0.48  
24    , 0     , , 23    4.78  , 4.07  , New   , 0.71  
24    , 7     , , 23    4.67  , 4.23  , New   , 0.44  
22    , 0     , , 23    4.49  , 3.99  , New   , 0.5   
22    , 7     , , 23    4.56  , 4.03  , New   , 0.53  
25    , 0     , , 24    4.6   , 4.15  , New   , 0.45  
23    , 0     , , 24    4.57  , 4.06  , New   , 0.51  
26    , 0     , , 25    4.54  , 4.14  , New   , 0.4   
26    , 1     , , 25    4.72  , 4.1   , New   , 0.62  
24    , 0     , , 25    4.52  , 4.13  , New   , 0.39  
24    , 1     , , 25    4.55  , 4.0   , New   , 0.55  
27    , 0     , , 26    4.51  , 4.06  , New   , 0.45  
27    , 2     , , 26    4.53  , 4.16  , New   , 0.37  
25    , 0     , , 26    4.59  , 4.09  , New   , 0.5   
25    , 2     , , 26    4.55  , 4.01  , New   , 0.54  
28    , 0     , , 27    4.59  , 3.99  , New   , 0.6   
28    , 3     , , 27    4.57  , 3.95  , New   , 0.62  
26    , 0     , , 27    4.55  , 4.15  , New   , 0.4   
26    , 3     , , 27    4.57  , 3.99  , New   , 0.58  
29    , 0     , , 28    4.41  , 4.03  , New   , 0.38  
29    , 4     , , 28    4.59  , 4.02  , New   , 0.57  
27    , 0     , , 28    4.63  , 4.08  , New   , 0.55  
27    , 4     , , 28    4.44  , 4.02  , New   , 0.42  
30    , 0     , , 29    4.53  , 3.93  , New   , 0.6   
30    , 5     , , 29    4.55  , 3.88  , New   , 0.67  
28    , 0     , , 29    4.49  , 3.9   , New   , 0.59  
28    , 5     , , 29    4.44  , 3.94  , New   , 0.5   
31    , 0     , , 30    4.41  , 3.85  , New   , 0.56  
31    , 6     , , 30    4.48  , 3.86  , New   , 0.62  
29    , 0     , , 30    4.55  , 3.94  , New   , 0.61  
29    , 6     , , 30    4.32  , 3.95  , New   , 0.37  
32    , 0     , , 31    4.36  , 3.91  , New   , 0.45  
32    , 7     , , 31    4.37  , 3.89  , New   , 0.48  
30    , 0     , , 31    4.65  , 3.9   , New   , 0.75  
30    , 7     , , 31    4.42  , 3.93  , New   , 0.49  

 sysdeps/x86_64/multiarch/memchr-evex.S | 580 +++++++++++++++----------
 1 file changed, 349 insertions(+), 231 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S
index 6dd5d67b90..65c16ef8a4 100644
--- a/sysdeps/x86_64/multiarch/memchr-evex.S
+++ b/sysdeps/x86_64/multiarch/memchr-evex.S
@@ -26,14 +26,28 @@
 
 # ifdef USE_AS_WMEMCHR
 #  define VPBROADCAST	vpbroadcastd
-#  define VPCMP		vpcmpd
-#  define SHIFT_REG	r8d
+#  define VPMINU	vpminud
+#  define VPCMP	vpcmpd
+#  define VPCMPEQ	vpcmpeqd
+#  define CHAR_SIZE	4
 # else
 #  define VPBROADCAST	vpbroadcastb
-#  define VPCMP		vpcmpb
-#  define SHIFT_REG	ecx
+#  define VPMINU	vpminub
+#  define VPCMP	vpcmpb
+#  define VPCMPEQ	vpcmpeqb
+#  define CHAR_SIZE	1
 # endif
 
+# ifdef USE_AS_RAWMEMCHR
+#  define RAW_PTR_REG	rcx
+#  define ALGN_PTR_REG	rdi
+# else
+#  define RAW_PTR_REG	rdi
+#  define ALGN_PTR_REG	rcx
+# endif
+
+#define XZERO		xmm23
+#define YZERO		ymm23
 # define XMMMATCH	xmm16
 # define YMMMATCH	ymm16
 # define YMM1		ymm17
@@ -44,18 +58,16 @@
 # define YMM6		ymm22
 
 # define VEC_SIZE 32
+# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
+# define PAGE_SIZE 4096
 
 	.section .text.evex,"ax",@progbits
-ENTRY (MEMCHR)
+ENTRY(MEMCHR)
 # ifndef USE_AS_RAWMEMCHR
 	/* Check for zero length.  */
 	test	%RDX_LP, %RDX_LP
 	jz	L(zero)
-# endif
-	movl	%edi, %ecx
-# ifdef USE_AS_WMEMCHR
-	shl	$2, %RDX_LP
-# else
+
 #  ifdef __ILP32__
 	/* Clear the upper 32 bits.  */
 	movl	%edx, %edx
@@ -63,319 +75,425 @@ ENTRY (MEMCHR)
 # endif
 	/* Broadcast CHAR to YMMMATCH.  */
 	VPBROADCAST %esi, %YMMMATCH
-	/* Check if we may cross page boundary with one vector load.  */
-	andl	$(2 * VEC_SIZE - 1), %ecx
-	cmpl	$VEC_SIZE, %ecx
-	ja	L(cros_page_boundary)
+	/* Check if we may cross page boundary with one
+	   vector load.  */
+	movl	%edi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	ja	L(cross_page_boundary)
 
 	/* Check the first VEC_SIZE bytes.  */
-	VPCMP	$0, (%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
-	testl	%eax, %eax
-
+	VPCMP	$0, (%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
 # ifndef USE_AS_RAWMEMCHR
-	jnz	L(first_vec_x0_check)
-	/* Adjust length and check the end of data.  */
-	subq	$VEC_SIZE, %rdx
-	jbe	L(zero)
+	/* If length < CHAR_PER_VEC handle special.  */
+	cmpq	$CHAR_PER_VEC, %rdx
+	jbe	L(first_vec_x0)
+# endif
+	testl	%eax, %eax
+	jz	L(aligned_more)
+	tzcntl	%eax, %eax
+# ifdef USE_AS_WMEMCHR
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	(%rdi, %rax, CHAR_SIZE), %rax
 # else
-	jnz	L(first_vec_x0)
+	addq	%rdi, %rax
 # endif
-
-	/* Align data for aligned loads in the loop.  */
-	addq	$VEC_SIZE, %rdi
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
+	ret
 
 # ifndef USE_AS_RAWMEMCHR
-	/* Adjust length.  */
-	addq	%rcx, %rdx
-
-	subq	$(VEC_SIZE * 4), %rdx
-	jbe	L(last_4x_vec_or_less)
-# endif
-	jmp	L(more_4x_vec)
+L(zero):
+	xorl	%eax, %eax
+	ret
 
+	.p2align 5
+L(first_vec_x0):
+	/* Check if first match was before length.  */
+	tzcntl	%eax, %eax
+	xorl	%ecx, %ecx
+	cmpl	%eax, %edx
+	leaq	(%rdi, %rax, CHAR_SIZE), %rax
+	cmovle	%rcx, %rax
+	ret
+# else
+	/* NB: first_vec_x0 is 17 bytes which will leave
+	   cross_page_boundary (which is relatively cold) close
+	   enough to ideal alignment. So only realign
+	   L(cross_page_boundary) if rawmemchr.  */
 	.p2align 4
-L(cros_page_boundary):
-	andl	$(VEC_SIZE - 1), %ecx
+# endif
+L(cross_page_boundary):
+	/* Save pointer before aligning as its original
+	   value is necessary for computer return address if byte is
+	   found or adjusting length if it is not and this is
+	   memchr.  */
+	movq	%rdi, %rcx
+	/* Align data to VEC_SIZE. ALGN_PTR_REG is rcx
+	   for memchr and rdi for rawmemchr.  */
+	andq	$-VEC_SIZE, %ALGN_PTR_REG
+	VPCMP	$0, (%ALGN_PTR_REG), %YMMMATCH, %k0
+	kmovd	%k0, %r8d
 # ifdef USE_AS_WMEMCHR
-	/* NB: Divide shift count by 4 since each bit in K1 represent 4
-	   bytes.  */
-	movl	%ecx, %SHIFT_REG
-	sarl	$2, %SHIFT_REG
+	/* NB: Divide shift count by 4 since each bit in
+	   K0 represent 4 bytes.  */
+	sarl	$2, %eax
+# endif
+# ifndef USE_AS_RAWMEMCHR
+	movl	$(PAGE_SIZE / CHAR_SIZE), %esi
+	subl	%eax, %esi
 # endif
-	andq	$-VEC_SIZE, %rdi
-	VPCMP	$0, (%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
-	/* Remove the leading bytes.  */
-	sarxl	%SHIFT_REG, %eax, %eax
-	testl	%eax, %eax
-	jz	L(aligned_more)
-	tzcntl	%eax, %eax
 # ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %eax
+	andl	$(CHAR_PER_VEC - 1), %eax
 # endif
+	/* Remove the leading bytes.  */
+	sarxl	%eax, %r8d, %eax
 # ifndef USE_AS_RAWMEMCHR
 	/* Check the end of data.  */
-	cmpq	%rax, %rdx
-	jbe	L(zero)
+	cmpq	%rsi, %rdx
+	jbe	L(first_vec_x0)
+# endif
+	testl	%eax, %eax
+	jz	L(cross_page_continue)
+	tzcntl	%eax, %eax
+# ifdef USE_AS_WMEMCHR
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	(%RAW_PTR_REG, %rax, CHAR_SIZE), %rax
+# else
+	addq	%RAW_PTR_REG, %rax
 # endif
-	addq	%rdi, %rax
-	addq	%rcx, %rax
 	ret
 
 	.p2align 4
-L(aligned_more):
-# ifndef USE_AS_RAWMEMCHR
-        /* Calculate "rdx + rcx - VEC_SIZE" with "rdx - (VEC_SIZE - rcx)"
-	   instead of "(rdx + rcx) - VEC_SIZE" to void possible addition
-	   overflow.  */
-	negq	%rcx
-	addq	$VEC_SIZE, %rcx
+L(first_vec_x1):
+	tzcntl	%eax, %eax
+	leaq	VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax
+	ret
 
-	/* Check the end of data.  */
-	subq	%rcx, %rdx
-	jbe	L(zero)
-# endif
+	.p2align 4
+L(first_vec_x2):
+	tzcntl	%eax, %eax
+	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
+	ret
 
-	addq	$VEC_SIZE, %rdi
+	.p2align 4
+L(first_vec_x3):
+	tzcntl	%eax, %eax
+	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
+	ret
+
+	.p2align 4
+L(first_vec_x4):
+	tzcntl	%eax, %eax
+	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
+	ret
+
+	.p2align 5
+L(aligned_more):
+	/* Check the first 4 * VEC_SIZE.  Only one
+	   VEC_SIZE at a time since data is only aligned to
+	   VEC_SIZE.  */
 
 # ifndef USE_AS_RAWMEMCHR
-	subq	$(VEC_SIZE * 4), %rdx
+	/* Align data to VEC_SIZE.  */
+L(cross_page_continue):
+	xorl	%ecx, %ecx
+	subl	%edi, %ecx
+	andq	$-VEC_SIZE, %rdi
+	/* esi is for adjusting length to see if near the
+	   end.  */
+	leal	(VEC_SIZE * 5)(%rdi, %rcx), %esi
+#  ifdef USE_AS_WMEMCHR
+	/* NB: Divide bytes by 4 to get the wchar_t
+	   count.  */
+	sarl	$2, %esi
+#  endif
+# else
+	andq	$-VEC_SIZE, %rdi
+L(cross_page_continue):
+# endif
+	/* Load first VEC regardless.  */
+	VPCMP	$0, (VEC_SIZE)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
+# ifndef USE_AS_RAWMEMCHR
+	/* Adjust length. If near end handle specially.
+	 */
+	subq	%rsi, %rdx
 	jbe	L(last_4x_vec_or_less)
 # endif
-
-L(more_4x_vec):
-	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
-	   since data is only aligned to VEC_SIZE.  */
-	VPCMP	$0, (%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-
-	VPCMP	$0, VEC_SIZE(%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
 	testl	%eax, %eax
 	jnz	L(first_vec_x1)
 
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
+	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
 	testl	%eax, %eax
 	jnz	L(first_vec_x2)
 
-	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
+	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
 	testl	%eax, %eax
 	jnz	L(first_vec_x3)
 
-	addq	$(VEC_SIZE * 4), %rdi
+	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x4)
+
 
 # ifndef USE_AS_RAWMEMCHR
-	subq	$(VEC_SIZE * 4), %rdx
-	jbe	L(last_4x_vec_or_less)
-# endif
+	/* Check if at last CHAR_PER_VEC * 4 length.  */
+	subq	$(CHAR_PER_VEC * 4), %rdx
+	jbe	L(last_4x_vec_or_less_cmpeq)
+	addq	$VEC_SIZE, %rdi
 
-	/* Align data to 4 * VEC_SIZE.  */
-	movq	%rdi, %rcx
-	andl	$(4 * VEC_SIZE - 1), %ecx
+	/* Align data to VEC_SIZE * 4 for the loop and
+	   readjust length.  */
+#  ifdef USE_AS_WMEMCHR
+	movl	%edi, %ecx
 	andq	$-(4 * VEC_SIZE), %rdi
-
-# ifndef USE_AS_RAWMEMCHR
-	/* Adjust length.  */
+	andl	$(VEC_SIZE * 4 - 1), %ecx
+	/* NB: Divide bytes by 4 to get the wchar_t
+	   count.  */
+	sarl	$2, %ecx
 	addq	%rcx, %rdx
+#  else
+	addq	%rdi, %rdx
+	andq	$-(4 * VEC_SIZE), %rdi
+	subq	%rdi, %rdx
+#  endif
+# else
+	addq	$VEC_SIZE, %rdi
+	andq	$-(4 * VEC_SIZE), %rdi
 # endif
 
+	vpxorq	%XZERO, %XZERO, %XZERO
+
+	/* Compare 4 * VEC at a time forward.  */
 	.p2align 4
 L(loop_4x_vec):
-	/* Compare 4 * VEC at a time forward.  */
-	VPCMP	$0, (%rdi), %YMMMATCH, %k1
-	VPCMP	$0, VEC_SIZE(%rdi), %YMMMATCH, %k2
-	kord	%k1, %k2, %k5
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3
-	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4
-
-	kord	%k3, %k4, %k6
-	kortestd %k5, %k6
-	jnz	L(4x_vec_end)
-
-	addq	$(VEC_SIZE * 4), %rdi
-
+	/* It would be possible to save some instructions
+	   using 4x VPCMP but bottleneck on port 5 makes it not woth
+	   it.  */
+	VPCMP	$4, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k1
+	/* xor will set bytes match esi to zero.  */
+	vpxorq	(VEC_SIZE * 5)(%rdi), %YMMMATCH, %YMM2
+	vpxorq	(VEC_SIZE * 6)(%rdi), %YMMMATCH, %YMM3
+	VPCMP	$0, (VEC_SIZE * 7)(%rdi), %YMMMATCH, %k3
+	/* Reduce VEC2 / VEC3 with min and VEC1 with zero
+	   mask.  */
+	VPMINU	%YMM2, %YMM3, %YMM3 {%k1} {z}
+	VPCMP	$0, %YMM3, %YZERO, %k2
 # ifdef USE_AS_RAWMEMCHR
-	jmp	L(loop_4x_vec)
+	subq	$-(VEC_SIZE * 4), %rdi
+	kortestd %k2, %k3
+	jz	L(loop_4x_vec)
 # else
-	subq	$(VEC_SIZE * 4), %rdx
-	ja	L(loop_4x_vec)
+	kortestd %k2, %k3
+	jnz	L(loop_4x_vec_end)
 
-L(last_4x_vec_or_less):
-	/* Less than 4 * VEC and aligned to VEC_SIZE.  */
-	addl	$(VEC_SIZE * 2), %edx
-	jle	L(last_2x_vec)
+	subq	$-(VEC_SIZE * 4), %rdi
 
-	VPCMP	$0, (%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x0)
+	subq	$(CHAR_PER_VEC * 4), %rdx
+	ja	L(loop_4x_vec)
 
-	VPCMP	$0, VEC_SIZE(%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
+	/* Fall through into less than 4 remaining
+	   vectors of length case.  */
+	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
+	addq	$(VEC_SIZE * 3), %rdi
+	.p2align 4
+L(last_4x_vec_or_less):
+	/* Check if first VEC contained match.  */
 	testl	%eax, %eax
-	jnz	L(first_vec_x1)
+	jnz	L(first_vec_x1_check)
 
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
-	testl	%eax, %eax
+	/* If remaining length > CHAR_PER_VEC * 2.  */
+	addl	$(CHAR_PER_VEC * 2), %edx
+	jg	L(last_4x_vec)
 
-	jnz	L(first_vec_x2_check)
-	subl	$VEC_SIZE, %edx
-	jle	L(zero)
+L(last_2x_vec):
+	/* If remaining length < CHAR_PER_VEC.  */
+	addl	$CHAR_PER_VEC, %edx
+	jle	L(zero_end)
+
+	/* Check VEC2 and compare any match with
+	   remaining length.  */
+	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
+	tzcntl	%eax, %eax
+	cmpl	%eax, %edx
+	jbe	L(set_zero_end)
+	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
+L(zero_end):
+	ret
 
-	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
-	testl	%eax, %eax
 
-	jnz	L(first_vec_x3_check)
+	.p2align 4
+L(first_vec_x1_check):
+	tzcntl	%eax, %eax
+	/* Adjust length.  */
+	subl	$-(CHAR_PER_VEC * 4), %edx
+	/* Check if match within remaining length.  */
+	cmpl	%eax, %edx
+	jbe	L(set_zero_end)
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax
+	ret
+L(set_zero_end):
 	xorl	%eax, %eax
 	ret
 
 	.p2align 4
-L(last_2x_vec):
-	addl	$(VEC_SIZE * 2), %edx
-	VPCMP	$0, (%rdi), %YMMMATCH, %k1
+L(loop_4x_vec_end):
+# endif
+	/* rawmemchr will fall through into this if match
+	   was found in loop.  */
+
+	/* k1 has not of matches with VEC1.  */
 	kmovd	%k1, %eax
-	testl	%eax, %eax
+# ifdef USE_AS_WMEMCHR
+	subl	$((1 << CHAR_PER_VEC) - 1), %eax
+# else
+	incl	%eax
+# endif
+	jnz	L(last_vec_x1_return)
 
-	jnz	L(first_vec_x0_check)
-	subl	$VEC_SIZE, %edx
-	jle	L(zero)
+	VPCMP	$0, %YMM2, %YZERO, %k0
+	kmovd	%k0, %eax
+	testl	%eax, %eax
+	jnz	L(last_vec_x2_return)
 
-	VPCMP	$0, VEC_SIZE(%rdi), %YMMMATCH, %k1
-	kmovd	%k1, %eax
+	kmovd	%k2, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x1_check)
-	xorl	%eax, %eax
-	ret
+	jnz	L(last_vec_x3_return)
 
-	.p2align 4
-L(first_vec_x0_check):
+	kmovd	%k3, %eax
 	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %eax
+# ifdef USE_AS_RAWMEMCHR
+	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
+# else
+	leaq	(VEC_SIZE * 7)(%rdi, %rax, CHAR_SIZE), %rax
 # endif
-	/* Check the end of data.  */
-	cmpq	%rax, %rdx
-	jbe	L(zero)
-	addq	%rdi, %rax
 	ret
 
 	.p2align 4
-L(first_vec_x1_check):
+L(last_vec_x1_return):
 	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %eax
-# endif
-	/* Check the end of data.  */
-	cmpq	%rax, %rdx
-	jbe	L(zero)
-	addq	$VEC_SIZE, %rax
+# ifdef USE_AS_RAWMEMCHR
+#  ifdef USE_AS_WMEMCHR
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	(%rdi, %rax, CHAR_SIZE), %rax
+#  else
 	addq	%rdi, %rax
-	ret
-
-	.p2align 4
-L(first_vec_x2_check):
-	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %eax
+#  endif
+# else
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
 # endif
-	/* Check the end of data.  */
-	cmpq	%rax, %rdx
-	jbe	L(zero)
-	addq	$(VEC_SIZE * 2), %rax
-	addq	%rdi, %rax
 	ret
 
 	.p2align 4
-L(first_vec_x3_check):
+L(last_vec_x2_return):
 	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	sall	$2, %eax
+# ifdef USE_AS_RAWMEMCHR
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax
+# else
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	(VEC_SIZE * 5)(%rdi, %rax, CHAR_SIZE), %rax
 # endif
-	/* Check the end of data.  */
-	cmpq	%rax, %rdx
-	jbe	L(zero)
-	addq	$(VEC_SIZE * 3), %rax
-	addq	%rdi, %rax
 	ret
 
 	.p2align 4
-L(zero):
-	xorl	%eax, %eax
-	ret
-# endif
-
-	.p2align 4
-L(first_vec_x0):
+L(last_vec_x3_return):
 	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(%rdi, %rax, 4), %rax
+# ifdef USE_AS_RAWMEMCHR
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
 # else
-	addq	%rdi, %rax
+	/* NB: Multiply bytes by CHAR_SIZE to get the
+	   wchar_t count.  */
+	leaq	(VEC_SIZE * 6)(%rdi, %rax, CHAR_SIZE), %rax
 # endif
 	ret
 
+
+# ifndef USE_AS_RAWMEMCHR
+L(last_4x_vec_or_less_cmpeq):
+	VPCMP	$0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
+	subq	$-(VEC_SIZE * 4), %rdi
+	/* Check first VEC regardless.  */
+	testl	%eax, %eax
+	jnz	L(first_vec_x1_check)
+
+	/* If remaining length <= CHAR_PER_VEC * 2.  */
+	addl	$(CHAR_PER_VEC * 2), %edx
+	jle	L(last_2x_vec)
+
 	.p2align 4
-L(first_vec_x1):
+L(last_4x_vec):
+	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
+	testl	%eax, %eax
+	jnz	L(last_vec_x2)
+
+
+	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
+	/* Create mask for possible matches within
+	   remaining length.  */
+#  ifdef USE_AS_WMEMCHR
+	movl	$((1 << (CHAR_PER_VEC * 2)) - 1), %ecx
+	bzhil	%edx, %ecx, %ecx
+#  else
+	movq	$-1, %rcx
+	bzhiq	%rdx, %rcx, %rcx
+#  endif
+	/* Test matches in data against length match.  */
+	andl	%ecx, %eax
+	jnz	L(last_vec_x3)
+
+	/* if remaining length <= CHAR_PER_VEC * 3 (Note
+	   this is after remaining length was found to be >
+	   CHAR_PER_VEC * 2.  */
+	subl	$CHAR_PER_VEC, %edx
+	jbe	L(zero_end2)
+
+
+	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0
+	kmovd	%k0, %eax
+	/* Shift remaining length mask for last VEC.  */
+#  ifdef USE_AS_WMEMCHR
+	shrl	$CHAR_PER_VEC, %ecx
+#  else
+	shrq	$CHAR_PER_VEC, %rcx
+#  endif
+	andl	%ecx, %eax
+	jz	L(zero_end2)
 	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	VEC_SIZE(%rdi, %rax, 4), %rax
-# else
-	addq	$VEC_SIZE, %rax
-	addq	%rdi, %rax
-# endif
+	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
+L(zero_end2):
 	ret
 
-	.p2align 4
-L(first_vec_x2):
+L(last_vec_x2):
 	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(VEC_SIZE * 2)(%rdi, %rax, 4), %rax
-# else
-	addq	$(VEC_SIZE * 2), %rax
-	addq	%rdi, %rax
-# endif
+	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
 	.p2align 4
-L(4x_vec_end):
-	kmovd	%k1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-	kmovd	%k2, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x1)
-	kmovd	%k3, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x2)
-	kmovd	%k4, %eax
-	testl	%eax, %eax
-L(first_vec_x3):
+L(last_vec_x3):
 	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
-	leaq	(VEC_SIZE * 3)(%rdi, %rax, 4), %rax
-# else
-	addq	$(VEC_SIZE * 3), %rax
-	addq	%rdi, %rax
-# endif
+	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
+# endif
 
-END (MEMCHR)
+END(MEMCHR)
 #endif
-- 
2.29.2


  parent reply	other threads:[~2021-05-03  8:46 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-03  8:44 [PATCH v1 1/3] Bench: Expand bench-memchr.c Noah Goldstein
2021-05-03  8:44 ` [PATCH v1 2/3] x86: Optimize memchr-avx2.S Noah Goldstein
2021-05-03 18:50   ` H.J. Lu
2021-05-03 20:06     ` Noah Goldstein
2021-05-03 20:06   ` [PATCH v2 " Noah Goldstein
2021-05-03 20:06     ` [PATCH v2 3/3] x86: Optimize memchr-evex.S Noah Goldstein
2021-05-03 22:26       ` H.J. Lu
2021-05-03 22:58         ` Noah Goldstein
2021-05-03 22:25     ` [PATCH v2 2/3] x86: Optimize memchr-avx2.S H.J. Lu
2021-05-03 22:58       ` Noah Goldstein
2021-05-03 22:58   ` [PATCH v3 " Noah Goldstein
2021-05-03 22:58     ` [PATCH v3 3/3] x86: Optimize memchr-evex.S Noah Goldstein
2021-05-03 22:59       ` Noah Goldstein
2021-05-03 22:59     ` [PATCH v3 2/3] x86: Optimize memchr-avx2.S Noah Goldstein
2021-05-03  8:44 ` Noah Goldstein [this message]
2021-05-03 18:58   ` [PATCH v1 3/3] x86: Optimize memchr-evex.S H.J. Lu
2021-05-03 20:06     ` Noah Goldstein
2021-05-03 17:17 ` [PATCH v1 1/3] Bench: Expand bench-memchr.c H.J. Lu
2021-05-03 19:51   ` Noah Goldstein
2021-05-03 20:59     ` H.J. Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210503084435.160548-3-goldstein.w.n@gmail.com \
    --to=goldstein.w.n@gmail.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).