From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Subject: [PATCH v1 1/2] x86: Optimize strlen-evex.S
Date: Fri, 16 Apr 2021 22:52:16 -0400 [thread overview]
Message-ID: <20210417025215.874105-1-goldstein.w.n@gmail.com> (raw)
No bug. This commit optimizes strlen-evex.S. The
optimizations are mostly small things but they add up to roughly
10-30% performance improvement for strlen. The results for strnlen are
bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and
test-wcsnlen are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
Tests where run on the following CPUs:
Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html
Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html
Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
All times are the geometric mean of N=20. The unit of time is
seconds.
"Cur" refers to the current implementation
"New" refers to this patches implementation
The strlen numbers are universal improvements:
Results For Skylake strlen-avx2
size, algn, Cur T , New T , Win , Dif
1 , 0 , 4.76 , 4.27 , New , 0.49
2 , 0 , 4.77 , 4.165 , New , 0.6
3 , 0 , 4.617 , 4.095 , New , 0.52
4 , 0 , 4.579 , 4.006 , New , 0.57
5 , 0 , 4.608 , 4.008 , New , 0.6
6 , 0 , 4.655 , 4.086 , New , 0.57
7 , 0 , 4.661 , 4.071 , New , 0.59
8 , 0 , 4.625 , 4.092 , New , 0.53
16 , 0 , 4.608 , 4.021 , New , 0.59
10 , 0 , 4.645 , 4.111 , New , 0.53
32 , 0 , 5.532 , 4.817 , New , 0.71
21 , 0 , 4.636 , 3.775 , New , 0.86
64 , 0 , 5.991 , 5.352 , New , 0.64
42 , 0 , 5.529 , 4.789 , New , 0.74
128 , 0 , 7.042 , 5.473 , New , 1.57
85 , 0 , 6.118 , 5.466 , New , 0.65
256 , 0 , 10.64 , 7.954 , New , 2.69
170 , 0 , 9.918 , 9.585 , New , 0.33
512 , 0 , 12.916, 10.242, New , 2.67
341 , 0 , 10.764, 10.216, New , 0.55
1024, 0 , 18.163, 14.844, New , 3.32
682 , 0 , 15.292, 13.382, New , 1.91
2048, 0 , 38.732, 24.396, New , 14.34
1365, 0 , 22.299, 20.08 , New , 2.22
4096, 0 , 79.054, 68.682, New , 10.37
2730, 0 , 61.47 , 40.705, New , 20.77
Results For Icelake strlen-avx2
size, algn, Cur T , New T , Win , Dif
1 , 0 , 2.681 , 1.99 , New , 0.69
2 , 0 , 2.823 , 2.232 , New , 0.59
3 , 0 , 2.57 , 2.077 , New , 0.49
4 , 0 , 2.659 , 2.128 , New , 0.53
5 , 0 , 2.666 , 2.109 , New , 0.56
6 , 0 , 2.596 , 2.053 , New , 0.54
7 , 0 , 2.623 , 2.152 , New , 0.47
8 , 0 , 2.675 , 2.178 , New , 0.5
16 , 0 , 2.675 , 2.202 , New , 0.47
10 , 0 , 2.672 , 2.2 , New , 0.47
32 , 0 , 3.383 , 2.868 , New , 0.52
21 , 0 , 2.693 , 2.032 , New , 0.66
64 , 0 , 3.404 , 3.056 , New , 0.35
42 , 0 , 3.511 , 2.967 , New , 0.54
128 , 0 , 4.191 , 3.627 , New , 0.56
85 , 0 , 3.559 , 2.922 , New , 0.64
256 , 0 , 6.782 , 5.493 , New , 1.29
170 , 0 , 6.24 , 4.988 , New , 1.25
512 , 0 , 9.305 , 7.308 , New , 2.0
341 , 0 , 7.626 , 6.272 , New , 1.35
1024, 0 , 14.455, 11.544, New , 2.91
682 , 0 , 10.728, 8.738 , New , 1.99
2048, 0 , 24.171, 24.101, New , 0.07
1365, 0 , 17.474, 14.387, New , 3.09
4096, 0 , 57.659, 51.675, New , 5.98
2730, 0 , 44.702, 28.04 , New , 16.66
Results For Tigerlake strlen-avx2
size, algn, Cur T , New T , Win , Dif
1 , 0 , 4.369 , 3.008 , New , 1.36
2 , 0 , 4.054 , 3.231 , New , 0.82
3 , 0 , 4.081 , 3.243 , New , 0.84
4 , 0 , 3.904 , 3.17 , New , 0.73
5 , 0 , 3.915 , 3.178 , New , 0.74
6 , 0 , 3.924 , 3.184 , New , 0.74
7 , 0 , 3.917 , 3.177 , New , 0.74
8 , 0 , 3.889 , 3.209 , New , 0.68
16 , 0 , 3.878 , 3.03 , New , 0.85
10 , 0 , 3.892 , 3.004 , New , 0.89
32 , 0 , 4.957 , 4.162 , New , 0.79
21 , 0 , 3.866 , 3.18 , New , 0.69
64 , 0 , 5.035 , 4.521 , New , 0.51
42 , 0 , 5.039 , 4.276 , New , 0.76
128 , 0 , 6.117 , 5.253 , New , 0.86
85 , 0 , 4.932 , 4.421 , New , 0.51
256 , 0 , 10.019, 8.221 , New , 1.8
170 , 0 , 8.954 , 7.404 , New , 1.55
512 , 0 , 14.071, 10.948, New , 3.12
341 , 0 , 11.177, 9.246 , New , 1.93
1024, 0 , 21.808, 17.034, New , 4.77
682 , 0 , 16.07 , 12.941, New , 3.13
2048, 0 , 37.332, 29.853, New , 7.48
1365, 0 , 26.394, 21.516, New , 4.88
4096, 0 , 87.951, 80.35 , New , 7.6
2730, 0 , 62.768, 44.247, New , 18.52
Results For Icelake strlen-evex
size, algn, Cur T , New T , Win , Dif
1 , 0 , 2.681 , 1.99 , New , 0.69
2 , 0 , 2.823 , 2.232 , New , 0.59
3 , 0 , 2.57 , 2.077 , New , 0.49
4 , 0 , 2.659 , 2.128 , New , 0.53
5 , 0 , 2.666 , 2.109 , New , 0.56
6 , 0 , 2.596 , 2.053 , New , 0.54
7 , 0 , 2.623 , 2.152 , New , 0.47
8 , 0 , 2.675 , 2.178 , New , 0.5
16 , 0 , 2.675 , 2.202 , New , 0.47
10 , 0 , 2.672 , 2.2 , New , 0.47
32 , 0 , 3.383 , 2.868 , New , 0.52
21 , 0 , 2.693 , 2.032 , New , 0.66
64 , 0 , 3.404 , 3.056 , New , 0.35
42 , 0 , 3.511 , 2.967 , New , 0.54
128 , 0 , 4.191 , 3.627 , New , 0.56
85 , 0 , 3.559 , 2.922 , New , 0.64
256 , 0 , 6.782 , 5.493 , New , 1.29
170 , 0 , 6.24 , 4.988 , New , 1.25
512 , 0 , 9.305 , 7.308 , New , 2.0
341 , 0 , 7.626 , 6.272 , New , 1.35
1024, 0 , 14.455, 11.544, New , 2.91
682 , 0 , 10.728, 8.738 , New , 1.99
2048, 0 , 24.171, 24.101, New , 0.07
1365, 0 , 17.474, 14.387, New , 3.09
4096, 0 , 57.659, 51.675, New , 5.98
2730, 0 , 44.702, 28.04 , New , 16.66
Results For Tigerlake strlen-evex
size, algn, Cur T , New T , Win , Dif
1 , 0 , 4.369 , 3.008 , New , 1.36
2 , 0 , 4.054 , 3.231 , New , 0.82
3 , 0 , 4.081 , 3.243 , New , 0.84
4 , 0 , 3.904 , 3.17 , New , 0.73
5 , 0 , 3.915 , 3.178 , New , 0.74
6 , 0 , 3.924 , 3.184 , New , 0.74
7 , 0 , 3.917 , 3.177 , New , 0.74
8 , 0 , 3.889 , 3.209 , New , 0.68
16 , 0 , 3.878 , 3.03 , New , 0.85
10 , 0 , 3.892 , 3.004 , New , 0.89
32 , 0 , 4.957 , 4.162 , New , 0.79
21 , 0 , 3.866 , 3.18 , New , 0.69
64 , 0 , 5.035 , 4.521 , New , 0.51
42 , 0 , 5.039 , 4.276 , New , 0.76
128 , 0 , 6.117 , 5.253 , New , 0.86
85 , 0 , 4.932 , 4.421 , New , 0.51
256 , 0 , 10.019, 8.221 , New , 1.8
170 , 0 , 8.954 , 7.404 , New , 1.55
512 , 0 , 14.071, 10.948, New , 3.12
341 , 0 , 11.177, 9.246 , New , 1.93
1024, 0 , 21.808, 17.034, New , 4.77
682 , 0 , 16.07 , 12.941, New , 3.13
2048, 0 , 37.332, 29.853, New , 7.48
1365, 0 , 26.394, 21.516, New , 4.88
4096, 0 , 87.951, 80.35 , New , 7.6
2730, 0 , 62.768, 44.247, New , 18.52
The strnlen numbers are a bit more of a mixed bag but I think
generally positive. Its possible that the current version should be
kept. Let me know.
Results For Skylake strnlen-avx2
size, algn, Cur T , Sub T , Win , Dif
1 , 0 , 4.06 , 4.1 , Cur , 0.04
2 , 0 , 4.15 , 4.08 , New , 0.07
3 , 0 , 4.1 , 4.03 , New , 0.07
4 , 0 , 3.95 , 3.91 , New , 0.04
5 , 0 , 4.07 , 3.9 , New , 0.17
6 , 0 , 4.04 , 3.92 , New , 0.12
7 , 0 , 4.03 , 3.89 , New , 0.14
1 , 1 , 3.75 , 3.79 , Cur , 0.04
2 , 2 , 4.0 , 3.91 , New , 0.09
3 , 3 , 4.04 , 3.92 , New , 0.12
4 , 4 , 3.95 , 3.86 , New , 0.09
5 , 5 , 3.97 , 3.91 , New , 0.06
6 , 6 , 3.96 , 3.92 , New , 0.04
7 , 7 , 3.97 , 3.91 , New , 0.06
4 , 1 , 3.76 , 3.83 , Cur , 0.07
8 , 0 , 3.73 , 4.01 , Cur , 0.28
8 , 1 , 3.73 , 3.88 , Cur , 0.15
16 , 0 , 3.68 , 3.84 , Cur , 0.16
16 , 1 , 3.75 , 3.92 , Cur , 0.17
32 , 0 , 5.93 , 5.95 , Cur , 0.02
32 , 1 , 5.95 , 5.98 , Cur , 0.03
64 , 0 , 6.46 , 5.31 , New , 1.15
64 , 1 , 6.66 , 5.43 , New , 1.23
128 , 0 , 7.42 , 6.02 , New , 1.4
128 , 1 , 7.57 , 5.81 , New , 1.76
256 , 0 , 12.02 , 9.89 , New , 2.13
256 , 1 , 11.91 , 9.84 , New , 2.07
512 , 0 , 15.06 , 11.77 , New , 3.29
512 , 1 , 14.79 , 11.75 , New , 3.04
1024, 0 , 23.61 , 16.98 , New , 6.63
1024, 1 , 23.63 , 16.91 , New , 6.72
Results For Icelake strnlen-avx2
size, algn, Cur T , Sub T , Win , Dif
1 , 0 , 2.81 , 2.51 , New , 0.3
2 , 0 , 2.8 , 2.53 , New , 0.27
3 , 0 , 2.7 , 2.57 , New , 0.13
4 , 0 , 2.68 , 2.55 , New , 0.13
5 , 0 , 2.7 , 2.57 , New , 0.13
6 , 0 , 2.73 , 2.6 , New , 0.13
7 , 0 , 2.69 , 2.61 , New , 0.08
1 , 1 , 2.53 , 2.5 , New , 0.03
2 , 2 , 2.67 , 2.6 , New , 0.07
3 , 3 , 2.67 , 2.59 , New , 0.08
4 , 4 , 2.66 , 2.57 , New , 0.09
5 , 5 , 2.65 , 2.56 , New , 0.09
6 , 6 , 2.67 , 2.59 , New , 0.08
7 , 7 , 2.65 , 2.62 , New , 0.03
4 , 1 , 2.65 , 2.41 , New , 0.24
8 , 0 , 2.68 , 2.56 , New , 0.12
8 , 1 , 2.62 , 2.55 , New , 0.07
16 , 0 , 2.66 , 2.56 , New , 0.1
16 , 1 , 2.63 , 2.55 , New , 0.08
32 , 0 , 3.62 , 3.19 , New , 0.43
32 , 1 , 3.74 , 3.45 , New , 0.29
64 , 0 , 3.9 , 3.7 , New , 0.2
64 , 1 , 4.13 , 3.68 , New , 0.45
128 , 0 , 4.34 , 4.17 , New , 0.17
128 , 1 , 4.59 , 4.07 , New , 0.52
256 , 0 , 6.74 , 6.56 , New , 0.18
256 , 1 , 7.34 , 7.13 , New , 0.21
512 , 0 , 9.64 , 8.67 , New , 0.97
512 , 1 , 9.49 , 8.56 , New , 0.93
1024, 0 , 13.57 , 12.35 , New , 1.22
1024, 1 , 13.57 , 12.59 , New , 0.98
Results For Tigerlake strnlen-avx2
size, algn, Cur T , Sub T , Win , Dif
1 , 0 , 4.21 , 3.91 , New , 0.3
2 , 0 , 4.1 , 3.79 , New , 0.31
3 , 0 , 4.02 , 3.81 , New , 0.21
4 , 0 , 4.06 , 3.82 , New , 0.24
5 , 0 , 4.1 , 3.81 , New , 0.29
6 , 0 , 4.08 , 3.82 , New , 0.26
7 , 0 , 4.07 , 3.87 , New , 0.2
1 , 1 , 3.95 , 3.8 , New , 0.15
2 , 2 , 4.11 , 3.88 , New , 0.23
3 , 3 , 4.08 , 3.88 , New , 0.2
4 , 4 , 4.05 , 3.94 , New , 0.11
5 , 5 , 4.02 , 3.89 , New , 0.13
6 , 6 , 4.02 , 3.89 , New , 0.13
7 , 7 , 4.08 , 3.84 , New , 0.24
4 , 1 , 4.07 , 3.7 , New , 0.37
8 , 0 , 4.08 , 3.95 , New , 0.13
8 , 1 , 4.01 , 4.02 , Cur , 0.01
16 , 0 , 4.03 , 4.03 , Eq , 0.0
16 , 1 , 4.05 , 4.0 , New , 0.05
32 , 0 , 5.86 , 5.23 , New , 0.63
32 , 1 , 5.88 , 5.36 , New , 0.52
64 , 0 , 6.38 , 5.73 , New , 0.65
64 , 1 , 6.49 , 5.56 , New , 0.93
128 , 0 , 7.17 , 6.39 , New , 0.78
128 , 1 , 7.1 , 6.41 , New , 0.69
256 , 0 , 11.65 , 11.0 , New , 0.65
256 , 1 , 11.37 , 10.97 , New , 0.4
512 , 0 , 14.86 , 13.43 , New , 1.43
512 , 1 , 14.63 , 13.35 , New , 1.28
1024, 0 , 20.92 , 19.33 , New , 1.59
1024, 1 , 20.85 , 19.38 , New , 1.47
Results For Icelake strnlen-evex
size, algn, Cur T , Sub T , Win , Dif
1 , 0 , 2.9 , 2.66 , New , 0.24
2 , 0 , 2.99 , 2.72 , New , 0.27
3 , 0 , 2.93 , 2.64 , New , 0.29
4 , 0 , 2.83 , 2.55 , New , 0.28
5 , 0 , 2.92 , 2.64 , New , 0.28
6 , 0 , 2.95 , 2.64 , New , 0.31
7 , 0 , 2.91 , 2.65 , New , 0.26
1 , 1 , 2.63 , 2.49 , New , 0.14
2 , 2 , 2.89 , 2.6 , New , 0.29
3 , 3 , 2.89 , 2.59 , New , 0.3
4 , 4 , 2.9 , 2.58 , New , 0.32
5 , 5 , 2.87 , 2.57 , New , 0.3
6 , 6 , 2.9 , 2.57 , New , 0.33
7 , 7 , 2.88 , 2.64 , New , 0.24
4 , 1 , 2.65 , 2.39 , New , 0.26
8 , 0 , 2.85 , 2.57 , New , 0.28
8 , 1 , 2.62 , 2.4 , New , 0.22
16 , 0 , 2.83 , 2.56 , New , 0.27
16 , 1 , 2.63 , 2.39 , New , 0.24
32 , 0 , 3.95 , 3.06 , New , 0.89
32 , 1 , 3.95 , 3.15 , New , 0.8
64 , 0 , 3.98 , 3.6 , New , 0.38
64 , 1 , 3.88 , 3.48 , New , 0.4
128 , 0 , 4.45 , 4.19 , New , 0.26
128 , 1 , 4.57 , 4.21 , New , 0.36
256 , 0 , 6.75 , 6.97 , Cur , 0.22
256 , 1 , 7.55 , 7.76 , Cur , 0.21
512 , 0 , 9.75 , 10.09 , Cur , 0.34
512 , 1 , 9.84 , 10.13 , Cur , 0.29
1024, 0 , 14.45 , 14.4 , New , 0.05
1024, 1 , 14.39 , 14.26 , New , 0.13
Results For Tigerlake strnlen-evex
size, algn, Cur T , Sub T , Win , Dif
1 , 0 , 3.86 , 3.59 , New , 0.27
2 , 0 , 3.78 , 3.41 , New , 0.37
3 , 0 , 3.69 , 3.4 , New , 0.29
4 , 0 , 3.62 , 3.33 , New , 0.29
5 , 0 , 3.76 , 3.37 , New , 0.39
6 , 0 , 3.73 , 3.39 , New , 0.34
7 , 0 , 3.7 , 3.4 , New , 0.3
1 , 1 , 3.58 , 3.35 , New , 0.23
2 , 2 , 3.75 , 3.34 , New , 0.41
3 , 3 , 3.72 , 3.39 , New , 0.33
4 , 4 , 3.69 , 3.38 , New , 0.31
5 , 5 , 3.69 , 3.37 , New , 0.32
6 , 6 , 3.68 , 3.37 , New , 0.31
7 , 7 , 3.74 , 3.35 , New , 0.39
4 , 1 , 3.39 , 3.27 , New , 0.12
8 , 0 , 3.4 , 3.29 , New , 0.11
8 , 1 , 3.34 , 3.32 , New , 0.02
16 , 0 , 3.36 , 3.34 , New , 0.02
16 , 1 , 3.39 , 3.3 , New , 0.09
32 , 0 , 5.13 , 5.13 , Eq , 0.0
32 , 1 , 5.18 , 5.16 , New , 0.02
64 , 0 , 5.87 , 5.44 , New , 0.43
64 , 1 , 5.97 , 5.44 , New , 0.53
128 , 0 , 7.14 , 6.48 , New , 0.66
128 , 1 , 7.08 , 6.63 , New , 0.45
256 , 0 , 11.68 , 12.57 , Cur , 0.89
256 , 1 , 11.67 , 12.23 , Cur , 0.56
512 , 0 , 15.64 , 15.74 , Cur , 0.1
512 , 1 , 15.52 , 15.69 , Cur , 0.17
1024, 0 , 23.02 , 22.57 , New , 0.45
1024, 1 , 23.0 , 22.8 , New , 0.2
sysdeps/x86_64/multiarch/strlen-evex.S | 569 +++++++++++++------------
1 file changed, 307 insertions(+), 262 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/strlen-evex.S b/sysdeps/x86_64/multiarch/strlen-evex.S
index 0583819078..d1aafac76f 100644
--- a/sysdeps/x86_64/multiarch/strlen-evex.S
+++ b/sysdeps/x86_64/multiarch/strlen-evex.S
@@ -29,11 +29,13 @@
# ifdef USE_AS_WCSLEN
# define VPCMP vpcmpd
# define VPMINU vpminud
-# define SHIFT_REG r9d
+# define SHIFT_REG ecx
+# define CHAR_SIZE 4
# else
# define VPCMP vpcmpb
# define VPMINU vpminub
-# define SHIFT_REG ecx
+# define SHIFT_REG edx
+# define CHAR_SIZE 1
# endif
# define XMMZERO xmm16
@@ -46,132 +48,169 @@
# define YMM6 ymm22
# define VEC_SIZE 32
+# define PAGE_SIZE 4096
+# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
.section .text.evex,"ax",@progbits
ENTRY (STRLEN)
# ifdef USE_AS_STRNLEN
- /* Check for zero length. */
+ /* Check zero length. */
test %RSI_LP, %RSI_LP
jz L(zero)
-# ifdef USE_AS_WCSLEN
- shl $2, %RSI_LP
-# elif defined __ILP32__
+# if !defined USE_AS_WCSLEN && defined __ILP32__
/* Clear the upper 32 bits. */
movl %esi, %esi
# endif
mov %RSI_LP, %R8_LP
# endif
- movl %edi, %ecx
- movq %rdi, %rdx
+ movl %edi, %eax
+ sall $20, %eax
vpxorq %XMMZERO, %XMMZERO, %XMMZERO
-
/* Check if we may cross page boundary with one vector load. */
- andl $(2 * VEC_SIZE - 1), %ecx
- cmpl $VEC_SIZE, %ecx
- ja L(cros_page_boundary)
+ cmpl $((PAGE_SIZE - VEC_SIZE) << 20), %eax
+ ja L(cross_page_boundary)
- /* Check the first VEC_SIZE bytes. Each bit in K0 represents a
- null byte. */
+ /* Check the first VEC_SIZE bytes. */
VPCMP $0, (%rdi), %YMMZERO, %k0
kmovd %k0, %eax
- testl %eax, %eax
-
# ifdef USE_AS_STRNLEN
- jnz L(first_vec_x0_check)
- /* Adjust length and check the end of data. */
- subq $VEC_SIZE, %rsi
- jbe L(max)
-# else
- jnz L(first_vec_x0)
+ /* If length < VEC_SIZE handle special. */
+ cmpq $CHAR_PER_VEC, %rsi
+ jbe L(first_vec_x0)
# endif
-
- /* Align data for aligned loads in the loop. */
- addq $VEC_SIZE, %rdi
- andl $(VEC_SIZE - 1), %ecx
- andq $-VEC_SIZE, %rdi
-
+ testl %eax, %eax
+ jz L(aligned_more)
+ tzcntl %eax, %eax
+ ret
# ifdef USE_AS_STRNLEN
- /* Adjust length. */
- addq %rcx, %rsi
+L(zero):
+ xorl %eax, %eax
+ ret
- subq $(VEC_SIZE * 4), %rsi
- jbe L(last_4x_vec_or_less)
+ .p2align 4
+L(first_vec_x0):
+ /* Select min of length and position of first null. */
+ btsq %rsi, %rax
+ tzcntl %eax, %eax
+ ret
# endif
- jmp L(more_4x_vec)
.p2align 4
-L(cros_page_boundary):
- andl $(VEC_SIZE - 1), %ecx
- andq $-VEC_SIZE, %rdi
-
-# ifdef USE_AS_WCSLEN
- /* NB: Divide shift count by 4 since each bit in K0 represent 4
- bytes. */
- movl %ecx, %SHIFT_REG
- sarl $2, %SHIFT_REG
+L(first_vec_x1):
+ tzcntl %eax, %eax
+ /* Safe to use 32 bit instructions as these are only called for
+ size = [1, 159]. */
+# ifdef USE_AS_STRNLEN
+ /* Use ecx which was computed earlier to compute correct value.
+ */
+# ifdef USE_AS_WCSLEN
+ sarl $2, %ecx
+# endif
+ leal -(CHAR_PER_VEC * 4 + 1)(%rcx, %rax), %eax
+# else
+ subl %edx, %edi
+# ifdef USE_AS_WCSLEN
+ sarl $2, %edi
+# endif
+ leal CHAR_PER_VEC(%rdi, %rax), %eax
# endif
- VPCMP $0, (%rdi), %YMMZERO, %k0
- kmovd %k0, %eax
+ ret
- /* Remove the leading bytes. */
- sarxl %SHIFT_REG, %eax, %eax
- testl %eax, %eax
- jz L(aligned_more)
+ .p2align 4
+L(first_vec_x2):
tzcntl %eax, %eax
-# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
-# endif
+ /* Safe to use 32 bit instructions as these are only called for
+ size = [1, 159]. */
# ifdef USE_AS_STRNLEN
- /* Check the end of data. */
- cmpq %rax, %rsi
- jbe L(max)
-# endif
- addq %rdi, %rax
- addq %rcx, %rax
- subq %rdx, %rax
-# ifdef USE_AS_WCSLEN
- shrq $2, %rax
+ /* Use ecx which was computed earlier to compute correct value.
+ */
+# ifdef USE_AS_WCSLEN
+ sarl $2, %ecx
+# endif
+ leal -(CHAR_PER_VEC * 3 + 1)(%rcx, %rax), %eax
+# else
+ subl %edx, %edi
+# ifdef USE_AS_WCSLEN
+ sarl $2, %edi
+# endif
+ leal (CHAR_PER_VEC * 2)(%rdi, %rax), %eax
# endif
ret
.p2align 4
-L(aligned_more):
+L(first_vec_x3):
+ tzcntl %eax, %eax
+ /* Safe to use 32 bit instructions as these are only called for
+ size = [1, 159]. */
# ifdef USE_AS_STRNLEN
- /* "rcx" is less than VEC_SIZE. Calculate "rdx + rcx - VEC_SIZE"
- with "rdx - (VEC_SIZE - rcx)" instead of "(rdx + rcx) - VEC_SIZE"
- to void possible addition overflow. */
- negq %rcx
- addq $VEC_SIZE, %rcx
-
- /* Check the end of data. */
- subq %rcx, %rsi
- jbe L(max)
+ /* Use ecx which was computed earlier to compute correct value.
+ */
+# ifdef USE_AS_WCSLEN
+ sarl $2, %ecx
+# endif
+ leal -(CHAR_PER_VEC * 2 + 1)(%rcx, %rax), %eax
+# else
+ subl %edx, %edi
+# ifdef USE_AS_WCSLEN
+ sarl $2, %edi
+# endif
+ leal (CHAR_PER_VEC * 3)(%rdi, %rax), %eax
# endif
+ ret
- addq $VEC_SIZE, %rdi
-
+ .p2align 4
+L(first_vec_x4):
+ tzcntl %eax, %eax
+ /* Safe to use 32 bit instructions as these are only called for
+ size = [1, 159]. */
# ifdef USE_AS_STRNLEN
- subq $(VEC_SIZE * 4), %rsi
- jbe L(last_4x_vec_or_less)
+ /* Use ecx which was computed earlier to compute correct value.
+ */
+# ifdef USE_AS_WCSLEN
+ sarl $2, %ecx
+# endif
+ leal -(CHAR_PER_VEC + 1)(%rcx, %rax), %eax
+# else
+ subl %edx, %edi
+# ifdef USE_AS_WCSLEN
+ sarl $2, %edi
+# endif
+ leal (CHAR_PER_VEC * 4)(%rdi, %rax), %eax
# endif
+ ret
-L(more_4x_vec):
+ /* strnlen jumps here. strlen falls through. */
+ .p2align 5
+L(aligned_more):
+ movq %rdi, %rdx
+ /* Align data to VEC_SIZE. */
+ andq $-(VEC_SIZE), %rdi
+L(cross_page_continue):
/* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time
since data is only aligned to VEC_SIZE. */
- VPCMP $0, (%rdi), %YMMZERO, %k0
- kmovd %k0, %eax
- testl %eax, %eax
- jnz L(first_vec_x0)
-
+# ifdef USE_AS_STRNLEN
+# ifdef USE_AS_WCSLEN
+ salq $2, %rsi
+# endif
+ /* + CHAR_SIZE because it simplies the logic in
+ last_4x_vec_or_less. */
+ leaq (VEC_SIZE * 5 + CHAR_SIZE)(%rdi), %rcx
+ subq %rdx, %rcx
+# endif
+ /* Load first VEC regardless. */
VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0
+# ifdef USE_AS_STRNLEN
+ /* Adjust length. If near end handle specially. */
+ subq %rcx, %rsi
+ jb L(last_4x_vec_or_less)
+# endif
kmovd %k0, %eax
testl %eax, %eax
jnz L(first_vec_x1)
VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
kmovd %k0, %eax
- testl %eax, %eax
+ test %eax, %eax
jnz L(first_vec_x2)
VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
@@ -179,258 +218,264 @@ L(more_4x_vec):
testl %eax, %eax
jnz L(first_vec_x3)
- addq $(VEC_SIZE * 4), %rdi
-
-# ifdef USE_AS_STRNLEN
- subq $(VEC_SIZE * 4), %rsi
- jbe L(last_4x_vec_or_less)
-# endif
-
- /* Align data to 4 * VEC_SIZE. */
- movq %rdi, %rcx
- andl $(4 * VEC_SIZE - 1), %ecx
- andq $-(4 * VEC_SIZE), %rdi
+ VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0
+ kmovd %k0, %eax
+ testl %eax, %eax
+ jnz L(first_vec_x4)
+ addq $VEC_SIZE, %rdi
# ifdef USE_AS_STRNLEN
- /* Adjust length. */
+ /* Check if at last VEC_SIZE * 4 length. */
+ cmpq $(VEC_SIZE * 4 - 1), %rsi
+ jbe L(last_4x_vec_or_less_load)
+ movl %edi, %ecx
+ andl $(VEC_SIZE * 4 - 1), %ecx
+ /* Readjust length. */
addq %rcx, %rsi
# endif
+ /* Align data to VEC_SIZE * 4. */
+ andq $-(VEC_SIZE * 4), %rdi
+ /* Compare 4 * VEC at a time forward. */
.p2align 4
L(loop_4x_vec):
- /* Compare 4 * VEC at a time forward. */
- VMOVA (%rdi), %YMM1
- VMOVA VEC_SIZE(%rdi), %YMM2
- VMOVA (VEC_SIZE * 2)(%rdi), %YMM3
- VMOVA (VEC_SIZE * 3)(%rdi), %YMM4
-
- VPMINU %YMM1, %YMM2, %YMM5
- VPMINU %YMM3, %YMM4, %YMM6
+ /* Load first VEC regardless. */
+ VMOVA (VEC_SIZE * 4)(%rdi), %YMM1
+# ifdef USE_AS_STRNLEN
+ /* Break if at end of length. */
+ subq $(VEC_SIZE * 4), %rsi
+ jb L(last_4x_vec_or_less_cmpeq)
+# endif
+ VPMINU (VEC_SIZE * 5)(%rdi), %YMM1, %YMM2
+ VMOVA (VEC_SIZE * 6)(%rdi), %YMM3
+ VPMINU (VEC_SIZE * 7)(%rdi), %YMM3, %YMM4
+ VPCMP $0, %YMM2, %YMMZERO, %k0
+ VPCMP $0, %YMM4, %YMMZERO, %k1
+ subq $-(VEC_SIZE * 4), %rdi
+ kortestd %k0, %k1
+ jz L(loop_4x_vec)
+
+ /* Check if end was in first half. */
+ kmovd %k0, %eax
+ subq %rdx, %rdi
+# ifdef USE_AS_WCSLEN
+ shrq $2, %rdi
+# endif
+ testl %eax, %eax
+ jz L(second_vec_return)
- VPMINU %YMM5, %YMM6, %YMM5
- VPCMP $0, %YMM5, %YMMZERO, %k0
- ktestd %k0, %k0
- jnz L(4x_vec_end)
+ VPCMP $0, %YMM1, %YMMZERO, %k2
+ kmovd %k2, %edx
+ /* Combine YMM1 matches (k2) with YMM2 matches (k0). */
+# ifdef USE_AS_WCSLEN
+ sall $CHAR_PER_VEC, %eax
+ orl %edx, %eax
+ tzcntl %eax, %eax
+# else
+ salq $CHAR_PER_VEC, %rax
+ orq %rdx, %rax
+ tzcntq %rax, %rax
+# endif
+ addq %rdi, %rax
+ ret
- addq $(VEC_SIZE * 4), %rdi
-# ifndef USE_AS_STRNLEN
- jmp L(loop_4x_vec)
-# else
- subq $(VEC_SIZE * 4), %rsi
- ja L(loop_4x_vec)
+# ifdef USE_AS_STRNLEN
+L(last_4x_vec_or_less_load):
+ /* Depending on entry adjust rdi / prepare first VEC in YMM1. */
+ VMOVA (VEC_SIZE * 4)(%rdi), %YMM1
+L(last_4x_vec_or_less_cmpeq):
+ VPCMP $0, %YMM1, %YMMZERO, %k0
+ addq $(VEC_SIZE * 3), %rdi
L(last_4x_vec_or_less):
- /* Less than 4 * VEC and aligned to VEC_SIZE. */
- addl $(VEC_SIZE * 2), %esi
- jle L(last_2x_vec)
-
- VPCMP $0, (%rdi), %YMMZERO, %k0
kmovd %k0, %eax
- testl %eax, %eax
- jnz L(first_vec_x0)
+ /* If remaining length > VEC_SIZE * 2. */
+ testl $(VEC_SIZE * 2), %esi
+ jnz L(last_4x_vec)
- VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0
- kmovd %k0, %eax
+ /* length may have been negative or positive depending on where
+ this was called from. This fixes that. */
+ andl $(VEC_SIZE * 4 - 1), %esi
testl %eax, %eax
- jnz L(first_vec_x1)
+ jnz L(last_vec_x1_check)
- VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
- kmovd %k0, %eax
- testl %eax, %eax
- jnz L(first_vec_x2_check)
subl $VEC_SIZE, %esi
- jle L(max)
+ jb L(max)
- VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
+ VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
kmovd %k0, %eax
- testl %eax, %eax
- jnz L(first_vec_x3_check)
- movq %r8, %rax
+ tzcntl %eax, %eax
# ifdef USE_AS_WCSLEN
- shrq $2, %rax
+ sarl $2, %esi
# endif
- ret
-
- .p2align 4
-L(last_2x_vec):
- addl $(VEC_SIZE * 2), %esi
-
- VPCMP $0, (%rdi), %YMMZERO, %k0
- kmovd %k0, %eax
- testl %eax, %eax
- jnz L(first_vec_x0_check)
- subl $VEC_SIZE, %esi
- jle L(max)
+ /* Check the end of data. */
+ cmpl %eax, %esi
+ jb L(max)
- VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0
- kmovd %k0, %eax
- testl %eax, %eax
- jnz L(first_vec_x1_check)
- movq %r8, %rax
+ subq %rdx, %rdi
# ifdef USE_AS_WCSLEN
- shrq $2, %rax
+ sarq $2, %rdi
# endif
+ leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax
+ ret
+L(max):
+ movq %r8, %rax
ret
+# endif
.p2align 4
-L(first_vec_x0_check):
+L(second_vec_return):
+ VPCMP $0, %YMM3, %YMMZERO, %k0
+ /* Combine YMM3 matches (k0) with YMM4 matches (k1). */
+# ifdef USE_AS_WCSLEN
+ kunpckbw %k0, %k1, %k0
+ kmovd %k0, %eax
+ tzcntl %eax, %eax
+# else
+ kunpckdq %k0, %k1, %k0
+ kmovq %k0, %rax
+ tzcntq %rax, %rax
+# endif
+ leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax
+ ret
+
+
+# ifdef USE_AS_STRNLEN
+L(last_vec_x1_check):
tzcntl %eax, %eax
# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
+ sarl $2, %esi
# endif
/* Check the end of data. */
- cmpq %rax, %rsi
- jbe L(max)
- addq %rdi, %rax
- subq %rdx, %rax
+ cmpl %eax, %esi
+ jb L(max)
+ subq %rdx, %rdi
# ifdef USE_AS_WCSLEN
- shrq $2, %rax
+ sarq $2, %rdi
# endif
+ leaq (CHAR_PER_VEC)(%rdi, %rax), %rax
ret
.p2align 4
-L(first_vec_x1_check):
+L(last_4x_vec):
+ /* Test first 2x VEC normally. */
+ testl %eax, %eax
+ jnz L(last_vec_x1)
+
+ VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
+ kmovd %k0, %eax
+ testl %eax, %eax
+ jnz L(last_vec_x2)
+
+ /* Normalize length. */
+ andl $(VEC_SIZE * 4 - 1), %esi
+ VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
+ kmovd %k0, %eax
+ testl %eax, %eax
+ jnz L(last_vec_x3)
+
+ subl $(VEC_SIZE * 3), %esi
+ jb L(max)
+
+ VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0
+ kmovd %k0, %eax
tzcntl %eax, %eax
# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
+ sarl $2, %esi
# endif
/* Check the end of data. */
- cmpq %rax, %rsi
- jbe L(max)
- addq $VEC_SIZE, %rax
- addq %rdi, %rax
- subq %rdx, %rax
+ cmpl %eax, %esi
+ jb L(max_end)
+
+ subq %rdx, %rdi
# ifdef USE_AS_WCSLEN
- shrq $2, %rax
+ sarq $2, %rdi
# endif
+ leaq (CHAR_PER_VEC * 4)(%rdi, %rax), %rax
ret
.p2align 4
-L(first_vec_x2_check):
+L(last_vec_x1):
tzcntl %eax, %eax
+ subq %rdx, %rdi
# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
+ sarq $2, %rdi
# endif
- /* Check the end of data. */
- cmpq %rax, %rsi
- jbe L(max)
- addq $(VEC_SIZE * 2), %rax
- addq %rdi, %rax
- subq %rdx, %rax
+ leaq (CHAR_PER_VEC)(%rdi, %rax), %rax
+ ret
+
+ .p2align 4
+L(last_vec_x2):
+ tzcntl %eax, %eax
+ subq %rdx, %rdi
# ifdef USE_AS_WCSLEN
- shrq $2, %rax
+ sarq $2, %rdi
# endif
+ leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax
ret
.p2align 4
-L(first_vec_x3_check):
+L(last_vec_x3):
tzcntl %eax, %eax
# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
+ sarl $2, %esi
# endif
+ subl $(CHAR_PER_VEC * 2), %esi
/* Check the end of data. */
- cmpq %rax, %rsi
- jbe L(max)
- addq $(VEC_SIZE * 3), %rax
- addq %rdi, %rax
- subq %rdx, %rax
+ cmpl %eax, %esi
+ jb L(max_end)
+ subq %rdx, %rdi
# ifdef USE_AS_WCSLEN
- shrq $2, %rax
+ sarq $2, %rdi
# endif
+ leaq (CHAR_PER_VEC * 3)(%rdi, %rax), %rax
ret
-
- .p2align 4
-L(max):
+L(max_end):
movq %r8, %rax
-# ifdef USE_AS_WCSLEN
- shrq $2, %rax
-# endif
- ret
-
- .p2align 4
-L(zero):
- xorl %eax, %eax
ret
# endif
+ /* Cold case for crossing page with first load. */
.p2align 4
-L(first_vec_x0):
- tzcntl %eax, %eax
-# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
-# endif
- addq %rdi, %rax
- subq %rdx, %rax
+L(cross_page_boundary):
+ movq %rdi, %rdx
+ /* Align data to VEC_SIZE. */
+ andq $-VEC_SIZE, %rdi
+ VPCMP $0, (%rdi), %YMMZERO, %k0
+ kmovd %k0, %eax
+ /* Remove the leading bytes. */
# ifdef USE_AS_WCSLEN
- shrq $2, %rax
+ movl %edx, %ecx
+ shrl $2, %ecx
+ andl $(CHAR_PER_VEC - 1), %ecx
# endif
- ret
-
- .p2align 4
-L(first_vec_x1):
+ /* SHIFT_REG is ecx for USE_AS_WCSLEN and edx otherwise. */
+ sarxl %SHIFT_REG, %eax, %eax
+ testl %eax, %eax
+# ifndef USE_AS_STRNLEN
+ jz L(cross_page_continue)
tzcntl %eax, %eax
-# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
-# endif
- addq $VEC_SIZE, %rax
- addq %rdi, %rax
- subq %rdx, %rax
-# ifdef USE_AS_WCSLEN
- shrq $2, %rax
-# endif
ret
-
- .p2align 4
-L(first_vec_x2):
- tzcntl %eax, %eax
-# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
-# endif
- addq $(VEC_SIZE * 2), %rax
- addq %rdi, %rax
- subq %rdx, %rax
-# ifdef USE_AS_WCSLEN
- shrq $2, %rax
-# endif
+# else
+ jnz L(cross_page_less_vec)
+# ifndef USE_AS_WCSLEN
+ movl %edx, %ecx
+ andl $(CHAR_PER_VEC - 1), %ecx
+# endif
+ movl $CHAR_PER_VEC, %eax
+ subl %ecx, %eax
+ cmpq %rax, %rsi
+ ja L(cross_page_continue)
+ movl %esi, %eax
ret
-
- .p2align 4
-L(4x_vec_end):
- VPCMP $0, %YMM1, %YMMZERO, %k0
- kmovd %k0, %eax
- testl %eax, %eax
- jnz L(first_vec_x0)
- VPCMP $0, %YMM2, %YMMZERO, %k1
- kmovd %k1, %eax
- testl %eax, %eax
- jnz L(first_vec_x1)
- VPCMP $0, %YMM3, %YMMZERO, %k2
- kmovd %k2, %eax
- testl %eax, %eax
- jnz L(first_vec_x2)
- VPCMP $0, %YMM4, %YMMZERO, %k3
- kmovd %k3, %eax
-L(first_vec_x3):
+L(cross_page_less_vec):
tzcntl %eax, %eax
-# ifdef USE_AS_WCSLEN
- /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
- sall $2, %eax
-# endif
- addq $(VEC_SIZE * 3), %rax
- addq %rdi, %rax
- subq %rdx, %rax
-# ifdef USE_AS_WCSLEN
- shrq $2, %rax
-# endif
+ /* Select min of length and position of first null. */
+ cmpq %rax, %rsi
+ cmovb %esi, %eax
ret
+# endif
END (STRLEN)
#endif
--
2.29.2
next reply other threads:[~2021-04-17 2:53 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-17 2:52 Noah Goldstein [this message]
2021-04-17 2:52 ` [PATCH v1 2/2] x86: Optimize strlen-avx2.S Noah Goldstein
2021-04-17 18:46 ` H.J. Lu
2021-04-17 18:56 ` [PATCH v1 1/2] x86: Optimize strlen-evex.S H.J. Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210417025215.874105-1-goldstein.w.n@gmail.com \
--to=goldstein.w.n@gmail.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).