From: "H.J. Lu" <hjl.tools@gmail.com>
To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: libc-alpha@sourceware.org, carlos@systemhalted.org
Subject: Re: [PATCH v1 1/2] x86: Optimize strlen-evex.S
Date: Sat, 17 Apr 2021 11:56:53 -0700 [thread overview]
Message-ID: <CAMe9rOo6N2=UZ4QLDfp4JzbdprmzzNZaYKtAhMWbSm2s4yZTAA@mail.gmail.com> (raw)
In-Reply-To: <20210417025215.874105-1-goldstein.w.n@gmail.com>
On Fri, Apr 16, 2021 at 7:53 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> No bug. This commit optimizes strlen-evex.S. The
> optimizations are mostly small things but they add up to roughly
> 10-30% performance improvement for strlen. The results for strnlen are
> bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and
> test-wcsnlen are all passing.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
> Tests where run on the following CPUs:
>
> Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html
>
> Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html
>
> Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
>
> All times are the geometric mean of N=20. The unit of time is
> seconds.
>
> "Cur" refers to the current implementation
> "New" refers to this patches implementation
>
>
> The strlen numbers are universal improvements:
>
> Results For Skylake strlen-avx2
> size, algn, Cur T , New T , Win , Dif
> 1 , 0 , 4.76 , 4.27 , New , 0.49
> 2 , 0 , 4.77 , 4.165 , New , 0.6
> 3 , 0 , 4.617 , 4.095 , New , 0.52
> 4 , 0 , 4.579 , 4.006 , New , 0.57
> 5 , 0 , 4.608 , 4.008 , New , 0.6
> 6 , 0 , 4.655 , 4.086 , New , 0.57
> 7 , 0 , 4.661 , 4.071 , New , 0.59
> 8 , 0 , 4.625 , 4.092 , New , 0.53
> 16 , 0 , 4.608 , 4.021 , New , 0.59
> 10 , 0 , 4.645 , 4.111 , New , 0.53
> 32 , 0 , 5.532 , 4.817 , New , 0.71
> 21 , 0 , 4.636 , 3.775 , New , 0.86
> 64 , 0 , 5.991 , 5.352 , New , 0.64
> 42 , 0 , 5.529 , 4.789 , New , 0.74
> 128 , 0 , 7.042 , 5.473 , New , 1.57
> 85 , 0 , 6.118 , 5.466 , New , 0.65
> 256 , 0 , 10.64 , 7.954 , New , 2.69
> 170 , 0 , 9.918 , 9.585 , New , 0.33
> 512 , 0 , 12.916, 10.242, New , 2.67
> 341 , 0 , 10.764, 10.216, New , 0.55
> 1024, 0 , 18.163, 14.844, New , 3.32
> 682 , 0 , 15.292, 13.382, New , 1.91
> 2048, 0 , 38.732, 24.396, New , 14.34
> 1365, 0 , 22.299, 20.08 , New , 2.22
> 4096, 0 , 79.054, 68.682, New , 10.37
> 2730, 0 , 61.47 , 40.705, New , 20.77
>
> Results For Icelake strlen-avx2
> size, algn, Cur T , New T , Win , Dif
> 1 , 0 , 2.681 , 1.99 , New , 0.69
> 2 , 0 , 2.823 , 2.232 , New , 0.59
> 3 , 0 , 2.57 , 2.077 , New , 0.49
> 4 , 0 , 2.659 , 2.128 , New , 0.53
> 5 , 0 , 2.666 , 2.109 , New , 0.56
> 6 , 0 , 2.596 , 2.053 , New , 0.54
> 7 , 0 , 2.623 , 2.152 , New , 0.47
> 8 , 0 , 2.675 , 2.178 , New , 0.5
> 16 , 0 , 2.675 , 2.202 , New , 0.47
> 10 , 0 , 2.672 , 2.2 , New , 0.47
> 32 , 0 , 3.383 , 2.868 , New , 0.52
> 21 , 0 , 2.693 , 2.032 , New , 0.66
> 64 , 0 , 3.404 , 3.056 , New , 0.35
> 42 , 0 , 3.511 , 2.967 , New , 0.54
> 128 , 0 , 4.191 , 3.627 , New , 0.56
> 85 , 0 , 3.559 , 2.922 , New , 0.64
> 256 , 0 , 6.782 , 5.493 , New , 1.29
> 170 , 0 , 6.24 , 4.988 , New , 1.25
> 512 , 0 , 9.305 , 7.308 , New , 2.0
> 341 , 0 , 7.626 , 6.272 , New , 1.35
> 1024, 0 , 14.455, 11.544, New , 2.91
> 682 , 0 , 10.728, 8.738 , New , 1.99
> 2048, 0 , 24.171, 24.101, New , 0.07
> 1365, 0 , 17.474, 14.387, New , 3.09
> 4096, 0 , 57.659, 51.675, New , 5.98
> 2730, 0 , 44.702, 28.04 , New , 16.66
>
> Results For Tigerlake strlen-avx2
> size, algn, Cur T , New T , Win , Dif
> 1 , 0 , 4.369 , 3.008 , New , 1.36
> 2 , 0 , 4.054 , 3.231 , New , 0.82
> 3 , 0 , 4.081 , 3.243 , New , 0.84
> 4 , 0 , 3.904 , 3.17 , New , 0.73
> 5 , 0 , 3.915 , 3.178 , New , 0.74
> 6 , 0 , 3.924 , 3.184 , New , 0.74
> 7 , 0 , 3.917 , 3.177 , New , 0.74
> 8 , 0 , 3.889 , 3.209 , New , 0.68
> 16 , 0 , 3.878 , 3.03 , New , 0.85
> 10 , 0 , 3.892 , 3.004 , New , 0.89
> 32 , 0 , 4.957 , 4.162 , New , 0.79
> 21 , 0 , 3.866 , 3.18 , New , 0.69
> 64 , 0 , 5.035 , 4.521 , New , 0.51
> 42 , 0 , 5.039 , 4.276 , New , 0.76
> 128 , 0 , 6.117 , 5.253 , New , 0.86
> 85 , 0 , 4.932 , 4.421 , New , 0.51
> 256 , 0 , 10.019, 8.221 , New , 1.8
> 170 , 0 , 8.954 , 7.404 , New , 1.55
> 512 , 0 , 14.071, 10.948, New , 3.12
> 341 , 0 , 11.177, 9.246 , New , 1.93
> 1024, 0 , 21.808, 17.034, New , 4.77
> 682 , 0 , 16.07 , 12.941, New , 3.13
> 2048, 0 , 37.332, 29.853, New , 7.48
> 1365, 0 , 26.394, 21.516, New , 4.88
> 4096, 0 , 87.951, 80.35 , New , 7.6
> 2730, 0 , 62.768, 44.247, New , 18.52
>
> Results For Icelake strlen-evex
> size, algn, Cur T , New T , Win , Dif
> 1 , 0 , 2.681 , 1.99 , New , 0.69
> 2 , 0 , 2.823 , 2.232 , New , 0.59
> 3 , 0 , 2.57 , 2.077 , New , 0.49
> 4 , 0 , 2.659 , 2.128 , New , 0.53
> 5 , 0 , 2.666 , 2.109 , New , 0.56
> 6 , 0 , 2.596 , 2.053 , New , 0.54
> 7 , 0 , 2.623 , 2.152 , New , 0.47
> 8 , 0 , 2.675 , 2.178 , New , 0.5
> 16 , 0 , 2.675 , 2.202 , New , 0.47
> 10 , 0 , 2.672 , 2.2 , New , 0.47
> 32 , 0 , 3.383 , 2.868 , New , 0.52
> 21 , 0 , 2.693 , 2.032 , New , 0.66
> 64 , 0 , 3.404 , 3.056 , New , 0.35
> 42 , 0 , 3.511 , 2.967 , New , 0.54
> 128 , 0 , 4.191 , 3.627 , New , 0.56
> 85 , 0 , 3.559 , 2.922 , New , 0.64
> 256 , 0 , 6.782 , 5.493 , New , 1.29
> 170 , 0 , 6.24 , 4.988 , New , 1.25
> 512 , 0 , 9.305 , 7.308 , New , 2.0
> 341 , 0 , 7.626 , 6.272 , New , 1.35
> 1024, 0 , 14.455, 11.544, New , 2.91
> 682 , 0 , 10.728, 8.738 , New , 1.99
> 2048, 0 , 24.171, 24.101, New , 0.07
> 1365, 0 , 17.474, 14.387, New , 3.09
> 4096, 0 , 57.659, 51.675, New , 5.98
> 2730, 0 , 44.702, 28.04 , New , 16.66
>
> Results For Tigerlake strlen-evex
> size, algn, Cur T , New T , Win , Dif
> 1 , 0 , 4.369 , 3.008 , New , 1.36
> 2 , 0 , 4.054 , 3.231 , New , 0.82
> 3 , 0 , 4.081 , 3.243 , New , 0.84
> 4 , 0 , 3.904 , 3.17 , New , 0.73
> 5 , 0 , 3.915 , 3.178 , New , 0.74
> 6 , 0 , 3.924 , 3.184 , New , 0.74
> 7 , 0 , 3.917 , 3.177 , New , 0.74
> 8 , 0 , 3.889 , 3.209 , New , 0.68
> 16 , 0 , 3.878 , 3.03 , New , 0.85
> 10 , 0 , 3.892 , 3.004 , New , 0.89
> 32 , 0 , 4.957 , 4.162 , New , 0.79
> 21 , 0 , 3.866 , 3.18 , New , 0.69
> 64 , 0 , 5.035 , 4.521 , New , 0.51
> 42 , 0 , 5.039 , 4.276 , New , 0.76
> 128 , 0 , 6.117 , 5.253 , New , 0.86
> 85 , 0 , 4.932 , 4.421 , New , 0.51
> 256 , 0 , 10.019, 8.221 , New , 1.8
> 170 , 0 , 8.954 , 7.404 , New , 1.55
> 512 , 0 , 14.071, 10.948, New , 3.12
> 341 , 0 , 11.177, 9.246 , New , 1.93
> 1024, 0 , 21.808, 17.034, New , 4.77
> 682 , 0 , 16.07 , 12.941, New , 3.13
> 2048, 0 , 37.332, 29.853, New , 7.48
> 1365, 0 , 26.394, 21.516, New , 4.88
> 4096, 0 , 87.951, 80.35 , New , 7.6
> 2730, 0 , 62.768, 44.247, New , 18.52
>
> The strnlen numbers are a bit more of a mixed bag but I think
> generally positive. Its possible that the current version should be
> kept. Let me know.
>
> Results For Skylake strnlen-avx2
> size, algn, Cur T , Sub T , Win , Dif
> 1 , 0 , 4.06 , 4.1 , Cur , 0.04
> 2 , 0 , 4.15 , 4.08 , New , 0.07
> 3 , 0 , 4.1 , 4.03 , New , 0.07
> 4 , 0 , 3.95 , 3.91 , New , 0.04
> 5 , 0 , 4.07 , 3.9 , New , 0.17
> 6 , 0 , 4.04 , 3.92 , New , 0.12
> 7 , 0 , 4.03 , 3.89 , New , 0.14
> 1 , 1 , 3.75 , 3.79 , Cur , 0.04
> 2 , 2 , 4.0 , 3.91 , New , 0.09
> 3 , 3 , 4.04 , 3.92 , New , 0.12
> 4 , 4 , 3.95 , 3.86 , New , 0.09
> 5 , 5 , 3.97 , 3.91 , New , 0.06
> 6 , 6 , 3.96 , 3.92 , New , 0.04
> 7 , 7 , 3.97 , 3.91 , New , 0.06
> 4 , 1 , 3.76 , 3.83 , Cur , 0.07
> 8 , 0 , 3.73 , 4.01 , Cur , 0.28
> 8 , 1 , 3.73 , 3.88 , Cur , 0.15
> 16 , 0 , 3.68 , 3.84 , Cur , 0.16
> 16 , 1 , 3.75 , 3.92 , Cur , 0.17
> 32 , 0 , 5.93 , 5.95 , Cur , 0.02
> 32 , 1 , 5.95 , 5.98 , Cur , 0.03
> 64 , 0 , 6.46 , 5.31 , New , 1.15
> 64 , 1 , 6.66 , 5.43 , New , 1.23
> 128 , 0 , 7.42 , 6.02 , New , 1.4
> 128 , 1 , 7.57 , 5.81 , New , 1.76
> 256 , 0 , 12.02 , 9.89 , New , 2.13
> 256 , 1 , 11.91 , 9.84 , New , 2.07
> 512 , 0 , 15.06 , 11.77 , New , 3.29
> 512 , 1 , 14.79 , 11.75 , New , 3.04
> 1024, 0 , 23.61 , 16.98 , New , 6.63
> 1024, 1 , 23.63 , 16.91 , New , 6.72
>
> Results For Icelake strnlen-avx2
> size, algn, Cur T , Sub T , Win , Dif
> 1 , 0 , 2.81 , 2.51 , New , 0.3
> 2 , 0 , 2.8 , 2.53 , New , 0.27
> 3 , 0 , 2.7 , 2.57 , New , 0.13
> 4 , 0 , 2.68 , 2.55 , New , 0.13
> 5 , 0 , 2.7 , 2.57 , New , 0.13
> 6 , 0 , 2.73 , 2.6 , New , 0.13
> 7 , 0 , 2.69 , 2.61 , New , 0.08
> 1 , 1 , 2.53 , 2.5 , New , 0.03
> 2 , 2 , 2.67 , 2.6 , New , 0.07
> 3 , 3 , 2.67 , 2.59 , New , 0.08
> 4 , 4 , 2.66 , 2.57 , New , 0.09
> 5 , 5 , 2.65 , 2.56 , New , 0.09
> 6 , 6 , 2.67 , 2.59 , New , 0.08
> 7 , 7 , 2.65 , 2.62 , New , 0.03
> 4 , 1 , 2.65 , 2.41 , New , 0.24
> 8 , 0 , 2.68 , 2.56 , New , 0.12
> 8 , 1 , 2.62 , 2.55 , New , 0.07
> 16 , 0 , 2.66 , 2.56 , New , 0.1
> 16 , 1 , 2.63 , 2.55 , New , 0.08
> 32 , 0 , 3.62 , 3.19 , New , 0.43
> 32 , 1 , 3.74 , 3.45 , New , 0.29
> 64 , 0 , 3.9 , 3.7 , New , 0.2
> 64 , 1 , 4.13 , 3.68 , New , 0.45
> 128 , 0 , 4.34 , 4.17 , New , 0.17
> 128 , 1 , 4.59 , 4.07 , New , 0.52
> 256 , 0 , 6.74 , 6.56 , New , 0.18
> 256 , 1 , 7.34 , 7.13 , New , 0.21
> 512 , 0 , 9.64 , 8.67 , New , 0.97
> 512 , 1 , 9.49 , 8.56 , New , 0.93
> 1024, 0 , 13.57 , 12.35 , New , 1.22
> 1024, 1 , 13.57 , 12.59 , New , 0.98
>
> Results For Tigerlake strnlen-avx2
> size, algn, Cur T , Sub T , Win , Dif
> 1 , 0 , 4.21 , 3.91 , New , 0.3
> 2 , 0 , 4.1 , 3.79 , New , 0.31
> 3 , 0 , 4.02 , 3.81 , New , 0.21
> 4 , 0 , 4.06 , 3.82 , New , 0.24
> 5 , 0 , 4.1 , 3.81 , New , 0.29
> 6 , 0 , 4.08 , 3.82 , New , 0.26
> 7 , 0 , 4.07 , 3.87 , New , 0.2
> 1 , 1 , 3.95 , 3.8 , New , 0.15
> 2 , 2 , 4.11 , 3.88 , New , 0.23
> 3 , 3 , 4.08 , 3.88 , New , 0.2
> 4 , 4 , 4.05 , 3.94 , New , 0.11
> 5 , 5 , 4.02 , 3.89 , New , 0.13
> 6 , 6 , 4.02 , 3.89 , New , 0.13
> 7 , 7 , 4.08 , 3.84 , New , 0.24
> 4 , 1 , 4.07 , 3.7 , New , 0.37
> 8 , 0 , 4.08 , 3.95 , New , 0.13
> 8 , 1 , 4.01 , 4.02 , Cur , 0.01
> 16 , 0 , 4.03 , 4.03 , Eq , 0.0
> 16 , 1 , 4.05 , 4.0 , New , 0.05
> 32 , 0 , 5.86 , 5.23 , New , 0.63
> 32 , 1 , 5.88 , 5.36 , New , 0.52
> 64 , 0 , 6.38 , 5.73 , New , 0.65
> 64 , 1 , 6.49 , 5.56 , New , 0.93
> 128 , 0 , 7.17 , 6.39 , New , 0.78
> 128 , 1 , 7.1 , 6.41 , New , 0.69
> 256 , 0 , 11.65 , 11.0 , New , 0.65
> 256 , 1 , 11.37 , 10.97 , New , 0.4
> 512 , 0 , 14.86 , 13.43 , New , 1.43
> 512 , 1 , 14.63 , 13.35 , New , 1.28
> 1024, 0 , 20.92 , 19.33 , New , 1.59
> 1024, 1 , 20.85 , 19.38 , New , 1.47
>
> Results For Icelake strnlen-evex
> size, algn, Cur T , Sub T , Win , Dif
> 1 , 0 , 2.9 , 2.66 , New , 0.24
> 2 , 0 , 2.99 , 2.72 , New , 0.27
> 3 , 0 , 2.93 , 2.64 , New , 0.29
> 4 , 0 , 2.83 , 2.55 , New , 0.28
> 5 , 0 , 2.92 , 2.64 , New , 0.28
> 6 , 0 , 2.95 , 2.64 , New , 0.31
> 7 , 0 , 2.91 , 2.65 , New , 0.26
> 1 , 1 , 2.63 , 2.49 , New , 0.14
> 2 , 2 , 2.89 , 2.6 , New , 0.29
> 3 , 3 , 2.89 , 2.59 , New , 0.3
> 4 , 4 , 2.9 , 2.58 , New , 0.32
> 5 , 5 , 2.87 , 2.57 , New , 0.3
> 6 , 6 , 2.9 , 2.57 , New , 0.33
> 7 , 7 , 2.88 , 2.64 , New , 0.24
> 4 , 1 , 2.65 , 2.39 , New , 0.26
> 8 , 0 , 2.85 , 2.57 , New , 0.28
> 8 , 1 , 2.62 , 2.4 , New , 0.22
> 16 , 0 , 2.83 , 2.56 , New , 0.27
> 16 , 1 , 2.63 , 2.39 , New , 0.24
> 32 , 0 , 3.95 , 3.06 , New , 0.89
> 32 , 1 , 3.95 , 3.15 , New , 0.8
> 64 , 0 , 3.98 , 3.6 , New , 0.38
> 64 , 1 , 3.88 , 3.48 , New , 0.4
> 128 , 0 , 4.45 , 4.19 , New , 0.26
> 128 , 1 , 4.57 , 4.21 , New , 0.36
> 256 , 0 , 6.75 , 6.97 , Cur , 0.22
> 256 , 1 , 7.55 , 7.76 , Cur , 0.21
> 512 , 0 , 9.75 , 10.09 , Cur , 0.34
> 512 , 1 , 9.84 , 10.13 , Cur , 0.29
> 1024, 0 , 14.45 , 14.4 , New , 0.05
> 1024, 1 , 14.39 , 14.26 , New , 0.13
>
> Results For Tigerlake strnlen-evex
> size, algn, Cur T , Sub T , Win , Dif
> 1 , 0 , 3.86 , 3.59 , New , 0.27
> 2 , 0 , 3.78 , 3.41 , New , 0.37
> 3 , 0 , 3.69 , 3.4 , New , 0.29
> 4 , 0 , 3.62 , 3.33 , New , 0.29
> 5 , 0 , 3.76 , 3.37 , New , 0.39
> 6 , 0 , 3.73 , 3.39 , New , 0.34
> 7 , 0 , 3.7 , 3.4 , New , 0.3
> 1 , 1 , 3.58 , 3.35 , New , 0.23
> 2 , 2 , 3.75 , 3.34 , New , 0.41
> 3 , 3 , 3.72 , 3.39 , New , 0.33
> 4 , 4 , 3.69 , 3.38 , New , 0.31
> 5 , 5 , 3.69 , 3.37 , New , 0.32
> 6 , 6 , 3.68 , 3.37 , New , 0.31
> 7 , 7 , 3.74 , 3.35 , New , 0.39
> 4 , 1 , 3.39 , 3.27 , New , 0.12
> 8 , 0 , 3.4 , 3.29 , New , 0.11
> 8 , 1 , 3.34 , 3.32 , New , 0.02
> 16 , 0 , 3.36 , 3.34 , New , 0.02
> 16 , 1 , 3.39 , 3.3 , New , 0.09
> 32 , 0 , 5.13 , 5.13 , Eq , 0.0
> 32 , 1 , 5.18 , 5.16 , New , 0.02
> 64 , 0 , 5.87 , 5.44 , New , 0.43
> 64 , 1 , 5.97 , 5.44 , New , 0.53
> 128 , 0 , 7.14 , 6.48 , New , 0.66
> 128 , 1 , 7.08 , 6.63 , New , 0.45
> 256 , 0 , 11.68 , 12.57 , Cur , 0.89
> 256 , 1 , 11.67 , 12.23 , Cur , 0.56
> 512 , 0 , 15.64 , 15.74 , Cur , 0.1
> 512 , 1 , 15.52 , 15.69 , Cur , 0.17
> 1024, 0 , 23.02 , 22.57 , New , 0.45
> 1024, 1 , 23.0 , 22.8 , New , 0.2
>
> sysdeps/x86_64/multiarch/strlen-evex.S | 569 +++++++++++++------------
> 1 file changed, 307 insertions(+), 262 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/strlen-evex.S b/sysdeps/x86_64/multiarch/strlen-evex.S
> index 0583819078..d1aafac76f 100644
> --- a/sysdeps/x86_64/multiarch/strlen-evex.S
> +++ b/sysdeps/x86_64/multiarch/strlen-evex.S
> @@ -29,11 +29,13 @@
> # ifdef USE_AS_WCSLEN
> # define VPCMP vpcmpd
> # define VPMINU vpminud
> -# define SHIFT_REG r9d
> +# define SHIFT_REG ecx
> +# define CHAR_SIZE 4
> # else
> # define VPCMP vpcmpb
> # define VPMINU vpminub
> -# define SHIFT_REG ecx
> +# define SHIFT_REG edx
> +# define CHAR_SIZE 1
> # endif
>
> # define XMMZERO xmm16
> @@ -46,132 +48,169 @@
> # define YMM6 ymm22
>
> # define VEC_SIZE 32
> +# define PAGE_SIZE 4096
> +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
>
> .section .text.evex,"ax",@progbits
> ENTRY (STRLEN)
> # ifdef USE_AS_STRNLEN
> - /* Check for zero length. */
> + /* Check zero length. */
> test %RSI_LP, %RSI_LP
> jz L(zero)
> -# ifdef USE_AS_WCSLEN
> - shl $2, %RSI_LP
> -# elif defined __ILP32__
> +# if !defined USE_AS_WCSLEN && defined __ILP32__
> /* Clear the upper 32 bits. */
> movl %esi, %esi
> # endif
> mov %RSI_LP, %R8_LP
> # endif
> - movl %edi, %ecx
> - movq %rdi, %rdx
> + movl %edi, %eax
> + sall $20, %eax
> vpxorq %XMMZERO, %XMMZERO, %XMMZERO
> -
> /* Check if we may cross page boundary with one vector load. */
> - andl $(2 * VEC_SIZE - 1), %ecx
> - cmpl $VEC_SIZE, %ecx
> - ja L(cros_page_boundary)
> + cmpl $((PAGE_SIZE - VEC_SIZE) << 20), %eax
> + ja L(cross_page_boundary)
>
> - /* Check the first VEC_SIZE bytes. Each bit in K0 represents a
> - null byte. */
> + /* Check the first VEC_SIZE bytes. */
Here are some quick comments:
1. The change is incorrect for x32:
FAIL: wcsmbs/tst-size_t-wcsnlen
$ cat wcsmbs/tst-size_t-wcsnlen.out
Didn't expect signal from child: got `Segmentation fault'
Your patch removed/changed special handling for x32.
2. Some of comment changes removed some detailed information, like
- /* Check the first VEC_SIZE bytes. Each bit in K0 represents a
- null byte. */
+ /* Check the first VEC_SIZE bytes. */
You should put them back.
3. You need to add more comments for changes which aren't 100%
obvious.
> VPCMP $0, (%rdi), %YMMZERO, %k0
> kmovd %k0, %eax
> - testl %eax, %eax
> -
> # ifdef USE_AS_STRNLEN
> - jnz L(first_vec_x0_check)
> - /* Adjust length and check the end of data. */
> - subq $VEC_SIZE, %rsi
> - jbe L(max)
> -# else
> - jnz L(first_vec_x0)
> + /* If length < VEC_SIZE handle special. */
> + cmpq $CHAR_PER_VEC, %rsi
> + jbe L(first_vec_x0)
> # endif
> -
> - /* Align data for aligned loads in the loop. */
> - addq $VEC_SIZE, %rdi
> - andl $(VEC_SIZE - 1), %ecx
> - andq $-VEC_SIZE, %rdi
> -
> + testl %eax, %eax
> + jz L(aligned_more)
> + tzcntl %eax, %eax
> + ret
> # ifdef USE_AS_STRNLEN
> - /* Adjust length. */
> - addq %rcx, %rsi
> +L(zero):
> + xorl %eax, %eax
> + ret
>
> - subq $(VEC_SIZE * 4), %rsi
> - jbe L(last_4x_vec_or_less)
> + .p2align 4
> +L(first_vec_x0):
> + /* Select min of length and position of first null. */
> + btsq %rsi, %rax
> + tzcntl %eax, %eax
> + ret
> # endif
> - jmp L(more_4x_vec)
>
> .p2align 4
> -L(cros_page_boundary):
> - andl $(VEC_SIZE - 1), %ecx
> - andq $-VEC_SIZE, %rdi
> -
> -# ifdef USE_AS_WCSLEN
> - /* NB: Divide shift count by 4 since each bit in K0 represent 4
> - bytes. */
> - movl %ecx, %SHIFT_REG
> - sarl $2, %SHIFT_REG
> +L(first_vec_x1):
> + tzcntl %eax, %eax
> + /* Safe to use 32 bit instructions as these are only called for
> + size = [1, 159]. */
> +# ifdef USE_AS_STRNLEN
> + /* Use ecx which was computed earlier to compute correct value.
> + */
> +# ifdef USE_AS_WCSLEN
> + sarl $2, %ecx
> +# endif
> + leal -(CHAR_PER_VEC * 4 + 1)(%rcx, %rax), %eax
> +# else
> + subl %edx, %edi
> +# ifdef USE_AS_WCSLEN
> + sarl $2, %edi
> +# endif
> + leal CHAR_PER_VEC(%rdi, %rax), %eax
> # endif
> - VPCMP $0, (%rdi), %YMMZERO, %k0
> - kmovd %k0, %eax
> + ret
>
> - /* Remove the leading bytes. */
> - sarxl %SHIFT_REG, %eax, %eax
> - testl %eax, %eax
> - jz L(aligned_more)
> + .p2align 4
> +L(first_vec_x2):
> tzcntl %eax, %eax
> -# ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> -# endif
> + /* Safe to use 32 bit instructions as these are only called for
> + size = [1, 159]. */
> # ifdef USE_AS_STRNLEN
> - /* Check the end of data. */
> - cmpq %rax, %rsi
> - jbe L(max)
> -# endif
> - addq %rdi, %rax
> - addq %rcx, %rax
> - subq %rdx, %rax
> -# ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> + /* Use ecx which was computed earlier to compute correct value.
> + */
> +# ifdef USE_AS_WCSLEN
> + sarl $2, %ecx
> +# endif
> + leal -(CHAR_PER_VEC * 3 + 1)(%rcx, %rax), %eax
> +# else
> + subl %edx, %edi
> +# ifdef USE_AS_WCSLEN
> + sarl $2, %edi
> +# endif
> + leal (CHAR_PER_VEC * 2)(%rdi, %rax), %eax
> # endif
> ret
>
> .p2align 4
> -L(aligned_more):
> +L(first_vec_x3):
> + tzcntl %eax, %eax
> + /* Safe to use 32 bit instructions as these are only called for
> + size = [1, 159]. */
> # ifdef USE_AS_STRNLEN
> - /* "rcx" is less than VEC_SIZE. Calculate "rdx + rcx - VEC_SIZE"
> - with "rdx - (VEC_SIZE - rcx)" instead of "(rdx + rcx) - VEC_SIZE"
> - to void possible addition overflow. */
> - negq %rcx
> - addq $VEC_SIZE, %rcx
> -
> - /* Check the end of data. */
> - subq %rcx, %rsi
> - jbe L(max)
> + /* Use ecx which was computed earlier to compute correct value.
> + */
> +# ifdef USE_AS_WCSLEN
> + sarl $2, %ecx
> +# endif
> + leal -(CHAR_PER_VEC * 2 + 1)(%rcx, %rax), %eax
> +# else
> + subl %edx, %edi
> +# ifdef USE_AS_WCSLEN
> + sarl $2, %edi
> +# endif
> + leal (CHAR_PER_VEC * 3)(%rdi, %rax), %eax
> # endif
> + ret
>
> - addq $VEC_SIZE, %rdi
> -
> + .p2align 4
> +L(first_vec_x4):
> + tzcntl %eax, %eax
> + /* Safe to use 32 bit instructions as these are only called for
> + size = [1, 159]. */
> # ifdef USE_AS_STRNLEN
> - subq $(VEC_SIZE * 4), %rsi
> - jbe L(last_4x_vec_or_less)
> + /* Use ecx which was computed earlier to compute correct value.
> + */
> +# ifdef USE_AS_WCSLEN
> + sarl $2, %ecx
> +# endif
> + leal -(CHAR_PER_VEC + 1)(%rcx, %rax), %eax
> +# else
> + subl %edx, %edi
> +# ifdef USE_AS_WCSLEN
> + sarl $2, %edi
> +# endif
> + leal (CHAR_PER_VEC * 4)(%rdi, %rax), %eax
> # endif
> + ret
>
> -L(more_4x_vec):
> + /* strnlen jumps here. strlen falls through. */
> + .p2align 5
> +L(aligned_more):
> + movq %rdi, %rdx
> + /* Align data to VEC_SIZE. */
> + andq $-(VEC_SIZE), %rdi
> +L(cross_page_continue):
> /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time
> since data is only aligned to VEC_SIZE. */
> - VPCMP $0, (%rdi), %YMMZERO, %k0
> - kmovd %k0, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x0)
> -
> +# ifdef USE_AS_STRNLEN
> +# ifdef USE_AS_WCSLEN
> + salq $2, %rsi
> +# endif
> + /* + CHAR_SIZE because it simplies the logic in
> + last_4x_vec_or_less. */
> + leaq (VEC_SIZE * 5 + CHAR_SIZE)(%rdi), %rcx
> + subq %rdx, %rcx
> +# endif
> + /* Load first VEC regardless. */
> VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0
> +# ifdef USE_AS_STRNLEN
> + /* Adjust length. If near end handle specially. */
> + subq %rcx, %rsi
> + jb L(last_4x_vec_or_less)
> +# endif
> kmovd %k0, %eax
> testl %eax, %eax
> jnz L(first_vec_x1)
>
> VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
> kmovd %k0, %eax
> - testl %eax, %eax
> + test %eax, %eax
> jnz L(first_vec_x2)
>
> VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
> @@ -179,258 +218,264 @@ L(more_4x_vec):
> testl %eax, %eax
> jnz L(first_vec_x3)
>
> - addq $(VEC_SIZE * 4), %rdi
> -
> -# ifdef USE_AS_STRNLEN
> - subq $(VEC_SIZE * 4), %rsi
> - jbe L(last_4x_vec_or_less)
> -# endif
> -
> - /* Align data to 4 * VEC_SIZE. */
> - movq %rdi, %rcx
> - andl $(4 * VEC_SIZE - 1), %ecx
> - andq $-(4 * VEC_SIZE), %rdi
> + VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0
> + kmovd %k0, %eax
> + testl %eax, %eax
> + jnz L(first_vec_x4)
>
> + addq $VEC_SIZE, %rdi
> # ifdef USE_AS_STRNLEN
> - /* Adjust length. */
> + /* Check if at last VEC_SIZE * 4 length. */
> + cmpq $(VEC_SIZE * 4 - 1), %rsi
> + jbe L(last_4x_vec_or_less_load)
> + movl %edi, %ecx
> + andl $(VEC_SIZE * 4 - 1), %ecx
> + /* Readjust length. */
> addq %rcx, %rsi
> # endif
> + /* Align data to VEC_SIZE * 4. */
> + andq $-(VEC_SIZE * 4), %rdi
>
> + /* Compare 4 * VEC at a time forward. */
> .p2align 4
> L(loop_4x_vec):
> - /* Compare 4 * VEC at a time forward. */
> - VMOVA (%rdi), %YMM1
> - VMOVA VEC_SIZE(%rdi), %YMM2
> - VMOVA (VEC_SIZE * 2)(%rdi), %YMM3
> - VMOVA (VEC_SIZE * 3)(%rdi), %YMM4
> -
> - VPMINU %YMM1, %YMM2, %YMM5
> - VPMINU %YMM3, %YMM4, %YMM6
> + /* Load first VEC regardless. */
> + VMOVA (VEC_SIZE * 4)(%rdi), %YMM1
> +# ifdef USE_AS_STRNLEN
> + /* Break if at end of length. */
> + subq $(VEC_SIZE * 4), %rsi
> + jb L(last_4x_vec_or_less_cmpeq)
> +# endif
> + VPMINU (VEC_SIZE * 5)(%rdi), %YMM1, %YMM2
> + VMOVA (VEC_SIZE * 6)(%rdi), %YMM3
> + VPMINU (VEC_SIZE * 7)(%rdi), %YMM3, %YMM4
> + VPCMP $0, %YMM2, %YMMZERO, %k0
> + VPCMP $0, %YMM4, %YMMZERO, %k1
> + subq $-(VEC_SIZE * 4), %rdi
> + kortestd %k0, %k1
> + jz L(loop_4x_vec)
> +
> + /* Check if end was in first half. */
> + kmovd %k0, %eax
> + subq %rdx, %rdi
> +# ifdef USE_AS_WCSLEN
> + shrq $2, %rdi
> +# endif
> + testl %eax, %eax
> + jz L(second_vec_return)
>
> - VPMINU %YMM5, %YMM6, %YMM5
> - VPCMP $0, %YMM5, %YMMZERO, %k0
> - ktestd %k0, %k0
> - jnz L(4x_vec_end)
> + VPCMP $0, %YMM1, %YMMZERO, %k2
> + kmovd %k2, %edx
> + /* Combine YMM1 matches (k2) with YMM2 matches (k0). */
> +# ifdef USE_AS_WCSLEN
> + sall $CHAR_PER_VEC, %eax
> + orl %edx, %eax
> + tzcntl %eax, %eax
> +# else
> + salq $CHAR_PER_VEC, %rax
> + orq %rdx, %rax
> + tzcntq %rax, %rax
> +# endif
> + addq %rdi, %rax
> + ret
>
> - addq $(VEC_SIZE * 4), %rdi
>
> -# ifndef USE_AS_STRNLEN
> - jmp L(loop_4x_vec)
> -# else
> - subq $(VEC_SIZE * 4), %rsi
> - ja L(loop_4x_vec)
> +# ifdef USE_AS_STRNLEN
>
> +L(last_4x_vec_or_less_load):
> + /* Depending on entry adjust rdi / prepare first VEC in YMM1. */
> + VMOVA (VEC_SIZE * 4)(%rdi), %YMM1
> +L(last_4x_vec_or_less_cmpeq):
> + VPCMP $0, %YMM1, %YMMZERO, %k0
> + addq $(VEC_SIZE * 3), %rdi
> L(last_4x_vec_or_less):
> - /* Less than 4 * VEC and aligned to VEC_SIZE. */
> - addl $(VEC_SIZE * 2), %esi
> - jle L(last_2x_vec)
> -
> - VPCMP $0, (%rdi), %YMMZERO, %k0
> kmovd %k0, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x0)
> + /* If remaining length > VEC_SIZE * 2. */
> + testl $(VEC_SIZE * 2), %esi
> + jnz L(last_4x_vec)
>
> - VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0
> - kmovd %k0, %eax
> + /* length may have been negative or positive depending on where
> + this was called from. This fixes that. */
> + andl $(VEC_SIZE * 4 - 1), %esi
> testl %eax, %eax
> - jnz L(first_vec_x1)
> + jnz L(last_vec_x1_check)
>
> - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
> - kmovd %k0, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x2_check)
> subl $VEC_SIZE, %esi
> - jle L(max)
> + jb L(max)
>
> - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
> + VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
> kmovd %k0, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x3_check)
> - movq %r8, %rax
> + tzcntl %eax, %eax
> # ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> + sarl $2, %esi
> # endif
> - ret
> -
> - .p2align 4
> -L(last_2x_vec):
> - addl $(VEC_SIZE * 2), %esi
> -
> - VPCMP $0, (%rdi), %YMMZERO, %k0
> - kmovd %k0, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x0_check)
> - subl $VEC_SIZE, %esi
> - jle L(max)
> + /* Check the end of data. */
> + cmpl %eax, %esi
> + jb L(max)
>
> - VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0
> - kmovd %k0, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x1_check)
> - movq %r8, %rax
> + subq %rdx, %rdi
> # ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> + sarq $2, %rdi
> # endif
> + leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax
> + ret
> +L(max):
> + movq %r8, %rax
> ret
> +# endif
>
> .p2align 4
> -L(first_vec_x0_check):
> +L(second_vec_return):
> + VPCMP $0, %YMM3, %YMMZERO, %k0
> + /* Combine YMM3 matches (k0) with YMM4 matches (k1). */
> +# ifdef USE_AS_WCSLEN
> + kunpckbw %k0, %k1, %k0
> + kmovd %k0, %eax
> + tzcntl %eax, %eax
> +# else
> + kunpckdq %k0, %k1, %k0
> + kmovq %k0, %rax
> + tzcntq %rax, %rax
> +# endif
> + leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax
> + ret
> +
> +
> +# ifdef USE_AS_STRNLEN
> +L(last_vec_x1_check):
> tzcntl %eax, %eax
> # ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> + sarl $2, %esi
> # endif
> /* Check the end of data. */
> - cmpq %rax, %rsi
> - jbe L(max)
> - addq %rdi, %rax
> - subq %rdx, %rax
> + cmpl %eax, %esi
> + jb L(max)
> + subq %rdx, %rdi
> # ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> + sarq $2, %rdi
> # endif
> + leaq (CHAR_PER_VEC)(%rdi, %rax), %rax
> ret
>
> .p2align 4
> -L(first_vec_x1_check):
> +L(last_4x_vec):
> + /* Test first 2x VEC normally. */
> + testl %eax, %eax
> + jnz L(last_vec_x1)
> +
> + VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
> + kmovd %k0, %eax
> + testl %eax, %eax
> + jnz L(last_vec_x2)
> +
> + /* Normalize length. */
> + andl $(VEC_SIZE * 4 - 1), %esi
> + VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
> + kmovd %k0, %eax
> + testl %eax, %eax
> + jnz L(last_vec_x3)
> +
> + subl $(VEC_SIZE * 3), %esi
> + jb L(max)
> +
> + VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0
> + kmovd %k0, %eax
> tzcntl %eax, %eax
> # ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> + sarl $2, %esi
> # endif
> /* Check the end of data. */
> - cmpq %rax, %rsi
> - jbe L(max)
> - addq $VEC_SIZE, %rax
> - addq %rdi, %rax
> - subq %rdx, %rax
> + cmpl %eax, %esi
> + jb L(max_end)
> +
> + subq %rdx, %rdi
> # ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> + sarq $2, %rdi
> # endif
> + leaq (CHAR_PER_VEC * 4)(%rdi, %rax), %rax
> ret
>
> .p2align 4
> -L(first_vec_x2_check):
> +L(last_vec_x1):
> tzcntl %eax, %eax
> + subq %rdx, %rdi
> # ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> + sarq $2, %rdi
> # endif
> - /* Check the end of data. */
> - cmpq %rax, %rsi
> - jbe L(max)
> - addq $(VEC_SIZE * 2), %rax
> - addq %rdi, %rax
> - subq %rdx, %rax
> + leaq (CHAR_PER_VEC)(%rdi, %rax), %rax
> + ret
> +
> + .p2align 4
> +L(last_vec_x2):
> + tzcntl %eax, %eax
> + subq %rdx, %rdi
> # ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> + sarq $2, %rdi
> # endif
> + leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax
> ret
>
> .p2align 4
> -L(first_vec_x3_check):
> +L(last_vec_x3):
> tzcntl %eax, %eax
> # ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> + sarl $2, %esi
> # endif
> + subl $(CHAR_PER_VEC * 2), %esi
> /* Check the end of data. */
> - cmpq %rax, %rsi
> - jbe L(max)
> - addq $(VEC_SIZE * 3), %rax
> - addq %rdi, %rax
> - subq %rdx, %rax
> + cmpl %eax, %esi
> + jb L(max_end)
> + subq %rdx, %rdi
> # ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> + sarq $2, %rdi
> # endif
> + leaq (CHAR_PER_VEC * 3)(%rdi, %rax), %rax
> ret
> -
> - .p2align 4
> -L(max):
> +L(max_end):
> movq %r8, %rax
> -# ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> -# endif
> - ret
> -
> - .p2align 4
> -L(zero):
> - xorl %eax, %eax
> ret
> # endif
>
> + /* Cold case for crossing page with first load. */
> .p2align 4
> -L(first_vec_x0):
> - tzcntl %eax, %eax
> -# ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> -# endif
> - addq %rdi, %rax
> - subq %rdx, %rax
> +L(cross_page_boundary):
> + movq %rdi, %rdx
> + /* Align data to VEC_SIZE. */
> + andq $-VEC_SIZE, %rdi
> + VPCMP $0, (%rdi), %YMMZERO, %k0
> + kmovd %k0, %eax
> + /* Remove the leading bytes. */
> # ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> + movl %edx, %ecx
> + shrl $2, %ecx
> + andl $(CHAR_PER_VEC - 1), %ecx
> # endif
> - ret
> -
> - .p2align 4
> -L(first_vec_x1):
> + /* SHIFT_REG is ecx for USE_AS_WCSLEN and edx otherwise. */
> + sarxl %SHIFT_REG, %eax, %eax
> + testl %eax, %eax
> +# ifndef USE_AS_STRNLEN
> + jz L(cross_page_continue)
> tzcntl %eax, %eax
> -# ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> -# endif
> - addq $VEC_SIZE, %rax
> - addq %rdi, %rax
> - subq %rdx, %rax
> -# ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> -# endif
> ret
> -
> - .p2align 4
> -L(first_vec_x2):
> - tzcntl %eax, %eax
> -# ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> -# endif
> - addq $(VEC_SIZE * 2), %rax
> - addq %rdi, %rax
> - subq %rdx, %rax
> -# ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> -# endif
> +# else
> + jnz L(cross_page_less_vec)
> +# ifndef USE_AS_WCSLEN
> + movl %edx, %ecx
> + andl $(CHAR_PER_VEC - 1), %ecx
> +# endif
> + movl $CHAR_PER_VEC, %eax
> + subl %ecx, %eax
> + cmpq %rax, %rsi
> + ja L(cross_page_continue)
> + movl %esi, %eax
> ret
> -
> - .p2align 4
> -L(4x_vec_end):
> - VPCMP $0, %YMM1, %YMMZERO, %k0
> - kmovd %k0, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x0)
> - VPCMP $0, %YMM2, %YMMZERO, %k1
> - kmovd %k1, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x1)
> - VPCMP $0, %YMM3, %YMMZERO, %k2
> - kmovd %k2, %eax
> - testl %eax, %eax
> - jnz L(first_vec_x2)
> - VPCMP $0, %YMM4, %YMMZERO, %k3
> - kmovd %k3, %eax
> -L(first_vec_x3):
> +L(cross_page_less_vec):
> tzcntl %eax, %eax
> -# ifdef USE_AS_WCSLEN
> - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */
> - sall $2, %eax
> -# endif
> - addq $(VEC_SIZE * 3), %rax
> - addq %rdi, %rax
> - subq %rdx, %rax
> -# ifdef USE_AS_WCSLEN
> - shrq $2, %rax
> -# endif
> + /* Select min of length and position of first null. */
> + cmpq %rax, %rsi
> + cmovb %esi, %eax
> ret
> +# endif
>
> END (STRLEN)
> #endif
> --
> 2.29.2
>
--
H.J.
prev parent reply other threads:[~2021-04-17 18:57 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-17 2:52 Noah Goldstein
2021-04-17 2:52 ` [PATCH v1 2/2] x86: Optimize strlen-avx2.S Noah Goldstein
2021-04-17 18:46 ` H.J. Lu
2021-04-17 18:56 ` H.J. Lu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAMe9rOo6N2=UZ4QLDfp4JzbdprmzzNZaYKtAhMWbSm2s4yZTAA@mail.gmail.com' \
--to=hjl.tools@gmail.com \
--cc=carlos@systemhalted.org \
--cc=goldstein.w.n@gmail.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).