* [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library @ 2022-06-03 4:42 Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein ` (7 more replies) 0 siblings, 8 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:42 UTC (permalink / raw) To: libc-alpha This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 33 +++++++++ sysdeps/x86_64/multiarch/avx-vecs.h | 53 ++++++++++++++ sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 33 +++++++++ sysdeps/x86_64/multiarch/avx2-vecs.h | 30 ++++++++ sysdeps/x86_64/multiarch/evex256-vecs.h | 50 +++++++++++++ sysdeps/x86_64/multiarch/evex512-vecs.h | 49 +++++++++++++ sysdeps/x86_64/multiarch/sse2-vecs.h | 48 +++++++++++++ sysdeps/x86_64/multiarch/vec-macros.h | 90 ++++++++++++++++++++++++ 8 files changed, 386 insertions(+) create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx2-rtm-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx2-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h new file mode 100644 index 0000000000..c00b83ea0e --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -0,0 +1,33 @@ +/* Common config for AVX-RTM VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_RTM_VECS_H +#define _AVX_RTM_VECS_H 1 + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define SECTION(p) p##.avx.rtm + +#define USE_WITH_RTM 1 +#include "avx-vecs.h" + +#endif diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h new file mode 100644 index 0000000000..3b84d7e8b2 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-vecs.h @@ -0,0 +1,53 @@ +/* Common config for AVX VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_VECS_H +#define _AVX_VECS_H 1 + +#ifdef HAS_VEC +# error "Multiple VEC configs included!" +#endif + +#define HAS_VEC 1 +#include "vec-macros.h" + +#ifndef USE_WITH_AVX2 +# define USE_WITH_AVX 1 +#endif +/* Included by RTM version. */ +#ifndef SECTION +# define SECTION(p) p##.avx +#endif + +#define VEC_SIZE 32 +/* 4-byte mov instructions with AVX2. */ +#define MOV_SIZE 4 +/* 1 (ret) + 3 (vzeroupper). */ +#define RET_SIZE 4 +#define VZEROUPPER vzeroupper + +#define VMOVU vmovdqu +#define VMOVA vmovdqa +#define VMOVNT vmovntdq + +/* Often need to access xmm portion. */ +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h new file mode 100644 index 0000000000..a5d46e8c66 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h @@ -0,0 +1,33 @@ +/* Common config for AVX2-RTM VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX2_RTM_VECS_H +#define _AVX2_RTM_VECS_H 1 + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define SECTION(p) p##.avx.rtm + +#define USE_WITH_RTM 1 +#include "avx2-vecs.h" + +#endif diff --git a/sysdeps/x86_64/multiarch/avx2-vecs.h b/sysdeps/x86_64/multiarch/avx2-vecs.h new file mode 100644 index 0000000000..4c029b4621 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx2-vecs.h @@ -0,0 +1,30 @@ +/* Common config for AVX2 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX2_VECS_H +#define _AVX2_VECS_H 1 + +#define USE_WITH_AVX2 1 +/* Included by RTM version. */ +#ifndef SECTION +# define SECTION(p) p##.avx +#endif +#include "avx-vecs.h" + +#endif diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h new file mode 100644 index 0000000000..ed7a32b0ec --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h @@ -0,0 +1,50 @@ +/* Common config for EVEX256 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX256_VECS_H +#define _EVEX256_VECS_H 1 + +#ifdef HAS_VEC +# error "Multiple VEC configs included!" +#endif + +#define HAS_VEC 1 +#include "vec-macros.h" + +#define USE_WITH_EVEX256 1 +#ifndef SECTION +# define SECTION(p) p##.evex +#endif + +#define VEC_SIZE 32 +/* 6-byte mov instructions with EVEX. */ +#define MOV_SIZE 6 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU vmovdqu64 +#define VMOVA vmovdqa64 +#define VMOVNT vmovntdq + +/* Often need to access xmm portion. */ +#define VEC_xmm VEC_hi_xmm +#define VEC VEC_hi_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h new file mode 100644 index 0000000000..53597734fc --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h @@ -0,0 +1,49 @@ +/* Common config for EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX512_VECS_H +#define _EVEX512_VECS_H 1 + +#ifdef HAS_VEC +# error "Multiple VEC configs included!" +#endif + +#define HAS_VEC 1 +#include "vec-macros.h" + +#define USE_WITH_EVEX512 1 +#define SECTION(p) p##.evex512 + +#define VEC_SIZE 64 +/* 6-byte mov instructions with EVEX. */ +#define MOV_SIZE 6 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU vmovdqu64 +#define VMOVA vmovdqa64 +#define VMOVNT vmovntdq + +/* Often need to access xmm/ymm portion. */ +#define VEC_xmm VEC_hi_xmm +#define VEC_ymm VEC_hi_ymm +#define VEC VEC_hi_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h new file mode 100644 index 0000000000..b645b93e3d --- /dev/null +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h @@ -0,0 +1,48 @@ +/* Common config for SSE2 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _SSE2_VECS_H +#define _SSE2_VECS_H 1 + +#ifdef HAS_VEC +# error "Multiple VEC configs included!" +#endif + +#define HAS_VEC 1 +#include "vec-macros.h" + +#define USE_WITH_SSE2 1 +#define SECTION(p) p + +#define VEC_SIZE 16 +/* 3-byte mov instructions with SSE2. */ +#define MOV_SIZE 3 +/* No vzeroupper needed. */ +#define RET_SIZE 1 + +#define VMOVU movups +#define VMOVA movaps +#define VMOVNT movntdq +#define VZEROUPPER + +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_xmm + + +#endif diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h new file mode 100644 index 0000000000..4dae4503c8 --- /dev/null +++ b/sysdeps/x86_64/multiarch/vec-macros.h @@ -0,0 +1,90 @@ +/* Macro helpers for VEC_{type}({vec_num}) + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _VEC_MACROS_H +# define _VEC_MACROS_H 1 + +# ifndef HAS_VEC +# error "Never include this file directly. Always include a vector config." +# endif + +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same + VEC(N) values. */ +#define VEC_hi_xmm0 xmm16 +#define VEC_hi_xmm1 xmm17 +#define VEC_hi_xmm2 xmm18 +#define VEC_hi_xmm3 xmm19 +#define VEC_hi_xmm4 xmm20 +#define VEC_hi_xmm5 xmm21 +#define VEC_hi_xmm6 xmm22 +#define VEC_hi_xmm7 xmm23 +#define VEC_hi_xmm8 xmm24 +#define VEC_hi_xmm9 xmm25 +#define VEC_hi_xmm10 xmm26 +#define VEC_hi_xmm11 xmm27 +#define VEC_hi_xmm12 xmm28 +#define VEC_hi_xmm13 xmm29 +#define VEC_hi_xmm14 xmm30 +#define VEC_hi_xmm15 xmm31 + +#define VEC_hi_ymm0 ymm16 +#define VEC_hi_ymm1 ymm17 +#define VEC_hi_ymm2 ymm18 +#define VEC_hi_ymm3 ymm19 +#define VEC_hi_ymm4 ymm20 +#define VEC_hi_ymm5 ymm21 +#define VEC_hi_ymm6 ymm22 +#define VEC_hi_ymm7 ymm23 +#define VEC_hi_ymm8 ymm24 +#define VEC_hi_ymm9 ymm25 +#define VEC_hi_ymm10 ymm26 +#define VEC_hi_ymm11 ymm27 +#define VEC_hi_ymm12 ymm28 +#define VEC_hi_ymm13 ymm29 +#define VEC_hi_ymm14 ymm30 +#define VEC_hi_ymm15 ymm31 + +#define VEC_hi_zmm0 zmm16 +#define VEC_hi_zmm1 zmm17 +#define VEC_hi_zmm2 zmm18 +#define VEC_hi_zmm3 zmm19 +#define VEC_hi_zmm4 zmm20 +#define VEC_hi_zmm5 zmm21 +#define VEC_hi_zmm6 zmm22 +#define VEC_hi_zmm7 zmm23 +#define VEC_hi_zmm8 zmm24 +#define VEC_hi_zmm9 zmm25 +#define VEC_hi_zmm10 zmm26 +#define VEC_hi_zmm11 zmm27 +#define VEC_hi_zmm12 zmm28 +#define VEC_hi_zmm13 zmm29 +#define VEC_hi_zmm14 zmm30 +#define VEC_hi_zmm15 zmm31 + +# define PRIMITIVE_VEC(vec, num) vec##num + +# define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) +# define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) +# define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) + +# define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) +# define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) +# define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) + +#endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein @ 2022-06-03 4:42 ` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (4 more replies) 2022-06-03 4:42 ` [PATCH v1 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein ` (6 subsequent siblings) 7 siblings, 5 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:42 UTC (permalink / raw) To: libc-alpha The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 1 + sysdeps/x86_64/sysdep.h | 16 ++++++++++++++++ 3 files changed, 18 insertions(+) diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h index c00b83ea0e..e954b8e1b0 100644 --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -20,6 +20,7 @@ #ifndef _AVX_RTM_VECS_H #define _AVX_RTM_VECS_H 1 +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h index a5d46e8c66..e20c3635a0 100644 --- a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h @@ -20,6 +20,7 @@ #ifndef _AVX2_RTM_VECS_H #define _AVX2_RTM_VECS_H 1 +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h index f14d50786d..2cb31a558b 100644 --- a/sysdeps/x86_64/sysdep.h +++ b/sysdeps/x86_64/sysdep.h @@ -106,6 +106,22 @@ lose: \ vzeroupper; \ ret +/* Can be used to replace vzeroupper that is not directly before a + return. */ +#define COND_VZEROUPPER_XTEST \ + xtest; \ + jz 1f; \ + vzeroall; \ + jmp 2f; \ +1: \ + vzeroupper; \ +2: + +/* In RTM define this as COND_VZEROUPPER_XTEST. */ +#ifndef COND_VZEROUPPER +# define COND_VZEROUPPER vzeroupper +#endif + /* Zero upper vector registers and return. */ #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN # define ZERO_UPPER_VEC_REGISTERS_RETURN \ -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-03 20:04 ` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein ` (7 more replies) 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein ` (3 subsequent siblings) 4 siblings, 8 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 20:04 UTC (permalink / raw) To: libc-alpha This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 33 +++++++++ sysdeps/x86_64/multiarch/avx-vecs.h | 53 ++++++++++++++ sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 33 +++++++++ sysdeps/x86_64/multiarch/avx2-vecs.h | 30 ++++++++ sysdeps/x86_64/multiarch/evex256-vecs.h | 50 +++++++++++++ sysdeps/x86_64/multiarch/evex512-vecs.h | 49 +++++++++++++ sysdeps/x86_64/multiarch/sse2-vecs.h | 48 +++++++++++++ sysdeps/x86_64/multiarch/vec-macros.h | 90 ++++++++++++++++++++++++ 8 files changed, 386 insertions(+) create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx2-rtm-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx2-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h new file mode 100644 index 0000000000..c00b83ea0e --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -0,0 +1,33 @@ +/* Common config for AVX-RTM VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_RTM_VECS_H +#define _AVX_RTM_VECS_H 1 + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define SECTION(p) p##.avx.rtm + +#define USE_WITH_RTM 1 +#include "avx-vecs.h" + +#endif diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h new file mode 100644 index 0000000000..3b84d7e8b2 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-vecs.h @@ -0,0 +1,53 @@ +/* Common config for AVX VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_VECS_H +#define _AVX_VECS_H 1 + +#ifdef HAS_VEC +# error "Multiple VEC configs included!" +#endif + +#define HAS_VEC 1 +#include "vec-macros.h" + +#ifndef USE_WITH_AVX2 +# define USE_WITH_AVX 1 +#endif +/* Included by RTM version. */ +#ifndef SECTION +# define SECTION(p) p##.avx +#endif + +#define VEC_SIZE 32 +/* 4-byte mov instructions with AVX2. */ +#define MOV_SIZE 4 +/* 1 (ret) + 3 (vzeroupper). */ +#define RET_SIZE 4 +#define VZEROUPPER vzeroupper + +#define VMOVU vmovdqu +#define VMOVA vmovdqa +#define VMOVNT vmovntdq + +/* Often need to access xmm portion. */ +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h new file mode 100644 index 0000000000..a5d46e8c66 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h @@ -0,0 +1,33 @@ +/* Common config for AVX2-RTM VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX2_RTM_VECS_H +#define _AVX2_RTM_VECS_H 1 + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define SECTION(p) p##.avx.rtm + +#define USE_WITH_RTM 1 +#include "avx2-vecs.h" + +#endif diff --git a/sysdeps/x86_64/multiarch/avx2-vecs.h b/sysdeps/x86_64/multiarch/avx2-vecs.h new file mode 100644 index 0000000000..4c029b4621 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx2-vecs.h @@ -0,0 +1,30 @@ +/* Common config for AVX2 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX2_VECS_H +#define _AVX2_VECS_H 1 + +#define USE_WITH_AVX2 1 +/* Included by RTM version. */ +#ifndef SECTION +# define SECTION(p) p##.avx +#endif +#include "avx-vecs.h" + +#endif diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h new file mode 100644 index 0000000000..ed7a32b0ec --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h @@ -0,0 +1,50 @@ +/* Common config for EVEX256 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX256_VECS_H +#define _EVEX256_VECS_H 1 + +#ifdef HAS_VEC +# error "Multiple VEC configs included!" +#endif + +#define HAS_VEC 1 +#include "vec-macros.h" + +#define USE_WITH_EVEX256 1 +#ifndef SECTION +# define SECTION(p) p##.evex +#endif + +#define VEC_SIZE 32 +/* 6-byte mov instructions with EVEX. */ +#define MOV_SIZE 6 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU vmovdqu64 +#define VMOVA vmovdqa64 +#define VMOVNT vmovntdq + +/* Often need to access xmm portion. */ +#define VEC_xmm VEC_hi_xmm +#define VEC VEC_hi_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h new file mode 100644 index 0000000000..53597734fc --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h @@ -0,0 +1,49 @@ +/* Common config for EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX512_VECS_H +#define _EVEX512_VECS_H 1 + +#ifdef HAS_VEC +# error "Multiple VEC configs included!" +#endif + +#define HAS_VEC 1 +#include "vec-macros.h" + +#define USE_WITH_EVEX512 1 +#define SECTION(p) p##.evex512 + +#define VEC_SIZE 64 +/* 6-byte mov instructions with EVEX. */ +#define MOV_SIZE 6 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU vmovdqu64 +#define VMOVA vmovdqa64 +#define VMOVNT vmovntdq + +/* Often need to access xmm/ymm portion. */ +#define VEC_xmm VEC_hi_xmm +#define VEC_ymm VEC_hi_ymm +#define VEC VEC_hi_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h new file mode 100644 index 0000000000..b645b93e3d --- /dev/null +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h @@ -0,0 +1,48 @@ +/* Common config for SSE2 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _SSE2_VECS_H +#define _SSE2_VECS_H 1 + +#ifdef HAS_VEC +# error "Multiple VEC configs included!" +#endif + +#define HAS_VEC 1 +#include "vec-macros.h" + +#define USE_WITH_SSE2 1 +#define SECTION(p) p + +#define VEC_SIZE 16 +/* 3-byte mov instructions with SSE2. */ +#define MOV_SIZE 3 +/* No vzeroupper needed. */ +#define RET_SIZE 1 + +#define VMOVU movups +#define VMOVA movaps +#define VMOVNT movntdq +#define VZEROUPPER + +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_xmm + + +#endif diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h new file mode 100644 index 0000000000..4dae4503c8 --- /dev/null +++ b/sysdeps/x86_64/multiarch/vec-macros.h @@ -0,0 +1,90 @@ +/* Macro helpers for VEC_{type}({vec_num}) + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _VEC_MACROS_H +# define _VEC_MACROS_H 1 + +# ifndef HAS_VEC +# error "Never include this file directly. Always include a vector config." +# endif + +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same + VEC(N) values. */ +#define VEC_hi_xmm0 xmm16 +#define VEC_hi_xmm1 xmm17 +#define VEC_hi_xmm2 xmm18 +#define VEC_hi_xmm3 xmm19 +#define VEC_hi_xmm4 xmm20 +#define VEC_hi_xmm5 xmm21 +#define VEC_hi_xmm6 xmm22 +#define VEC_hi_xmm7 xmm23 +#define VEC_hi_xmm8 xmm24 +#define VEC_hi_xmm9 xmm25 +#define VEC_hi_xmm10 xmm26 +#define VEC_hi_xmm11 xmm27 +#define VEC_hi_xmm12 xmm28 +#define VEC_hi_xmm13 xmm29 +#define VEC_hi_xmm14 xmm30 +#define VEC_hi_xmm15 xmm31 + +#define VEC_hi_ymm0 ymm16 +#define VEC_hi_ymm1 ymm17 +#define VEC_hi_ymm2 ymm18 +#define VEC_hi_ymm3 ymm19 +#define VEC_hi_ymm4 ymm20 +#define VEC_hi_ymm5 ymm21 +#define VEC_hi_ymm6 ymm22 +#define VEC_hi_ymm7 ymm23 +#define VEC_hi_ymm8 ymm24 +#define VEC_hi_ymm9 ymm25 +#define VEC_hi_ymm10 ymm26 +#define VEC_hi_ymm11 ymm27 +#define VEC_hi_ymm12 ymm28 +#define VEC_hi_ymm13 ymm29 +#define VEC_hi_ymm14 ymm30 +#define VEC_hi_ymm15 ymm31 + +#define VEC_hi_zmm0 zmm16 +#define VEC_hi_zmm1 zmm17 +#define VEC_hi_zmm2 zmm18 +#define VEC_hi_zmm3 zmm19 +#define VEC_hi_zmm4 zmm20 +#define VEC_hi_zmm5 zmm21 +#define VEC_hi_zmm6 zmm22 +#define VEC_hi_zmm7 zmm23 +#define VEC_hi_zmm8 zmm24 +#define VEC_hi_zmm9 zmm25 +#define VEC_hi_zmm10 zmm26 +#define VEC_hi_zmm11 zmm27 +#define VEC_hi_zmm12 zmm28 +#define VEC_hi_zmm13 zmm29 +#define VEC_hi_zmm14 zmm30 +#define VEC_hi_zmm15 zmm31 + +# define PRIMITIVE_VEC(vec, num) vec##num + +# define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) +# define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) +# define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) + +# define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) +# define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) +# define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) + +#endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v2 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein @ 2022-06-03 20:04 ` Noah Goldstein 2022-06-03 23:12 ` H.J. Lu 2022-06-03 20:04 ` [PATCH v2 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein ` (6 subsequent siblings) 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 20:04 UTC (permalink / raw) To: libc-alpha The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 1 + sysdeps/x86_64/sysdep.h | 16 ++++++++++++++++ 3 files changed, 18 insertions(+) diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h index c00b83ea0e..e954b8e1b0 100644 --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -20,6 +20,7 @@ #ifndef _AVX_RTM_VECS_H #define _AVX_RTM_VECS_H 1 +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h index a5d46e8c66..e20c3635a0 100644 --- a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h @@ -20,6 +20,7 @@ #ifndef _AVX2_RTM_VECS_H #define _AVX2_RTM_VECS_H 1 +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h index f14d50786d..2cb31a558b 100644 --- a/sysdeps/x86_64/sysdep.h +++ b/sysdeps/x86_64/sysdep.h @@ -106,6 +106,22 @@ lose: \ vzeroupper; \ ret +/* Can be used to replace vzeroupper that is not directly before a + return. */ +#define COND_VZEROUPPER_XTEST \ + xtest; \ + jz 1f; \ + vzeroall; \ + jmp 2f; \ +1: \ + vzeroupper; \ +2: + +/* In RTM define this as COND_VZEROUPPER_XTEST. */ +#ifndef COND_VZEROUPPER +# define COND_VZEROUPPER vzeroupper +#endif + /* Zero upper vector registers and return. */ #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN # define ZERO_UPPER_VEC_REGISTERS_RETURN \ -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v2 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-03 20:04 ` [PATCH v2 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-03 23:12 ` H.J. Lu 2022-06-03 23:33 ` Noah Goldstein 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-03 23:12 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Fri, Jun 3, 2022 at 1:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The RTM vzeroupper mitigation has no way of replacing inline > vzeroupper not before a return. > > This code does not change any existing functionality. > > There is no difference in the objdump of libc.so before and after this > patch. > --- > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + > sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 1 + > sysdeps/x86_64/sysdep.h | 16 ++++++++++++++++ > 3 files changed, 18 insertions(+) > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > index c00b83ea0e..e954b8e1b0 100644 > --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > @@ -20,6 +20,7 @@ > #ifndef _AVX_RTM_VECS_H > #define _AVX_RTM_VECS_H 1 > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > index a5d46e8c66..e20c3635a0 100644 > --- a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > @@ -20,6 +20,7 @@ > #ifndef _AVX2_RTM_VECS_H > #define _AVX2_RTM_VECS_H 1 > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h > index f14d50786d..2cb31a558b 100644 > --- a/sysdeps/x86_64/sysdep.h > +++ b/sysdeps/x86_64/sysdep.h > @@ -106,6 +106,22 @@ lose: \ > vzeroupper; \ > ret > > +/* Can be used to replace vzeroupper that is not directly before a > + return. */ > +#define COND_VZEROUPPER_XTEST \ > + xtest; \ > + jz 1f; \ > + vzeroall; \ > + jmp 2f; \ > +1: \ > + vzeroupper; \ > +2: Will "ret" always be after "2:"? > +/* In RTM define this as COND_VZEROUPPER_XTEST. */ > +#ifndef COND_VZEROUPPER > +# define COND_VZEROUPPER vzeroupper > +#endif > + > /* Zero upper vector registers and return. */ > #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN > # define ZERO_UPPER_VEC_REGISTERS_RETURN \ > -- > 2.34.1 > -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v2 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-03 23:12 ` H.J. Lu @ 2022-06-03 23:33 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:33 UTC (permalink / raw) To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell On Fri, Jun 3, 2022 at 6:12 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Fri, Jun 3, 2022 at 1:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The RTM vzeroupper mitigation has no way of replacing inline > > vzeroupper not before a return. > > > > This code does not change any existing functionality. > > > > There is no difference in the objdump of libc.so before and after this > > patch. > > --- > > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + > > sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 1 + > > sysdeps/x86_64/sysdep.h | 16 ++++++++++++++++ > > 3 files changed, 18 insertions(+) > > > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > index c00b83ea0e..e954b8e1b0 100644 > > --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > @@ -20,6 +20,7 @@ > > #ifndef _AVX_RTM_VECS_H > > #define _AVX_RTM_VECS_H 1 > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > > index a5d46e8c66..e20c3635a0 100644 > > --- a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > > +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > > @@ -20,6 +20,7 @@ > > #ifndef _AVX2_RTM_VECS_H > > #define _AVX2_RTM_VECS_H 1 > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h > > index f14d50786d..2cb31a558b 100644 > > --- a/sysdeps/x86_64/sysdep.h > > +++ b/sysdeps/x86_64/sysdep.h > > @@ -106,6 +106,22 @@ lose: \ > > vzeroupper; \ > > ret > > > > +/* Can be used to replace vzeroupper that is not directly before a > > + return. */ > > +#define COND_VZEROUPPER_XTEST \ > > + xtest; \ > > + jz 1f; \ > > + vzeroall; \ > > + jmp 2f; \ > > +1: \ > > + vzeroupper; \ > > +2: > > Will "ret" always be after "2:"? At some point but not immediately afterwards. For example: L(zero): xorl %eax, %eax VZEROUPPER_RETURN L(check): tzcntl %eax, %eax cmpl %eax, %edx jle L(zero) addq %rdi, %rax VZEROUPPER_RETURN Can become: L(zero): xorl %eax, %eax ret L(check): tzcntl %eax, %eax COND_VZEROUPPER cmpl %eax, %edx jle L(zero) addq %rdi, %rax ret Which saves code size. > > > +/* In RTM define this as COND_VZEROUPPER_XTEST. */ > > +#ifndef COND_VZEROUPPER > > +# define COND_VZEROUPPER vzeroupper > > +#endif > > + > > /* Zero upper vector registers and return. */ > > #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN > > # define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > -- > > 2.34.1 > > > > > -- > H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v2 3/8] Benchtests: Improve memrchr benchmarks 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-03 20:04 ` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein ` (5 subsequent siblings) 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 20:04 UTC (permalink / raw) To: libc-alpha Add a second iteration for memrchr to set `pos` starting from the end of the buffer. Previously `pos` was only set relative to the begining of the buffer. This isn't really useful for memchr because the begining of the search space is (buf + len). --- benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- 1 file changed, 65 insertions(+), 45 deletions(-) diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c index 4d7212332f..0facda2fa0 100644 --- a/benchtests/bench-memchr.c +++ b/benchtests/bench-memchr.c @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, static void do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, - int seek_char) + int seek_char, int invert_pos) { size_t i; @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, if (pos < len) { - buf[align + pos] = seek_char; + if (invert_pos) + buf[align + len - pos] = seek_char; + else + buf[align + pos] = seek_char; buf[align + len] = -seek_char; } else @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, json_attr_uint (json_ctx, "pos", pos); json_attr_uint (json_ctx, "len", len); json_attr_uint (json_ctx, "seek_char", seek_char); + json_attr_uint (json_ctx, "invert_pos", invert_pos); json_array_begin (json_ctx, "timings"); @@ -123,6 +127,7 @@ int test_main (void) { size_t i; + int repeats; json_ctx_t json_ctx; test_init (); @@ -142,53 +147,68 @@ test_main (void) json_array_begin (&json_ctx, "results"); - for (i = 1; i < 8; ++i) + for (repeats = 0; repeats < 2; ++repeats) { - do_test (&json_ctx, 0, 16 << i, 2048, 23); - do_test (&json_ctx, i, 64, 256, 23); - do_test (&json_ctx, 0, 16 << i, 2048, 0); - do_test (&json_ctx, i, 64, 256, 0); - - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); + do_test (&json_ctx, i, 64, 256, 23, repeats); + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); + do_test (&json_ctx, i, 64, 256, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, i, 256, 23); - do_test (&json_ctx, 0, i, 256, 0); - do_test (&json_ctx, i, i, 256, 23); - do_test (&json_ctx, i, i, 256, 0); + /* Also test the position close to the beginning for memrchr. */ + do_test (&json_ctx, 0, i, 256, 23, repeats); + do_test (&json_ctx, 0, i, 256, 0, repeats); + do_test (&json_ctx, i, i, 256, 23, repeats); + do_test (&json_ctx, i, i, 256, 0, repeats); #endif - } - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, i, i << 5, 192, 23); - do_test (&json_ctx, i, i << 5, 192, 0); - do_test (&json_ctx, i, i << 5, 256, 23); - do_test (&json_ctx, i, i << 5, 256, 0); - do_test (&json_ctx, i, i << 5, 512, 23); - do_test (&json_ctx, i, i << 5, 512, 0); - - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); - } - for (i = 1; i < 32; ++i) - { - do_test (&json_ctx, 0, i, i + 1, 23); - do_test (&json_ctx, 0, i, i + 1, 0); - do_test (&json_ctx, i, i, i + 1, 23); - do_test (&json_ctx, i, i, i + 1, 0); - do_test (&json_ctx, 0, i, i - 1, 23); - do_test (&json_ctx, 0, i, i - 1, 0); - do_test (&json_ctx, i, i, i - 1, 23); - do_test (&json_ctx, i, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); + } + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, i, i << 5, 192, 23, repeats); + do_test (&json_ctx, i, i << 5, 192, 0, repeats); + do_test (&json_ctx, i, i << 5, 256, 23, repeats); + do_test (&json_ctx, i, i << 5, 256, 0, repeats); + do_test (&json_ctx, i, i << 5, 512, 23, repeats); + do_test (&json_ctx, i, i << 5, 512, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); + } + for (i = 1; i < 32; ++i) + { + do_test (&json_ctx, 0, i, i + 1, 23, repeats); + do_test (&json_ctx, 0, i, i + 1, 0, repeats); + do_test (&json_ctx, i, i, i + 1, 23, repeats); + do_test (&json_ctx, i, i, i + 1, 0, repeats); + do_test (&json_ctx, 0, i, i - 1, 23, repeats); + do_test (&json_ctx, 0, i, i - 1, 0, repeats); + do_test (&json_ctx, i, i, i - 1, 23, repeats); + do_test (&json_ctx, i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); + #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, 1, i + 1, 23); - do_test (&json_ctx, 0, 2, i + 1, 0); + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); +#endif + } +#ifndef USE_AS_MEMRCHR + break; #endif } -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v2 4/8] x86: Optimize memrchr-sse2.S 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein @ 2022-06-03 20:04 ` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 5/8] x86: Optimize memrchr-evex.S Noah Goldstein ` (4 subsequent siblings) 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 20:04 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. --- sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- 1 file changed, 292 insertions(+), 321 deletions(-) diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S index d1a9f47911..b0dffd2ae2 100644 --- a/sysdeps/x86_64/memrchr.S +++ b/sysdeps/x86_64/memrchr.S @@ -18,362 +18,333 @@ <https://www.gnu.org/licenses/>. */ #include <sysdep.h> +#define VEC_SIZE 16 +#define PAGE_SIZE 4096 .text -ENTRY (__memrchr) - movd %esi, %xmm1 - - sub $16, %RDX_LP - jbe L(length_less16) - - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add %RDX_LP, %RDI_LP - pshufd $0, %xmm1, %xmm1 - - movdqu (%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - -/* Check if there is a match. */ - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - mov %edi, %ecx - and $15, %ecx - jz L(loop_prolog) - - add $16, %rdi - add $16, %rdx - and $-16, %rdi - sub %rcx, %rdx - - .p2align 4 -L(loop_prolog): - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm4 - pcmpeqb %xmm1, %xmm4 - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches0) - - mov %edi, %ecx - and $63, %ecx - jz L(align64_loop) - - add $64, %rdi - add $64, %rdx - and $-64, %rdi - sub %rcx, %rdx - - .p2align 4 -L(align64_loop): - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa (%rdi), %xmm0 - movdqa 16(%rdi), %xmm2 - movdqa 32(%rdi), %xmm3 - movdqa 48(%rdi), %xmm4 - - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm1, %xmm2 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm1, %xmm4 - - pmaxub %xmm3, %xmm0 - pmaxub %xmm4, %xmm2 - pmaxub %xmm0, %xmm2 - pmovmskb %xmm2, %eax - - test %eax, %eax - jz L(align64_loop) - - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches48) - - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm2 - - pcmpeqb %xmm1, %xmm2 - pcmpeqb (%rdi), %xmm1 - - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches16) - - pmovmskb %xmm1, %eax - bsr %eax, %eax - - add %rdi, %rax +ENTRY_P2ALIGN(__memrchr, 6) +#ifdef __ILP32__ + /* Clear upper bits. */ + mov %RDX_LP, %RDX_LP +#endif + movd %esi, %xmm0 + + /* Get end pointer. */ + leaq (%rdx, %rdi), %rcx + + punpcklbw %xmm0, %xmm0 + punpcklwd %xmm0, %xmm0 + pshufd $0, %xmm0, %xmm0 + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %ecx + jz L(page_cross) + + /* NB: This load happens regardless of whether rdx (len) is zero. Since + it doesn't cross a page and the standard gurantees any pointer have + at least one-valid byte this load must be safe. For the entire + history of the x86 memrchr implementation this has been possible so + no code "should" be relying on a zero-length check before this load. + The zero-length check is moved to the page cross case because it is + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE + into 2-cache lines. */ + movups -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is + zero. */ + bsrl %eax, %eax + jz L(ret_0) + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here + if out of bounds. */ + addl %edx, %eax + jl L(zero_0) + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base + ptr. */ + addq %rdi, %rax +L(ret_0): ret - .p2align 4 -L(exit_loop): - add $64, %edx - cmp $32, %edx - jbe L(exit_loop_32) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16_1) - cmp $48, %edx - jbe L(return_null) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches0_1) - xor %eax, %eax + .p2align 4,, 5 +L(ret_vec_x0): + bsrl %eax, %eax + leaq -(VEC_SIZE)(%rcx, %rax), %rax ret - .p2align 4 -L(exit_loop_32): - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48_1) - cmp $16, %edx - jbe L(return_null) - - pcmpeqb 32(%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches32_1) - xor %eax, %eax + .p2align 4,, 2 +L(zero_0): + xorl %eax, %eax ret - .p2align 4 -L(matches0): - bsr %eax, %eax - add %rdi, %rax - ret - - .p2align 4 -L(matches16): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax - ret - .p2align 4 -L(matches32): - bsr %eax, %eax - lea 32(%rax, %rdi), %rax + .p2align 4,, 8 +L(more_1x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) + + /* Align rcx (pointer to string). */ + decq %rcx + andq $-VEC_SIZE, %rcx + + movq %rcx, %rdx + /* NB: We could consistenyl save 1-byte in this pattern with `movaps + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is + it adds more frontend uops (even if the moves can be eliminated) and + some percentage of the time actual backend uops. */ + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + subq %rdi, %rdx + pmovmskb %xmm1, %eax + + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) +L(last_2x_vec): + subl $VEC_SIZE, %edx + jbe L(ret_vec_x0_test) + + testl %eax, %eax + jnz L(ret_vec_x0) + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_1) + addl %edx, %eax + jl L(zero_0) + addq %rdi, %rax +L(ret_1): ret - .p2align 4 -L(matches48): - bsr %eax, %eax - lea 48(%rax, %rdi), %rax + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) + causes the hot pause (length <= VEC_SIZE) to span multiple cache + lines. Naturally aligned % 16 to 8-bytes. */ +L(page_cross): + /* Zero length check. */ + testq %rdx, %rdx + jz L(zero_0) + + leaq -1(%rcx), %r8 + andq $-(VEC_SIZE), %r8 + + movaps (%r8), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %esi + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + negl %ecx + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count + explicitly. */ + andl $(VEC_SIZE - 1), %ecx + shl %cl, %esi + movzwl %si, %eax + leaq (%rdi, %rdx), %rcx + cmpq %rdi, %r8 + ja L(more_1x_vec) + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_2) + addl %edx, %eax + jl L(zero_1) + addq %rdi, %rax +L(ret_2): ret - .p2align 4 -L(matches0_1): - bsr %eax, %eax - sub $64, %rdx - add %rax, %rdx - jl L(return_null) - add %rdi, %rax + /* Fits in aliging bytes. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(matches16_1): - bsr %eax, %eax - sub $48, %rdx - add %rax, %rdx - jl L(return_null) - lea 16(%rdi, %rax), %rax + .p2align 4,, 5 +L(ret_vec_x1): + bsrl %eax, %eax + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(matches32_1): - bsr %eax, %eax - sub $32, %rdx - add %rax, %rdx - jl L(return_null) - lea 32(%rdi, %rax), %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) - .p2align 4 -L(matches48_1): - bsr %eax, %eax - sub $16, %rdx - add %rax, %rdx - jl L(return_null) - lea 48(%rdi, %rax), %rax - ret + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + testl %eax, %eax + jnz L(ret_vec_x1) - .p2align 4 -L(return_null): - xor %eax, %eax - ret - .p2align 4 -L(length_less16_offset0): - test %edx, %edx - jz L(return_null) + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - mov %dl, %cl - pcmpeqb (%rdi), %xmm1 + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) - mov $1, %edx - sal %cl, %edx - sub $1, %edx + addl $(VEC_SIZE), %edx + jle L(ret_vec_x2_test) - pmovmskb %xmm1, %eax +L(last_vec): + testl %eax, %eax + jnz L(ret_vec_x2) - and %edx, %eax - test %eax, %eax - jz L(return_null) + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - bsr %eax, %eax - add %rdi, %rax + subl $(VEC_SIZE), %edx + bsrl %eax, %eax + jz L(ret_3) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax +L(ret_3): ret - .p2align 4 -L(length_less16): - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add $16, %edx - - pshufd $0, %xmm1, %xmm1 - - mov %edi, %ecx - and $15, %ecx - jz L(length_less16_offset0) - - mov %cl, %dh - mov %ecx, %esi - add %dl, %dh - and $-16, %rdi - - sub $16, %dh - ja L(length_less16_part2) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - - sar %cl, %eax - mov %dl, %cl - - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax - test %eax, %eax - jz L(return_null) - - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 6 +L(ret_vec_x2_test): + bsrl %eax, %eax + jz L(zero_2) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax ret - .p2align 4 -L(length_less16_part2): - movdqa 16(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - - mov %dh, %cl - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax +L(zero_2): + xorl %eax, %eax + ret - test %eax, %eax - jnz L(length_less16_part2_return) - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax + .p2align 4,, 5 +L(ret_vec_x2): + bsrl %eax, %eax + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - mov %esi, %ecx - sar %cl, %eax - test %eax, %eax - jz L(return_null) + .p2align 4,, 5 +L(ret_vec_x3): + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 8 +L(more_4x_vec): + testl %eax, %eax + jnz L(ret_vec_x2) + + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_x3) + + addq $-(VEC_SIZE * 4), %rcx + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) + + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end + keeping the code from spilling to the next cache line. */ + addq $(VEC_SIZE * 4 - 1), %rcx + andq $-(VEC_SIZE * 4), %rcx + leaq (VEC_SIZE * 4)(%rdi), %rdx + andq $-(VEC_SIZE * 4), %rdx + + .p2align 4,, 11 +L(loop_4x_vec): + movaps (VEC_SIZE * -1)(%rcx), %xmm1 + movaps (VEC_SIZE * -2)(%rcx), %xmm2 + movaps (VEC_SIZE * -3)(%rcx), %xmm3 + movaps (VEC_SIZE * -4)(%rcx), %xmm4 + pcmpeqb %xmm0, %xmm1 + pcmpeqb %xmm0, %xmm2 + pcmpeqb %xmm0, %xmm3 + pcmpeqb %xmm0, %xmm4 + + por %xmm1, %xmm2 + por %xmm3, %xmm4 + por %xmm2, %xmm4 + + pmovmskb %xmm4, %esi + testl %esi, %esi + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rcx + cmpq %rdx, %rcx + jne L(loop_4x_vec) + + subl %edi, %edx + + /* Ends up being 1-byte nop. */ + .p2align 4,, 2 +L(last_4x_vec): + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) + + testl %eax, %eax + jnz L(ret_vec_x0) + + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_end) + + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $(VEC_SIZE * 3), %edx + ja L(last_vec) + bsrl %eax, %eax + jz L(ret_4) + addl %edx, %eax + jl L(zero_3) + addq %rdi, %rax +L(ret_4): ret - .p2align 4 -L(length_less16_part2_return): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax + /* Ends up being 1-byte nop. */ + .p2align 4,, 3 +L(loop_end): + pmovmskb %xmm1, %eax + sall $16, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm2, %eax + testl %eax, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm3, %eax + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + sall $16, %eax + orl %esi, %eax + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax ret -END (__memrchr) +L(ret_vec_end): + bsrl %eax, %eax + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax + ret + /* Use in L(last_4x_vec). In the same cache line. This is just a spare + aligning bytes. */ +L(zero_3): + xorl %eax, %eax + ret + /* 2-bytes from next cache line. */ +END(__memrchr) weak_alias (__memrchr, memrchr) -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v2 5/8] x86: Optimize memrchr-evex.S 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (2 preceding siblings ...) 2022-06-03 20:04 ` [PATCH v2 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein @ 2022-06-03 20:04 ` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein ` (3 subsequent siblings) 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 20:04 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the begining of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ 1 file changed, 268 insertions(+), 271 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S index 0b99709c6b..ad541c0e50 100644 --- a/sysdeps/x86_64/multiarch/memrchr-evex.S +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S @@ -19,319 +19,316 @@ #if IS_IN (libc) # include <sysdep.h> +# include "evex256-vecs.h" +# if VEC_SIZE != 32 +# error "VEC_SIZE != 32 unimplemented" +# endif + +# ifndef MEMRCHR +# define MEMRCHR __memrchr_evex +# endif + +# define PAGE_SIZE 4096 +# define VECMATCH VEC(0) + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN(MEMRCHR, 6) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) + + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdi, %rdx), %rax + vpbroadcastb %esi, %VECMATCH + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret -# define VMOVA vmovdqa64 - -# define YMMMATCH ymm16 - -# define VEC_SIZE 32 - - .section .text.evex,"ax",@progbits -ENTRY (__memrchr_evex) - /* Broadcast CHAR to YMMMATCH. */ - vpbroadcastb %esi, %YMMMATCH - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP - - /* Check the last VEC_SIZE bytes. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) - - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx - - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 - kord %k1, %k2, %k5 - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 - - kord %k3, %k4, %k6 - kortestd %k5, %k6 - jz L(loop_4x_vec) - - /* There is a match. */ - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - kmovd %k1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0_dec): + decq %rax +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax ret - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) + /* Align rax (pointer to string). */ + andq $-VEC_SIZE, %rax - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) + /* Recompute length after aligning. */ + movq %rax, %rdx - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - ret + subq %rdi, %rdx - .p2align 4 + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) + + /* Must dec rax because L(ret_vec_x0_test) expects it. */ + decq %rax cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpb $0, (%rsi), %VECMATCH, %k0 + kmovd %k0, %r8d + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %ecx + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %ecx + shlxl %ecx, %r8d, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_1) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax + /* Continue creating zero labels that fit in aligning bytes and get + 2-byte encoding / are in the same cache line as condition. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - ret + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx - - kmovd %k1, %eax - - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx - - /* Check for zero length. */ - testl %edx, %edx - jz L(zero) - - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) - - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + .p2align 4,, 8 +L(ret_vec_x2): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + .p2align 4,, 8 +L(ret_vec_x3): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - /* Check the last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax + .p2align 4,, 8 +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + testl %ecx, %ecx + jnz L(ret_vec_x3) - andl %edx, %eax - testl %eax, %eax - jz L(zero) + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - ret + decq %rax + andq $-(VEC_SIZE * 4), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + andq $-(VEC_SIZE * 4), %rdx .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx - - /* Check the last VEC. */ - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 +L(loop_4x_vec): + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later + on). */ + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 + + /* VEC(2/3) will have zero-byte where we found a CHAR. */ + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 + + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} + vptestnmb %VEC(3), %VEC(3), %k2 + + /* Any 1s and we found CHAR. */ + kortestd %k2, %k4 + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) + + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ + subq $-(VEC_SIZE * 4), %rdx + movq %rdx, %rax + subl %edi, %edx +L(last_4x_vec): + + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - kmovd %k1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jnz L(last_vec_x1) + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - /* Check the second last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 + testl %ecx, %ecx + jnz L(ret_vec_x1) - movl %r8d, %ecx + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - kmovd %k1, %eax + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret_1) + xorl %eax, %eax +L(ret_1): + ret - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 6 +L(loop_end): + kmovd %k1, %ecx + notl %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vptestnmb %VEC(2), %VEC(2), %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + kmovd %k2, %ecx + kmovd %k4, %esi + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + addq %rcx, %rax + ret + .p2align 4,, 4 +L(ret_vec_x0_end): + addq $(VEC_SIZE), %rax +L(ret_vec_x1_end): + bsrl %ecx, %ecx + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax ret -END (__memrchr_evex) + +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v2 6/8] x86: Optimize memrchr-avx2.S 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (3 preceding siblings ...) 2022-06-03 20:04 ` [PATCH v2 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-03 20:04 ` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein ` (2 subsequent siblings) 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 20:04 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the begining of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memrchr-avx2.S | 538 ++++++++++---------- 2 files changed, 260 insertions(+), 279 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S index cea2d2a72d..5e9beeeef2 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMRCHR __memrchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S index ba2ce7cb03..6915e1c373 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S @@ -21,340 +21,320 @@ # include <sysdep.h> # ifndef MEMRCHR -# define MEMRCHR __memrchr_avx2 +# define MEMRCHR __memrchr_avx2 # endif # ifndef VZEROUPPER -# define VZEROUPPER vzeroupper +# define VZEROUPPER vzeroupper # endif +// abf-off # ifndef SECTION # define SECTION(p) p##.avx # endif +// abf-on + +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + .section SECTION(.text), "ax", @progbits +ENTRY(MEMRCHR) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) -# define VEC_SIZE 32 - - .section SECTION(.text),"ax",@progbits -ENTRY (MEMRCHR) - /* Broadcast CHAR to YMM0. */ vmovd %esi, %xmm0 - vpbroadcastb %xmm0, %ymm0 - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdx, %rdi), %rax - /* Check the last VEC_SIZE bytes. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) + vpbroadcastb %xmm0, %ymm0 - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) + +L(ret_vec_x0_test): + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + + /* Hoist vzeroupper (not great for RTM) to save code size. This allows + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vmovdqa (%rdi), %ymm1 - vmovdqa VEC_SIZE(%rdi), %ymm2 - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 - - vpcmpeqb %ymm1, %ymm0, %ymm1 - vpcmpeqb %ymm2, %ymm0, %ymm2 - vpcmpeqb %ymm3, %ymm0, %ymm3 - vpcmpeqb %ymm4, %ymm0, %ymm4 - - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - testl %eax, %eax - jz L(loop_4x_vec) - - /* There is a match. */ - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpmovmskb %ymm1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax L(return_vzeroupper): ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) - - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Align rax (string pointer). */ + andq $-VEC_SIZE, %rax + + /* Recompute remaining length after aligning. */ + movq %rax, %rdx + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 + subq %rdi, %rdx + decq %rax + vpmovmskb %ymm1, %ecx + /* Fall through for short (hotter than length). */ + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpeqb (%rsi), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %r8d + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %r8d + shlxl %r8d, %ecx, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret + .p2align 4,, 11 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + subq %rcx, %rax VZEROUPPER_RETURN + .p2align 4,, 10 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - VZEROUPPER_RETURN + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax - VZEROUPPER_RETURN + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) + +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - .p2align 4 -L(null): + /* First in aligning bytes. */ +L(zero_2): xorl %eax, %eax ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx + .p2align 4,, 4 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - vpcmpeqb (%rdi), %ymm0, %ymm1 - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx + .p2align 4,, 11 +L(ret_vec_x2): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - vpmovmskb %ymm1, %eax + .p2align 4,, 14 +L(ret_vec_x3): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Check for zero length. */ - testl %edx, %edx - jz L(null) + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) + testl %ecx, %ecx + jnz L(ret_vec_x3) - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + /* Align rax to (VEC_SIZE - 1). */ + orq $(VEC_SIZE * 4 - 1), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + orq $(VEC_SIZE * 4 - 1), %rdx - /* Check the last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + .p2align 4 +L(loop_4x_vec): + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 - movl $1, %edx - sall %cl, %edx - subl $1, %edx + vpor %ymm1, %ymm2, %ymm2 + vpor %ymm3, %ymm4, %ymm4 + vpor %ymm2, %ymm4, %ymm4 + vpmovmskb %ymm4, %esi - andl %edx, %eax - testl %eax, %eax - jz L(zero) + testl %esi, %esi + jnz L(loop_end) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - VZEROUPPER_RETURN + addq $(VEC_SIZE * -4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) - .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx + subl %edi, %edx + incl %edx - /* Check the last VEC. */ - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 +L(last_4x_vec): + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - vpmovmskb %ymm1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_end) - /* Remove the trailing bytes. */ - andl %edx, %eax + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) - testl %eax, %eax - jnz L(last_vec_x1) + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - /* Check the second last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) + + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret0) + xorl %eax, %eax +L(ret0): + ret - movl %r8d, %ecx - vpmovmskb %ymm1, %eax + .p2align 4 +L(loop_end): + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vpmovmskb %ymm2, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + vpmovmskb %ymm3, %ecx + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + .p2align 4,, 4 +L(ret_vec_x1_end): + /* 64-bit version will automatically add 32 (VEC_SIZE). */ + lzcntq %rcx, %rcx + subq %rcx, %rax + VZEROUPPER_RETURN - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 4 +L(ret_vec_x0_end): + lzcntl %ecx, %ecx + subq %rcx, %rax VZEROUPPER_RETURN -END (MEMRCHR) + + /* 2 bytes until next cache line. */ +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v2 7/8] x86: Shrink code size of memchr-avx2.S 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (4 preceding siblings ...) 2022-06-03 20:04 ` [PATCH v2 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-03 20:04 ` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-03 23:09 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library H.J. Lu 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 20:04 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- 2 files changed, 60 insertions(+), 50 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S index 87b076c7c4..c4d71938c5 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMCHR __memchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S index 75bd7262e0..28a01280ec 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S @@ -57,7 +57,7 @@ # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 5) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ # ifdef __ILP32__ @@ -87,12 +87,14 @@ ENTRY (MEMCHR) # endif testl %eax, %eax jz L(aligned_more) - tzcntl %eax, %eax + bsfl %eax, %eax addq %rdi, %rax - VZEROUPPER_RETURN +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN + # ifndef USE_AS_RAWMEMCHR - .p2align 5 + .p2align 4 L(first_vec_x0): /* Check if first match was before length. */ tzcntl %eax, %eax @@ -100,58 +102,31 @@ L(first_vec_x0): /* NB: Multiply length by 4 to get byte count. */ sall $2, %edx # endif - xorl %ecx, %ecx + COND_VZEROUPPER + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch + block. branch here as opposed to cmovcc is not that costly. Common + usage of memchr is to check if the return was NULL (if string was + known to contain CHAR user would use rawmemchr). This branch will be + highly correlated with the user branch and can be used by most + modern branch predictors to predict the user branch. */ cmpl %eax, %edx - leaq (%rdi, %rax), %rax - cmovle %rcx, %rax - VZEROUPPER_RETURN - -L(null): - xorl %eax, %eax - ret -# endif - .p2align 4 -L(cross_page_boundary): - /* Save pointer before aligning as its original value is - necessary for computer return address if byte is found or - adjusting length if it is not and this is memchr. */ - movq %rdi, %rcx - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr - and rdi for rawmemchr. */ - orq $(VEC_SIZE - 1), %ALGN_PTR_REG - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax -# ifndef USE_AS_RAWMEMCHR - /* Calculate length until end of page (length checked for a - match). */ - leaq 1(%ALGN_PTR_REG), %rsi - subq %RRAW_PTR_REG, %rsi -# ifdef USE_AS_WMEMCHR - /* NB: Divide bytes by 4 to get wchar_t count. */ - shrl $2, %esi -# endif -# endif - /* Remove the leading bytes. */ - sarxl %ERAW_PTR_REG, %eax, %eax -# ifndef USE_AS_RAWMEMCHR - /* Check the end of data. */ - cmpq %rsi, %rdx - jbe L(first_vec_x0) + jle L(null) + addq %rdi, %rax + ret # endif - testl %eax, %eax - jz L(cross_page_continue) - tzcntl %eax, %eax - addq %RRAW_PTR_REG, %rax -L(return_vzeroupper): - ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 + .p2align 4,, 10 L(first_vec_x1): - tzcntl %eax, %eax + bsfl %eax, %eax incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - +# ifndef USE_AS_RAWMEMCHR + /* First in aligning bytes here. */ +L(null): + xorl %eax, %eax + ret +# endif .p2align 4 L(first_vec_x2): tzcntl %eax, %eax @@ -340,7 +315,7 @@ L(first_vec_x1_check): incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - .p2align 4 + .p2align 4,, 6 L(set_zero_end): xorl %eax, %eax VZEROUPPER_RETURN @@ -428,5 +403,39 @@ L(last_vec_x3): VZEROUPPER_RETURN # endif + .p2align 4 +L(cross_page_boundary): + /* Save pointer before aligning as its original value is necessary for + computer return address if byte is found or adjusting length if it + is not and this is memchr. */ + movq %rdi, %rcx + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for + rawmemchr. */ + andq $-VEC_SIZE, %ALGN_PTR_REG + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 + vpmovmskb %ymm1, %eax +# ifndef USE_AS_RAWMEMCHR + /* Calculate length until end of page (length checked for a match). */ + leal VEC_SIZE(%ALGN_PTR_REG), %esi + subl %ERAW_PTR_REG, %esi +# ifdef USE_AS_WMEMCHR + /* NB: Divide bytes by 4 to get wchar_t count. */ + shrl $2, %esi +# endif +# endif + /* Remove the leading bytes. */ + sarxl %ERAW_PTR_REG, %eax, %eax +# ifndef USE_AS_RAWMEMCHR + /* Check the end of data. */ + cmpq %rsi, %rdx + jbe L(first_vec_x0) +# endif + testl %eax, %eax + jz L(cross_page_continue) + bsfl %eax, %eax + addq %RRAW_PTR_REG, %rax + VZEROUPPER_RETURN + + END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v2 8/8] x86: Shrink code size of memchr-evex.S 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (5 preceding siblings ...) 2022-06-03 20:04 ` [PATCH v2 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein @ 2022-06-03 20:04 ` Noah Goldstein 2022-06-03 23:09 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library H.J. Lu 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 20:04 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-evex.S | 46 ++++++++++++++------------ 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S index cfaf02907d..0fd11b7632 100644 --- a/sysdeps/x86_64/multiarch/memchr-evex.S +++ b/sysdeps/x86_64/multiarch/memchr-evex.S @@ -88,7 +88,7 @@ # define PAGE_SIZE 4096 .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 6) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ test %RDX_LP, %RDX_LP @@ -131,22 +131,24 @@ L(zero): xorl %eax, %eax ret - .p2align 5 + .p2align 4 L(first_vec_x0): - /* Check if first match was before length. */ - tzcntl %eax, %eax - xorl %ecx, %ecx - cmpl %eax, %edx - leaq (%rdi, %rax, CHAR_SIZE), %rax - cmovle %rcx, %rax + /* Check if first match was before length. NB: tzcnt has false data- + dependency on destination. eax already had a data-dependency on esi + so this should have no affect here. */ + tzcntl %eax, %esi +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rsi, CHAR_SIZE), %rdi +# else + addq %rsi, %rdi +# endif + xorl %eax, %eax + cmpl %esi, %edx + cmovg %rdi, %rax ret -# else - /* NB: first_vec_x0 is 17 bytes which will leave - cross_page_boundary (which is relatively cold) close enough - to ideal alignment. So only realign L(cross_page_boundary) if - rawmemchr. */ - .p2align 4 # endif + + .p2align 4 L(cross_page_boundary): /* Save pointer before aligning as its original value is necessary for computer return address if byte is found or @@ -400,10 +402,14 @@ L(last_2x_vec): L(zero_end): ret +L(set_zero_end): + xorl %eax, %eax + ret .p2align 4 L(first_vec_x1_check): - tzcntl %eax, %eax + /* eax must be non-zero. Use bsfl to save code size. */ + bsfl %eax, %eax /* Adjust length. */ subl $-(CHAR_PER_VEC * 4), %edx /* Check if match within remaining length. */ @@ -412,9 +418,6 @@ L(first_vec_x1_check): /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax ret -L(set_zero_end): - xorl %eax, %eax - ret .p2align 4 L(loop_4x_vec_end): @@ -464,7 +467,7 @@ L(loop_4x_vec_end): # endif ret - .p2align 4 + .p2align 4,, 10 L(last_vec_x1_return): tzcntl %eax, %eax # if defined USE_AS_WMEMCHR || RET_OFFSET != 0 @@ -496,6 +499,7 @@ L(last_vec_x3_return): # endif # ifndef USE_AS_RAWMEMCHR + .p2align 4,, 5 L(last_4x_vec_or_less_cmpeq): VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 kmovd %k0, %eax @@ -546,7 +550,7 @@ L(last_4x_vec): # endif andl %ecx, %eax jz L(zero_end2) - tzcntl %eax, %eax + bsfl %eax, %eax leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax L(zero_end2): ret @@ -562,6 +566,6 @@ L(last_vec_x3): leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax ret # endif - + /* 7 bytes from next cache line. */ END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (6 preceding siblings ...) 2022-06-03 20:04 ` [PATCH v2 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein @ 2022-06-03 23:09 ` H.J. Lu 2022-06-03 23:49 ` Noah Goldstein 7 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-03 23:09 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Fri, Jun 3, 2022 at 1:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > This patch does not touch any existing code and is only meant to be a > tool for future patches so that simple source files can more easily be > maintained to target multiple VEC classes. > > There is no difference in the objdump of libc.so before and after this > patch. > --- > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 33 +++++++++ > sysdeps/x86_64/multiarch/avx-vecs.h | 53 ++++++++++++++ > sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 33 +++++++++ > sysdeps/x86_64/multiarch/avx2-vecs.h | 30 ++++++++ > sysdeps/x86_64/multiarch/evex256-vecs.h | 50 +++++++++++++ > sysdeps/x86_64/multiarch/evex512-vecs.h | 49 +++++++++++++ > sysdeps/x86_64/multiarch/sse2-vecs.h | 48 +++++++++++++ > sysdeps/x86_64/multiarch/vec-macros.h | 90 ++++++++++++++++++++++++ > 8 files changed, 386 insertions(+) > create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/avx2-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > new file mode 100644 > index 0000000000..c00b83ea0e > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > @@ -0,0 +1,33 @@ > +/* Common config for AVX-RTM VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX_RTM_VECS_H > +#define _AVX_RTM_VECS_H 1 > + > +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > + > +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > + > +#define SECTION(p) p##.avx.rtm > + > +#define USE_WITH_RTM 1 > +#include "avx-vecs.h" > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h > new file mode 100644 > index 0000000000..3b84d7e8b2 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx-vecs.h > @@ -0,0 +1,53 @@ > +/* Common config for AVX VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX_VECS_H > +#define _AVX_VECS_H 1 > + > +#ifdef HAS_VEC > +# error "Multiple VEC configs included!" > +#endif > + > +#define HAS_VEC 1 > +#include "vec-macros.h" > + > +#ifndef USE_WITH_AVX2 > +# define USE_WITH_AVX 1 > +#endif > +/* Included by RTM version. */ > +#ifndef SECTION > +# define SECTION(p) p##.avx > +#endif Can SECTION be defined unconditionally? If a different SECTION is needed, you can undef it first, > + > +#define VEC_SIZE 32 > +/* 4-byte mov instructions with AVX2. */ > +#define MOV_SIZE 4 > +/* 1 (ret) + 3 (vzeroupper). */ > +#define RET_SIZE 4 > +#define VZEROUPPER vzeroupper > + > +#define VMOVU vmovdqu > +#define VMOVA vmovdqa > +#define VMOVNT vmovntdq > + > +/* Often need to access xmm portion. */ > +#define VEC_xmm VEC_any_xmm > +#define VEC VEC_any_ymm Can we check VEC or VEC_SIZE instead of HAS_VEC? > + > +#endif Do we need both AVX and AVX2? Will AVX2 be sufficient? > diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > new file mode 100644 > index 0000000000..a5d46e8c66 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > @@ -0,0 +1,33 @@ > +/* Common config for AVX2-RTM VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX2_RTM_VECS_H > +#define _AVX2_RTM_VECS_H 1 > + > +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > + > +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > + > +#define SECTION(p) p##.avx.rtm > + > +#define USE_WITH_RTM 1 > +#include "avx2-vecs.h" > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/avx2-vecs.h b/sysdeps/x86_64/multiarch/avx2-vecs.h > new file mode 100644 > index 0000000000..4c029b4621 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx2-vecs.h > @@ -0,0 +1,30 @@ > +/* Common config for AVX2 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX2_VECS_H > +#define _AVX2_VECS_H 1 > + > +#define USE_WITH_AVX2 1 > +/* Included by RTM version. */ > +#ifndef SECTION > +# define SECTION(p) p##.avx > +#endif > +#include "avx-vecs.h" > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h > new file mode 100644 > index 0000000000..ed7a32b0ec > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h > @@ -0,0 +1,50 @@ > +/* Common config for EVEX256 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _EVEX256_VECS_H > +#define _EVEX256_VECS_H 1 > + > +#ifdef HAS_VEC > +# error "Multiple VEC configs included!" > +#endif > + > +#define HAS_VEC 1 > +#include "vec-macros.h" > + > +#define USE_WITH_EVEX256 1 > +#ifndef SECTION > +# define SECTION(p) p##.evex > +#endif > + > +#define VEC_SIZE 32 > +/* 6-byte mov instructions with EVEX. */ > +#define MOV_SIZE 6 > +/* No vzeroupper needed. */ > +#define RET_SIZE 1 > +#define VZEROUPPER > + > +#define VMOVU vmovdqu64 > +#define VMOVA vmovdqa64 > +#define VMOVNT vmovntdq > + > +/* Often need to access xmm portion. */ > +#define VEC_xmm VEC_hi_xmm > +#define VEC VEC_hi_ymm Can we add evex-vecs.h for common macros? > +#endif > diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h > new file mode 100644 > index 0000000000..53597734fc > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h > @@ -0,0 +1,49 @@ > +/* Common config for EVEX512 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _EVEX512_VECS_H > +#define _EVEX512_VECS_H 1 > + > +#ifdef HAS_VEC > +# error "Multiple VEC configs included!" > +#endif > + > +#define HAS_VEC 1 > +#include "vec-macros.h" > + > +#define USE_WITH_EVEX512 1 > +#define SECTION(p) p##.evex512 > + > +#define VEC_SIZE 64 > +/* 6-byte mov instructions with EVEX. */ > +#define MOV_SIZE 6 > +/* No vzeroupper needed. */ > +#define RET_SIZE 1 > +#define VZEROUPPER > + > +#define VMOVU vmovdqu64 > +#define VMOVA vmovdqa64 > +#define VMOVNT vmovntdq > + > +/* Often need to access xmm/ymm portion. */ > +#define VEC_xmm VEC_hi_xmm > +#define VEC_ymm VEC_hi_ymm > +#define VEC VEC_hi_zmm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h > new file mode 100644 > index 0000000000..b645b93e3d > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h > @@ -0,0 +1,48 @@ > +/* Common config for SSE2 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _SSE2_VECS_H > +#define _SSE2_VECS_H 1 > + > +#ifdef HAS_VEC > +# error "Multiple VEC configs included!" > +#endif > + > +#define HAS_VEC 1 > +#include "vec-macros.h" > + > +#define USE_WITH_SSE2 1 > +#define SECTION(p) p > + > +#define VEC_SIZE 16 > +/* 3-byte mov instructions with SSE2. */ > +#define MOV_SIZE 3 > +/* No vzeroupper needed. */ > +#define RET_SIZE 1 > + > +#define VMOVU movups > +#define VMOVA movaps > +#define VMOVNT movntdq > +#define VZEROUPPER > + > +#define VEC_xmm VEC_any_xmm > +#define VEC VEC_any_xmm > + > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h > new file mode 100644 > index 0000000000..4dae4503c8 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/vec-macros.h > @@ -0,0 +1,90 @@ > +/* Macro helpers for VEC_{type}({vec_num}) > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _VEC_MACROS_H > +# define _VEC_MACROS_H 1 Remove a space after #. > + > +# ifndef HAS_VEC > +# error "Never include this file directly. Always include a vector config." > +# endif Remove a space after #. > + > +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same > + VEC(N) values. */ > +#define VEC_hi_xmm0 xmm16 > +#define VEC_hi_xmm1 xmm17 > +#define VEC_hi_xmm2 xmm18 > +#define VEC_hi_xmm3 xmm19 > +#define VEC_hi_xmm4 xmm20 > +#define VEC_hi_xmm5 xmm21 > +#define VEC_hi_xmm6 xmm22 > +#define VEC_hi_xmm7 xmm23 > +#define VEC_hi_xmm8 xmm24 > +#define VEC_hi_xmm9 xmm25 > +#define VEC_hi_xmm10 xmm26 > +#define VEC_hi_xmm11 xmm27 > +#define VEC_hi_xmm12 xmm28 > +#define VEC_hi_xmm13 xmm29 > +#define VEC_hi_xmm14 xmm30 > +#define VEC_hi_xmm15 xmm31 > + > +#define VEC_hi_ymm0 ymm16 > +#define VEC_hi_ymm1 ymm17 > +#define VEC_hi_ymm2 ymm18 > +#define VEC_hi_ymm3 ymm19 > +#define VEC_hi_ymm4 ymm20 > +#define VEC_hi_ymm5 ymm21 > +#define VEC_hi_ymm6 ymm22 > +#define VEC_hi_ymm7 ymm23 > +#define VEC_hi_ymm8 ymm24 > +#define VEC_hi_ymm9 ymm25 > +#define VEC_hi_ymm10 ymm26 > +#define VEC_hi_ymm11 ymm27 > +#define VEC_hi_ymm12 ymm28 > +#define VEC_hi_ymm13 ymm29 > +#define VEC_hi_ymm14 ymm30 > +#define VEC_hi_ymm15 ymm31 > + > +#define VEC_hi_zmm0 zmm16 > +#define VEC_hi_zmm1 zmm17 > +#define VEC_hi_zmm2 zmm18 > +#define VEC_hi_zmm3 zmm19 > +#define VEC_hi_zmm4 zmm20 > +#define VEC_hi_zmm5 zmm21 > +#define VEC_hi_zmm6 zmm22 > +#define VEC_hi_zmm7 zmm23 > +#define VEC_hi_zmm8 zmm24 > +#define VEC_hi_zmm9 zmm25 > +#define VEC_hi_zmm10 zmm26 > +#define VEC_hi_zmm11 zmm27 > +#define VEC_hi_zmm12 zmm28 > +#define VEC_hi_zmm13 zmm29 > +#define VEC_hi_zmm14 zmm30 > +#define VEC_hi_zmm15 zmm31 > + > +# define PRIMITIVE_VEC(vec, num) vec##num > + > +# define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) > +# define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) > +# define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) > + > +# define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) > +# define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) > +# define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) > + > +#endif > -- > 2.34.1 > -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-03 23:09 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library H.J. Lu @ 2022-06-03 23:49 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:49 UTC (permalink / raw) To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell On Fri, Jun 3, 2022 at 6:10 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Fri, Jun 3, 2022 at 1:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > This patch does not touch any existing code and is only meant to be a > > tool for future patches so that simple source files can more easily be > > maintained to target multiple VEC classes. > > > > There is no difference in the objdump of libc.so before and after this > > patch. > > --- > > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 33 +++++++++ > > sysdeps/x86_64/multiarch/avx-vecs.h | 53 ++++++++++++++ > > sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 33 +++++++++ > > sysdeps/x86_64/multiarch/avx2-vecs.h | 30 ++++++++ > > sysdeps/x86_64/multiarch/evex256-vecs.h | 50 +++++++++++++ > > sysdeps/x86_64/multiarch/evex512-vecs.h | 49 +++++++++++++ > > sysdeps/x86_64/multiarch/sse2-vecs.h | 48 +++++++++++++ > > sysdeps/x86_64/multiarch/vec-macros.h | 90 ++++++++++++++++++++++++ > > 8 files changed, 386 insertions(+) > > create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/avx2-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h > > > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > new file mode 100644 > > index 0000000000..c00b83ea0e > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > @@ -0,0 +1,33 @@ > > +/* Common config for AVX-RTM VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _AVX_RTM_VECS_H > > +#define _AVX_RTM_VECS_H 1 > > + > > +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > + > > +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > > + > > +#define SECTION(p) p##.avx.rtm > > + > > +#define USE_WITH_RTM 1 > > +#include "avx-vecs.h" > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h > > new file mode 100644 > > index 0000000000..3b84d7e8b2 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/avx-vecs.h > > @@ -0,0 +1,53 @@ > > +/* Common config for AVX VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _AVX_VECS_H > > +#define _AVX_VECS_H 1 > > + > > +#ifdef HAS_VEC > > +# error "Multiple VEC configs included!" > > +#endif > > + > > +#define HAS_VEC 1 > > +#include "vec-macros.h" > > + > > +#ifndef USE_WITH_AVX2 > > +# define USE_WITH_AVX 1 > > +#endif > > +/* Included by RTM version. */ > > +#ifndef SECTION > > +# define SECTION(p) p##.avx > > +#endif > > Can SECTION be defined unconditionally? If a different SECTION > is needed, you can undef it first, Fixed in V2. > > > + > > +#define VEC_SIZE 32 > > +/* 4-byte mov instructions with AVX2. */ > > +#define MOV_SIZE 4 > > +/* 1 (ret) + 3 (vzeroupper). */ > > +#define RET_SIZE 4 > > +#define VZEROUPPER vzeroupper > > + > > +#define VMOVU vmovdqu > > +#define VMOVA vmovdqa > > +#define VMOVNT vmovntdq > > + > > +/* Often need to access xmm portion. */ > > +#define VEC_xmm VEC_any_xmm > > +#define VEC VEC_any_ymm > > Can we check VEC or VEC_SIZE instead of HAS_VEC? Changed in V2. > > > + > > +#endif > > Do we need both AVX and AVX2? Will AVX2 be sufficient? Removed avx2 version in V2. > > > diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > > new file mode 100644 > > index 0000000000..a5d46e8c66 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > > @@ -0,0 +1,33 @@ > > +/* Common config for AVX2-RTM VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _AVX2_RTM_VECS_H > > +#define _AVX2_RTM_VECS_H 1 > > + > > +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > + > > +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > > + > > +#define SECTION(p) p##.avx.rtm > > + > > +#define USE_WITH_RTM 1 > > +#include "avx2-vecs.h" > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/avx2-vecs.h b/sysdeps/x86_64/multiarch/avx2-vecs.h > > new file mode 100644 > > index 0000000000..4c029b4621 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/avx2-vecs.h > > @@ -0,0 +1,30 @@ > > +/* Common config for AVX2 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _AVX2_VECS_H > > +#define _AVX2_VECS_H 1 > > + > > +#define USE_WITH_AVX2 1 > > +/* Included by RTM version. */ > > +#ifndef SECTION > > +# define SECTION(p) p##.avx > > +#endif > > +#include "avx-vecs.h" > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h > > new file mode 100644 > > index 0000000000..ed7a32b0ec > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h > > @@ -0,0 +1,50 @@ > > +/* Common config for EVEX256 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _EVEX256_VECS_H > > +#define _EVEX256_VECS_H 1 > > + > > +#ifdef HAS_VEC > > +# error "Multiple VEC configs included!" > > +#endif > > + > > +#define HAS_VEC 1 > > +#include "vec-macros.h" > > + > > +#define USE_WITH_EVEX256 1 > > +#ifndef SECTION > > +# define SECTION(p) p##.evex > > +#endif > > + > > +#define VEC_SIZE 32 > > +/* 6-byte mov instructions with EVEX. */ > > +#define MOV_SIZE 6 > > +/* No vzeroupper needed. */ > > +#define RET_SIZE 1 > > +#define VZEROUPPER > > + > > +#define VMOVU vmovdqu64 > > +#define VMOVA vmovdqa64 > > +#define VMOVNT vmovntdq > > + > > +/* Often need to access xmm portion. */ > > +#define VEC_xmm VEC_hi_xmm > > +#define VEC VEC_hi_ymm > > Can we add evex-vecs.h for common macros? Done in V2. > > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h > > new file mode 100644 > > index 0000000000..53597734fc > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h > > @@ -0,0 +1,49 @@ > > +/* Common config for EVEX512 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _EVEX512_VECS_H > > +#define _EVEX512_VECS_H 1 > > + > > +#ifdef HAS_VEC > > +# error "Multiple VEC configs included!" > > +#endif > > + > > +#define HAS_VEC 1 > > +#include "vec-macros.h" > > + > > +#define USE_WITH_EVEX512 1 > > +#define SECTION(p) p##.evex512 > > + > > +#define VEC_SIZE 64 > > +/* 6-byte mov instructions with EVEX. */ > > +#define MOV_SIZE 6 > > +/* No vzeroupper needed. */ > > +#define RET_SIZE 1 > > +#define VZEROUPPER > > + > > +#define VMOVU vmovdqu64 > > +#define VMOVA vmovdqa64 > > +#define VMOVNT vmovntdq > > + > > +/* Often need to access xmm/ymm portion. */ > > +#define VEC_xmm VEC_hi_xmm > > +#define VEC_ymm VEC_hi_ymm > > +#define VEC VEC_hi_zmm > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h > > new file mode 100644 > > index 0000000000..b645b93e3d > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h > > @@ -0,0 +1,48 @@ > > +/* Common config for SSE2 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _SSE2_VECS_H > > +#define _SSE2_VECS_H 1 > > + > > +#ifdef HAS_VEC > > +# error "Multiple VEC configs included!" > > +#endif > > + > > +#define HAS_VEC 1 > > +#include "vec-macros.h" > > + > > +#define USE_WITH_SSE2 1 > > +#define SECTION(p) p > > + > > +#define VEC_SIZE 16 > > +/* 3-byte mov instructions with SSE2. */ > > +#define MOV_SIZE 3 > > +/* No vzeroupper needed. */ > > +#define RET_SIZE 1 > > + > > +#define VMOVU movups > > +#define VMOVA movaps > > +#define VMOVNT movntdq > > +#define VZEROUPPER > > + > > +#define VEC_xmm VEC_any_xmm > > +#define VEC VEC_any_xmm > > + > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h > > new file mode 100644 > > index 0000000000..4dae4503c8 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/vec-macros.h > > @@ -0,0 +1,90 @@ > > +/* Macro helpers for VEC_{type}({vec_num}) > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _VEC_MACROS_H > > +# define _VEC_MACROS_H 1 > > Remove a space after #. Fixed in V2. > > > + > > +# ifndef HAS_VEC > > +# error "Never include this file directly. Always include a vector config." > > +# endif > > Remove a space after #. Fixed in V2. > > > + > > +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same > > + VEC(N) values. */ > > +#define VEC_hi_xmm0 xmm16 > > +#define VEC_hi_xmm1 xmm17 > > +#define VEC_hi_xmm2 xmm18 > > +#define VEC_hi_xmm3 xmm19 > > +#define VEC_hi_xmm4 xmm20 > > +#define VEC_hi_xmm5 xmm21 > > +#define VEC_hi_xmm6 xmm22 > > +#define VEC_hi_xmm7 xmm23 > > +#define VEC_hi_xmm8 xmm24 > > +#define VEC_hi_xmm9 xmm25 > > +#define VEC_hi_xmm10 xmm26 > > +#define VEC_hi_xmm11 xmm27 > > +#define VEC_hi_xmm12 xmm28 > > +#define VEC_hi_xmm13 xmm29 > > +#define VEC_hi_xmm14 xmm30 > > +#define VEC_hi_xmm15 xmm31 > > + > > +#define VEC_hi_ymm0 ymm16 > > +#define VEC_hi_ymm1 ymm17 > > +#define VEC_hi_ymm2 ymm18 > > +#define VEC_hi_ymm3 ymm19 > > +#define VEC_hi_ymm4 ymm20 > > +#define VEC_hi_ymm5 ymm21 > > +#define VEC_hi_ymm6 ymm22 > > +#define VEC_hi_ymm7 ymm23 > > +#define VEC_hi_ymm8 ymm24 > > +#define VEC_hi_ymm9 ymm25 > > +#define VEC_hi_ymm10 ymm26 > > +#define VEC_hi_ymm11 ymm27 > > +#define VEC_hi_ymm12 ymm28 > > +#define VEC_hi_ymm13 ymm29 > > +#define VEC_hi_ymm14 ymm30 > > +#define VEC_hi_ymm15 ymm31 > > + > > +#define VEC_hi_zmm0 zmm16 > > +#define VEC_hi_zmm1 zmm17 > > +#define VEC_hi_zmm2 zmm18 > > +#define VEC_hi_zmm3 zmm19 > > +#define VEC_hi_zmm4 zmm20 > > +#define VEC_hi_zmm5 zmm21 > > +#define VEC_hi_zmm6 zmm22 > > +#define VEC_hi_zmm7 zmm23 > > +#define VEC_hi_zmm8 zmm24 > > +#define VEC_hi_zmm9 zmm25 > > +#define VEC_hi_zmm10 zmm26 > > +#define VEC_hi_zmm11 zmm27 > > +#define VEC_hi_zmm12 zmm28 > > +#define VEC_hi_zmm13 zmm29 > > +#define VEC_hi_zmm14 zmm30 > > +#define VEC_hi_zmm15 zmm31 > > + > > +# define PRIMITIVE_VEC(vec, num) vec##num > > + > > +# define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) > > +# define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) > > +# define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) > > + > > +# define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) > > +# define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) > > +# define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) > > + Removed spaces here as well in V2. > > +#endif > > -- > > 2.34.1 > > > > > -- > H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v3 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein @ 2022-06-03 23:49 ` Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein ` (6 more replies) 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (2 subsequent siblings) 4 siblings, 7 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:49 UTC (permalink / raw) To: libc-alpha This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 34 ++++++++ sysdeps/x86_64/multiarch/avx-vecs.h | 47 +++++++++++ sysdeps/x86_64/multiarch/evex-vecs-common.h | 39 +++++++++ sysdeps/x86_64/multiarch/evex256-vecs.h | 35 ++++++++ sysdeps/x86_64/multiarch/evex512-vecs.h | 35 ++++++++ sysdeps/x86_64/multiarch/sse2-vecs.h | 47 +++++++++++ sysdeps/x86_64/multiarch/vec-macros.h | 90 +++++++++++++++++++++ 7 files changed, 327 insertions(+) create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex-vecs-common.h create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h new file mode 100644 index 0000000000..3f531dd47f --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -0,0 +1,34 @@ +/* Common config for AVX-RTM VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_RTM_VECS_H +#define _AVX_RTM_VECS_H 1 + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define USE_WITH_RTM 1 +#include "avx-vecs.h" + +#undef SECTION +#define SECTION(p) p##.avx.rtm + +#endif diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h new file mode 100644 index 0000000000..89680f5db8 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-vecs.h @@ -0,0 +1,47 @@ +/* Common config for AVX VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_VECS_H +#define _AVX_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 32 +#include "vec-macros.h" + +#define USE_WITH_AVX 1 +#define SECTION(p) p##.avx + +/* 4-byte mov instructions with AVX2. */ +#define MOV_SIZE 4 +/* 1 (ret) + 3 (vzeroupper). */ +#define RET_SIZE 4 +#define VZEROUPPER vzeroupper + +#define VMOVU vmovdqu +#define VMOVA vmovdqa +#define VMOVNT vmovntdq + +/* Often need to access xmm portion. */ +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex-vecs-common.h b/sysdeps/x86_64/multiarch/evex-vecs-common.h new file mode 100644 index 0000000000..99806ebcd7 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex-vecs-common.h @@ -0,0 +1,39 @@ +/* Common config for EVEX256 and EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX_VECS_COMMON_H +#define _EVEX_VECS_COMMON_H 1 + +#include "vec-macros.h" + +/* 6-byte mov instructions with EVEX. */ +#define MOV_SIZE 6 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU vmovdqu64 +#define VMOVA vmovdqa64 +#define VMOVNT vmovntdq + +#define VEC_xmm VEC_hi_xmm +#define VEC_ymm VEC_hi_ymm +#define VEC_zmm VEC_hi_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h new file mode 100644 index 0000000000..222ba46dc7 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h @@ -0,0 +1,35 @@ +/* Common config for EVEX256 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX256_VECS_H +#define _EVEX256_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 32 +#include "evex-vecs-common.h" + +#define USE_WITH_EVEX256 1 +#define SECTION(p) p##.evex + +#define VEC VEC_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h new file mode 100644 index 0000000000..d1784d5368 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h @@ -0,0 +1,35 @@ +/* Common config for EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX512_VECS_H +#define _EVEX512_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 64 +#include "evex-vecs-common.h" + +#define USE_WITH_EVEX512 1 +#define SECTION(p) p##.evex512 + +#define VEC VEC_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h new file mode 100644 index 0000000000..2b77a59d56 --- /dev/null +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h @@ -0,0 +1,47 @@ +/* Common config for SSE2 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _SSE2_VECS_H +#define _SSE2_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 16 +#include "vec-macros.h" + +#define USE_WITH_SSE2 1 +#define SECTION(p) p + +/* 3-byte mov instructions with SSE2. */ +#define MOV_SIZE 3 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU movups +#define VMOVA movaps +#define VMOVNT movntdq + +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_xmm + + +#endif diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h new file mode 100644 index 0000000000..9f3ffecede --- /dev/null +++ b/sysdeps/x86_64/multiarch/vec-macros.h @@ -0,0 +1,90 @@ +/* Macro helpers for VEC_{type}({vec_num}) + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _VEC_MACROS_H +#define _VEC_MACROS_H 1 + +#ifndef VEC_SIZE +# error "Never include this file directly. Always include a vector config." +#endif + +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same + VEC(N) values. */ +#define VEC_hi_xmm0 xmm16 +#define VEC_hi_xmm1 xmm17 +#define VEC_hi_xmm2 xmm18 +#define VEC_hi_xmm3 xmm19 +#define VEC_hi_xmm4 xmm20 +#define VEC_hi_xmm5 xmm21 +#define VEC_hi_xmm6 xmm22 +#define VEC_hi_xmm7 xmm23 +#define VEC_hi_xmm8 xmm24 +#define VEC_hi_xmm9 xmm25 +#define VEC_hi_xmm10 xmm26 +#define VEC_hi_xmm11 xmm27 +#define VEC_hi_xmm12 xmm28 +#define VEC_hi_xmm13 xmm29 +#define VEC_hi_xmm14 xmm30 +#define VEC_hi_xmm15 xmm31 + +#define VEC_hi_ymm0 ymm16 +#define VEC_hi_ymm1 ymm17 +#define VEC_hi_ymm2 ymm18 +#define VEC_hi_ymm3 ymm19 +#define VEC_hi_ymm4 ymm20 +#define VEC_hi_ymm5 ymm21 +#define VEC_hi_ymm6 ymm22 +#define VEC_hi_ymm7 ymm23 +#define VEC_hi_ymm8 ymm24 +#define VEC_hi_ymm9 ymm25 +#define VEC_hi_ymm10 ymm26 +#define VEC_hi_ymm11 ymm27 +#define VEC_hi_ymm12 ymm28 +#define VEC_hi_ymm13 ymm29 +#define VEC_hi_ymm14 ymm30 +#define VEC_hi_ymm15 ymm31 + +#define VEC_hi_zmm0 zmm16 +#define VEC_hi_zmm1 zmm17 +#define VEC_hi_zmm2 zmm18 +#define VEC_hi_zmm3 zmm19 +#define VEC_hi_zmm4 zmm20 +#define VEC_hi_zmm5 zmm21 +#define VEC_hi_zmm6 zmm22 +#define VEC_hi_zmm7 zmm23 +#define VEC_hi_zmm8 zmm24 +#define VEC_hi_zmm9 zmm25 +#define VEC_hi_zmm10 zmm26 +#define VEC_hi_zmm11 zmm27 +#define VEC_hi_zmm12 zmm28 +#define VEC_hi_zmm13 zmm29 +#define VEC_hi_zmm14 zmm30 +#define VEC_hi_zmm15 zmm31 + +#define PRIMITIVE_VEC(vec, num) vec##num + +#define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) +#define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) +#define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) + +#define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) +#define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) +#define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) + +#endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v3 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein @ 2022-06-03 23:49 ` Noah Goldstein 2022-06-06 21:30 ` H.J. Lu 2022-06-03 23:49 ` [PATCH v3 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein ` (5 subsequent siblings) 6 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:49 UTC (permalink / raw) To: libc-alpha The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + sysdeps/x86_64/sysdep.h | 16 ++++++++++++++++ 2 files changed, 17 insertions(+) diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h index 3f531dd47f..6ca9f5e6ba 100644 --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -20,6 +20,7 @@ #ifndef _AVX_RTM_VECS_H #define _AVX_RTM_VECS_H 1 +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h index f14d50786d..2cb31a558b 100644 --- a/sysdeps/x86_64/sysdep.h +++ b/sysdeps/x86_64/sysdep.h @@ -106,6 +106,22 @@ lose: \ vzeroupper; \ ret +/* Can be used to replace vzeroupper that is not directly before a + return. */ +#define COND_VZEROUPPER_XTEST \ + xtest; \ + jz 1f; \ + vzeroall; \ + jmp 2f; \ +1: \ + vzeroupper; \ +2: + +/* In RTM define this as COND_VZEROUPPER_XTEST. */ +#ifndef COND_VZEROUPPER +# define COND_VZEROUPPER vzeroupper +#endif + /* Zero upper vector registers and return. */ #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN # define ZERO_UPPER_VEC_REGISTERS_RETURN \ -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v3 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-03 23:49 ` [PATCH v3 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-06 21:30 ` H.J. Lu 2022-06-06 22:38 ` Noah Goldstein 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-06 21:30 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Fri, Jun 3, 2022 at 4:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The RTM vzeroupper mitigation has no way of replacing inline > vzeroupper not before a return. > > This code does not change any existing functionality. > > There is no difference in the objdump of libc.so before and after this > patch. > --- > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + > sysdeps/x86_64/sysdep.h | 16 ++++++++++++++++ > 2 files changed, 17 insertions(+) > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > index 3f531dd47f..6ca9f5e6ba 100644 > --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > @@ -20,6 +20,7 @@ > #ifndef _AVX_RTM_VECS_H > #define _AVX_RTM_VECS_H 1 > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h > index f14d50786d..2cb31a558b 100644 > --- a/sysdeps/x86_64/sysdep.h > +++ b/sysdeps/x86_64/sysdep.h > @@ -106,6 +106,22 @@ lose: \ > vzeroupper; \ > ret > > +/* Can be used to replace vzeroupper that is not directly before a > + return. */ Please mention that it should be used to reduce the number of vzerouppers. > +#define COND_VZEROUPPER_XTEST \ > + xtest; \ > + jz 1f; \ > + vzeroall; \ > + jmp 2f; \ > +1: \ > + vzeroupper; \ > +2: > + > +/* In RTM define this as COND_VZEROUPPER_XTEST. */ > +#ifndef COND_VZEROUPPER > +# define COND_VZEROUPPER vzeroupper > +#endif > + > /* Zero upper vector registers and return. */ > #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN > # define ZERO_UPPER_VEC_REGISTERS_RETURN \ > -- > 2.34.1 > -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v3 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-06 21:30 ` H.J. Lu @ 2022-06-06 22:38 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:38 UTC (permalink / raw) To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 2:31 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Fri, Jun 3, 2022 at 4:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The RTM vzeroupper mitigation has no way of replacing inline > > vzeroupper not before a return. > > > > This code does not change any existing functionality. > > > > There is no difference in the objdump of libc.so before and after this > > patch. > > --- > > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + > > sysdeps/x86_64/sysdep.h | 16 ++++++++++++++++ > > 2 files changed, 17 insertions(+) > > > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > index 3f531dd47f..6ca9f5e6ba 100644 > > --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > @@ -20,6 +20,7 @@ > > #ifndef _AVX_RTM_VECS_H > > #define _AVX_RTM_VECS_H 1 > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h > > index f14d50786d..2cb31a558b 100644 > > --- a/sysdeps/x86_64/sysdep.h > > +++ b/sysdeps/x86_64/sysdep.h > > @@ -106,6 +106,22 @@ lose: \ > > vzeroupper; \ > > ret > > > > +/* Can be used to replace vzeroupper that is not directly before a > > + return. */ > > Please mention that it should be used to reduce the number of > vzerouppers. Fixed in V4 Made things more explicit in the comment and commit message. > > > +#define COND_VZEROUPPER_XTEST \ > > + xtest; \ > > + jz 1f; \ > > + vzeroall; \ > > + jmp 2f; \ > > +1: \ > > + vzeroupper; \ > > +2: > > + > > +/* In RTM define this as COND_VZEROUPPER_XTEST. */ > > +#ifndef COND_VZEROUPPER > > +# define COND_VZEROUPPER vzeroupper > > +#endif > > + > > /* Zero upper vector registers and return. */ > > #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN > > # define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > -- > > 2.34.1 > > > > > -- > H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v3 3/8] Benchtests: Improve memrchr benchmarks 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-03 23:49 ` Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein ` (4 subsequent siblings) 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:49 UTC (permalink / raw) To: libc-alpha Add a second iteration for memrchr to set `pos` starting from the end of the buffer. Previously `pos` was only set relative to the begining of the buffer. This isn't really useful for memchr because the begining of the search space is (buf + len). --- benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- 1 file changed, 65 insertions(+), 45 deletions(-) diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c index 4d7212332f..0facda2fa0 100644 --- a/benchtests/bench-memchr.c +++ b/benchtests/bench-memchr.c @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, static void do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, - int seek_char) + int seek_char, int invert_pos) { size_t i; @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, if (pos < len) { - buf[align + pos] = seek_char; + if (invert_pos) + buf[align + len - pos] = seek_char; + else + buf[align + pos] = seek_char; buf[align + len] = -seek_char; } else @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, json_attr_uint (json_ctx, "pos", pos); json_attr_uint (json_ctx, "len", len); json_attr_uint (json_ctx, "seek_char", seek_char); + json_attr_uint (json_ctx, "invert_pos", invert_pos); json_array_begin (json_ctx, "timings"); @@ -123,6 +127,7 @@ int test_main (void) { size_t i; + int repeats; json_ctx_t json_ctx; test_init (); @@ -142,53 +147,68 @@ test_main (void) json_array_begin (&json_ctx, "results"); - for (i = 1; i < 8; ++i) + for (repeats = 0; repeats < 2; ++repeats) { - do_test (&json_ctx, 0, 16 << i, 2048, 23); - do_test (&json_ctx, i, 64, 256, 23); - do_test (&json_ctx, 0, 16 << i, 2048, 0); - do_test (&json_ctx, i, 64, 256, 0); - - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); + do_test (&json_ctx, i, 64, 256, 23, repeats); + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); + do_test (&json_ctx, i, 64, 256, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, i, 256, 23); - do_test (&json_ctx, 0, i, 256, 0); - do_test (&json_ctx, i, i, 256, 23); - do_test (&json_ctx, i, i, 256, 0); + /* Also test the position close to the beginning for memrchr. */ + do_test (&json_ctx, 0, i, 256, 23, repeats); + do_test (&json_ctx, 0, i, 256, 0, repeats); + do_test (&json_ctx, i, i, 256, 23, repeats); + do_test (&json_ctx, i, i, 256, 0, repeats); #endif - } - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, i, i << 5, 192, 23); - do_test (&json_ctx, i, i << 5, 192, 0); - do_test (&json_ctx, i, i << 5, 256, 23); - do_test (&json_ctx, i, i << 5, 256, 0); - do_test (&json_ctx, i, i << 5, 512, 23); - do_test (&json_ctx, i, i << 5, 512, 0); - - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); - } - for (i = 1; i < 32; ++i) - { - do_test (&json_ctx, 0, i, i + 1, 23); - do_test (&json_ctx, 0, i, i + 1, 0); - do_test (&json_ctx, i, i, i + 1, 23); - do_test (&json_ctx, i, i, i + 1, 0); - do_test (&json_ctx, 0, i, i - 1, 23); - do_test (&json_ctx, 0, i, i - 1, 0); - do_test (&json_ctx, i, i, i - 1, 23); - do_test (&json_ctx, i, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); + } + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, i, i << 5, 192, 23, repeats); + do_test (&json_ctx, i, i << 5, 192, 0, repeats); + do_test (&json_ctx, i, i << 5, 256, 23, repeats); + do_test (&json_ctx, i, i << 5, 256, 0, repeats); + do_test (&json_ctx, i, i << 5, 512, 23, repeats); + do_test (&json_ctx, i, i << 5, 512, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); + } + for (i = 1; i < 32; ++i) + { + do_test (&json_ctx, 0, i, i + 1, 23, repeats); + do_test (&json_ctx, 0, i, i + 1, 0, repeats); + do_test (&json_ctx, i, i, i + 1, 23, repeats); + do_test (&json_ctx, i, i, i + 1, 0, repeats); + do_test (&json_ctx, 0, i, i - 1, 23, repeats); + do_test (&json_ctx, 0, i, i - 1, 0, repeats); + do_test (&json_ctx, i, i, i - 1, 23, repeats); + do_test (&json_ctx, i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); + #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, 1, i + 1, 23); - do_test (&json_ctx, 0, 2, i + 1, 0); + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); +#endif + } +#ifndef USE_AS_MEMRCHR + break; #endif } -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v3 4/8] x86: Optimize memrchr-sse2.S 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein @ 2022-06-03 23:49 ` Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 5/8] x86: Optimize memrchr-evex.S Noah Goldstein ` (3 subsequent siblings) 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:49 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. --- sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- 1 file changed, 292 insertions(+), 321 deletions(-) diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S index d1a9f47911..b0dffd2ae2 100644 --- a/sysdeps/x86_64/memrchr.S +++ b/sysdeps/x86_64/memrchr.S @@ -18,362 +18,333 @@ <https://www.gnu.org/licenses/>. */ #include <sysdep.h> +#define VEC_SIZE 16 +#define PAGE_SIZE 4096 .text -ENTRY (__memrchr) - movd %esi, %xmm1 - - sub $16, %RDX_LP - jbe L(length_less16) - - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add %RDX_LP, %RDI_LP - pshufd $0, %xmm1, %xmm1 - - movdqu (%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - -/* Check if there is a match. */ - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - mov %edi, %ecx - and $15, %ecx - jz L(loop_prolog) - - add $16, %rdi - add $16, %rdx - and $-16, %rdi - sub %rcx, %rdx - - .p2align 4 -L(loop_prolog): - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm4 - pcmpeqb %xmm1, %xmm4 - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches0) - - mov %edi, %ecx - and $63, %ecx - jz L(align64_loop) - - add $64, %rdi - add $64, %rdx - and $-64, %rdi - sub %rcx, %rdx - - .p2align 4 -L(align64_loop): - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa (%rdi), %xmm0 - movdqa 16(%rdi), %xmm2 - movdqa 32(%rdi), %xmm3 - movdqa 48(%rdi), %xmm4 - - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm1, %xmm2 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm1, %xmm4 - - pmaxub %xmm3, %xmm0 - pmaxub %xmm4, %xmm2 - pmaxub %xmm0, %xmm2 - pmovmskb %xmm2, %eax - - test %eax, %eax - jz L(align64_loop) - - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches48) - - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm2 - - pcmpeqb %xmm1, %xmm2 - pcmpeqb (%rdi), %xmm1 - - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches16) - - pmovmskb %xmm1, %eax - bsr %eax, %eax - - add %rdi, %rax +ENTRY_P2ALIGN(__memrchr, 6) +#ifdef __ILP32__ + /* Clear upper bits. */ + mov %RDX_LP, %RDX_LP +#endif + movd %esi, %xmm0 + + /* Get end pointer. */ + leaq (%rdx, %rdi), %rcx + + punpcklbw %xmm0, %xmm0 + punpcklwd %xmm0, %xmm0 + pshufd $0, %xmm0, %xmm0 + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %ecx + jz L(page_cross) + + /* NB: This load happens regardless of whether rdx (len) is zero. Since + it doesn't cross a page and the standard gurantees any pointer have + at least one-valid byte this load must be safe. For the entire + history of the x86 memrchr implementation this has been possible so + no code "should" be relying on a zero-length check before this load. + The zero-length check is moved to the page cross case because it is + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE + into 2-cache lines. */ + movups -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is + zero. */ + bsrl %eax, %eax + jz L(ret_0) + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here + if out of bounds. */ + addl %edx, %eax + jl L(zero_0) + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base + ptr. */ + addq %rdi, %rax +L(ret_0): ret - .p2align 4 -L(exit_loop): - add $64, %edx - cmp $32, %edx - jbe L(exit_loop_32) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16_1) - cmp $48, %edx - jbe L(return_null) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches0_1) - xor %eax, %eax + .p2align 4,, 5 +L(ret_vec_x0): + bsrl %eax, %eax + leaq -(VEC_SIZE)(%rcx, %rax), %rax ret - .p2align 4 -L(exit_loop_32): - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48_1) - cmp $16, %edx - jbe L(return_null) - - pcmpeqb 32(%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches32_1) - xor %eax, %eax + .p2align 4,, 2 +L(zero_0): + xorl %eax, %eax ret - .p2align 4 -L(matches0): - bsr %eax, %eax - add %rdi, %rax - ret - - .p2align 4 -L(matches16): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax - ret - .p2align 4 -L(matches32): - bsr %eax, %eax - lea 32(%rax, %rdi), %rax + .p2align 4,, 8 +L(more_1x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) + + /* Align rcx (pointer to string). */ + decq %rcx + andq $-VEC_SIZE, %rcx + + movq %rcx, %rdx + /* NB: We could consistenyl save 1-byte in this pattern with `movaps + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is + it adds more frontend uops (even if the moves can be eliminated) and + some percentage of the time actual backend uops. */ + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + subq %rdi, %rdx + pmovmskb %xmm1, %eax + + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) +L(last_2x_vec): + subl $VEC_SIZE, %edx + jbe L(ret_vec_x0_test) + + testl %eax, %eax + jnz L(ret_vec_x0) + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_1) + addl %edx, %eax + jl L(zero_0) + addq %rdi, %rax +L(ret_1): ret - .p2align 4 -L(matches48): - bsr %eax, %eax - lea 48(%rax, %rdi), %rax + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) + causes the hot pause (length <= VEC_SIZE) to span multiple cache + lines. Naturally aligned % 16 to 8-bytes. */ +L(page_cross): + /* Zero length check. */ + testq %rdx, %rdx + jz L(zero_0) + + leaq -1(%rcx), %r8 + andq $-(VEC_SIZE), %r8 + + movaps (%r8), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %esi + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + negl %ecx + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count + explicitly. */ + andl $(VEC_SIZE - 1), %ecx + shl %cl, %esi + movzwl %si, %eax + leaq (%rdi, %rdx), %rcx + cmpq %rdi, %r8 + ja L(more_1x_vec) + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_2) + addl %edx, %eax + jl L(zero_1) + addq %rdi, %rax +L(ret_2): ret - .p2align 4 -L(matches0_1): - bsr %eax, %eax - sub $64, %rdx - add %rax, %rdx - jl L(return_null) - add %rdi, %rax + /* Fits in aliging bytes. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(matches16_1): - bsr %eax, %eax - sub $48, %rdx - add %rax, %rdx - jl L(return_null) - lea 16(%rdi, %rax), %rax + .p2align 4,, 5 +L(ret_vec_x1): + bsrl %eax, %eax + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(matches32_1): - bsr %eax, %eax - sub $32, %rdx - add %rax, %rdx - jl L(return_null) - lea 32(%rdi, %rax), %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) - .p2align 4 -L(matches48_1): - bsr %eax, %eax - sub $16, %rdx - add %rax, %rdx - jl L(return_null) - lea 48(%rdi, %rax), %rax - ret + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + testl %eax, %eax + jnz L(ret_vec_x1) - .p2align 4 -L(return_null): - xor %eax, %eax - ret - .p2align 4 -L(length_less16_offset0): - test %edx, %edx - jz L(return_null) + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - mov %dl, %cl - pcmpeqb (%rdi), %xmm1 + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) - mov $1, %edx - sal %cl, %edx - sub $1, %edx + addl $(VEC_SIZE), %edx + jle L(ret_vec_x2_test) - pmovmskb %xmm1, %eax +L(last_vec): + testl %eax, %eax + jnz L(ret_vec_x2) - and %edx, %eax - test %eax, %eax - jz L(return_null) + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - bsr %eax, %eax - add %rdi, %rax + subl $(VEC_SIZE), %edx + bsrl %eax, %eax + jz L(ret_3) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax +L(ret_3): ret - .p2align 4 -L(length_less16): - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add $16, %edx - - pshufd $0, %xmm1, %xmm1 - - mov %edi, %ecx - and $15, %ecx - jz L(length_less16_offset0) - - mov %cl, %dh - mov %ecx, %esi - add %dl, %dh - and $-16, %rdi - - sub $16, %dh - ja L(length_less16_part2) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - - sar %cl, %eax - mov %dl, %cl - - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax - test %eax, %eax - jz L(return_null) - - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 6 +L(ret_vec_x2_test): + bsrl %eax, %eax + jz L(zero_2) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax ret - .p2align 4 -L(length_less16_part2): - movdqa 16(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - - mov %dh, %cl - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax +L(zero_2): + xorl %eax, %eax + ret - test %eax, %eax - jnz L(length_less16_part2_return) - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax + .p2align 4,, 5 +L(ret_vec_x2): + bsrl %eax, %eax + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - mov %esi, %ecx - sar %cl, %eax - test %eax, %eax - jz L(return_null) + .p2align 4,, 5 +L(ret_vec_x3): + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 8 +L(more_4x_vec): + testl %eax, %eax + jnz L(ret_vec_x2) + + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_x3) + + addq $-(VEC_SIZE * 4), %rcx + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) + + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end + keeping the code from spilling to the next cache line. */ + addq $(VEC_SIZE * 4 - 1), %rcx + andq $-(VEC_SIZE * 4), %rcx + leaq (VEC_SIZE * 4)(%rdi), %rdx + andq $-(VEC_SIZE * 4), %rdx + + .p2align 4,, 11 +L(loop_4x_vec): + movaps (VEC_SIZE * -1)(%rcx), %xmm1 + movaps (VEC_SIZE * -2)(%rcx), %xmm2 + movaps (VEC_SIZE * -3)(%rcx), %xmm3 + movaps (VEC_SIZE * -4)(%rcx), %xmm4 + pcmpeqb %xmm0, %xmm1 + pcmpeqb %xmm0, %xmm2 + pcmpeqb %xmm0, %xmm3 + pcmpeqb %xmm0, %xmm4 + + por %xmm1, %xmm2 + por %xmm3, %xmm4 + por %xmm2, %xmm4 + + pmovmskb %xmm4, %esi + testl %esi, %esi + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rcx + cmpq %rdx, %rcx + jne L(loop_4x_vec) + + subl %edi, %edx + + /* Ends up being 1-byte nop. */ + .p2align 4,, 2 +L(last_4x_vec): + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) + + testl %eax, %eax + jnz L(ret_vec_x0) + + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_end) + + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $(VEC_SIZE * 3), %edx + ja L(last_vec) + bsrl %eax, %eax + jz L(ret_4) + addl %edx, %eax + jl L(zero_3) + addq %rdi, %rax +L(ret_4): ret - .p2align 4 -L(length_less16_part2_return): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax + /* Ends up being 1-byte nop. */ + .p2align 4,, 3 +L(loop_end): + pmovmskb %xmm1, %eax + sall $16, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm2, %eax + testl %eax, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm3, %eax + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + sall $16, %eax + orl %esi, %eax + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax ret -END (__memrchr) +L(ret_vec_end): + bsrl %eax, %eax + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax + ret + /* Use in L(last_4x_vec). In the same cache line. This is just a spare + aligning bytes. */ +L(zero_3): + xorl %eax, %eax + ret + /* 2-bytes from next cache line. */ +END(__memrchr) weak_alias (__memrchr, memrchr) -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v3 5/8] x86: Optimize memrchr-evex.S 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein ` (2 preceding siblings ...) 2022-06-03 23:49 ` [PATCH v3 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein @ 2022-06-03 23:49 ` Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein ` (2 subsequent siblings) 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:49 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the begining of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ 1 file changed, 268 insertions(+), 271 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S index 0b99709c6b..ad541c0e50 100644 --- a/sysdeps/x86_64/multiarch/memrchr-evex.S +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S @@ -19,319 +19,316 @@ #if IS_IN (libc) # include <sysdep.h> +# include "evex256-vecs.h" +# if VEC_SIZE != 32 +# error "VEC_SIZE != 32 unimplemented" +# endif + +# ifndef MEMRCHR +# define MEMRCHR __memrchr_evex +# endif + +# define PAGE_SIZE 4096 +# define VECMATCH VEC(0) + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN(MEMRCHR, 6) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) + + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdi, %rdx), %rax + vpbroadcastb %esi, %VECMATCH + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret -# define VMOVA vmovdqa64 - -# define YMMMATCH ymm16 - -# define VEC_SIZE 32 - - .section .text.evex,"ax",@progbits -ENTRY (__memrchr_evex) - /* Broadcast CHAR to YMMMATCH. */ - vpbroadcastb %esi, %YMMMATCH - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP - - /* Check the last VEC_SIZE bytes. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) - - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx - - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 - kord %k1, %k2, %k5 - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 - - kord %k3, %k4, %k6 - kortestd %k5, %k6 - jz L(loop_4x_vec) - - /* There is a match. */ - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - kmovd %k1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0_dec): + decq %rax +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax ret - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) + /* Align rax (pointer to string). */ + andq $-VEC_SIZE, %rax - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) + /* Recompute length after aligning. */ + movq %rax, %rdx - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - ret + subq %rdi, %rdx - .p2align 4 + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) + + /* Must dec rax because L(ret_vec_x0_test) expects it. */ + decq %rax cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpb $0, (%rsi), %VECMATCH, %k0 + kmovd %k0, %r8d + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %ecx + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %ecx + shlxl %ecx, %r8d, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_1) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax + /* Continue creating zero labels that fit in aligning bytes and get + 2-byte encoding / are in the same cache line as condition. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - ret + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx - - kmovd %k1, %eax - - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx - - /* Check for zero length. */ - testl %edx, %edx - jz L(zero) - - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) - - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + .p2align 4,, 8 +L(ret_vec_x2): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + .p2align 4,, 8 +L(ret_vec_x3): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - /* Check the last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax + .p2align 4,, 8 +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + testl %ecx, %ecx + jnz L(ret_vec_x3) - andl %edx, %eax - testl %eax, %eax - jz L(zero) + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - ret + decq %rax + andq $-(VEC_SIZE * 4), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + andq $-(VEC_SIZE * 4), %rdx .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx - - /* Check the last VEC. */ - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 +L(loop_4x_vec): + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later + on). */ + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 + + /* VEC(2/3) will have zero-byte where we found a CHAR. */ + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 + + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} + vptestnmb %VEC(3), %VEC(3), %k2 + + /* Any 1s and we found CHAR. */ + kortestd %k2, %k4 + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) + + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ + subq $-(VEC_SIZE * 4), %rdx + movq %rdx, %rax + subl %edi, %edx +L(last_4x_vec): + + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - kmovd %k1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jnz L(last_vec_x1) + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - /* Check the second last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 + testl %ecx, %ecx + jnz L(ret_vec_x1) - movl %r8d, %ecx + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - kmovd %k1, %eax + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret_1) + xorl %eax, %eax +L(ret_1): + ret - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 6 +L(loop_end): + kmovd %k1, %ecx + notl %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vptestnmb %VEC(2), %VEC(2), %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + kmovd %k2, %ecx + kmovd %k4, %esi + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + addq %rcx, %rax + ret + .p2align 4,, 4 +L(ret_vec_x0_end): + addq $(VEC_SIZE), %rax +L(ret_vec_x1_end): + bsrl %ecx, %ecx + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax ret -END (__memrchr_evex) + +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v3 6/8] x86: Optimize memrchr-avx2.S 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein ` (3 preceding siblings ...) 2022-06-03 23:49 ` [PATCH v3 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-03 23:49 ` Noah Goldstein 2022-06-03 23:50 ` [PATCH v3 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-03 23:50 ` [PATCH v3 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:49 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the begining of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memrchr-avx2.S | 538 ++++++++++---------- 2 files changed, 260 insertions(+), 279 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S index cea2d2a72d..5e9beeeef2 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMRCHR __memrchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S index ba2ce7cb03..6915e1c373 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S @@ -21,340 +21,320 @@ # include <sysdep.h> # ifndef MEMRCHR -# define MEMRCHR __memrchr_avx2 +# define MEMRCHR __memrchr_avx2 # endif # ifndef VZEROUPPER -# define VZEROUPPER vzeroupper +# define VZEROUPPER vzeroupper # endif +// abf-off # ifndef SECTION # define SECTION(p) p##.avx # endif +// abf-on + +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + .section SECTION(.text), "ax", @progbits +ENTRY(MEMRCHR) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) -# define VEC_SIZE 32 - - .section SECTION(.text),"ax",@progbits -ENTRY (MEMRCHR) - /* Broadcast CHAR to YMM0. */ vmovd %esi, %xmm0 - vpbroadcastb %xmm0, %ymm0 - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdx, %rdi), %rax - /* Check the last VEC_SIZE bytes. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) + vpbroadcastb %xmm0, %ymm0 - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) + +L(ret_vec_x0_test): + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + + /* Hoist vzeroupper (not great for RTM) to save code size. This allows + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vmovdqa (%rdi), %ymm1 - vmovdqa VEC_SIZE(%rdi), %ymm2 - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 - - vpcmpeqb %ymm1, %ymm0, %ymm1 - vpcmpeqb %ymm2, %ymm0, %ymm2 - vpcmpeqb %ymm3, %ymm0, %ymm3 - vpcmpeqb %ymm4, %ymm0, %ymm4 - - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - testl %eax, %eax - jz L(loop_4x_vec) - - /* There is a match. */ - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpmovmskb %ymm1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax L(return_vzeroupper): ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) - - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Align rax (string pointer). */ + andq $-VEC_SIZE, %rax + + /* Recompute remaining length after aligning. */ + movq %rax, %rdx + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 + subq %rdi, %rdx + decq %rax + vpmovmskb %ymm1, %ecx + /* Fall through for short (hotter than length). */ + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpeqb (%rsi), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %r8d + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %r8d + shlxl %r8d, %ecx, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret + .p2align 4,, 11 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + subq %rcx, %rax VZEROUPPER_RETURN + .p2align 4,, 10 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - VZEROUPPER_RETURN + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax - VZEROUPPER_RETURN + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) + +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - .p2align 4 -L(null): + /* First in aligning bytes. */ +L(zero_2): xorl %eax, %eax ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx + .p2align 4,, 4 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - vpcmpeqb (%rdi), %ymm0, %ymm1 - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx + .p2align 4,, 11 +L(ret_vec_x2): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - vpmovmskb %ymm1, %eax + .p2align 4,, 14 +L(ret_vec_x3): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Check for zero length. */ - testl %edx, %edx - jz L(null) + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) + testl %ecx, %ecx + jnz L(ret_vec_x3) - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + /* Align rax to (VEC_SIZE - 1). */ + orq $(VEC_SIZE * 4 - 1), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + orq $(VEC_SIZE * 4 - 1), %rdx - /* Check the last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + .p2align 4 +L(loop_4x_vec): + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 - movl $1, %edx - sall %cl, %edx - subl $1, %edx + vpor %ymm1, %ymm2, %ymm2 + vpor %ymm3, %ymm4, %ymm4 + vpor %ymm2, %ymm4, %ymm4 + vpmovmskb %ymm4, %esi - andl %edx, %eax - testl %eax, %eax - jz L(zero) + testl %esi, %esi + jnz L(loop_end) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - VZEROUPPER_RETURN + addq $(VEC_SIZE * -4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) - .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx + subl %edi, %edx + incl %edx - /* Check the last VEC. */ - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 +L(last_4x_vec): + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - vpmovmskb %ymm1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_end) - /* Remove the trailing bytes. */ - andl %edx, %eax + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) - testl %eax, %eax - jnz L(last_vec_x1) + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - /* Check the second last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) + + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret0) + xorl %eax, %eax +L(ret0): + ret - movl %r8d, %ecx - vpmovmskb %ymm1, %eax + .p2align 4 +L(loop_end): + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vpmovmskb %ymm2, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + vpmovmskb %ymm3, %ecx + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + .p2align 4,, 4 +L(ret_vec_x1_end): + /* 64-bit version will automatically add 32 (VEC_SIZE). */ + lzcntq %rcx, %rcx + subq %rcx, %rax + VZEROUPPER_RETURN - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 4 +L(ret_vec_x0_end): + lzcntl %ecx, %ecx + subq %rcx, %rax VZEROUPPER_RETURN -END (MEMRCHR) + + /* 2 bytes until next cache line. */ +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v3 7/8] x86: Shrink code size of memchr-avx2.S 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein ` (4 preceding siblings ...) 2022-06-03 23:49 ` [PATCH v3 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-03 23:50 ` Noah Goldstein 2022-06-03 23:50 ` [PATCH v3 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:50 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- 2 files changed, 60 insertions(+), 50 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S index 87b076c7c4..c4d71938c5 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMCHR __memchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S index 75bd7262e0..28a01280ec 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S @@ -57,7 +57,7 @@ # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 5) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ # ifdef __ILP32__ @@ -87,12 +87,14 @@ ENTRY (MEMCHR) # endif testl %eax, %eax jz L(aligned_more) - tzcntl %eax, %eax + bsfl %eax, %eax addq %rdi, %rax - VZEROUPPER_RETURN +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN + # ifndef USE_AS_RAWMEMCHR - .p2align 5 + .p2align 4 L(first_vec_x0): /* Check if first match was before length. */ tzcntl %eax, %eax @@ -100,58 +102,31 @@ L(first_vec_x0): /* NB: Multiply length by 4 to get byte count. */ sall $2, %edx # endif - xorl %ecx, %ecx + COND_VZEROUPPER + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch + block. branch here as opposed to cmovcc is not that costly. Common + usage of memchr is to check if the return was NULL (if string was + known to contain CHAR user would use rawmemchr). This branch will be + highly correlated with the user branch and can be used by most + modern branch predictors to predict the user branch. */ cmpl %eax, %edx - leaq (%rdi, %rax), %rax - cmovle %rcx, %rax - VZEROUPPER_RETURN - -L(null): - xorl %eax, %eax - ret -# endif - .p2align 4 -L(cross_page_boundary): - /* Save pointer before aligning as its original value is - necessary for computer return address if byte is found or - adjusting length if it is not and this is memchr. */ - movq %rdi, %rcx - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr - and rdi for rawmemchr. */ - orq $(VEC_SIZE - 1), %ALGN_PTR_REG - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax -# ifndef USE_AS_RAWMEMCHR - /* Calculate length until end of page (length checked for a - match). */ - leaq 1(%ALGN_PTR_REG), %rsi - subq %RRAW_PTR_REG, %rsi -# ifdef USE_AS_WMEMCHR - /* NB: Divide bytes by 4 to get wchar_t count. */ - shrl $2, %esi -# endif -# endif - /* Remove the leading bytes. */ - sarxl %ERAW_PTR_REG, %eax, %eax -# ifndef USE_AS_RAWMEMCHR - /* Check the end of data. */ - cmpq %rsi, %rdx - jbe L(first_vec_x0) + jle L(null) + addq %rdi, %rax + ret # endif - testl %eax, %eax - jz L(cross_page_continue) - tzcntl %eax, %eax - addq %RRAW_PTR_REG, %rax -L(return_vzeroupper): - ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 + .p2align 4,, 10 L(first_vec_x1): - tzcntl %eax, %eax + bsfl %eax, %eax incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - +# ifndef USE_AS_RAWMEMCHR + /* First in aligning bytes here. */ +L(null): + xorl %eax, %eax + ret +# endif .p2align 4 L(first_vec_x2): tzcntl %eax, %eax @@ -340,7 +315,7 @@ L(first_vec_x1_check): incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - .p2align 4 + .p2align 4,, 6 L(set_zero_end): xorl %eax, %eax VZEROUPPER_RETURN @@ -428,5 +403,39 @@ L(last_vec_x3): VZEROUPPER_RETURN # endif + .p2align 4 +L(cross_page_boundary): + /* Save pointer before aligning as its original value is necessary for + computer return address if byte is found or adjusting length if it + is not and this is memchr. */ + movq %rdi, %rcx + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for + rawmemchr. */ + andq $-VEC_SIZE, %ALGN_PTR_REG + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 + vpmovmskb %ymm1, %eax +# ifndef USE_AS_RAWMEMCHR + /* Calculate length until end of page (length checked for a match). */ + leal VEC_SIZE(%ALGN_PTR_REG), %esi + subl %ERAW_PTR_REG, %esi +# ifdef USE_AS_WMEMCHR + /* NB: Divide bytes by 4 to get wchar_t count. */ + shrl $2, %esi +# endif +# endif + /* Remove the leading bytes. */ + sarxl %ERAW_PTR_REG, %eax, %eax +# ifndef USE_AS_RAWMEMCHR + /* Check the end of data. */ + cmpq %rsi, %rdx + jbe L(first_vec_x0) +# endif + testl %eax, %eax + jz L(cross_page_continue) + bsfl %eax, %eax + addq %RRAW_PTR_REG, %rax + VZEROUPPER_RETURN + + END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v3 8/8] x86: Shrink code size of memchr-evex.S 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein ` (5 preceding siblings ...) 2022-06-03 23:50 ` [PATCH v3 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein @ 2022-06-03 23:50 ` Noah Goldstein 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 23:50 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-evex.S | 46 ++++++++++++++------------ 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S index cfaf02907d..0fd11b7632 100644 --- a/sysdeps/x86_64/multiarch/memchr-evex.S +++ b/sysdeps/x86_64/multiarch/memchr-evex.S @@ -88,7 +88,7 @@ # define PAGE_SIZE 4096 .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 6) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ test %RDX_LP, %RDX_LP @@ -131,22 +131,24 @@ L(zero): xorl %eax, %eax ret - .p2align 5 + .p2align 4 L(first_vec_x0): - /* Check if first match was before length. */ - tzcntl %eax, %eax - xorl %ecx, %ecx - cmpl %eax, %edx - leaq (%rdi, %rax, CHAR_SIZE), %rax - cmovle %rcx, %rax + /* Check if first match was before length. NB: tzcnt has false data- + dependency on destination. eax already had a data-dependency on esi + so this should have no affect here. */ + tzcntl %eax, %esi +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rsi, CHAR_SIZE), %rdi +# else + addq %rsi, %rdi +# endif + xorl %eax, %eax + cmpl %esi, %edx + cmovg %rdi, %rax ret -# else - /* NB: first_vec_x0 is 17 bytes which will leave - cross_page_boundary (which is relatively cold) close enough - to ideal alignment. So only realign L(cross_page_boundary) if - rawmemchr. */ - .p2align 4 # endif + + .p2align 4 L(cross_page_boundary): /* Save pointer before aligning as its original value is necessary for computer return address if byte is found or @@ -400,10 +402,14 @@ L(last_2x_vec): L(zero_end): ret +L(set_zero_end): + xorl %eax, %eax + ret .p2align 4 L(first_vec_x1_check): - tzcntl %eax, %eax + /* eax must be non-zero. Use bsfl to save code size. */ + bsfl %eax, %eax /* Adjust length. */ subl $-(CHAR_PER_VEC * 4), %edx /* Check if match within remaining length. */ @@ -412,9 +418,6 @@ L(first_vec_x1_check): /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax ret -L(set_zero_end): - xorl %eax, %eax - ret .p2align 4 L(loop_4x_vec_end): @@ -464,7 +467,7 @@ L(loop_4x_vec_end): # endif ret - .p2align 4 + .p2align 4,, 10 L(last_vec_x1_return): tzcntl %eax, %eax # if defined USE_AS_WMEMCHR || RET_OFFSET != 0 @@ -496,6 +499,7 @@ L(last_vec_x3_return): # endif # ifndef USE_AS_RAWMEMCHR + .p2align 4,, 5 L(last_4x_vec_or_less_cmpeq): VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 kmovd %k0, %eax @@ -546,7 +550,7 @@ L(last_4x_vec): # endif andl %ecx, %eax jz L(zero_end2) - tzcntl %eax, %eax + bsfl %eax, %eax leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax L(zero_end2): ret @@ -562,6 +566,6 @@ L(last_vec_x3): leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax ret # endif - + /* 7 bytes from next cache line. */ END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein @ 2022-06-06 22:37 ` Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein ` (6 more replies) 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 4 siblings, 7 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:37 UTC (permalink / raw) To: libc-alpha This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 34 ++++++++ sysdeps/x86_64/multiarch/avx-vecs.h | 47 +++++++++++ sysdeps/x86_64/multiarch/evex-vecs-common.h | 39 +++++++++ sysdeps/x86_64/multiarch/evex256-vecs.h | 35 ++++++++ sysdeps/x86_64/multiarch/evex512-vecs.h | 35 ++++++++ sysdeps/x86_64/multiarch/sse2-vecs.h | 47 +++++++++++ sysdeps/x86_64/multiarch/vec-macros.h | 90 +++++++++++++++++++++ 7 files changed, 327 insertions(+) create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex-vecs-common.h create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h new file mode 100644 index 0000000000..3f531dd47f --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -0,0 +1,34 @@ +/* Common config for AVX-RTM VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_RTM_VECS_H +#define _AVX_RTM_VECS_H 1 + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define USE_WITH_RTM 1 +#include "avx-vecs.h" + +#undef SECTION +#define SECTION(p) p##.avx.rtm + +#endif diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h new file mode 100644 index 0000000000..89680f5db8 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-vecs.h @@ -0,0 +1,47 @@ +/* Common config for AVX VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_VECS_H +#define _AVX_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 32 +#include "vec-macros.h" + +#define USE_WITH_AVX 1 +#define SECTION(p) p##.avx + +/* 4-byte mov instructions with AVX2. */ +#define MOV_SIZE 4 +/* 1 (ret) + 3 (vzeroupper). */ +#define RET_SIZE 4 +#define VZEROUPPER vzeroupper + +#define VMOVU vmovdqu +#define VMOVA vmovdqa +#define VMOVNT vmovntdq + +/* Often need to access xmm portion. */ +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex-vecs-common.h b/sysdeps/x86_64/multiarch/evex-vecs-common.h new file mode 100644 index 0000000000..99806ebcd7 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex-vecs-common.h @@ -0,0 +1,39 @@ +/* Common config for EVEX256 and EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX_VECS_COMMON_H +#define _EVEX_VECS_COMMON_H 1 + +#include "vec-macros.h" + +/* 6-byte mov instructions with EVEX. */ +#define MOV_SIZE 6 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU vmovdqu64 +#define VMOVA vmovdqa64 +#define VMOVNT vmovntdq + +#define VEC_xmm VEC_hi_xmm +#define VEC_ymm VEC_hi_ymm +#define VEC_zmm VEC_hi_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h new file mode 100644 index 0000000000..222ba46dc7 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h @@ -0,0 +1,35 @@ +/* Common config for EVEX256 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX256_VECS_H +#define _EVEX256_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 32 +#include "evex-vecs-common.h" + +#define USE_WITH_EVEX256 1 +#define SECTION(p) p##.evex + +#define VEC VEC_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h new file mode 100644 index 0000000000..d1784d5368 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h @@ -0,0 +1,35 @@ +/* Common config for EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX512_VECS_H +#define _EVEX512_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 64 +#include "evex-vecs-common.h" + +#define USE_WITH_EVEX512 1 +#define SECTION(p) p##.evex512 + +#define VEC VEC_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h new file mode 100644 index 0000000000..2b77a59d56 --- /dev/null +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h @@ -0,0 +1,47 @@ +/* Common config for SSE2 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _SSE2_VECS_H +#define _SSE2_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 16 +#include "vec-macros.h" + +#define USE_WITH_SSE2 1 +#define SECTION(p) p + +/* 3-byte mov instructions with SSE2. */ +#define MOV_SIZE 3 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU movups +#define VMOVA movaps +#define VMOVNT movntdq + +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_xmm + + +#endif diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h new file mode 100644 index 0000000000..9f3ffecede --- /dev/null +++ b/sysdeps/x86_64/multiarch/vec-macros.h @@ -0,0 +1,90 @@ +/* Macro helpers for VEC_{type}({vec_num}) + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _VEC_MACROS_H +#define _VEC_MACROS_H 1 + +#ifndef VEC_SIZE +# error "Never include this file directly. Always include a vector config." +#endif + +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same + VEC(N) values. */ +#define VEC_hi_xmm0 xmm16 +#define VEC_hi_xmm1 xmm17 +#define VEC_hi_xmm2 xmm18 +#define VEC_hi_xmm3 xmm19 +#define VEC_hi_xmm4 xmm20 +#define VEC_hi_xmm5 xmm21 +#define VEC_hi_xmm6 xmm22 +#define VEC_hi_xmm7 xmm23 +#define VEC_hi_xmm8 xmm24 +#define VEC_hi_xmm9 xmm25 +#define VEC_hi_xmm10 xmm26 +#define VEC_hi_xmm11 xmm27 +#define VEC_hi_xmm12 xmm28 +#define VEC_hi_xmm13 xmm29 +#define VEC_hi_xmm14 xmm30 +#define VEC_hi_xmm15 xmm31 + +#define VEC_hi_ymm0 ymm16 +#define VEC_hi_ymm1 ymm17 +#define VEC_hi_ymm2 ymm18 +#define VEC_hi_ymm3 ymm19 +#define VEC_hi_ymm4 ymm20 +#define VEC_hi_ymm5 ymm21 +#define VEC_hi_ymm6 ymm22 +#define VEC_hi_ymm7 ymm23 +#define VEC_hi_ymm8 ymm24 +#define VEC_hi_ymm9 ymm25 +#define VEC_hi_ymm10 ymm26 +#define VEC_hi_ymm11 ymm27 +#define VEC_hi_ymm12 ymm28 +#define VEC_hi_ymm13 ymm29 +#define VEC_hi_ymm14 ymm30 +#define VEC_hi_ymm15 ymm31 + +#define VEC_hi_zmm0 zmm16 +#define VEC_hi_zmm1 zmm17 +#define VEC_hi_zmm2 zmm18 +#define VEC_hi_zmm3 zmm19 +#define VEC_hi_zmm4 zmm20 +#define VEC_hi_zmm5 zmm21 +#define VEC_hi_zmm6 zmm22 +#define VEC_hi_zmm7 zmm23 +#define VEC_hi_zmm8 zmm24 +#define VEC_hi_zmm9 zmm25 +#define VEC_hi_zmm10 zmm26 +#define VEC_hi_zmm11 zmm27 +#define VEC_hi_zmm12 zmm28 +#define VEC_hi_zmm13 zmm29 +#define VEC_hi_zmm14 zmm30 +#define VEC_hi_zmm15 zmm31 + +#define PRIMITIVE_VEC(vec, num) vec##num + +#define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) +#define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) +#define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) + +#define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) +#define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) +#define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) + +#endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein @ 2022-06-06 22:37 ` Noah Goldstein 2022-06-07 2:45 ` H.J. Lu 2022-06-06 22:37 ` [PATCH v4 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein ` (5 subsequent siblings) 6 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:37 UTC (permalink / raw) To: libc-alpha The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This can be useful when hoisting a vzeroupper to save code size for example: ``` L(foo): cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax VZEROUPPER_RETURN L(bar): xorl %eax, %eax VZEROUPPER_RETURN ``` Can become: ``` L(foo): COND_VZEROUPPER cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax ret L(bar): xorl %eax, %eax ret ``` This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + sysdeps/x86_64/sysdep.h | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h index 3f531dd47f..6ca9f5e6ba 100644 --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -20,6 +20,7 @@ #ifndef _AVX_RTM_VECS_H #define _AVX_RTM_VECS_H 1 +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h index f14d50786d..4f512d5566 100644 --- a/sysdeps/x86_64/sysdep.h +++ b/sysdeps/x86_64/sysdep.h @@ -106,6 +106,24 @@ lose: \ vzeroupper; \ ret +/* Can be used to replace vzeroupper that is not directly before a + return. This is useful when hoisting a vzeroupper from multiple + return paths to decrease the total number of vzerouppers and code + size. */ +#define COND_VZEROUPPER_XTEST \ + xtest; \ + jz 1f; \ + vzeroall; \ + jmp 2f; \ +1: \ + vzeroupper; \ +2: + +/* In RTM define this as COND_VZEROUPPER_XTEST. */ +#ifndef COND_VZEROUPPER +# define COND_VZEROUPPER vzeroupper +#endif + /* Zero upper vector registers and return. */ #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN # define ZERO_UPPER_VEC_REGISTERS_RETURN \ -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-06 22:37 ` [PATCH v4 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-07 2:45 ` H.J. Lu 2022-07-14 2:12 ` Sunil Pandey 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 2:45 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The RTM vzeroupper mitigation has no way of replacing inline > vzeroupper not before a return. > > This can be useful when hoisting a vzeroupper to save code size > for example: > > ``` > L(foo): > cmpl %eax, %edx > jz L(bar) > tzcntl %eax, %eax > addq %rdi, %rax > VZEROUPPER_RETURN > > L(bar): > xorl %eax, %eax > VZEROUPPER_RETURN > ``` > > Can become: > > ``` > L(foo): > COND_VZEROUPPER > cmpl %eax, %edx > jz L(bar) > tzcntl %eax, %eax > addq %rdi, %rax > ret > > L(bar): > xorl %eax, %eax > ret > ``` > > This code does not change any existing functionality. > > There is no difference in the objdump of libc.so before and after this > patch. > --- > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + > sysdeps/x86_64/sysdep.h | 18 ++++++++++++++++++ > 2 files changed, 19 insertions(+) > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > index 3f531dd47f..6ca9f5e6ba 100644 > --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > @@ -20,6 +20,7 @@ > #ifndef _AVX_RTM_VECS_H > #define _AVX_RTM_VECS_H 1 > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h > index f14d50786d..4f512d5566 100644 > --- a/sysdeps/x86_64/sysdep.h > +++ b/sysdeps/x86_64/sysdep.h > @@ -106,6 +106,24 @@ lose: \ > vzeroupper; \ > ret > > +/* Can be used to replace vzeroupper that is not directly before a > + return. This is useful when hoisting a vzeroupper from multiple > + return paths to decrease the total number of vzerouppers and code > + size. */ > +#define COND_VZEROUPPER_XTEST \ > + xtest; \ > + jz 1f; \ > + vzeroall; \ > + jmp 2f; \ > +1: \ > + vzeroupper; \ > +2: > + > +/* In RTM define this as COND_VZEROUPPER_XTEST. */ > +#ifndef COND_VZEROUPPER > +# define COND_VZEROUPPER vzeroupper > +#endif > + > /* Zero upper vector registers and return. */ > #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN > # define ZERO_UPPER_VEC_REGISTERS_RETURN \ > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-07 2:45 ` H.J. Lu @ 2022-07-14 2:12 ` Sunil Pandey 0 siblings, 0 replies; 82+ messages in thread From: Sunil Pandey @ 2022-07-14 2:12 UTC (permalink / raw) To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library On Mon, Jun 6, 2022 at 7:46 PM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The RTM vzeroupper mitigation has no way of replacing inline > > vzeroupper not before a return. > > > > This can be useful when hoisting a vzeroupper to save code size > > for example: > > > > ``` > > L(foo): > > cmpl %eax, %edx > > jz L(bar) > > tzcntl %eax, %eax > > addq %rdi, %rax > > VZEROUPPER_RETURN > > > > L(bar): > > xorl %eax, %eax > > VZEROUPPER_RETURN > > ``` > > > > Can become: > > > > ``` > > L(foo): > > COND_VZEROUPPER > > cmpl %eax, %edx > > jz L(bar) > > tzcntl %eax, %eax > > addq %rdi, %rax > > ret > > > > L(bar): > > xorl %eax, %eax > > ret > > ``` > > > > This code does not change any existing functionality. > > > > There is no difference in the objdump of libc.so before and after this > > patch. > > --- > > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + > > sysdeps/x86_64/sysdep.h | 18 ++++++++++++++++++ > > 2 files changed, 19 insertions(+) > > > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > index 3f531dd47f..6ca9f5e6ba 100644 > > --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > @@ -20,6 +20,7 @@ > > #ifndef _AVX_RTM_VECS_H > > #define _AVX_RTM_VECS_H 1 > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h > > index f14d50786d..4f512d5566 100644 > > --- a/sysdeps/x86_64/sysdep.h > > +++ b/sysdeps/x86_64/sysdep.h > > @@ -106,6 +106,24 @@ lose: \ > > vzeroupper; \ > > ret > > > > +/* Can be used to replace vzeroupper that is not directly before a > > + return. This is useful when hoisting a vzeroupper from multiple > > + return paths to decrease the total number of vzerouppers and code > > + size. */ > > +#define COND_VZEROUPPER_XTEST \ > > + xtest; \ > > + jz 1f; \ > > + vzeroall; \ > > + jmp 2f; \ > > +1: \ > > + vzeroupper; \ > > +2: > > + > > +/* In RTM define this as COND_VZEROUPPER_XTEST. */ > > +#ifndef COND_VZEROUPPER > > +# define COND_VZEROUPPER vzeroupper > > +#endif > > + > > /* Zero upper vector registers and return. */ > > #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN > > # define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > -- > > 2.34.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 3/8] Benchtests: Improve memrchr benchmarks 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-06 22:37 ` Noah Goldstein 2022-06-07 2:44 ` H.J. Lu 2022-06-06 22:37 ` [PATCH v4 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein ` (4 subsequent siblings) 6 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:37 UTC (permalink / raw) To: libc-alpha Add a second iteration for memrchr to set `pos` starting from the end of the buffer. Previously `pos` was only set relative to the begining of the buffer. This isn't really useful for memchr because the begining of the search space is (buf + len). --- benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- 1 file changed, 65 insertions(+), 45 deletions(-) diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c index 4d7212332f..0facda2fa0 100644 --- a/benchtests/bench-memchr.c +++ b/benchtests/bench-memchr.c @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, static void do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, - int seek_char) + int seek_char, int invert_pos) { size_t i; @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, if (pos < len) { - buf[align + pos] = seek_char; + if (invert_pos) + buf[align + len - pos] = seek_char; + else + buf[align + pos] = seek_char; buf[align + len] = -seek_char; } else @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, json_attr_uint (json_ctx, "pos", pos); json_attr_uint (json_ctx, "len", len); json_attr_uint (json_ctx, "seek_char", seek_char); + json_attr_uint (json_ctx, "invert_pos", invert_pos); json_array_begin (json_ctx, "timings"); @@ -123,6 +127,7 @@ int test_main (void) { size_t i; + int repeats; json_ctx_t json_ctx; test_init (); @@ -142,53 +147,68 @@ test_main (void) json_array_begin (&json_ctx, "results"); - for (i = 1; i < 8; ++i) + for (repeats = 0; repeats < 2; ++repeats) { - do_test (&json_ctx, 0, 16 << i, 2048, 23); - do_test (&json_ctx, i, 64, 256, 23); - do_test (&json_ctx, 0, 16 << i, 2048, 0); - do_test (&json_ctx, i, 64, 256, 0); - - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); + do_test (&json_ctx, i, 64, 256, 23, repeats); + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); + do_test (&json_ctx, i, 64, 256, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, i, 256, 23); - do_test (&json_ctx, 0, i, 256, 0); - do_test (&json_ctx, i, i, 256, 23); - do_test (&json_ctx, i, i, 256, 0); + /* Also test the position close to the beginning for memrchr. */ + do_test (&json_ctx, 0, i, 256, 23, repeats); + do_test (&json_ctx, 0, i, 256, 0, repeats); + do_test (&json_ctx, i, i, 256, 23, repeats); + do_test (&json_ctx, i, i, 256, 0, repeats); #endif - } - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, i, i << 5, 192, 23); - do_test (&json_ctx, i, i << 5, 192, 0); - do_test (&json_ctx, i, i << 5, 256, 23); - do_test (&json_ctx, i, i << 5, 256, 0); - do_test (&json_ctx, i, i << 5, 512, 23); - do_test (&json_ctx, i, i << 5, 512, 0); - - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); - } - for (i = 1; i < 32; ++i) - { - do_test (&json_ctx, 0, i, i + 1, 23); - do_test (&json_ctx, 0, i, i + 1, 0); - do_test (&json_ctx, i, i, i + 1, 23); - do_test (&json_ctx, i, i, i + 1, 0); - do_test (&json_ctx, 0, i, i - 1, 23); - do_test (&json_ctx, 0, i, i - 1, 0); - do_test (&json_ctx, i, i, i - 1, 23); - do_test (&json_ctx, i, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); + } + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, i, i << 5, 192, 23, repeats); + do_test (&json_ctx, i, i << 5, 192, 0, repeats); + do_test (&json_ctx, i, i << 5, 256, 23, repeats); + do_test (&json_ctx, i, i << 5, 256, 0, repeats); + do_test (&json_ctx, i, i << 5, 512, 23, repeats); + do_test (&json_ctx, i, i << 5, 512, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); + } + for (i = 1; i < 32; ++i) + { + do_test (&json_ctx, 0, i, i + 1, 23, repeats); + do_test (&json_ctx, 0, i, i + 1, 0, repeats); + do_test (&json_ctx, i, i, i + 1, 23, repeats); + do_test (&json_ctx, i, i, i + 1, 0, repeats); + do_test (&json_ctx, 0, i, i - 1, 23, repeats); + do_test (&json_ctx, 0, i, i - 1, 0, repeats); + do_test (&json_ctx, i, i, i - 1, 23, repeats); + do_test (&json_ctx, i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); + #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, 1, i + 1, 23); - do_test (&json_ctx, 0, 2, i + 1, 0); + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); +#endif + } +#ifndef USE_AS_MEMRCHR + break; #endif } -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 3/8] Benchtests: Improve memrchr benchmarks 2022-06-06 22:37 ` [PATCH v4 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein @ 2022-06-07 2:44 ` H.J. Lu 2022-06-07 4:10 ` Noah Goldstein 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 2:44 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > Add a second iteration for memrchr to set `pos` starting from the end > of the buffer. > > Previously `pos` was only set relative to the begining of the beginning > buffer. This isn't really useful for memchr because the beginning memrchr > of the search space is (buf + len). > --- > benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- > 1 file changed, 65 insertions(+), 45 deletions(-) > > diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c > index 4d7212332f..0facda2fa0 100644 > --- a/benchtests/bench-memchr.c > +++ b/benchtests/bench-memchr.c > @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, > > static void > do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > - int seek_char) > + int seek_char, int invert_pos) > { > size_t i; > > @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > > if (pos < len) > { > - buf[align + pos] = seek_char; > + if (invert_pos) > + buf[align + len - pos] = seek_char; > + else > + buf[align + pos] = seek_char; > buf[align + len] = -seek_char; > } > else > @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > json_attr_uint (json_ctx, "pos", pos); > json_attr_uint (json_ctx, "len", len); > json_attr_uint (json_ctx, "seek_char", seek_char); > + json_attr_uint (json_ctx, "invert_pos", invert_pos); > > json_array_begin (json_ctx, "timings"); > > @@ -123,6 +127,7 @@ int > test_main (void) > { > size_t i; > + int repeats; > json_ctx_t json_ctx; > test_init (); > > @@ -142,53 +147,68 @@ test_main (void) > > json_array_begin (&json_ctx, "results"); > > - for (i = 1; i < 8; ++i) > + for (repeats = 0; repeats < 2; ++repeats) > { > - do_test (&json_ctx, 0, 16 << i, 2048, 23); > - do_test (&json_ctx, i, 64, 256, 23); > - do_test (&json_ctx, 0, 16 << i, 2048, 0); > - do_test (&json_ctx, i, 64, 256, 0); > - > - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); > + for (i = 1; i < 8; ++i) > + { > + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); > + do_test (&json_ctx, i, 64, 256, 23, repeats); > + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); > + do_test (&json_ctx, i, 64, 256, 0, repeats); > + > + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); > #ifdef USE_AS_MEMRCHR > - /* Also test the position close to the beginning for memrchr. */ > - do_test (&json_ctx, 0, i, 256, 23); > - do_test (&json_ctx, 0, i, 256, 0); > - do_test (&json_ctx, i, i, 256, 23); > - do_test (&json_ctx, i, i, 256, 0); > + /* Also test the position close to the beginning for memrchr. */ > + do_test (&json_ctx, 0, i, 256, 23, repeats); > + do_test (&json_ctx, 0, i, 256, 0, repeats); > + do_test (&json_ctx, i, i, 256, 23, repeats); > + do_test (&json_ctx, i, i, 256, 0, repeats); > #endif > - } > - for (i = 1; i < 8; ++i) > - { > - do_test (&json_ctx, i, i << 5, 192, 23); > - do_test (&json_ctx, i, i << 5, 192, 0); > - do_test (&json_ctx, i, i << 5, 256, 23); > - do_test (&json_ctx, i, i << 5, 256, 0); > - do_test (&json_ctx, i, i << 5, 512, 23); > - do_test (&json_ctx, i, i << 5, 512, 0); > - > - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); > - } > - for (i = 1; i < 32; ++i) > - { > - do_test (&json_ctx, 0, i, i + 1, 23); > - do_test (&json_ctx, 0, i, i + 1, 0); > - do_test (&json_ctx, i, i, i + 1, 23); > - do_test (&json_ctx, i, i, i + 1, 0); > - do_test (&json_ctx, 0, i, i - 1, 23); > - do_test (&json_ctx, 0, i, i - 1, 0); > - do_test (&json_ctx, i, i, i - 1, 23); > - do_test (&json_ctx, i, i, i - 1, 0); > - > - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); > - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); > - > - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); > - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); > + } > + for (i = 1; i < 8; ++i) > + { > + do_test (&json_ctx, i, i << 5, 192, 23, repeats); > + do_test (&json_ctx, i, i << 5, 192, 0, repeats); > + do_test (&json_ctx, i, i << 5, 256, 23, repeats); > + do_test (&json_ctx, i, i << 5, 256, 0, repeats); > + do_test (&json_ctx, i, i << 5, 512, 23, repeats); > + do_test (&json_ctx, i, i << 5, 512, 0, repeats); > + > + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); > + } > + for (i = 1; i < 32; ++i) > + { > + do_test (&json_ctx, 0, i, i + 1, 23, repeats); > + do_test (&json_ctx, 0, i, i + 1, 0, repeats); > + do_test (&json_ctx, i, i, i + 1, 23, repeats); > + do_test (&json_ctx, i, i, i + 1, 0, repeats); > + do_test (&json_ctx, 0, i, i - 1, 23, repeats); > + do_test (&json_ctx, 0, i, i - 1, 0, repeats); > + do_test (&json_ctx, i, i, i - 1, 23, repeats); > + do_test (&json_ctx, i, i, i - 1, 0, repeats); > + > + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); > + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); > + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); > + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); > + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); > + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); > + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); > + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); > + > + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); > + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); > + > + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); > + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); > + > #ifdef USE_AS_MEMRCHR > - /* Also test the position close to the beginning for memrchr. */ > - do_test (&json_ctx, 0, 1, i + 1, 23); > - do_test (&json_ctx, 0, 2, i + 1, 0); > + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); > + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); > +#endif > + } > +#ifndef USE_AS_MEMRCHR > + break; > #endif > } > > -- > 2.34.1 > -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 3/8] Benchtests: Improve memrchr benchmarks 2022-06-07 2:44 ` H.J. Lu @ 2022-06-07 4:10 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:10 UTC (permalink / raw) To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 7:44 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Add a second iteration for memrchr to set `pos` starting from the end > > of the buffer. > > > > Previously `pos` was only set relative to the begining of the > > beginning > > buffer. This isn't really useful for memchr because the beginning > memrchr Fixed in V5. > > of the search space is (buf + len). > > --- > > benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- > > 1 file changed, 65 insertions(+), 45 deletions(-) > > > > diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c > > index 4d7212332f..0facda2fa0 100644 > > --- a/benchtests/bench-memchr.c > > +++ b/benchtests/bench-memchr.c > > @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, > > > > static void > > do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > > - int seek_char) > > + int seek_char, int invert_pos) > > { > > size_t i; > > > > @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > > > > if (pos < len) > > { > > - buf[align + pos] = seek_char; > > + if (invert_pos) > > + buf[align + len - pos] = seek_char; > > + else > > + buf[align + pos] = seek_char; > > buf[align + len] = -seek_char; > > } > > else > > @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > > json_attr_uint (json_ctx, "pos", pos); > > json_attr_uint (json_ctx, "len", len); > > json_attr_uint (json_ctx, "seek_char", seek_char); > > + json_attr_uint (json_ctx, "invert_pos", invert_pos); > > > > json_array_begin (json_ctx, "timings"); > > > > @@ -123,6 +127,7 @@ int > > test_main (void) > > { > > size_t i; > > + int repeats; > > json_ctx_t json_ctx; > > test_init (); > > > > @@ -142,53 +147,68 @@ test_main (void) > > > > json_array_begin (&json_ctx, "results"); > > > > - for (i = 1; i < 8; ++i) > > + for (repeats = 0; repeats < 2; ++repeats) > > { > > - do_test (&json_ctx, 0, 16 << i, 2048, 23); > > - do_test (&json_ctx, i, 64, 256, 23); > > - do_test (&json_ctx, 0, 16 << i, 2048, 0); > > - do_test (&json_ctx, i, 64, 256, 0); > > - > > - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); > > + for (i = 1; i < 8; ++i) > > + { > > + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); > > + do_test (&json_ctx, i, 64, 256, 23, repeats); > > + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); > > + do_test (&json_ctx, i, 64, 256, 0, repeats); > > + > > + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); > > #ifdef USE_AS_MEMRCHR > > - /* Also test the position close to the beginning for memrchr. */ > > - do_test (&json_ctx, 0, i, 256, 23); > > - do_test (&json_ctx, 0, i, 256, 0); > > - do_test (&json_ctx, i, i, 256, 23); > > - do_test (&json_ctx, i, i, 256, 0); > > + /* Also test the position close to the beginning for memrchr. */ > > + do_test (&json_ctx, 0, i, 256, 23, repeats); > > + do_test (&json_ctx, 0, i, 256, 0, repeats); > > + do_test (&json_ctx, i, i, 256, 23, repeats); > > + do_test (&json_ctx, i, i, 256, 0, repeats); > > #endif > > - } > > - for (i = 1; i < 8; ++i) > > - { > > - do_test (&json_ctx, i, i << 5, 192, 23); > > - do_test (&json_ctx, i, i << 5, 192, 0); > > - do_test (&json_ctx, i, i << 5, 256, 23); > > - do_test (&json_ctx, i, i << 5, 256, 0); > > - do_test (&json_ctx, i, i << 5, 512, 23); > > - do_test (&json_ctx, i, i << 5, 512, 0); > > - > > - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); > > - } > > - for (i = 1; i < 32; ++i) > > - { > > - do_test (&json_ctx, 0, i, i + 1, 23); > > - do_test (&json_ctx, 0, i, i + 1, 0); > > - do_test (&json_ctx, i, i, i + 1, 23); > > - do_test (&json_ctx, i, i, i + 1, 0); > > - do_test (&json_ctx, 0, i, i - 1, 23); > > - do_test (&json_ctx, 0, i, i - 1, 0); > > - do_test (&json_ctx, i, i, i - 1, 23); > > - do_test (&json_ctx, i, i, i - 1, 0); > > - > > - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); > > - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); > > - > > - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); > > - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); > > + } > > + for (i = 1; i < 8; ++i) > > + { > > + do_test (&json_ctx, i, i << 5, 192, 23, repeats); > > + do_test (&json_ctx, i, i << 5, 192, 0, repeats); > > + do_test (&json_ctx, i, i << 5, 256, 23, repeats); > > + do_test (&json_ctx, i, i << 5, 256, 0, repeats); > > + do_test (&json_ctx, i, i << 5, 512, 23, repeats); > > + do_test (&json_ctx, i, i << 5, 512, 0, repeats); > > + > > + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); > > + } > > + for (i = 1; i < 32; ++i) > > + { > > + do_test (&json_ctx, 0, i, i + 1, 23, repeats); > > + do_test (&json_ctx, 0, i, i + 1, 0, repeats); > > + do_test (&json_ctx, i, i, i + 1, 23, repeats); > > + do_test (&json_ctx, i, i, i + 1, 0, repeats); > > + do_test (&json_ctx, 0, i, i - 1, 23, repeats); > > + do_test (&json_ctx, 0, i, i - 1, 0, repeats); > > + do_test (&json_ctx, i, i, i - 1, 23, repeats); > > + do_test (&json_ctx, i, i, i - 1, 0, repeats); > > + > > + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); > > + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); > > + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); > > + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); > > + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); > > + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); > > + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); > > + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); > > + > > + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); > > + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); > > + > > + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); > > + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); > > + > > #ifdef USE_AS_MEMRCHR > > - /* Also test the position close to the beginning for memrchr. */ > > - do_test (&json_ctx, 0, 1, i + 1, 23); > > - do_test (&json_ctx, 0, 2, i + 1, 0); > > + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); > > + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); > > +#endif > > + } > > +#ifndef USE_AS_MEMRCHR > > + break; > > #endif > > } > > > > -- > > 2.34.1 > > > > > -- > H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 4/8] x86: Optimize memrchr-sse2.S 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein @ 2022-06-06 22:37 ` Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 5/8] x86: Optimize memrchr-evex.S Noah Goldstein ` (3 subsequent siblings) 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:37 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. --- sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- 1 file changed, 292 insertions(+), 321 deletions(-) diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S index d1a9f47911..b0dffd2ae2 100644 --- a/sysdeps/x86_64/memrchr.S +++ b/sysdeps/x86_64/memrchr.S @@ -18,362 +18,333 @@ <https://www.gnu.org/licenses/>. */ #include <sysdep.h> +#define VEC_SIZE 16 +#define PAGE_SIZE 4096 .text -ENTRY (__memrchr) - movd %esi, %xmm1 - - sub $16, %RDX_LP - jbe L(length_less16) - - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add %RDX_LP, %RDI_LP - pshufd $0, %xmm1, %xmm1 - - movdqu (%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - -/* Check if there is a match. */ - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - mov %edi, %ecx - and $15, %ecx - jz L(loop_prolog) - - add $16, %rdi - add $16, %rdx - and $-16, %rdi - sub %rcx, %rdx - - .p2align 4 -L(loop_prolog): - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm4 - pcmpeqb %xmm1, %xmm4 - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches0) - - mov %edi, %ecx - and $63, %ecx - jz L(align64_loop) - - add $64, %rdi - add $64, %rdx - and $-64, %rdi - sub %rcx, %rdx - - .p2align 4 -L(align64_loop): - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa (%rdi), %xmm0 - movdqa 16(%rdi), %xmm2 - movdqa 32(%rdi), %xmm3 - movdqa 48(%rdi), %xmm4 - - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm1, %xmm2 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm1, %xmm4 - - pmaxub %xmm3, %xmm0 - pmaxub %xmm4, %xmm2 - pmaxub %xmm0, %xmm2 - pmovmskb %xmm2, %eax - - test %eax, %eax - jz L(align64_loop) - - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches48) - - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm2 - - pcmpeqb %xmm1, %xmm2 - pcmpeqb (%rdi), %xmm1 - - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches16) - - pmovmskb %xmm1, %eax - bsr %eax, %eax - - add %rdi, %rax +ENTRY_P2ALIGN(__memrchr, 6) +#ifdef __ILP32__ + /* Clear upper bits. */ + mov %RDX_LP, %RDX_LP +#endif + movd %esi, %xmm0 + + /* Get end pointer. */ + leaq (%rdx, %rdi), %rcx + + punpcklbw %xmm0, %xmm0 + punpcklwd %xmm0, %xmm0 + pshufd $0, %xmm0, %xmm0 + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %ecx + jz L(page_cross) + + /* NB: This load happens regardless of whether rdx (len) is zero. Since + it doesn't cross a page and the standard gurantees any pointer have + at least one-valid byte this load must be safe. For the entire + history of the x86 memrchr implementation this has been possible so + no code "should" be relying on a zero-length check before this load. + The zero-length check is moved to the page cross case because it is + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE + into 2-cache lines. */ + movups -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is + zero. */ + bsrl %eax, %eax + jz L(ret_0) + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here + if out of bounds. */ + addl %edx, %eax + jl L(zero_0) + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base + ptr. */ + addq %rdi, %rax +L(ret_0): ret - .p2align 4 -L(exit_loop): - add $64, %edx - cmp $32, %edx - jbe L(exit_loop_32) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16_1) - cmp $48, %edx - jbe L(return_null) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches0_1) - xor %eax, %eax + .p2align 4,, 5 +L(ret_vec_x0): + bsrl %eax, %eax + leaq -(VEC_SIZE)(%rcx, %rax), %rax ret - .p2align 4 -L(exit_loop_32): - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48_1) - cmp $16, %edx - jbe L(return_null) - - pcmpeqb 32(%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches32_1) - xor %eax, %eax + .p2align 4,, 2 +L(zero_0): + xorl %eax, %eax ret - .p2align 4 -L(matches0): - bsr %eax, %eax - add %rdi, %rax - ret - - .p2align 4 -L(matches16): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax - ret - .p2align 4 -L(matches32): - bsr %eax, %eax - lea 32(%rax, %rdi), %rax + .p2align 4,, 8 +L(more_1x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) + + /* Align rcx (pointer to string). */ + decq %rcx + andq $-VEC_SIZE, %rcx + + movq %rcx, %rdx + /* NB: We could consistenyl save 1-byte in this pattern with `movaps + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is + it adds more frontend uops (even if the moves can be eliminated) and + some percentage of the time actual backend uops. */ + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + subq %rdi, %rdx + pmovmskb %xmm1, %eax + + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) +L(last_2x_vec): + subl $VEC_SIZE, %edx + jbe L(ret_vec_x0_test) + + testl %eax, %eax + jnz L(ret_vec_x0) + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_1) + addl %edx, %eax + jl L(zero_0) + addq %rdi, %rax +L(ret_1): ret - .p2align 4 -L(matches48): - bsr %eax, %eax - lea 48(%rax, %rdi), %rax + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) + causes the hot pause (length <= VEC_SIZE) to span multiple cache + lines. Naturally aligned % 16 to 8-bytes. */ +L(page_cross): + /* Zero length check. */ + testq %rdx, %rdx + jz L(zero_0) + + leaq -1(%rcx), %r8 + andq $-(VEC_SIZE), %r8 + + movaps (%r8), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %esi + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + negl %ecx + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count + explicitly. */ + andl $(VEC_SIZE - 1), %ecx + shl %cl, %esi + movzwl %si, %eax + leaq (%rdi, %rdx), %rcx + cmpq %rdi, %r8 + ja L(more_1x_vec) + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_2) + addl %edx, %eax + jl L(zero_1) + addq %rdi, %rax +L(ret_2): ret - .p2align 4 -L(matches0_1): - bsr %eax, %eax - sub $64, %rdx - add %rax, %rdx - jl L(return_null) - add %rdi, %rax + /* Fits in aliging bytes. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(matches16_1): - bsr %eax, %eax - sub $48, %rdx - add %rax, %rdx - jl L(return_null) - lea 16(%rdi, %rax), %rax + .p2align 4,, 5 +L(ret_vec_x1): + bsrl %eax, %eax + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(matches32_1): - bsr %eax, %eax - sub $32, %rdx - add %rax, %rdx - jl L(return_null) - lea 32(%rdi, %rax), %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) - .p2align 4 -L(matches48_1): - bsr %eax, %eax - sub $16, %rdx - add %rax, %rdx - jl L(return_null) - lea 48(%rdi, %rax), %rax - ret + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + testl %eax, %eax + jnz L(ret_vec_x1) - .p2align 4 -L(return_null): - xor %eax, %eax - ret - .p2align 4 -L(length_less16_offset0): - test %edx, %edx - jz L(return_null) + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - mov %dl, %cl - pcmpeqb (%rdi), %xmm1 + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) - mov $1, %edx - sal %cl, %edx - sub $1, %edx + addl $(VEC_SIZE), %edx + jle L(ret_vec_x2_test) - pmovmskb %xmm1, %eax +L(last_vec): + testl %eax, %eax + jnz L(ret_vec_x2) - and %edx, %eax - test %eax, %eax - jz L(return_null) + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - bsr %eax, %eax - add %rdi, %rax + subl $(VEC_SIZE), %edx + bsrl %eax, %eax + jz L(ret_3) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax +L(ret_3): ret - .p2align 4 -L(length_less16): - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add $16, %edx - - pshufd $0, %xmm1, %xmm1 - - mov %edi, %ecx - and $15, %ecx - jz L(length_less16_offset0) - - mov %cl, %dh - mov %ecx, %esi - add %dl, %dh - and $-16, %rdi - - sub $16, %dh - ja L(length_less16_part2) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - - sar %cl, %eax - mov %dl, %cl - - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax - test %eax, %eax - jz L(return_null) - - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 6 +L(ret_vec_x2_test): + bsrl %eax, %eax + jz L(zero_2) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax ret - .p2align 4 -L(length_less16_part2): - movdqa 16(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - - mov %dh, %cl - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax +L(zero_2): + xorl %eax, %eax + ret - test %eax, %eax - jnz L(length_less16_part2_return) - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax + .p2align 4,, 5 +L(ret_vec_x2): + bsrl %eax, %eax + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - mov %esi, %ecx - sar %cl, %eax - test %eax, %eax - jz L(return_null) + .p2align 4,, 5 +L(ret_vec_x3): + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 8 +L(more_4x_vec): + testl %eax, %eax + jnz L(ret_vec_x2) + + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_x3) + + addq $-(VEC_SIZE * 4), %rcx + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) + + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end + keeping the code from spilling to the next cache line. */ + addq $(VEC_SIZE * 4 - 1), %rcx + andq $-(VEC_SIZE * 4), %rcx + leaq (VEC_SIZE * 4)(%rdi), %rdx + andq $-(VEC_SIZE * 4), %rdx + + .p2align 4,, 11 +L(loop_4x_vec): + movaps (VEC_SIZE * -1)(%rcx), %xmm1 + movaps (VEC_SIZE * -2)(%rcx), %xmm2 + movaps (VEC_SIZE * -3)(%rcx), %xmm3 + movaps (VEC_SIZE * -4)(%rcx), %xmm4 + pcmpeqb %xmm0, %xmm1 + pcmpeqb %xmm0, %xmm2 + pcmpeqb %xmm0, %xmm3 + pcmpeqb %xmm0, %xmm4 + + por %xmm1, %xmm2 + por %xmm3, %xmm4 + por %xmm2, %xmm4 + + pmovmskb %xmm4, %esi + testl %esi, %esi + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rcx + cmpq %rdx, %rcx + jne L(loop_4x_vec) + + subl %edi, %edx + + /* Ends up being 1-byte nop. */ + .p2align 4,, 2 +L(last_4x_vec): + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) + + testl %eax, %eax + jnz L(ret_vec_x0) + + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_end) + + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $(VEC_SIZE * 3), %edx + ja L(last_vec) + bsrl %eax, %eax + jz L(ret_4) + addl %edx, %eax + jl L(zero_3) + addq %rdi, %rax +L(ret_4): ret - .p2align 4 -L(length_less16_part2_return): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax + /* Ends up being 1-byte nop. */ + .p2align 4,, 3 +L(loop_end): + pmovmskb %xmm1, %eax + sall $16, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm2, %eax + testl %eax, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm3, %eax + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + sall $16, %eax + orl %esi, %eax + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax ret -END (__memrchr) +L(ret_vec_end): + bsrl %eax, %eax + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax + ret + /* Use in L(last_4x_vec). In the same cache line. This is just a spare + aligning bytes. */ +L(zero_3): + xorl %eax, %eax + ret + /* 2-bytes from next cache line. */ +END(__memrchr) weak_alias (__memrchr, memrchr) -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 5/8] x86: Optimize memrchr-evex.S 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (2 preceding siblings ...) 2022-06-06 22:37 ` [PATCH v4 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein @ 2022-06-06 22:37 ` Noah Goldstein 2022-06-07 2:41 ` H.J. Lu 2022-06-06 22:37 ` [PATCH v4 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein ` (2 subsequent siblings) 6 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:37 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the begining of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ 1 file changed, 268 insertions(+), 271 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S index 0b99709c6b..ad541c0e50 100644 --- a/sysdeps/x86_64/multiarch/memrchr-evex.S +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S @@ -19,319 +19,316 @@ #if IS_IN (libc) # include <sysdep.h> +# include "evex256-vecs.h" +# if VEC_SIZE != 32 +# error "VEC_SIZE != 32 unimplemented" +# endif + +# ifndef MEMRCHR +# define MEMRCHR __memrchr_evex +# endif + +# define PAGE_SIZE 4096 +# define VECMATCH VEC(0) + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN(MEMRCHR, 6) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) + + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdi, %rdx), %rax + vpbroadcastb %esi, %VECMATCH + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret -# define VMOVA vmovdqa64 - -# define YMMMATCH ymm16 - -# define VEC_SIZE 32 - - .section .text.evex,"ax",@progbits -ENTRY (__memrchr_evex) - /* Broadcast CHAR to YMMMATCH. */ - vpbroadcastb %esi, %YMMMATCH - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP - - /* Check the last VEC_SIZE bytes. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) - - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx - - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 - kord %k1, %k2, %k5 - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 - - kord %k3, %k4, %k6 - kortestd %k5, %k6 - jz L(loop_4x_vec) - - /* There is a match. */ - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - kmovd %k1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0_dec): + decq %rax +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax ret - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) + /* Align rax (pointer to string). */ + andq $-VEC_SIZE, %rax - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) + /* Recompute length after aligning. */ + movq %rax, %rdx - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - ret + subq %rdi, %rdx - .p2align 4 + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) + + /* Must dec rax because L(ret_vec_x0_test) expects it. */ + decq %rax cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpb $0, (%rsi), %VECMATCH, %k0 + kmovd %k0, %r8d + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %ecx + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %ecx + shlxl %ecx, %r8d, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_1) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax + /* Continue creating zero labels that fit in aligning bytes and get + 2-byte encoding / are in the same cache line as condition. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - ret + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx - - kmovd %k1, %eax - - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx - - /* Check for zero length. */ - testl %edx, %edx - jz L(zero) - - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) - - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + .p2align 4,, 8 +L(ret_vec_x2): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + .p2align 4,, 8 +L(ret_vec_x3): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - /* Check the last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax + .p2align 4,, 8 +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + testl %ecx, %ecx + jnz L(ret_vec_x3) - andl %edx, %eax - testl %eax, %eax - jz L(zero) + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - ret + decq %rax + andq $-(VEC_SIZE * 4), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + andq $-(VEC_SIZE * 4), %rdx .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx - - /* Check the last VEC. */ - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 +L(loop_4x_vec): + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later + on). */ + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 + + /* VEC(2/3) will have zero-byte where we found a CHAR. */ + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 + + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} + vptestnmb %VEC(3), %VEC(3), %k2 + + /* Any 1s and we found CHAR. */ + kortestd %k2, %k4 + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) + + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ + subq $-(VEC_SIZE * 4), %rdx + movq %rdx, %rax + subl %edi, %edx +L(last_4x_vec): + + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - kmovd %k1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jnz L(last_vec_x1) + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - /* Check the second last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 + testl %ecx, %ecx + jnz L(ret_vec_x1) - movl %r8d, %ecx + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - kmovd %k1, %eax + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret_1) + xorl %eax, %eax +L(ret_1): + ret - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 6 +L(loop_end): + kmovd %k1, %ecx + notl %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vptestnmb %VEC(2), %VEC(2), %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + kmovd %k2, %ecx + kmovd %k4, %esi + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + addq %rcx, %rax + ret + .p2align 4,, 4 +L(ret_vec_x0_end): + addq $(VEC_SIZE), %rax +L(ret_vec_x1_end): + bsrl %ecx, %ecx + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax ret -END (__memrchr_evex) + +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 5/8] x86: Optimize memrchr-evex.S 2022-06-06 22:37 ` [PATCH v4 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-07 2:41 ` H.J. Lu 2022-06-07 4:09 ` Noah Goldstein 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 2:41 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The new code: > 1. prioritizes smaller user-arg lengths more. > 2. optimizes target placement more carefully > 3. reuses logic more > 4. fixes up various inefficiencies in the logic. The biggest > case here is the `lzcnt` logic for checking returns which > saves either a branch or multiple instructions. > > The total code size saving is: 263 bytes > Geometric Mean of all benchmarks New / Old: 0.755 > > Regressions: > There are some regressions. Particularly where the length (user arg > length) is large but the position of the match char is near the > begining of the string (in first VEC). This case has roughly a beginning > 20% regression. > > This is because the new logic gives the hot path for immediate matches > to shorter lengths (the more common input). This case has roughly > a 35% speedup. > > Full xcheck passes on x86_64. > --- > sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ > 1 file changed, 268 insertions(+), 271 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S > index 0b99709c6b..ad541c0e50 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-evex.S > +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S > @@ -19,319 +19,316 @@ > #if IS_IN (libc) > > # include <sysdep.h> > +# include "evex256-vecs.h" > +# if VEC_SIZE != 32 > +# error "VEC_SIZE != 32 unimplemented" > +# endif > + > +# ifndef MEMRCHR > +# define MEMRCHR __memrchr_evex > +# endif > + > +# define PAGE_SIZE 4096 > +# define VECMATCH VEC(0) > + > + .section SECTION(.text), "ax", @progbits > +ENTRY_P2ALIGN(MEMRCHR, 6) > +# ifdef __ILP32__ > + /* Clear upper bits. */ > + and %RDX_LP, %RDX_LP > +# else > + test %RDX_LP, %RDX_LP > +# endif > + jz L(zero_0) > + > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > + correct page cross check and 2) it correctly sets up end ptr to be > + subtract by lzcnt aligned. */ > + leaq -1(%rdi, %rdx), %rax > + vpbroadcastb %esi, %VECMATCH > + > + /* Check if we can load 1x VEC without cross a page. */ > + testl $(PAGE_SIZE - VEC_SIZE), %eax > + jz L(page_cross) > + > + /* Don't use rax for pointer here because EVEX has better encoding with > + offset % VEC_SIZE == 0. */ > + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 > + kmovd %k0, %ecx > + > + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ > + cmpq $VEC_SIZE, %rdx > + ja L(more_1x_vec) > +L(ret_vec_x0_test): > + > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > + will gurantee edx (len) is less than it. */ guarantee > + lzcntl %ecx, %ecx > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > -# define VMOVA vmovdqa64 > - > -# define YMMMATCH ymm16 > - > -# define VEC_SIZE 32 > - > - .section .text.evex,"ax",@progbits > -ENTRY (__memrchr_evex) > - /* Broadcast CHAR to YMMMATCH. */ > - vpbroadcastb %esi, %YMMMATCH > - > - sub $VEC_SIZE, %RDX_LP > - jbe L(last_vec_or_less) > - > - add %RDX_LP, %RDI_LP > - > - /* Check the last VEC_SIZE bytes. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - subq $(VEC_SIZE * 4), %rdi > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(aligned_more) > - > - /* Align data for aligned loads in the loop. */ > - addq $VEC_SIZE, %rdi > - addq $VEC_SIZE, %rdx > - andq $-VEC_SIZE, %rdi > - subq %rcx, %rdx > - > - .p2align 4 > -L(aligned_more): > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > - since data is only aligned to VEC_SIZE. */ > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > - kmovd %k4, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > - There are some overlaps with above if data isn't aligned > - to 4 * VEC_SIZE. */ > - movl %edi, %ecx > - andl $(VEC_SIZE * 4 - 1), %ecx > - jz L(loop_4x_vec) > - > - addq $(VEC_SIZE * 4), %rdi > - addq $(VEC_SIZE * 4), %rdx > - andq $-(VEC_SIZE * 4), %rdi > - subq %rcx, %rdx > + /* Fits in aligning bytes of first cache line. */ > +L(zero_0): > + xorl %eax, %eax > + ret > > - .p2align 4 > -L(loop_4x_vec): > - /* Compare 4 * VEC at a time forward. */ > - subq $(VEC_SIZE * 4), %rdi > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 > - kord %k1, %k2, %k5 > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 > - > - kord %k3, %k4, %k6 > - kortestd %k5, %k6 > - jz L(loop_4x_vec) > - > - /* There is a match. */ > - kmovd %k4, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - kmovd %k1, %eax > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 9 > +L(ret_vec_x0_dec): > + decq %rax > +L(ret_vec_x0): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_4x_vec_or_less): > - addl $(VEC_SIZE * 4), %edx > - cmpl $(VEC_SIZE * 2), %edx > - jbe L(last_2x_vec) > + .p2align 4,, 10 > +L(more_1x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > + /* Align rax (pointer to string). */ > + andq $-VEC_SIZE, %rax > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > + /* Recompute length after aligning. */ > + movq %rax, %rdx > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1_check) > - cmpl $(VEC_SIZE * 3), %edx > - jbe L(zero) > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > - kmovd %k4, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 4), %rdx > - addq %rax, %rdx > - jl L(zero) > - addq %rdi, %rax > - ret > + subq %rdi, %rdx > > - .p2align 4 > + cmpq $(VEC_SIZE * 2), %rdx > + ja L(more_2x_vec) > L(last_2x_vec): > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3_check) > + > + /* Must dec rax because L(ret_vec_x0_test) expects it. */ > + decq %rax > cmpl $VEC_SIZE, %edx > - jbe L(zero) > - > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 2), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + jbe L(ret_vec_x0_test) > + > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + /* Don't use rax for pointer here because EVEX has better encoding with > + offset % VEC_SIZE == 0. */ > + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 > + kmovd %k0, %ecx > + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_vec_x0): > - bsrl %eax, %eax > - addq %rdi, %rax > + /* Inexpensive place to put this regarding code size / target alignments > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit ^^^^^^^^^^^^^^^^^^^ Typo? > + in first cache line. */ > +L(page_cross): > + movq %rax, %rsi > + andq $-VEC_SIZE, %rsi > + vpcmpb $0, (%rsi), %VECMATCH, %k0 > + kmovd %k0, %r8d > + /* Shift out negative alignment (because we are starting from endptr and > + working backwards). */ > + movl %eax, %ecx > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > + notl %ecx > + shlxl %ecx, %r8d, %ecx > + cmpq %rdi, %rsi > + ja L(more_1x_vec) > + lzcntl %ecx, %ecx > + cmpl %ecx, %edx > + jle L(zero_1) > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_vec_x1): > - bsrl %eax, %eax > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > + /* Continue creating zero labels that fit in aligning bytes and get > + 2-byte encoding / are in the same cache line as condition. */ > +L(zero_1): > + xorl %eax, %eax > ret > > - .p2align 4 > -L(last_vec_x2): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + .p2align 4,, 8 > +L(ret_vec_x1): > + /* This will naturally add 32 to position. */ > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > ret > > - .p2align 4 > -L(last_vec_x3): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + .p2align 4,, 8 > +L(more_2x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0_dec) > > - .p2align 4 > -L(last_vec_x1_check): > - bsrl %eax, %eax > - subq $(VEC_SIZE * 3), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - ret > + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - .p2align 4 > -L(last_vec_x3_check): > - bsrl %eax, %eax > - subq $VEC_SIZE, %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - .p2align 4 > -L(zero): > - xorl %eax, %eax > + subq $(VEC_SIZE * 4), %rdx > + ja L(more_4x_vec) > + > + cmpl $(VEC_SIZE * -1), %edx > + jle L(ret_vec_x2_test) > +L(last_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > + > + > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 3 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_1) > ret > > - .p2align 4 > -L(last_vec_or_less_aligned): > - movl %edx, %ecx > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - > - movl $1, %edx > - /* Support rdx << 32. */ > - salq %cl, %rdx > - subq $1, %rdx > - > - kmovd %k1, %eax > - > - /* Remove the trailing bytes. */ > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > - > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 8 > +L(ret_vec_x2_test): > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_1) > ret > > - .p2align 4 > -L(last_vec_or_less): > - addl $VEC_SIZE, %edx > - > - /* Check for zero length. */ > - testl %edx, %edx > - jz L(zero) > - > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(last_vec_or_less_aligned) > - > - movl %ecx, %esi > - movl %ecx, %r8d > - addl %edx, %esi > - andq $-VEC_SIZE, %rdi > + .p2align 4,, 8 > +L(ret_vec_x2): > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > + ret > > - subl $VEC_SIZE, %esi > - ja L(last_vec_2x_aligned) > + .p2align 4,, 8 > +L(ret_vec_x3): > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > + ret > > - /* Check the last VEC. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > + .p2align 4,, 8 > +L(more_4x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > > - /* Remove the leading and trailing bytes. */ > - sarl %cl, %eax > - movl %edx, %ecx > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + testl %ecx, %ecx > + jnz L(ret_vec_x3) > > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > + /* Check if near end before re-aligning (otherwise might do an > + unnecissary loop iteration). */ unnecessary > + addq $-(VEC_SIZE * 4), %rax > + cmpq $(VEC_SIZE * 4), %rdx > + jbe L(last_4x_vec) > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > - ret > + decq %rax > + andq $-(VEC_SIZE * 4), %rax > + movq %rdi, %rdx > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > + lengths that overflow can be valid and break the comparison. */ > + andq $-(VEC_SIZE * 4), %rdx > > .p2align 4 > -L(last_vec_2x_aligned): > - movl %esi, %ecx > - > - /* Check the last VEC. */ > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > +L(loop_4x_vec): > + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later > + on). */ > + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 > + > + /* VEC(2/3) will have zero-byte where we found a CHAR. */ > + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) > + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) > + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 > + > + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where > + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ > + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} > + vptestnmb %VEC(3), %VEC(3), %k2 > + > + /* Any 1s and we found CHAR. */ > + kortestd %k2, %k4 > + jnz L(loop_end) > + > + addq $-(VEC_SIZE * 4), %rax > + cmpq %rdx, %rax > + jne L(loop_4x_vec) > + > + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ > + subq $-(VEC_SIZE * 4), %rdx > + movq %rdx, %rax > + subl %edi, %edx > +L(last_4x_vec): > + > + /* Used no matter what. */ > + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + cmpl $(VEC_SIZE * 2), %edx > + jbe L(last_2x_vec) > > - kmovd %k1, %eax > + testl %ecx, %ecx > + jnz L(ret_vec_x0_dec) > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > > - testl %eax, %eax > - jnz L(last_vec_x1) > + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - /* Check the second last VEC. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - movl %r8d, %ecx > + /* Used no matter what. */ > + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - kmovd %k1, %eax > + cmpl $(VEC_SIZE * 3), %edx > + ja L(last_vec) > > - /* Remove the leading bytes. Must use unsigned right shift for > - bsrl below. */ > - shrl %cl, %eax > - testl %eax, %eax > - jz L(zero) > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + jbe L(ret_1) > + xorl %eax, %eax > +L(ret_1): > + ret > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > + .p2align 4,, 6 > +L(loop_end): > + kmovd %k1, %ecx > + notl %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > + > + vptestnmb %VEC(2), %VEC(2), %k0 > + kmovd %k0, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > + > + kmovd %k2, %ecx > + kmovd %k4, %esi > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > + then it won't affect the result in esi (VEC4). If ecx is non-zero > + then CHAR in VEC3 and bsrq will use that position. */ > + salq $32, %rcx > + orq %rsi, %rcx > + bsrq %rcx, %rcx > + addq %rcx, %rax > + ret > + .p2align 4,, 4 > +L(ret_vec_x0_end): > + addq $(VEC_SIZE), %rax > +L(ret_vec_x1_end): > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax > ret > -END (__memrchr_evex) > + > +END(MEMRCHR) > #endif > -- > 2.34.1 > -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 5/8] x86: Optimize memrchr-evex.S 2022-06-07 2:41 ` H.J. Lu @ 2022-06-07 4:09 ` Noah Goldstein 2022-06-07 4:12 ` Noah Goldstein 0 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:09 UTC (permalink / raw) To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 7:41 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The new code: > > 1. prioritizes smaller user-arg lengths more. > > 2. optimizes target placement more carefully > > 3. reuses logic more > > 4. fixes up various inefficiencies in the logic. The biggest > > case here is the `lzcnt` logic for checking returns which > > saves either a branch or multiple instructions. > > > > The total code size saving is: 263 bytes > > Geometric Mean of all benchmarks New / Old: 0.755 > > > > Regressions: > > There are some regressions. Particularly where the length (user arg > > length) is large but the position of the match char is near the > > begining of the string (in first VEC). This case has roughly a > > beginning Fixed in V5. > > > 20% regression. > > > > This is because the new logic gives the hot path for immediate matches > > to shorter lengths (the more common input). This case has roughly > > a 35% speedup. > > > > Full xcheck passes on x86_64. > > --- > > sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ > > 1 file changed, 268 insertions(+), 271 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S > > index 0b99709c6b..ad541c0e50 100644 > > --- a/sysdeps/x86_64/multiarch/memrchr-evex.S > > +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S > > @@ -19,319 +19,316 @@ > > #if IS_IN (libc) > > > > # include <sysdep.h> > > +# include "evex256-vecs.h" > > +# if VEC_SIZE != 32 > > +# error "VEC_SIZE != 32 unimplemented" > > +# endif > > + > > +# ifndef MEMRCHR > > +# define MEMRCHR __memrchr_evex > > +# endif > > + > > +# define PAGE_SIZE 4096 > > +# define VECMATCH VEC(0) > > + > > + .section SECTION(.text), "ax", @progbits > > +ENTRY_P2ALIGN(MEMRCHR, 6) > > +# ifdef __ILP32__ > > + /* Clear upper bits. */ > > + and %RDX_LP, %RDX_LP > > +# else > > + test %RDX_LP, %RDX_LP > > +# endif > > + jz L(zero_0) > > + > > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > > + correct page cross check and 2) it correctly sets up end ptr to be > > + subtract by lzcnt aligned. */ > > + leaq -1(%rdi, %rdx), %rax > > + vpbroadcastb %esi, %VECMATCH > > + > > + /* Check if we can load 1x VEC without cross a page. */ > > + testl $(PAGE_SIZE - VEC_SIZE), %eax > > + jz L(page_cross) > > + > > + /* Don't use rax for pointer here because EVEX has better encoding with > > + offset % VEC_SIZE == 0. */ > > + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > + > > + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ > > + cmpq $VEC_SIZE, %rdx > > + ja L(more_1x_vec) > > +L(ret_vec_x0_test): > > + > > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > > + will gurantee edx (len) is less than it. */ > guarantee Fixed in V5. > > + lzcntl %ecx, %ecx > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > + ret > > > > -# define VMOVA vmovdqa64 > > - > > -# define YMMMATCH ymm16 > > - > > -# define VEC_SIZE 32 > > - > > - .section .text.evex,"ax",@progbits > > -ENTRY (__memrchr_evex) > > - /* Broadcast CHAR to YMMMATCH. */ > > - vpbroadcastb %esi, %YMMMATCH > > - > > - sub $VEC_SIZE, %RDX_LP > > - jbe L(last_vec_or_less) > > - > > - add %RDX_LP, %RDI_LP > > - > > - /* Check the last VEC_SIZE bytes. */ > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x0) > > - > > - subq $(VEC_SIZE * 4), %rdi > > - movl %edi, %ecx > > - andl $(VEC_SIZE - 1), %ecx > > - jz L(aligned_more) > > - > > - /* Align data for aligned loads in the loop. */ > > - addq $VEC_SIZE, %rdi > > - addq $VEC_SIZE, %rdx > > - andq $-VEC_SIZE, %rdi > > - subq %rcx, %rdx > > - > > - .p2align 4 > > -L(aligned_more): > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > - > > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > > - since data is only aligned to VEC_SIZE. */ > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > > - kmovd %k2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > > - kmovd %k3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > - > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > > - kmovd %k4, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x0) > > - > > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > > - There are some overlaps with above if data isn't aligned > > - to 4 * VEC_SIZE. */ > > - movl %edi, %ecx > > - andl $(VEC_SIZE * 4 - 1), %ecx > > - jz L(loop_4x_vec) > > - > > - addq $(VEC_SIZE * 4), %rdi > > - addq $(VEC_SIZE * 4), %rdx > > - andq $-(VEC_SIZE * 4), %rdi > > - subq %rcx, %rdx > > + /* Fits in aligning bytes of first cache line. */ > > +L(zero_0): > > + xorl %eax, %eax > > + ret > > > > - .p2align 4 > > -L(loop_4x_vec): > > - /* Compare 4 * VEC at a time forward. */ > > - subq $(VEC_SIZE * 4), %rdi > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > - > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 > > - kord %k1, %k2, %k5 > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 > > - > > - kord %k3, %k4, %k6 > > - kortestd %k5, %k6 > > - jz L(loop_4x_vec) > > - > > - /* There is a match. */ > > - kmovd %k4, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - kmovd %k3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - kmovd %k2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > - > > - kmovd %k1, %eax > > - bsrl %eax, %eax > > - addq %rdi, %rax > > + .p2align 4,, 9 > > +L(ret_vec_x0_dec): > > + decq %rax > > +L(ret_vec_x0): > > + lzcntl %ecx, %ecx > > + subq %rcx, %rax > > ret > > > > - .p2align 4 > > -L(last_4x_vec_or_less): > > - addl $(VEC_SIZE * 4), %edx > > - cmpl $(VEC_SIZE * 2), %edx > > - jbe L(last_2x_vec) > > + .p2align 4,, 10 > > +L(more_1x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > + /* Align rax (pointer to string). */ > > + andq $-VEC_SIZE, %rax > > > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > > - kmovd %k2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > + /* Recompute length after aligning. */ > > + movq %rax, %rdx > > > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > > - kmovd %k3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1_check) > > - cmpl $(VEC_SIZE * 3), %edx > > - jbe L(zero) > > + /* Need no matter what. */ > > + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > > - kmovd %k4, %eax > > - testl %eax, %eax > > - jz L(zero) > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 4), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addq %rdi, %rax > > - ret > > + subq %rdi, %rdx > > > > - .p2align 4 > > + cmpq $(VEC_SIZE * 2), %rdx > > + ja L(more_2x_vec) > > L(last_2x_vec): > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3_check) > > + > > + /* Must dec rax because L(ret_vec_x0_test) expects it. */ > > + decq %rax > > cmpl $VEC_SIZE, %edx > > - jbe L(zero) > > - > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jz L(zero) > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 2), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $(VEC_SIZE * 2), %eax > > - addq %rdi, %rax > > + jbe L(ret_vec_x0_test) > > + > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > + > > + /* Don't use rax for pointer here because EVEX has better encoding with > > + offset % VEC_SIZE == 0. */ > > + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ > > + lzcntq %rcx, %rcx > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > ret > > > > - .p2align 4 > > -L(last_vec_x0): > > - bsrl %eax, %eax > > - addq %rdi, %rax > > + /* Inexpensive place to put this regarding code size / target alignments > > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > > + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit > ^^^^^^^^^^^^^^^^^^^ Typo? Missed this in V5. Will fix in V6 (will wait for other feedback). > > + in first cache line. */ > > +L(page_cross): > > + movq %rax, %rsi > > + andq $-VEC_SIZE, %rsi > > + vpcmpb $0, (%rsi), %VECMATCH, %k0 > > + kmovd %k0, %r8d > > + /* Shift out negative alignment (because we are starting from endptr and > > + working backwards). */ > > + movl %eax, %ecx > > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > > + notl %ecx > > + shlxl %ecx, %r8d, %ecx > > + cmpq %rdi, %rsi > > + ja L(more_1x_vec) > > + lzcntl %ecx, %ecx > > + cmpl %ecx, %edx > > + jle L(zero_1) > > + subq %rcx, %rax > > ret > > > > - .p2align 4 > > -L(last_vec_x1): > > - bsrl %eax, %eax > > - addl $VEC_SIZE, %eax > > - addq %rdi, %rax > > + /* Continue creating zero labels that fit in aligning bytes and get > > + 2-byte encoding / are in the same cache line as condition. */ > > +L(zero_1): > > + xorl %eax, %eax > > ret > > > > - .p2align 4 > > -L(last_vec_x2): > > - bsrl %eax, %eax > > - addl $(VEC_SIZE * 2), %eax > > - addq %rdi, %rax > > + .p2align 4,, 8 > > +L(ret_vec_x1): > > + /* This will naturally add 32 to position. */ > > + bsrl %ecx, %ecx > > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > > ret > > > > - .p2align 4 > > -L(last_vec_x3): > > - bsrl %eax, %eax > > - addl $(VEC_SIZE * 3), %eax > > - addq %rdi, %rax > > - ret > > + .p2align 4,, 8 > > +L(more_2x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_dec) > > > > - .p2align 4 > > -L(last_vec_x1_check): > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 3), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $VEC_SIZE, %eax > > - addq %rdi, %rax > > - ret > > + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > > > - .p2align 4 > > -L(last_vec_x3_check): > > - bsrl %eax, %eax > > - subq $VEC_SIZE, %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $(VEC_SIZE * 3), %eax > > - addq %rdi, %rax > > - ret > > + /* Need no matter what. */ > > + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - .p2align 4 > > -L(zero): > > - xorl %eax, %eax > > + subq $(VEC_SIZE * 4), %rdx > > + ja L(more_4x_vec) > > + > > + cmpl $(VEC_SIZE * -1), %edx > > + jle L(ret_vec_x2_test) > > +L(last_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > + > > + > > + /* Need no matter what. */ > > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 3 + 1), %rax > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + ja L(zero_1) > > ret > > > > - .p2align 4 > > -L(last_vec_or_less_aligned): > > - movl %edx, %ecx > > - > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > - > > - movl $1, %edx > > - /* Support rdx << 32. */ > > - salq %cl, %rdx > > - subq $1, %rdx > > - > > - kmovd %k1, %eax > > - > > - /* Remove the trailing bytes. */ > > - andl %edx, %eax > > - testl %eax, %eax > > - jz L(zero) > > - > > - bsrl %eax, %eax > > - addq %rdi, %rax > > + .p2align 4,, 8 > > +L(ret_vec_x2_test): > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 2 + 1), %rax > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + ja L(zero_1) > > ret > > > > - .p2align 4 > > -L(last_vec_or_less): > > - addl $VEC_SIZE, %edx > > - > > - /* Check for zero length. */ > > - testl %edx, %edx > > - jz L(zero) > > - > > - movl %edi, %ecx > > - andl $(VEC_SIZE - 1), %ecx > > - jz L(last_vec_or_less_aligned) > > - > > - movl %ecx, %esi > > - movl %ecx, %r8d > > - addl %edx, %esi > > - andq $-VEC_SIZE, %rdi > > + .p2align 4,, 8 > > +L(ret_vec_x2): > > + bsrl %ecx, %ecx > > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > > + ret > > > > - subl $VEC_SIZE, %esi > > - ja L(last_vec_2x_aligned) > > + .p2align 4,, 8 > > +L(ret_vec_x3): > > + bsrl %ecx, %ecx > > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > > + ret > > > > - /* Check the last VEC. */ > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > + .p2align 4,, 8 > > +L(more_4x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > > > - /* Remove the leading and trailing bytes. */ > > - sarl %cl, %eax > > - movl %edx, %ecx > > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - movl $1, %edx > > - sall %cl, %edx > > - subl $1, %edx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x3) > > > > - andl %edx, %eax > > - testl %eax, %eax > > - jz L(zero) > > + /* Check if near end before re-aligning (otherwise might do an > > + unnecissary loop iteration). */ > unnecessary > > + addq $-(VEC_SIZE * 4), %rax > > + cmpq $(VEC_SIZE * 4), %rdx > > + jbe L(last_4x_vec) > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - addq %r8, %rax > > - ret > > + decq %rax > > + andq $-(VEC_SIZE * 4), %rax > > + movq %rdi, %rdx > > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > > + lengths that overflow can be valid and break the comparison. */ > > + andq $-(VEC_SIZE * 4), %rdx > > > > .p2align 4 > > -L(last_vec_2x_aligned): > > - movl %esi, %ecx > > - > > - /* Check the last VEC. */ > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > > +L(loop_4x_vec): > > + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later > > + on). */ > > + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 > > + > > + /* VEC(2/3) will have zero-byte where we found a CHAR. */ > > + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) > > + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) > > + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 > > + > > + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where > > + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ > > + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} > > + vptestnmb %VEC(3), %VEC(3), %k2 > > + > > + /* Any 1s and we found CHAR. */ > > + kortestd %k2, %k4 > > + jnz L(loop_end) > > + > > + addq $-(VEC_SIZE * 4), %rax > > + cmpq %rdx, %rax > > + jne L(loop_4x_vec) > > + > > + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ > > + subq $-(VEC_SIZE * 4), %rdx > > + movq %rdx, %rax > > + subl %edi, %edx > > +L(last_4x_vec): > > + > > + /* Used no matter what. */ > > + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - movl $1, %edx > > - sall %cl, %edx > > - subl $1, %edx > > + cmpl $(VEC_SIZE * 2), %edx > > + jbe L(last_2x_vec) > > > > - kmovd %k1, %eax > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_dec) > > > > - /* Remove the trailing bytes. */ > > - andl %edx, %eax > > > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - /* Check the second last VEC. */ > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > > > - movl %r8d, %ecx > > + /* Used no matter what. */ > > + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - kmovd %k1, %eax > > + cmpl $(VEC_SIZE * 3), %edx > > + ja L(last_vec) > > > > - /* Remove the leading bytes. Must use unsigned right shift for > > - bsrl below. */ > > - shrl %cl, %eax > > - testl %eax, %eax > > - jz L(zero) > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 2 + 1), %rax > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + jbe L(ret_1) > > + xorl %eax, %eax > > +L(ret_1): > > + ret > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - addq %r8, %rax > > + .p2align 4,, 6 > > +L(loop_end): > > + kmovd %k1, %ecx > > + notl %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_end) > > + > > + vptestnmb %VEC(2), %VEC(2), %k0 > > + kmovd %k0, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1_end) > > + > > + kmovd %k2, %ecx > > + kmovd %k4, %esi > > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > > + then it won't affect the result in esi (VEC4). If ecx is non-zero > > + then CHAR in VEC3 and bsrq will use that position. */ > > + salq $32, %rcx > > + orq %rsi, %rcx > > + bsrq %rcx, %rcx > > + addq %rcx, %rax > > + ret > > + .p2align 4,, 4 > > +L(ret_vec_x0_end): > > + addq $(VEC_SIZE), %rax > > +L(ret_vec_x1_end): > > + bsrl %ecx, %ecx > > + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax > > ret > > -END (__memrchr_evex) > > + > > +END(MEMRCHR) > > #endif > > -- > > 2.34.1 > > > > > -- > H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 5/8] x86: Optimize memrchr-evex.S 2022-06-07 4:09 ` Noah Goldstein @ 2022-06-07 4:12 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:12 UTC (permalink / raw) To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 9:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, Jun 6, 2022 at 7:41 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > The new code: > > > 1. prioritizes smaller user-arg lengths more. > > > 2. optimizes target placement more carefully > > > 3. reuses logic more > > > 4. fixes up various inefficiencies in the logic. The biggest > > > case here is the `lzcnt` logic for checking returns which > > > saves either a branch or multiple instructions. > > > > > > The total code size saving is: 263 bytes > > > Geometric Mean of all benchmarks New / Old: 0.755 > > > > > > Regressions: > > > There are some regressions. Particularly where the length (user arg > > > length) is large but the position of the match char is near the > > > begining of the string (in first VEC). This case has roughly a > > > > beginning > > Fixed in V5. > > > > > 20% regression. > > > > > > This is because the new logic gives the hot path for immediate matches > > > to shorter lengths (the more common input). This case has roughly > > > a 35% speedup. > > > > > > Full xcheck passes on x86_64. > > > --- > > > sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ > > > 1 file changed, 268 insertions(+), 271 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S > > > index 0b99709c6b..ad541c0e50 100644 > > > --- a/sysdeps/x86_64/multiarch/memrchr-evex.S > > > +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S > > > @@ -19,319 +19,316 @@ > > > #if IS_IN (libc) > > > > > > # include <sysdep.h> > > > +# include "evex256-vecs.h" > > > +# if VEC_SIZE != 32 > > > +# error "VEC_SIZE != 32 unimplemented" > > > +# endif > > > + > > > +# ifndef MEMRCHR > > > +# define MEMRCHR __memrchr_evex > > > +# endif > > > + > > > +# define PAGE_SIZE 4096 > > > +# define VECMATCH VEC(0) > > > + > > > + .section SECTION(.text), "ax", @progbits > > > +ENTRY_P2ALIGN(MEMRCHR, 6) > > > +# ifdef __ILP32__ > > > + /* Clear upper bits. */ > > > + and %RDX_LP, %RDX_LP > > > +# else > > > + test %RDX_LP, %RDX_LP > > > +# endif > > > + jz L(zero_0) > > > + > > > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > > > + correct page cross check and 2) it correctly sets up end ptr to be > > > + subtract by lzcnt aligned. */ > > > + leaq -1(%rdi, %rdx), %rax > > > + vpbroadcastb %esi, %VECMATCH > > > + > > > + /* Check if we can load 1x VEC without cross a page. */ > > > + testl $(PAGE_SIZE - VEC_SIZE), %eax > > > + jz L(page_cross) > > > + > > > + /* Don't use rax for pointer here because EVEX has better encoding with > > > + offset % VEC_SIZE == 0. */ > > > + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > + > > > + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ > > > + cmpq $VEC_SIZE, %rdx > > > + ja L(more_1x_vec) > > > +L(ret_vec_x0_test): > > > + > > > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > > > + will gurantee edx (len) is less than it. */ > > guarantee > > Fixed in V5. > > > > + lzcntl %ecx, %ecx > > > + cmpl %ecx, %edx > > > + jle L(zero_0) > > > + subq %rcx, %rax > > > + ret > > > > > > -# define VMOVA vmovdqa64 > > > - > > > -# define YMMMATCH ymm16 > > > - > > > -# define VEC_SIZE 32 > > > - > > > - .section .text.evex,"ax",@progbits > > > -ENTRY (__memrchr_evex) > > > - /* Broadcast CHAR to YMMMATCH. */ > > > - vpbroadcastb %esi, %YMMMATCH > > > - > > > - sub $VEC_SIZE, %RDX_LP > > > - jbe L(last_vec_or_less) > > > - > > > - add %RDX_LP, %RDI_LP > > > - > > > - /* Check the last VEC_SIZE bytes. */ > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > > - kmovd %k1, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x0) > > > - > > > - subq $(VEC_SIZE * 4), %rdi > > > - movl %edi, %ecx > > > - andl $(VEC_SIZE - 1), %ecx > > > - jz L(aligned_more) > > > - > > > - /* Align data for aligned loads in the loop. */ > > > - addq $VEC_SIZE, %rdi > > > - addq $VEC_SIZE, %rdx > > > - andq $-VEC_SIZE, %rdi > > > - subq %rcx, %rdx > > > - > > > - .p2align 4 > > > -L(aligned_more): > > > - subq $(VEC_SIZE * 4), %rdx > > > - jbe L(last_4x_vec_or_less) > > > - > > > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > - since data is only aligned to VEC_SIZE. */ > > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > > - kmovd %k1, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x3) > > > - > > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > > > - kmovd %k2, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x2) > > > - > > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > > > - kmovd %k3, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x1) > > > - > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > > > - kmovd %k4, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x0) > > > - > > > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > > > - There are some overlaps with above if data isn't aligned > > > - to 4 * VEC_SIZE. */ > > > - movl %edi, %ecx > > > - andl $(VEC_SIZE * 4 - 1), %ecx > > > - jz L(loop_4x_vec) > > > - > > > - addq $(VEC_SIZE * 4), %rdi > > > - addq $(VEC_SIZE * 4), %rdx > > > - andq $-(VEC_SIZE * 4), %rdi > > > - subq %rcx, %rdx > > > + /* Fits in aligning bytes of first cache line. */ > > > +L(zero_0): > > > + xorl %eax, %eax > > > + ret > > > > > > - .p2align 4 > > > -L(loop_4x_vec): > > > - /* Compare 4 * VEC at a time forward. */ > > > - subq $(VEC_SIZE * 4), %rdi > > > - subq $(VEC_SIZE * 4), %rdx > > > - jbe L(last_4x_vec_or_less) > > > - > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 > > > - kord %k1, %k2, %k5 > > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 > > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 > > > - > > > - kord %k3, %k4, %k6 > > > - kortestd %k5, %k6 > > > - jz L(loop_4x_vec) > > > - > > > - /* There is a match. */ > > > - kmovd %k4, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x3) > > > - > > > - kmovd %k3, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x2) > > > - > > > - kmovd %k2, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x1) > > > - > > > - kmovd %k1, %eax > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > + .p2align 4,, 9 > > > +L(ret_vec_x0_dec): > > > + decq %rax > > > +L(ret_vec_x0): > > > + lzcntl %ecx, %ecx > > > + subq %rcx, %rax > > > ret > > > > > > - .p2align 4 > > > -L(last_4x_vec_or_less): > > > - addl $(VEC_SIZE * 4), %edx > > > - cmpl $(VEC_SIZE * 2), %edx > > > - jbe L(last_2x_vec) > > > + .p2align 4,, 10 > > > +L(more_1x_vec): > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0) > > > > > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > > - kmovd %k1, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x3) > > > + /* Align rax (pointer to string). */ > > > + andq $-VEC_SIZE, %rax > > > > > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > > > - kmovd %k2, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x2) > > > + /* Recompute length after aligning. */ > > > + movq %rax, %rdx > > > > > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > > > - kmovd %k3, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x1_check) > > > - cmpl $(VEC_SIZE * 3), %edx > > > - jbe L(zero) > > > + /* Need no matter what. */ > > > + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > > > - kmovd %k4, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > - bsrl %eax, %eax > > > - subq $(VEC_SIZE * 4), %rdx > > > - addq %rax, %rdx > > > - jl L(zero) > > > - addq %rdi, %rax > > > - ret > > > + subq %rdi, %rdx > > > > > > - .p2align 4 > > > + cmpq $(VEC_SIZE * 2), %rdx > > > + ja L(more_2x_vec) > > > L(last_2x_vec): > > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > > - kmovd %k1, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x3_check) > > > + > > > + /* Must dec rax because L(ret_vec_x0_test) expects it. */ > > > + decq %rax > > > cmpl $VEC_SIZE, %edx > > > - jbe L(zero) > > > - > > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 > > > - kmovd %k1, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > - bsrl %eax, %eax > > > - subq $(VEC_SIZE * 2), %rdx > > > - addq %rax, %rdx > > > - jl L(zero) > > > - addl $(VEC_SIZE * 2), %eax > > > - addq %rdi, %rax > > > + jbe L(ret_vec_x0_test) > > > + > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0) > > > + > > > + /* Don't use rax for pointer here because EVEX has better encoding with > > > + offset % VEC_SIZE == 0. */ > > > + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ > > > + lzcntq %rcx, %rcx > > > + cmpl %ecx, %edx > > > + jle L(zero_0) > > > + subq %rcx, %rax > > > ret > > > > > > - .p2align 4 > > > -L(last_vec_x0): > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > + /* Inexpensive place to put this regarding code size / target alignments > > > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > > > + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit > > ^^^^^^^^^^^^^^^^^^^ Typo? > > Missed this in V5. Will fix in V6 (will wait for other feedback). Fixed in v6 (in avx2 version as well). > > > + in first cache line. */ > > > +L(page_cross): > > > + movq %rax, %rsi > > > + andq $-VEC_SIZE, %rsi > > > + vpcmpb $0, (%rsi), %VECMATCH, %k0 > > > + kmovd %k0, %r8d > > > + /* Shift out negative alignment (because we are starting from endptr and > > > + working backwards). */ > > > + movl %eax, %ecx > > > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > > > + notl %ecx > > > + shlxl %ecx, %r8d, %ecx > > > + cmpq %rdi, %rsi > > > + ja L(more_1x_vec) > > > + lzcntl %ecx, %ecx > > > + cmpl %ecx, %edx > > > + jle L(zero_1) > > > + subq %rcx, %rax > > > ret > > > > > > - .p2align 4 > > > -L(last_vec_x1): > > > - bsrl %eax, %eax > > > - addl $VEC_SIZE, %eax > > > - addq %rdi, %rax > > > + /* Continue creating zero labels that fit in aligning bytes and get > > > + 2-byte encoding / are in the same cache line as condition. */ > > > +L(zero_1): > > > + xorl %eax, %eax > > > ret > > > > > > - .p2align 4 > > > -L(last_vec_x2): > > > - bsrl %eax, %eax > > > - addl $(VEC_SIZE * 2), %eax > > > - addq %rdi, %rax > > > + .p2align 4,, 8 > > > +L(ret_vec_x1): > > > + /* This will naturally add 32 to position. */ > > > + bsrl %ecx, %ecx > > > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > > > ret > > > > > > - .p2align 4 > > > -L(last_vec_x3): > > > - bsrl %eax, %eax > > > - addl $(VEC_SIZE * 3), %eax > > > - addq %rdi, %rax > > > - ret > > > + .p2align 4,, 8 > > > +L(more_2x_vec): > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0_dec) > > > > > > - .p2align 4 > > > -L(last_vec_x1_check): > > > - bsrl %eax, %eax > > > - subq $(VEC_SIZE * 3), %rdx > > > - addq %rax, %rdx > > > - jl L(zero) > > > - addl $VEC_SIZE, %eax > > > - addq %rdi, %rax > > > - ret > > > + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x1) > > > > > > - .p2align 4 > > > -L(last_vec_x3_check): > > > - bsrl %eax, %eax > > > - subq $VEC_SIZE, %rdx > > > - addq %rax, %rdx > > > - jl L(zero) > > > - addl $(VEC_SIZE * 3), %eax > > > - addq %rdi, %rax > > > - ret > > > + /* Need no matter what. */ > > > + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > > > > - .p2align 4 > > > -L(zero): > > > - xorl %eax, %eax > > > + subq $(VEC_SIZE * 4), %rdx > > > + ja L(more_4x_vec) > > > + > > > + cmpl $(VEC_SIZE * -1), %edx > > > + jle L(ret_vec_x2_test) > > > +L(last_vec): > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x2) > > > + > > > + > > > + /* Need no matter what. */ > > > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > + lzcntl %ecx, %ecx > > > + subq $(VEC_SIZE * 3 + 1), %rax > > > + subq %rcx, %rax > > > + cmpq %rax, %rdi > > > + ja L(zero_1) > > > ret > > > > > > - .p2align 4 > > > -L(last_vec_or_less_aligned): > > > - movl %edx, %ecx > > > - > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > > - > > > - movl $1, %edx > > > - /* Support rdx << 32. */ > > > - salq %cl, %rdx > > > - subq $1, %rdx > > > - > > > - kmovd %k1, %eax > > > - > > > - /* Remove the trailing bytes. */ > > > - andl %edx, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > - > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > + .p2align 4,, 8 > > > +L(ret_vec_x2_test): > > > + lzcntl %ecx, %ecx > > > + subq $(VEC_SIZE * 2 + 1), %rax > > > + subq %rcx, %rax > > > + cmpq %rax, %rdi > > > + ja L(zero_1) > > > ret > > > > > > - .p2align 4 > > > -L(last_vec_or_less): > > > - addl $VEC_SIZE, %edx > > > - > > > - /* Check for zero length. */ > > > - testl %edx, %edx > > > - jz L(zero) > > > - > > > - movl %edi, %ecx > > > - andl $(VEC_SIZE - 1), %ecx > > > - jz L(last_vec_or_less_aligned) > > > - > > > - movl %ecx, %esi > > > - movl %ecx, %r8d > > > - addl %edx, %esi > > > - andq $-VEC_SIZE, %rdi > > > + .p2align 4,, 8 > > > +L(ret_vec_x2): > > > + bsrl %ecx, %ecx > > > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > > > + ret > > > > > > - subl $VEC_SIZE, %esi > > > - ja L(last_vec_2x_aligned) > > > + .p2align 4,, 8 > > > +L(ret_vec_x3): > > > + bsrl %ecx, %ecx > > > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > > > + ret > > > > > > - /* Check the last VEC. */ > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > > - kmovd %k1, %eax > > > + .p2align 4,, 8 > > > +L(more_4x_vec): > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x2) > > > > > > - /* Remove the leading and trailing bytes. */ > > > - sarl %cl, %eax > > > - movl %edx, %ecx > > > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > > > > - movl $1, %edx > > > - sall %cl, %edx > > > - subl $1, %edx > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x3) > > > > > > - andl %edx, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > + /* Check if near end before re-aligning (otherwise might do an > > > + unnecissary loop iteration). */ > > unnecessary > > > + addq $-(VEC_SIZE * 4), %rax > > > + cmpq $(VEC_SIZE * 4), %rdx > > > + jbe L(last_4x_vec) > > > > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > - addq %r8, %rax > > > - ret > > > + decq %rax > > > + andq $-(VEC_SIZE * 4), %rax > > > + movq %rdi, %rdx > > > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > > > + lengths that overflow can be valid and break the comparison. */ > > > + andq $-(VEC_SIZE * 4), %rdx > > > > > > .p2align 4 > > > -L(last_vec_2x_aligned): > > > - movl %esi, %ecx > > > - > > > - /* Check the last VEC. */ > > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > > > +L(loop_4x_vec): > > > + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later > > > + on). */ > > > + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 > > > + > > > + /* VEC(2/3) will have zero-byte where we found a CHAR. */ > > > + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) > > > + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) > > > + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 > > > + > > > + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where > > > + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ > > > + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} > > > + vptestnmb %VEC(3), %VEC(3), %k2 > > > + > > > + /* Any 1s and we found CHAR. */ > > > + kortestd %k2, %k4 > > > + jnz L(loop_end) > > > + > > > + addq $-(VEC_SIZE * 4), %rax > > > + cmpq %rdx, %rax > > > + jne L(loop_4x_vec) > > > + > > > + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ > > > + subq $-(VEC_SIZE * 4), %rdx > > > + movq %rdx, %rax > > > + subl %edi, %edx > > > +L(last_4x_vec): > > > + > > > + /* Used no matter what. */ > > > + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > > > > - movl $1, %edx > > > - sall %cl, %edx > > > - subl $1, %edx > > > + cmpl $(VEC_SIZE * 2), %edx > > > + jbe L(last_2x_vec) > > > > > > - kmovd %k1, %eax > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0_dec) > > > > > > - /* Remove the trailing bytes. */ > > > - andl %edx, %eax > > > > > > - testl %eax, %eax > > > - jnz L(last_vec_x1) > > > + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > > > > - /* Check the second last VEC. */ > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x1) > > > > > > - movl %r8d, %ecx > > > + /* Used no matter what. */ > > > + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 > > > + kmovd %k0, %ecx > > > > > > - kmovd %k1, %eax > > > + cmpl $(VEC_SIZE * 3), %edx > > > + ja L(last_vec) > > > > > > - /* Remove the leading bytes. Must use unsigned right shift for > > > - bsrl below. */ > > > - shrl %cl, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > + lzcntl %ecx, %ecx > > > + subq $(VEC_SIZE * 2 + 1), %rax > > > + subq %rcx, %rax > > > + cmpq %rax, %rdi > > > + jbe L(ret_1) > > > + xorl %eax, %eax > > > +L(ret_1): > > > + ret > > > > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > - addq %r8, %rax > > > + .p2align 4,, 6 > > > +L(loop_end): > > > + kmovd %k1, %ecx > > > + notl %ecx > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0_end) > > > + > > > + vptestnmb %VEC(2), %VEC(2), %k0 > > > + kmovd %k0, %ecx > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x1_end) > > > + > > > + kmovd %k2, %ecx > > > + kmovd %k4, %esi > > > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > > > + then it won't affect the result in esi (VEC4). If ecx is non-zero > > > + then CHAR in VEC3 and bsrq will use that position. */ > > > + salq $32, %rcx > > > + orq %rsi, %rcx > > > + bsrq %rcx, %rcx > > > + addq %rcx, %rax > > > + ret > > > + .p2align 4,, 4 > > > +L(ret_vec_x0_end): > > > + addq $(VEC_SIZE), %rax > > > +L(ret_vec_x1_end): > > > + bsrl %ecx, %ecx > > > + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax > > > ret > > > -END (__memrchr_evex) > > > + > > > +END(MEMRCHR) > > > #endif > > > -- > > > 2.34.1 > > > > > > > > > -- > > H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 6/8] x86: Optimize memrchr-avx2.S 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (3 preceding siblings ...) 2022-06-06 22:37 ` [PATCH v4 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-06 22:37 ` Noah Goldstein 2022-06-07 2:35 ` H.J. Lu 2022-06-06 22:37 ` [PATCH v4 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 6 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:37 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the begining of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memrchr-avx2.S | 538 ++++++++++---------- 2 files changed, 260 insertions(+), 279 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S index cea2d2a72d..5e9beeeef2 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMRCHR __memrchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S index ba2ce7cb03..6915e1c373 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S @@ -21,340 +21,320 @@ # include <sysdep.h> # ifndef MEMRCHR -# define MEMRCHR __memrchr_avx2 +# define MEMRCHR __memrchr_avx2 # endif # ifndef VZEROUPPER -# define VZEROUPPER vzeroupper +# define VZEROUPPER vzeroupper # endif +// abf-off # ifndef SECTION # define SECTION(p) p##.avx # endif +// abf-on + +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + .section SECTION(.text), "ax", @progbits +ENTRY(MEMRCHR) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) -# define VEC_SIZE 32 - - .section SECTION(.text),"ax",@progbits -ENTRY (MEMRCHR) - /* Broadcast CHAR to YMM0. */ vmovd %esi, %xmm0 - vpbroadcastb %xmm0, %ymm0 - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdx, %rdi), %rax - /* Check the last VEC_SIZE bytes. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) + vpbroadcastb %xmm0, %ymm0 - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) + +L(ret_vec_x0_test): + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + + /* Hoist vzeroupper (not great for RTM) to save code size. This allows + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vmovdqa (%rdi), %ymm1 - vmovdqa VEC_SIZE(%rdi), %ymm2 - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 - - vpcmpeqb %ymm1, %ymm0, %ymm1 - vpcmpeqb %ymm2, %ymm0, %ymm2 - vpcmpeqb %ymm3, %ymm0, %ymm3 - vpcmpeqb %ymm4, %ymm0, %ymm4 - - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - testl %eax, %eax - jz L(loop_4x_vec) - - /* There is a match. */ - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpmovmskb %ymm1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax L(return_vzeroupper): ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) - - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Align rax (string pointer). */ + andq $-VEC_SIZE, %rax + + /* Recompute remaining length after aligning. */ + movq %rax, %rdx + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 + subq %rdi, %rdx + decq %rax + vpmovmskb %ymm1, %ecx + /* Fall through for short (hotter than length). */ + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpeqb (%rsi), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %r8d + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %r8d + shlxl %r8d, %ecx, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret + .p2align 4,, 11 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + subq %rcx, %rax VZEROUPPER_RETURN + .p2align 4,, 10 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - VZEROUPPER_RETURN + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax - VZEROUPPER_RETURN + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) + +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - .p2align 4 -L(null): + /* First in aligning bytes. */ +L(zero_2): xorl %eax, %eax ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx + .p2align 4,, 4 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - vpcmpeqb (%rdi), %ymm0, %ymm1 - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx + .p2align 4,, 11 +L(ret_vec_x2): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - vpmovmskb %ymm1, %eax + .p2align 4,, 14 +L(ret_vec_x3): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Check for zero length. */ - testl %edx, %edx - jz L(null) + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) + testl %ecx, %ecx + jnz L(ret_vec_x3) - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + /* Align rax to (VEC_SIZE - 1). */ + orq $(VEC_SIZE * 4 - 1), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + orq $(VEC_SIZE * 4 - 1), %rdx - /* Check the last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + .p2align 4 +L(loop_4x_vec): + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 - movl $1, %edx - sall %cl, %edx - subl $1, %edx + vpor %ymm1, %ymm2, %ymm2 + vpor %ymm3, %ymm4, %ymm4 + vpor %ymm2, %ymm4, %ymm4 + vpmovmskb %ymm4, %esi - andl %edx, %eax - testl %eax, %eax - jz L(zero) + testl %esi, %esi + jnz L(loop_end) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - VZEROUPPER_RETURN + addq $(VEC_SIZE * -4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) - .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx + subl %edi, %edx + incl %edx - /* Check the last VEC. */ - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 +L(last_4x_vec): + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - vpmovmskb %ymm1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_end) - /* Remove the trailing bytes. */ - andl %edx, %eax + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) - testl %eax, %eax - jnz L(last_vec_x1) + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - /* Check the second last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) + + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret0) + xorl %eax, %eax +L(ret0): + ret - movl %r8d, %ecx - vpmovmskb %ymm1, %eax + .p2align 4 +L(loop_end): + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vpmovmskb %ymm2, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + vpmovmskb %ymm3, %ecx + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + .p2align 4,, 4 +L(ret_vec_x1_end): + /* 64-bit version will automatically add 32 (VEC_SIZE). */ + lzcntq %rcx, %rcx + subq %rcx, %rax + VZEROUPPER_RETURN - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 4 +L(ret_vec_x0_end): + lzcntl %ecx, %ecx + subq %rcx, %rax VZEROUPPER_RETURN -END (MEMRCHR) + + /* 2 bytes until next cache line. */ +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 6/8] x86: Optimize memrchr-avx2.S 2022-06-06 22:37 ` [PATCH v4 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-07 2:35 ` H.J. Lu 2022-06-07 4:06 ` Noah Goldstein 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 2:35 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The new code: > 1. prioritizes smaller user-arg lengths more. > 2. optimizes target placement more carefully > 3. reuses logic more > 4. fixes up various inefficiencies in the logic. The biggest > case here is the `lzcnt` logic for checking returns which > saves either a branch or multiple instructions. > > The total code size saving is: 306 bytes > Geometric Mean of all benchmarks New / Old: 0.760 > > Regressions: > There are some regressions. Particularly where the length (user arg > length) is large but the position of the match char is near the > begining of the string (in first VEC). This case has roughly a > 10-20% regression. > > This is because the new logic gives the hot path for immediate matches > to shorter lengths (the more common input). This case has roughly > a 15-45% speedup. > > Full xcheck passes on x86_64. > --- > sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + > sysdeps/x86_64/multiarch/memrchr-avx2.S | 538 ++++++++++---------- > 2 files changed, 260 insertions(+), 279 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > index cea2d2a72d..5e9beeeef2 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > @@ -2,6 +2,7 @@ > # define MEMRCHR __memrchr_avx2_rtm > #endif > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S > index ba2ce7cb03..6915e1c373 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S > @@ -21,340 +21,320 @@ > # include <sysdep.h> > > # ifndef MEMRCHR > -# define MEMRCHR __memrchr_avx2 > +# define MEMRCHR __memrchr_avx2 > # endif > > # ifndef VZEROUPPER > -# define VZEROUPPER vzeroupper > +# define VZEROUPPER vzeroupper > # endif > > +// abf-off > # ifndef SECTION > # define SECTION(p) p##.avx > # endif > +// abf-on What are the above changes for? > +# define VEC_SIZE 32 > +# define PAGE_SIZE 4096 > + .section SECTION(.text), "ax", @progbits > +ENTRY(MEMRCHR) > +# ifdef __ILP32__ > + /* Clear upper bits. */ > + and %RDX_LP, %RDX_LP > +# else > + test %RDX_LP, %RDX_LP > +# endif > + jz L(zero_0) > > -# define VEC_SIZE 32 > - > - .section SECTION(.text),"ax",@progbits > -ENTRY (MEMRCHR) > - /* Broadcast CHAR to YMM0. */ > vmovd %esi, %xmm0 > - vpbroadcastb %xmm0, %ymm0 > - > - sub $VEC_SIZE, %RDX_LP > - jbe L(last_vec_or_less) > - > - add %RDX_LP, %RDI_LP > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > + correct page cross check and 2) it correctly sets up end ptr to be > + subtract by lzcnt aligned. */ > + leaq -1(%rdx, %rdi), %rax > > - /* Check the last VEC_SIZE bytes. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - subq $(VEC_SIZE * 4), %rdi > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(aligned_more) > + vpbroadcastb %xmm0, %ymm0 > > - /* Align data for aligned loads in the loop. */ > - addq $VEC_SIZE, %rdi > - addq $VEC_SIZE, %rdx > - andq $-VEC_SIZE, %rdi > - subq %rcx, %rdx > + /* Check if we can load 1x VEC without cross a page. */ > + testl $(PAGE_SIZE - VEC_SIZE), %eax > + jz L(page_cross) > + > + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + cmpq $VEC_SIZE, %rdx > + ja L(more_1x_vec) > + > +L(ret_vec_x0_test): > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > + will gurantee edx (len) is less than it. */ > + lzcntl %ecx, %ecx > + > + /* Hoist vzeroupper (not great for RTM) to save code size. This allows > + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > - .p2align 4 > -L(aligned_more): > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > - since data is only aligned to VEC_SIZE. */ > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpcmpeqb (%rdi), %ymm0, %ymm4 > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > - There are some overlaps with above if data isn't aligned > - to 4 * VEC_SIZE. */ > - movl %edi, %ecx > - andl $(VEC_SIZE * 4 - 1), %ecx > - jz L(loop_4x_vec) > - > - addq $(VEC_SIZE * 4), %rdi > - addq $(VEC_SIZE * 4), %rdx > - andq $-(VEC_SIZE * 4), %rdi > - subq %rcx, %rdx > + /* Fits in aligning bytes of first cache line. */ > +L(zero_0): > + xorl %eax, %eax > + ret > > - .p2align 4 > -L(loop_4x_vec): > - /* Compare 4 * VEC at a time forward. */ > - subq $(VEC_SIZE * 4), %rdi > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - vmovdqa (%rdi), %ymm1 > - vmovdqa VEC_SIZE(%rdi), %ymm2 > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 > - > - vpcmpeqb %ymm1, %ymm0, %ymm1 > - vpcmpeqb %ymm2, %ymm0, %ymm2 > - vpcmpeqb %ymm3, %ymm0, %ymm3 > - vpcmpeqb %ymm4, %ymm0, %ymm4 > - > - vpor %ymm1, %ymm2, %ymm5 > - vpor %ymm3, %ymm4, %ymm6 > - vpor %ymm5, %ymm6, %ymm5 > - > - vpmovmskb %ymm5, %eax > - testl %eax, %eax > - jz L(loop_4x_vec) > - > - /* There is a match. */ > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpmovmskb %ymm1, %eax > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 9 > +L(ret_vec_x0): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > L(return_vzeroupper): > ZERO_UPPER_VEC_REGISTERS_RETURN > > - .p2align 4 > -L(last_4x_vec_or_less): > - addl $(VEC_SIZE * 4), %edx > - cmpl $(VEC_SIZE * 2), %edx > - jbe L(last_2x_vec) > - > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1_check) > - cmpl $(VEC_SIZE * 3), %edx > - jbe L(zero) > - > - vpcmpeqb (%rdi), %ymm0, %ymm4 > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 4), %rdx > - addq %rax, %rdx > - jl L(zero) > - addq %rdi, %rax > - VZEROUPPER_RETURN > - > - .p2align 4 > + .p2align 4,, 10 > +L(more_1x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + /* Align rax (string pointer). */ > + andq $-VEC_SIZE, %rax > + > + /* Recompute remaining length after aligning. */ > + movq %rax, %rdx > + /* Need this comparison next no matter what. */ > + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 > + subq %rdi, %rdx > + decq %rax > + vpmovmskb %ymm1, %ecx > + /* Fall through for short (hotter than length). */ > + cmpq $(VEC_SIZE * 2), %rdx > + ja L(more_2x_vec) > L(last_2x_vec): > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3_check) > cmpl $VEC_SIZE, %edx > - jbe L(zero) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 2), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > - > - .p2align 4 > -L(last_vec_x0): > - bsrl %eax, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > + jbe L(ret_vec_x0_test) > + > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + /* 64-bit lzcnt. This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > - .p2align 4 > -L(last_vec_x1): > - bsrl %eax, %eax > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > - .p2align 4 > -L(last_vec_x2): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + /* Inexpensive place to put this regarding code size / target alignments > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit in turn > + in first cache line. */ > +L(page_cross): > + movq %rax, %rsi > + andq $-VEC_SIZE, %rsi > + vpcmpeqb (%rsi), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + /* Shift out negative alignment (because we are starting from endptr and > + working backwards). */ > + movl %eax, %r8d > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > + notl %r8d > + shlxl %r8d, %ecx, %ecx > + cmpq %rdi, %rsi > + ja L(more_1x_vec) > + lzcntl %ecx, %ecx > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > + .p2align 4,, 11 > +L(ret_vec_x1): > + /* This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + subq %rcx, %rax > VZEROUPPER_RETURN > + .p2align 4,, 10 > +L(more_2x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > > - .p2align 4 > -L(last_vec_x3): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - .p2align 4 > -L(last_vec_x1_check): > - bsrl %eax, %eax > - subq $(VEC_SIZE * 3), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > - .p2align 4 > -L(last_vec_x3_check): > - bsrl %eax, %eax > - subq $VEC_SIZE, %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > + /* Needed no matter what. */ > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - .p2align 4 > -L(zero): > - xorl %eax, %eax > - VZEROUPPER_RETURN > + subq $(VEC_SIZE * 4), %rdx > + ja L(more_4x_vec) > + > + cmpl $(VEC_SIZE * -1), %edx > + jle L(ret_vec_x2_test) > + > +L(last_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > + > + /* Needed no matter what. */ > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 3), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_2) > + ret > > - .p2align 4 > -L(null): > + /* First in aligning bytes. */ > +L(zero_2): > xorl %eax, %eax > ret > > - .p2align 4 > -L(last_vec_or_less_aligned): > - movl %edx, %ecx > + .p2align 4,, 4 > +L(ret_vec_x2_test): > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_2) > + ret > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > - movl $1, %edx > - /* Support rdx << 32. */ > - salq %cl, %rdx > - subq $1, %rdx > + .p2align 4,, 11 > +L(ret_vec_x2): > + /* ecx must be non-zero. */ > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - vpmovmskb %ymm1, %eax > + .p2align 4,, 14 > +L(ret_vec_x3): > + /* ecx must be non-zero. */ > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > > - bsrl %eax, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > .p2align 4 > -L(last_vec_or_less): > - addl $VEC_SIZE, %edx > +L(more_4x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > > - /* Check for zero length. */ > - testl %edx, %edx > - jz L(null) > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(last_vec_or_less_aligned) > + testl %ecx, %ecx > + jnz L(ret_vec_x3) > > - movl %ecx, %esi > - movl %ecx, %r8d > - addl %edx, %esi > - andq $-VEC_SIZE, %rdi > + /* Check if near end before re-aligning (otherwise might do an > + unnecissary loop iteration). */ > + addq $-(VEC_SIZE * 4), %rax > + cmpq $(VEC_SIZE * 4), %rdx > + jbe L(last_4x_vec) > > - subl $VEC_SIZE, %esi > - ja L(last_vec_2x_aligned) > + /* Align rax to (VEC_SIZE - 1). */ > + orq $(VEC_SIZE * 4 - 1), %rax > + movq %rdi, %rdx > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > + lengths that overflow can be valid and break the comparison. */ > + orq $(VEC_SIZE * 4 - 1), %rdx > > - /* Check the last VEC. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - > - /* Remove the leading and trailing bytes. */ > - sarl %cl, %eax > - movl %edx, %ecx > + .p2align 4 > +L(loop_4x_vec): > + /* Need this comparison next no matter what. */ > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + vpor %ymm1, %ymm2, %ymm2 > + vpor %ymm3, %ymm4, %ymm4 > + vpor %ymm2, %ymm4, %ymm4 > + vpmovmskb %ymm4, %esi > > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > + testl %esi, %esi > + jnz L(loop_end) > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > - VZEROUPPER_RETURN > + addq $(VEC_SIZE * -4), %rax > + cmpq %rdx, %rax > + jne L(loop_4x_vec) > > - .p2align 4 > -L(last_vec_2x_aligned): > - movl %esi, %ecx > + subl %edi, %edx > + incl %edx > > - /* Check the last VEC. */ > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 > +L(last_4x_vec): > + /* Used no matter what. */ > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + cmpl $(VEC_SIZE * 2), %edx > + jbe L(last_2x_vec) > > - vpmovmskb %ymm1, %eax > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > > - testl %eax, %eax > - jnz L(last_vec_x1) > + /* Used no matter what. */ > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - /* Check the second last VEC. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > + cmpl $(VEC_SIZE * 3), %edx > + ja L(last_vec) > + > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + jbe L(ret0) > + xorl %eax, %eax > +L(ret0): > + ret > > - movl %r8d, %ecx > > - vpmovmskb %ymm1, %eax > + .p2align 4 > +L(loop_end): > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > + > + vpmovmskb %ymm2, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > + > + vpmovmskb %ymm3, %ecx > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > + then it won't affect the result in esi (VEC4). If ecx is non-zero > + then CHAR in VEC3 and bsrq will use that position. */ > + salq $32, %rcx > + orq %rsi, %rcx > + bsrq %rcx, %rcx > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - /* Remove the leading bytes. Must use unsigned right shift for > - bsrl below. */ > - shrl %cl, %eax > - testl %eax, %eax > - jz L(zero) > + .p2align 4,, 4 > +L(ret_vec_x1_end): > + /* 64-bit version will automatically add 32 (VEC_SIZE). */ > + lzcntq %rcx, %rcx > + subq %rcx, %rax > + VZEROUPPER_RETURN > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > + .p2align 4,, 4 > +L(ret_vec_x0_end): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > VZEROUPPER_RETURN > -END (MEMRCHR) > + > + /* 2 bytes until next cache line. */ > +END(MEMRCHR) > #endif > -- > 2.34.1 > -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 6/8] x86: Optimize memrchr-avx2.S 2022-06-07 2:35 ` H.J. Lu @ 2022-06-07 4:06 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:06 UTC (permalink / raw) To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 7:35 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Mon, Jun 6, 2022 at 3:37 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The new code: > > 1. prioritizes smaller user-arg lengths more. > > 2. optimizes target placement more carefully > > 3. reuses logic more > > 4. fixes up various inefficiencies in the logic. The biggest > > case here is the `lzcnt` logic for checking returns which > > saves either a branch or multiple instructions. > > > > The total code size saving is: 306 bytes > > Geometric Mean of all benchmarks New / Old: 0.760 > > > > Regressions: > > There are some regressions. Particularly where the length (user arg > > length) is large but the position of the match char is near the > > begining of the string (in first VEC). This case has roughly a Fixed this in V5 > > 10-20% regression. > > > > This is because the new logic gives the hot path for immediate matches > > to shorter lengths (the more common input). This case has roughly > > a 15-45% speedup. > > > > Full xcheck passes on x86_64. > > --- > > sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + > > sysdeps/x86_64/multiarch/memrchr-avx2.S | 538 ++++++++++---------- > > 2 files changed, 260 insertions(+), 279 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > index cea2d2a72d..5e9beeeef2 100644 > > --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > @@ -2,6 +2,7 @@ > > # define MEMRCHR __memrchr_avx2_rtm > > #endif > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S > > index ba2ce7cb03..6915e1c373 100644 > > --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S > > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S > > @@ -21,340 +21,320 @@ > > # include <sysdep.h> > > > > # ifndef MEMRCHR > > -# define MEMRCHR __memrchr_avx2 > > +# define MEMRCHR __memrchr_avx2 > > # endif > > > > # ifndef VZEROUPPER > > -# define VZEROUPPER vzeroupper > > +# define VZEROUPPER vzeroupper > > # endif > > > > +// abf-off > > # ifndef SECTION > > # define SECTION(p) p##.avx > > # endif > > +// abf-on > > What are the above changes for? Removed in V5 (directive for auto-formatter). > > > +# define VEC_SIZE 32 > > +# define PAGE_SIZE 4096 > > + .section SECTION(.text), "ax", @progbits > > +ENTRY(MEMRCHR) > > +# ifdef __ILP32__ > > + /* Clear upper bits. */ > > + and %RDX_LP, %RDX_LP > > +# else > > + test %RDX_LP, %RDX_LP > > +# endif > > + jz L(zero_0) > > > > -# define VEC_SIZE 32 > > - > > - .section SECTION(.text),"ax",@progbits > > -ENTRY (MEMRCHR) > > - /* Broadcast CHAR to YMM0. */ > > vmovd %esi, %xmm0 > > - vpbroadcastb %xmm0, %ymm0 > > - > > - sub $VEC_SIZE, %RDX_LP > > - jbe L(last_vec_or_less) > > - > > - add %RDX_LP, %RDI_LP > > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > > + correct page cross check and 2) it correctly sets up end ptr to be > > + subtract by lzcnt aligned. */ > > + leaq -1(%rdx, %rdi), %rax > > > > - /* Check the last VEC_SIZE bytes. */ > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x0) > > - > > - subq $(VEC_SIZE * 4), %rdi > > - movl %edi, %ecx > > - andl $(VEC_SIZE - 1), %ecx > > - jz L(aligned_more) > > + vpbroadcastb %xmm0, %ymm0 > > > > - /* Align data for aligned loads in the loop. */ > > - addq $VEC_SIZE, %rdi > > - addq $VEC_SIZE, %rdx > > - andq $-VEC_SIZE, %rdi > > - subq %rcx, %rdx > > + /* Check if we can load 1x VEC without cross a page. */ > > + testl $(PAGE_SIZE - VEC_SIZE), %eax > > + jz L(page_cross) > > + > > + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + cmpq $VEC_SIZE, %rdx > > + ja L(more_1x_vec) > > + > > +L(ret_vec_x0_test): > > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > > + will gurantee edx (len) is less than it. */ > > + lzcntl %ecx, %ecx > > + > > + /* Hoist vzeroupper (not great for RTM) to save code size. This allows > > + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ > > + COND_VZEROUPPER > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > + ret > > > > - .p2align 4 > > -L(aligned_more): > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > - > > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > > - since data is only aligned to VEC_SIZE. */ > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > > - vpmovmskb %ymm2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > > - vpmovmskb %ymm3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > - > > - vpcmpeqb (%rdi), %ymm0, %ymm4 > > - vpmovmskb %ymm4, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x0) > > - > > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > > - There are some overlaps with above if data isn't aligned > > - to 4 * VEC_SIZE. */ > > - movl %edi, %ecx > > - andl $(VEC_SIZE * 4 - 1), %ecx > > - jz L(loop_4x_vec) > > - > > - addq $(VEC_SIZE * 4), %rdi > > - addq $(VEC_SIZE * 4), %rdx > > - andq $-(VEC_SIZE * 4), %rdi > > - subq %rcx, %rdx > > + /* Fits in aligning bytes of first cache line. */ > > +L(zero_0): > > + xorl %eax, %eax > > + ret > > > > - .p2align 4 > > -L(loop_4x_vec): > > - /* Compare 4 * VEC at a time forward. */ > > - subq $(VEC_SIZE * 4), %rdi > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > - > > - vmovdqa (%rdi), %ymm1 > > - vmovdqa VEC_SIZE(%rdi), %ymm2 > > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 > > - > > - vpcmpeqb %ymm1, %ymm0, %ymm1 > > - vpcmpeqb %ymm2, %ymm0, %ymm2 > > - vpcmpeqb %ymm3, %ymm0, %ymm3 > > - vpcmpeqb %ymm4, %ymm0, %ymm4 > > - > > - vpor %ymm1, %ymm2, %ymm5 > > - vpor %ymm3, %ymm4, %ymm6 > > - vpor %ymm5, %ymm6, %ymm5 > > - > > - vpmovmskb %ymm5, %eax > > - testl %eax, %eax > > - jz L(loop_4x_vec) > > - > > - /* There is a match. */ > > - vpmovmskb %ymm4, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - vpmovmskb %ymm3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - vpmovmskb %ymm2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > - > > - vpmovmskb %ymm1, %eax > > - bsrl %eax, %eax > > - addq %rdi, %rax > > + .p2align 4,, 9 > > +L(ret_vec_x0): > > + lzcntl %ecx, %ecx > > + subq %rcx, %rax > > L(return_vzeroupper): > > ZERO_UPPER_VEC_REGISTERS_RETURN > > > > - .p2align 4 > > -L(last_4x_vec_or_less): > > - addl $(VEC_SIZE * 4), %edx > > - cmpl $(VEC_SIZE * 2), %edx > > - jbe L(last_2x_vec) > > - > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > > - vpmovmskb %ymm2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > > - vpmovmskb %ymm3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1_check) > > - cmpl $(VEC_SIZE * 3), %edx > > - jbe L(zero) > > - > > - vpcmpeqb (%rdi), %ymm0, %ymm4 > > - vpmovmskb %ymm4, %eax > > - testl %eax, %eax > > - jz L(zero) > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 4), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > + .p2align 4,, 10 > > +L(more_1x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > + > > + /* Align rax (string pointer). */ > > + andq $-VEC_SIZE, %rax > > + > > + /* Recompute remaining length after aligning. */ > > + movq %rax, %rdx > > + /* Need this comparison next no matter what. */ > > + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 > > + subq %rdi, %rdx > > + decq %rax > > + vpmovmskb %ymm1, %ecx > > + /* Fall through for short (hotter than length). */ > > + cmpq $(VEC_SIZE * 2), %rdx > > + ja L(more_2x_vec) > > L(last_2x_vec): > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3_check) > > cmpl $VEC_SIZE, %edx > > - jbe L(zero) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jz L(zero) > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 2), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $(VEC_SIZE * 2), %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(last_vec_x0): > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > + jbe L(ret_vec_x0_test) > > + > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > + > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + /* 64-bit lzcnt. This will naturally add 32 to position. */ > > + lzcntq %rcx, %rcx > > + COND_VZEROUPPER > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > + ret > > > > - .p2align 4 > > -L(last_vec_x1): > > - bsrl %eax, %eax > > - addl $VEC_SIZE, %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(last_vec_x2): > > - bsrl %eax, %eax > > - addl $(VEC_SIZE * 2), %eax > > - addq %rdi, %rax > > + /* Inexpensive place to put this regarding code size / target alignments > > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > > + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit > in turn Fixed in V5. > > + in first cache line. */ > > +L(page_cross): > > + movq %rax, %rsi > > + andq $-VEC_SIZE, %rsi > > + vpcmpeqb (%rsi), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + /* Shift out negative alignment (because we are starting from endptr and > > + working backwards). */ > > + movl %eax, %r8d > > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > > + notl %r8d > > + shlxl %r8d, %ecx, %ecx > > + cmpq %rdi, %rsi > > + ja L(more_1x_vec) > > + lzcntl %ecx, %ecx > > + COND_VZEROUPPER > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > + ret > > + .p2align 4,, 11 > > +L(ret_vec_x1): > > + /* This will naturally add 32 to position. */ > > + lzcntq %rcx, %rcx > > + subq %rcx, %rax > > VZEROUPPER_RETURN > > + .p2align 4,, 10 > > +L(more_2x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > > > - .p2align 4 > > -L(last_vec_x3): > > - bsrl %eax, %eax > > - addl $(VEC_SIZE * 3), %eax > > - addq %rdi, %rax > > - ret > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > > > - .p2align 4 > > -L(last_vec_x1_check): > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 3), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $VEC_SIZE, %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(last_vec_x3_check): > > - bsrl %eax, %eax > > - subq $VEC_SIZE, %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $(VEC_SIZE * 3), %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > + /* Needed no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > > > - .p2align 4 > > -L(zero): > > - xorl %eax, %eax > > - VZEROUPPER_RETURN > > + subq $(VEC_SIZE * 4), %rdx > > + ja L(more_4x_vec) > > + > > + cmpl $(VEC_SIZE * -1), %edx > > + jle L(ret_vec_x2_test) > > + > > +L(last_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > + > > + /* Needed no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 3), %rax > > + COND_VZEROUPPER > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + ja L(zero_2) > > + ret > > > > - .p2align 4 > > -L(null): > > + /* First in aligning bytes. */ > > +L(zero_2): > > xorl %eax, %eax > > ret > > > > - .p2align 4 > > -L(last_vec_or_less_aligned): > > - movl %edx, %ecx > > + .p2align 4,, 4 > > +L(ret_vec_x2_test): > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 2), %rax > > + COND_VZEROUPPER > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + ja L(zero_2) > > + ret > > > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > > > - movl $1, %edx > > - /* Support rdx << 32. */ > > - salq %cl, %rdx > > - subq $1, %rdx > > + .p2align 4,, 11 > > +L(ret_vec_x2): > > + /* ecx must be non-zero. */ > > + bsrl %ecx, %ecx > > + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax > > + VZEROUPPER_RETURN > > > > - vpmovmskb %ymm1, %eax > > + .p2align 4,, 14 > > +L(ret_vec_x3): > > + /* ecx must be non-zero. */ > > + bsrl %ecx, %ecx > > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > > + VZEROUPPER_RETURN > > > > - /* Remove the trailing bytes. */ > > - andl %edx, %eax > > - testl %eax, %eax > > - jz L(zero) > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > > > .p2align 4 > > -L(last_vec_or_less): > > - addl $VEC_SIZE, %edx > > +L(more_4x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > > > - /* Check for zero length. */ > > - testl %edx, %edx > > - jz L(null) > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > > > - movl %edi, %ecx > > - andl $(VEC_SIZE - 1), %ecx > > - jz L(last_vec_or_less_aligned) > > + testl %ecx, %ecx > > + jnz L(ret_vec_x3) > > > > - movl %ecx, %esi > > - movl %ecx, %r8d > > - addl %edx, %esi > > - andq $-VEC_SIZE, %rdi > > + /* Check if near end before re-aligning (otherwise might do an > > + unnecissary loop iteration). */ > > + addq $-(VEC_SIZE * 4), %rax > > + cmpq $(VEC_SIZE * 4), %rdx > > + jbe L(last_4x_vec) > > > > - subl $VEC_SIZE, %esi > > - ja L(last_vec_2x_aligned) > > + /* Align rax to (VEC_SIZE - 1). */ > > + orq $(VEC_SIZE * 4 - 1), %rax > > + movq %rdi, %rdx > > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > > + lengths that overflow can be valid and break the comparison. */ > > + orq $(VEC_SIZE * 4 - 1), %rdx > > > > - /* Check the last VEC. */ > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - > > - /* Remove the leading and trailing bytes. */ > > - sarl %cl, %eax > > - movl %edx, %ecx > > + .p2align 4 > > +L(loop_4x_vec): > > + /* Need this comparison next no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 > > > > - movl $1, %edx > > - sall %cl, %edx > > - subl $1, %edx > > + vpor %ymm1, %ymm2, %ymm2 > > + vpor %ymm3, %ymm4, %ymm4 > > + vpor %ymm2, %ymm4, %ymm4 > > + vpmovmskb %ymm4, %esi > > > > - andl %edx, %eax > > - testl %eax, %eax > > - jz L(zero) > > + testl %esi, %esi > > + jnz L(loop_end) > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - addq %r8, %rax > > - VZEROUPPER_RETURN > > + addq $(VEC_SIZE * -4), %rax > > + cmpq %rdx, %rax > > + jne L(loop_4x_vec) > > > > - .p2align 4 > > -L(last_vec_2x_aligned): > > - movl %esi, %ecx > > + subl %edi, %edx > > + incl %edx > > > > - /* Check the last VEC. */ > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 > > +L(last_4x_vec): > > + /* Used no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > > > - movl $1, %edx > > - sall %cl, %edx > > - subl $1, %edx > > + cmpl $(VEC_SIZE * 2), %edx > > + jbe L(last_2x_vec) > > > > - vpmovmskb %ymm1, %eax > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_end) > > > > - /* Remove the trailing bytes. */ > > - andl %edx, %eax > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1_end) > > > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > + /* Used no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > > > - /* Check the second last VEC. */ > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > + cmpl $(VEC_SIZE * 3), %edx > > + ja L(last_vec) > > + > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 2), %rax > > + COND_VZEROUPPER > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + jbe L(ret0) > > + xorl %eax, %eax > > +L(ret0): > > + ret > > > > - movl %r8d, %ecx > > > > - vpmovmskb %ymm1, %eax > > + .p2align 4 > > +L(loop_end): > > + vpmovmskb %ymm1, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_end) > > + > > + vpmovmskb %ymm2, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1_end) > > + > > + vpmovmskb %ymm3, %ecx > > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > > + then it won't affect the result in esi (VEC4). If ecx is non-zero > > + then CHAR in VEC3 and bsrq will use that position. */ > > + salq $32, %rcx > > + orq %rsi, %rcx > > + bsrq %rcx, %rcx > > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > > + VZEROUPPER_RETURN > > > > - /* Remove the leading bytes. Must use unsigned right shift for > > - bsrl below. */ > > - shrl %cl, %eax > > - testl %eax, %eax > > - jz L(zero) > > + .p2align 4,, 4 > > +L(ret_vec_x1_end): > > + /* 64-bit version will automatically add 32 (VEC_SIZE). */ > > + lzcntq %rcx, %rcx > > + subq %rcx, %rax > > + VZEROUPPER_RETURN > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - addq %r8, %rax > > + .p2align 4,, 4 > > +L(ret_vec_x0_end): > > + lzcntl %ecx, %ecx > > + subq %rcx, %rax > > VZEROUPPER_RETURN > > -END (MEMRCHR) > > + > > + /* 2 bytes until next cache line. */ > > +END(MEMRCHR) > > #endif > > -- > > 2.34.1 > > > > > -- > H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 7/8] x86: Shrink code size of memchr-avx2.S 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (4 preceding siblings ...) 2022-06-06 22:37 ` [PATCH v4 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-06 22:37 ` Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:37 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- 2 files changed, 60 insertions(+), 50 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S index 87b076c7c4..c4d71938c5 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMCHR __memchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S index 75bd7262e0..28a01280ec 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S @@ -57,7 +57,7 @@ # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 5) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ # ifdef __ILP32__ @@ -87,12 +87,14 @@ ENTRY (MEMCHR) # endif testl %eax, %eax jz L(aligned_more) - tzcntl %eax, %eax + bsfl %eax, %eax addq %rdi, %rax - VZEROUPPER_RETURN +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN + # ifndef USE_AS_RAWMEMCHR - .p2align 5 + .p2align 4 L(first_vec_x0): /* Check if first match was before length. */ tzcntl %eax, %eax @@ -100,58 +102,31 @@ L(first_vec_x0): /* NB: Multiply length by 4 to get byte count. */ sall $2, %edx # endif - xorl %ecx, %ecx + COND_VZEROUPPER + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch + block. branch here as opposed to cmovcc is not that costly. Common + usage of memchr is to check if the return was NULL (if string was + known to contain CHAR user would use rawmemchr). This branch will be + highly correlated with the user branch and can be used by most + modern branch predictors to predict the user branch. */ cmpl %eax, %edx - leaq (%rdi, %rax), %rax - cmovle %rcx, %rax - VZEROUPPER_RETURN - -L(null): - xorl %eax, %eax - ret -# endif - .p2align 4 -L(cross_page_boundary): - /* Save pointer before aligning as its original value is - necessary for computer return address if byte is found or - adjusting length if it is not and this is memchr. */ - movq %rdi, %rcx - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr - and rdi for rawmemchr. */ - orq $(VEC_SIZE - 1), %ALGN_PTR_REG - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax -# ifndef USE_AS_RAWMEMCHR - /* Calculate length until end of page (length checked for a - match). */ - leaq 1(%ALGN_PTR_REG), %rsi - subq %RRAW_PTR_REG, %rsi -# ifdef USE_AS_WMEMCHR - /* NB: Divide bytes by 4 to get wchar_t count. */ - shrl $2, %esi -# endif -# endif - /* Remove the leading bytes. */ - sarxl %ERAW_PTR_REG, %eax, %eax -# ifndef USE_AS_RAWMEMCHR - /* Check the end of data. */ - cmpq %rsi, %rdx - jbe L(first_vec_x0) + jle L(null) + addq %rdi, %rax + ret # endif - testl %eax, %eax - jz L(cross_page_continue) - tzcntl %eax, %eax - addq %RRAW_PTR_REG, %rax -L(return_vzeroupper): - ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 + .p2align 4,, 10 L(first_vec_x1): - tzcntl %eax, %eax + bsfl %eax, %eax incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - +# ifndef USE_AS_RAWMEMCHR + /* First in aligning bytes here. */ +L(null): + xorl %eax, %eax + ret +# endif .p2align 4 L(first_vec_x2): tzcntl %eax, %eax @@ -340,7 +315,7 @@ L(first_vec_x1_check): incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - .p2align 4 + .p2align 4,, 6 L(set_zero_end): xorl %eax, %eax VZEROUPPER_RETURN @@ -428,5 +403,39 @@ L(last_vec_x3): VZEROUPPER_RETURN # endif + .p2align 4 +L(cross_page_boundary): + /* Save pointer before aligning as its original value is necessary for + computer return address if byte is found or adjusting length if it + is not and this is memchr. */ + movq %rdi, %rcx + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for + rawmemchr. */ + andq $-VEC_SIZE, %ALGN_PTR_REG + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 + vpmovmskb %ymm1, %eax +# ifndef USE_AS_RAWMEMCHR + /* Calculate length until end of page (length checked for a match). */ + leal VEC_SIZE(%ALGN_PTR_REG), %esi + subl %ERAW_PTR_REG, %esi +# ifdef USE_AS_WMEMCHR + /* NB: Divide bytes by 4 to get wchar_t count. */ + shrl $2, %esi +# endif +# endif + /* Remove the leading bytes. */ + sarxl %ERAW_PTR_REG, %eax, %eax +# ifndef USE_AS_RAWMEMCHR + /* Check the end of data. */ + cmpq %rsi, %rdx + jbe L(first_vec_x0) +# endif + testl %eax, %eax + jz L(cross_page_continue) + bsfl %eax, %eax + addq %RRAW_PTR_REG, %rax + VZEROUPPER_RETURN + + END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 8/8] x86: Shrink code size of memchr-evex.S 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (5 preceding siblings ...) 2022-06-06 22:37 ` [PATCH v4 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein @ 2022-06-06 22:37 ` Noah Goldstein 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-06 22:37 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-evex.S | 46 ++++++++++++++------------ 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S index cfaf02907d..0fd11b7632 100644 --- a/sysdeps/x86_64/multiarch/memchr-evex.S +++ b/sysdeps/x86_64/multiarch/memchr-evex.S @@ -88,7 +88,7 @@ # define PAGE_SIZE 4096 .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 6) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ test %RDX_LP, %RDX_LP @@ -131,22 +131,24 @@ L(zero): xorl %eax, %eax ret - .p2align 5 + .p2align 4 L(first_vec_x0): - /* Check if first match was before length. */ - tzcntl %eax, %eax - xorl %ecx, %ecx - cmpl %eax, %edx - leaq (%rdi, %rax, CHAR_SIZE), %rax - cmovle %rcx, %rax + /* Check if first match was before length. NB: tzcnt has false data- + dependency on destination. eax already had a data-dependency on esi + so this should have no affect here. */ + tzcntl %eax, %esi +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rsi, CHAR_SIZE), %rdi +# else + addq %rsi, %rdi +# endif + xorl %eax, %eax + cmpl %esi, %edx + cmovg %rdi, %rax ret -# else - /* NB: first_vec_x0 is 17 bytes which will leave - cross_page_boundary (which is relatively cold) close enough - to ideal alignment. So only realign L(cross_page_boundary) if - rawmemchr. */ - .p2align 4 # endif + + .p2align 4 L(cross_page_boundary): /* Save pointer before aligning as its original value is necessary for computer return address if byte is found or @@ -400,10 +402,14 @@ L(last_2x_vec): L(zero_end): ret +L(set_zero_end): + xorl %eax, %eax + ret .p2align 4 L(first_vec_x1_check): - tzcntl %eax, %eax + /* eax must be non-zero. Use bsfl to save code size. */ + bsfl %eax, %eax /* Adjust length. */ subl $-(CHAR_PER_VEC * 4), %edx /* Check if match within remaining length. */ @@ -412,9 +418,6 @@ L(first_vec_x1_check): /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax ret -L(set_zero_end): - xorl %eax, %eax - ret .p2align 4 L(loop_4x_vec_end): @@ -464,7 +467,7 @@ L(loop_4x_vec_end): # endif ret - .p2align 4 + .p2align 4,, 10 L(last_vec_x1_return): tzcntl %eax, %eax # if defined USE_AS_WMEMCHR || RET_OFFSET != 0 @@ -496,6 +499,7 @@ L(last_vec_x3_return): # endif # ifndef USE_AS_RAWMEMCHR + .p2align 4,, 5 L(last_4x_vec_or_less_cmpeq): VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 kmovd %k0, %eax @@ -546,7 +550,7 @@ L(last_4x_vec): # endif andl %ecx, %eax jz L(zero_end2) - tzcntl %eax, %eax + bsfl %eax, %eax leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax L(zero_end2): ret @@ -562,6 +566,6 @@ L(last_vec_x3): leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax ret # endif - + /* 7 bytes from next cache line. */ END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein ` (2 preceding siblings ...) 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein @ 2022-06-07 4:05 ` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein ` (6 more replies) 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 4 siblings, 7 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:05 UTC (permalink / raw) To: libc-alpha This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 34 ++++++++ sysdeps/x86_64/multiarch/avx-vecs.h | 47 +++++++++++ sysdeps/x86_64/multiarch/evex-vecs-common.h | 39 +++++++++ sysdeps/x86_64/multiarch/evex256-vecs.h | 35 ++++++++ sysdeps/x86_64/multiarch/evex512-vecs.h | 35 ++++++++ sysdeps/x86_64/multiarch/sse2-vecs.h | 47 +++++++++++ sysdeps/x86_64/multiarch/vec-macros.h | 90 +++++++++++++++++++++ 7 files changed, 327 insertions(+) create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex-vecs-common.h create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h new file mode 100644 index 0000000000..3f531dd47f --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -0,0 +1,34 @@ +/* Common config for AVX-RTM VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_RTM_VECS_H +#define _AVX_RTM_VECS_H 1 + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define USE_WITH_RTM 1 +#include "avx-vecs.h" + +#undef SECTION +#define SECTION(p) p##.avx.rtm + +#endif diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h new file mode 100644 index 0000000000..89680f5db8 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-vecs.h @@ -0,0 +1,47 @@ +/* Common config for AVX VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_VECS_H +#define _AVX_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 32 +#include "vec-macros.h" + +#define USE_WITH_AVX 1 +#define SECTION(p) p##.avx + +/* 4-byte mov instructions with AVX2. */ +#define MOV_SIZE 4 +/* 1 (ret) + 3 (vzeroupper). */ +#define RET_SIZE 4 +#define VZEROUPPER vzeroupper + +#define VMOVU vmovdqu +#define VMOVA vmovdqa +#define VMOVNT vmovntdq + +/* Often need to access xmm portion. */ +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex-vecs-common.h b/sysdeps/x86_64/multiarch/evex-vecs-common.h new file mode 100644 index 0000000000..99806ebcd7 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex-vecs-common.h @@ -0,0 +1,39 @@ +/* Common config for EVEX256 and EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX_VECS_COMMON_H +#define _EVEX_VECS_COMMON_H 1 + +#include "vec-macros.h" + +/* 6-byte mov instructions with EVEX. */ +#define MOV_SIZE 6 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU vmovdqu64 +#define VMOVA vmovdqa64 +#define VMOVNT vmovntdq + +#define VEC_xmm VEC_hi_xmm +#define VEC_ymm VEC_hi_ymm +#define VEC_zmm VEC_hi_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h new file mode 100644 index 0000000000..222ba46dc7 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h @@ -0,0 +1,35 @@ +/* Common config for EVEX256 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX256_VECS_H +#define _EVEX256_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 32 +#include "evex-vecs-common.h" + +#define USE_WITH_EVEX256 1 +#define SECTION(p) p##.evex + +#define VEC VEC_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h new file mode 100644 index 0000000000..d1784d5368 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h @@ -0,0 +1,35 @@ +/* Common config for EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX512_VECS_H +#define _EVEX512_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 64 +#include "evex-vecs-common.h" + +#define USE_WITH_EVEX512 1 +#define SECTION(p) p##.evex512 + +#define VEC VEC_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h new file mode 100644 index 0000000000..2b77a59d56 --- /dev/null +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h @@ -0,0 +1,47 @@ +/* Common config for SSE2 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _SSE2_VECS_H +#define _SSE2_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 16 +#include "vec-macros.h" + +#define USE_WITH_SSE2 1 +#define SECTION(p) p + +/* 3-byte mov instructions with SSE2. */ +#define MOV_SIZE 3 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU movups +#define VMOVA movaps +#define VMOVNT movntdq + +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_xmm + + +#endif diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h new file mode 100644 index 0000000000..9f3ffecede --- /dev/null +++ b/sysdeps/x86_64/multiarch/vec-macros.h @@ -0,0 +1,90 @@ +/* Macro helpers for VEC_{type}({vec_num}) + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _VEC_MACROS_H +#define _VEC_MACROS_H 1 + +#ifndef VEC_SIZE +# error "Never include this file directly. Always include a vector config." +#endif + +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same + VEC(N) values. */ +#define VEC_hi_xmm0 xmm16 +#define VEC_hi_xmm1 xmm17 +#define VEC_hi_xmm2 xmm18 +#define VEC_hi_xmm3 xmm19 +#define VEC_hi_xmm4 xmm20 +#define VEC_hi_xmm5 xmm21 +#define VEC_hi_xmm6 xmm22 +#define VEC_hi_xmm7 xmm23 +#define VEC_hi_xmm8 xmm24 +#define VEC_hi_xmm9 xmm25 +#define VEC_hi_xmm10 xmm26 +#define VEC_hi_xmm11 xmm27 +#define VEC_hi_xmm12 xmm28 +#define VEC_hi_xmm13 xmm29 +#define VEC_hi_xmm14 xmm30 +#define VEC_hi_xmm15 xmm31 + +#define VEC_hi_ymm0 ymm16 +#define VEC_hi_ymm1 ymm17 +#define VEC_hi_ymm2 ymm18 +#define VEC_hi_ymm3 ymm19 +#define VEC_hi_ymm4 ymm20 +#define VEC_hi_ymm5 ymm21 +#define VEC_hi_ymm6 ymm22 +#define VEC_hi_ymm7 ymm23 +#define VEC_hi_ymm8 ymm24 +#define VEC_hi_ymm9 ymm25 +#define VEC_hi_ymm10 ymm26 +#define VEC_hi_ymm11 ymm27 +#define VEC_hi_ymm12 ymm28 +#define VEC_hi_ymm13 ymm29 +#define VEC_hi_ymm14 ymm30 +#define VEC_hi_ymm15 ymm31 + +#define VEC_hi_zmm0 zmm16 +#define VEC_hi_zmm1 zmm17 +#define VEC_hi_zmm2 zmm18 +#define VEC_hi_zmm3 zmm19 +#define VEC_hi_zmm4 zmm20 +#define VEC_hi_zmm5 zmm21 +#define VEC_hi_zmm6 zmm22 +#define VEC_hi_zmm7 zmm23 +#define VEC_hi_zmm8 zmm24 +#define VEC_hi_zmm9 zmm25 +#define VEC_hi_zmm10 zmm26 +#define VEC_hi_zmm11 zmm27 +#define VEC_hi_zmm12 zmm28 +#define VEC_hi_zmm13 zmm29 +#define VEC_hi_zmm14 zmm30 +#define VEC_hi_zmm15 zmm31 + +#define PRIMITIVE_VEC(vec, num) vec##num + +#define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) +#define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) +#define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) + +#define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) +#define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) +#define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) + +#endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v5 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein @ 2022-06-07 4:05 ` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein ` (5 subsequent siblings) 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:05 UTC (permalink / raw) To: libc-alpha The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This can be useful when hoisting a vzeroupper to save code size for example: ``` L(foo): cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax VZEROUPPER_RETURN L(bar): xorl %eax, %eax VZEROUPPER_RETURN ``` Can become: ``` L(foo): COND_VZEROUPPER cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax ret L(bar): xorl %eax, %eax ret ``` This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + sysdeps/x86_64/sysdep.h | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h index 3f531dd47f..6ca9f5e6ba 100644 --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -20,6 +20,7 @@ #ifndef _AVX_RTM_VECS_H #define _AVX_RTM_VECS_H 1 +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h index f14d50786d..4f512d5566 100644 --- a/sysdeps/x86_64/sysdep.h +++ b/sysdeps/x86_64/sysdep.h @@ -106,6 +106,24 @@ lose: \ vzeroupper; \ ret +/* Can be used to replace vzeroupper that is not directly before a + return. This is useful when hoisting a vzeroupper from multiple + return paths to decrease the total number of vzerouppers and code + size. */ +#define COND_VZEROUPPER_XTEST \ + xtest; \ + jz 1f; \ + vzeroall; \ + jmp 2f; \ +1: \ + vzeroupper; \ +2: + +/* In RTM define this as COND_VZEROUPPER_XTEST. */ +#ifndef COND_VZEROUPPER +# define COND_VZEROUPPER vzeroupper +#endif + /* Zero upper vector registers and return. */ #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN # define ZERO_UPPER_VEC_REGISTERS_RETURN \ -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v5 3/8] Benchtests: Improve memrchr benchmarks 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-07 4:05 ` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein ` (4 subsequent siblings) 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:05 UTC (permalink / raw) To: libc-alpha Add a second iteration for memrchr to set `pos` starting from the end of the buffer. Previously `pos` was only set relative to the begining of the buffer. This isn't really useful for memrchr because the begining of the search space is (buf + len). --- benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- 1 file changed, 65 insertions(+), 45 deletions(-) diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c index 4d7212332f..0facda2fa0 100644 --- a/benchtests/bench-memchr.c +++ b/benchtests/bench-memchr.c @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, static void do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, - int seek_char) + int seek_char, int invert_pos) { size_t i; @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, if (pos < len) { - buf[align + pos] = seek_char; + if (invert_pos) + buf[align + len - pos] = seek_char; + else + buf[align + pos] = seek_char; buf[align + len] = -seek_char; } else @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, json_attr_uint (json_ctx, "pos", pos); json_attr_uint (json_ctx, "len", len); json_attr_uint (json_ctx, "seek_char", seek_char); + json_attr_uint (json_ctx, "invert_pos", invert_pos); json_array_begin (json_ctx, "timings"); @@ -123,6 +127,7 @@ int test_main (void) { size_t i; + int repeats; json_ctx_t json_ctx; test_init (); @@ -142,53 +147,68 @@ test_main (void) json_array_begin (&json_ctx, "results"); - for (i = 1; i < 8; ++i) + for (repeats = 0; repeats < 2; ++repeats) { - do_test (&json_ctx, 0, 16 << i, 2048, 23); - do_test (&json_ctx, i, 64, 256, 23); - do_test (&json_ctx, 0, 16 << i, 2048, 0); - do_test (&json_ctx, i, 64, 256, 0); - - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); + do_test (&json_ctx, i, 64, 256, 23, repeats); + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); + do_test (&json_ctx, i, 64, 256, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, i, 256, 23); - do_test (&json_ctx, 0, i, 256, 0); - do_test (&json_ctx, i, i, 256, 23); - do_test (&json_ctx, i, i, 256, 0); + /* Also test the position close to the beginning for memrchr. */ + do_test (&json_ctx, 0, i, 256, 23, repeats); + do_test (&json_ctx, 0, i, 256, 0, repeats); + do_test (&json_ctx, i, i, 256, 23, repeats); + do_test (&json_ctx, i, i, 256, 0, repeats); #endif - } - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, i, i << 5, 192, 23); - do_test (&json_ctx, i, i << 5, 192, 0); - do_test (&json_ctx, i, i << 5, 256, 23); - do_test (&json_ctx, i, i << 5, 256, 0); - do_test (&json_ctx, i, i << 5, 512, 23); - do_test (&json_ctx, i, i << 5, 512, 0); - - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); - } - for (i = 1; i < 32; ++i) - { - do_test (&json_ctx, 0, i, i + 1, 23); - do_test (&json_ctx, 0, i, i + 1, 0); - do_test (&json_ctx, i, i, i + 1, 23); - do_test (&json_ctx, i, i, i + 1, 0); - do_test (&json_ctx, 0, i, i - 1, 23); - do_test (&json_ctx, 0, i, i - 1, 0); - do_test (&json_ctx, i, i, i - 1, 23); - do_test (&json_ctx, i, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); + } + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, i, i << 5, 192, 23, repeats); + do_test (&json_ctx, i, i << 5, 192, 0, repeats); + do_test (&json_ctx, i, i << 5, 256, 23, repeats); + do_test (&json_ctx, i, i << 5, 256, 0, repeats); + do_test (&json_ctx, i, i << 5, 512, 23, repeats); + do_test (&json_ctx, i, i << 5, 512, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); + } + for (i = 1; i < 32; ++i) + { + do_test (&json_ctx, 0, i, i + 1, 23, repeats); + do_test (&json_ctx, 0, i, i + 1, 0, repeats); + do_test (&json_ctx, i, i, i + 1, 23, repeats); + do_test (&json_ctx, i, i, i + 1, 0, repeats); + do_test (&json_ctx, 0, i, i - 1, 23, repeats); + do_test (&json_ctx, 0, i, i - 1, 0, repeats); + do_test (&json_ctx, i, i, i - 1, 23, repeats); + do_test (&json_ctx, i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); + #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, 1, i + 1, 23); - do_test (&json_ctx, 0, 2, i + 1, 0); + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); +#endif + } +#ifndef USE_AS_MEMRCHR + break; #endif } -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v5 4/8] x86: Optimize memrchr-sse2.S 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein @ 2022-06-07 4:05 ` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 5/8] x86: Optimize memrchr-evex.S Noah Goldstein ` (3 subsequent siblings) 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:05 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. --- sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- 1 file changed, 292 insertions(+), 321 deletions(-) diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S index d1a9f47911..b0dffd2ae2 100644 --- a/sysdeps/x86_64/memrchr.S +++ b/sysdeps/x86_64/memrchr.S @@ -18,362 +18,333 @@ <https://www.gnu.org/licenses/>. */ #include <sysdep.h> +#define VEC_SIZE 16 +#define PAGE_SIZE 4096 .text -ENTRY (__memrchr) - movd %esi, %xmm1 - - sub $16, %RDX_LP - jbe L(length_less16) - - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add %RDX_LP, %RDI_LP - pshufd $0, %xmm1, %xmm1 - - movdqu (%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - -/* Check if there is a match. */ - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - mov %edi, %ecx - and $15, %ecx - jz L(loop_prolog) - - add $16, %rdi - add $16, %rdx - and $-16, %rdi - sub %rcx, %rdx - - .p2align 4 -L(loop_prolog): - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm4 - pcmpeqb %xmm1, %xmm4 - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches0) - - mov %edi, %ecx - and $63, %ecx - jz L(align64_loop) - - add $64, %rdi - add $64, %rdx - and $-64, %rdi - sub %rcx, %rdx - - .p2align 4 -L(align64_loop): - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa (%rdi), %xmm0 - movdqa 16(%rdi), %xmm2 - movdqa 32(%rdi), %xmm3 - movdqa 48(%rdi), %xmm4 - - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm1, %xmm2 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm1, %xmm4 - - pmaxub %xmm3, %xmm0 - pmaxub %xmm4, %xmm2 - pmaxub %xmm0, %xmm2 - pmovmskb %xmm2, %eax - - test %eax, %eax - jz L(align64_loop) - - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches48) - - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm2 - - pcmpeqb %xmm1, %xmm2 - pcmpeqb (%rdi), %xmm1 - - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches16) - - pmovmskb %xmm1, %eax - bsr %eax, %eax - - add %rdi, %rax +ENTRY_P2ALIGN(__memrchr, 6) +#ifdef __ILP32__ + /* Clear upper bits. */ + mov %RDX_LP, %RDX_LP +#endif + movd %esi, %xmm0 + + /* Get end pointer. */ + leaq (%rdx, %rdi), %rcx + + punpcklbw %xmm0, %xmm0 + punpcklwd %xmm0, %xmm0 + pshufd $0, %xmm0, %xmm0 + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %ecx + jz L(page_cross) + + /* NB: This load happens regardless of whether rdx (len) is zero. Since + it doesn't cross a page and the standard gurantees any pointer have + at least one-valid byte this load must be safe. For the entire + history of the x86 memrchr implementation this has been possible so + no code "should" be relying on a zero-length check before this load. + The zero-length check is moved to the page cross case because it is + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE + into 2-cache lines. */ + movups -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is + zero. */ + bsrl %eax, %eax + jz L(ret_0) + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here + if out of bounds. */ + addl %edx, %eax + jl L(zero_0) + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base + ptr. */ + addq %rdi, %rax +L(ret_0): ret - .p2align 4 -L(exit_loop): - add $64, %edx - cmp $32, %edx - jbe L(exit_loop_32) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16_1) - cmp $48, %edx - jbe L(return_null) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches0_1) - xor %eax, %eax + .p2align 4,, 5 +L(ret_vec_x0): + bsrl %eax, %eax + leaq -(VEC_SIZE)(%rcx, %rax), %rax ret - .p2align 4 -L(exit_loop_32): - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48_1) - cmp $16, %edx - jbe L(return_null) - - pcmpeqb 32(%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches32_1) - xor %eax, %eax + .p2align 4,, 2 +L(zero_0): + xorl %eax, %eax ret - .p2align 4 -L(matches0): - bsr %eax, %eax - add %rdi, %rax - ret - - .p2align 4 -L(matches16): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax - ret - .p2align 4 -L(matches32): - bsr %eax, %eax - lea 32(%rax, %rdi), %rax + .p2align 4,, 8 +L(more_1x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) + + /* Align rcx (pointer to string). */ + decq %rcx + andq $-VEC_SIZE, %rcx + + movq %rcx, %rdx + /* NB: We could consistenyl save 1-byte in this pattern with `movaps + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is + it adds more frontend uops (even if the moves can be eliminated) and + some percentage of the time actual backend uops. */ + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + subq %rdi, %rdx + pmovmskb %xmm1, %eax + + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) +L(last_2x_vec): + subl $VEC_SIZE, %edx + jbe L(ret_vec_x0_test) + + testl %eax, %eax + jnz L(ret_vec_x0) + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_1) + addl %edx, %eax + jl L(zero_0) + addq %rdi, %rax +L(ret_1): ret - .p2align 4 -L(matches48): - bsr %eax, %eax - lea 48(%rax, %rdi), %rax + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) + causes the hot pause (length <= VEC_SIZE) to span multiple cache + lines. Naturally aligned % 16 to 8-bytes. */ +L(page_cross): + /* Zero length check. */ + testq %rdx, %rdx + jz L(zero_0) + + leaq -1(%rcx), %r8 + andq $-(VEC_SIZE), %r8 + + movaps (%r8), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %esi + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + negl %ecx + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count + explicitly. */ + andl $(VEC_SIZE - 1), %ecx + shl %cl, %esi + movzwl %si, %eax + leaq (%rdi, %rdx), %rcx + cmpq %rdi, %r8 + ja L(more_1x_vec) + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_2) + addl %edx, %eax + jl L(zero_1) + addq %rdi, %rax +L(ret_2): ret - .p2align 4 -L(matches0_1): - bsr %eax, %eax - sub $64, %rdx - add %rax, %rdx - jl L(return_null) - add %rdi, %rax + /* Fits in aliging bytes. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(matches16_1): - bsr %eax, %eax - sub $48, %rdx - add %rax, %rdx - jl L(return_null) - lea 16(%rdi, %rax), %rax + .p2align 4,, 5 +L(ret_vec_x1): + bsrl %eax, %eax + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(matches32_1): - bsr %eax, %eax - sub $32, %rdx - add %rax, %rdx - jl L(return_null) - lea 32(%rdi, %rax), %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) - .p2align 4 -L(matches48_1): - bsr %eax, %eax - sub $16, %rdx - add %rax, %rdx - jl L(return_null) - lea 48(%rdi, %rax), %rax - ret + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + testl %eax, %eax + jnz L(ret_vec_x1) - .p2align 4 -L(return_null): - xor %eax, %eax - ret - .p2align 4 -L(length_less16_offset0): - test %edx, %edx - jz L(return_null) + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - mov %dl, %cl - pcmpeqb (%rdi), %xmm1 + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) - mov $1, %edx - sal %cl, %edx - sub $1, %edx + addl $(VEC_SIZE), %edx + jle L(ret_vec_x2_test) - pmovmskb %xmm1, %eax +L(last_vec): + testl %eax, %eax + jnz L(ret_vec_x2) - and %edx, %eax - test %eax, %eax - jz L(return_null) + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - bsr %eax, %eax - add %rdi, %rax + subl $(VEC_SIZE), %edx + bsrl %eax, %eax + jz L(ret_3) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax +L(ret_3): ret - .p2align 4 -L(length_less16): - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add $16, %edx - - pshufd $0, %xmm1, %xmm1 - - mov %edi, %ecx - and $15, %ecx - jz L(length_less16_offset0) - - mov %cl, %dh - mov %ecx, %esi - add %dl, %dh - and $-16, %rdi - - sub $16, %dh - ja L(length_less16_part2) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - - sar %cl, %eax - mov %dl, %cl - - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax - test %eax, %eax - jz L(return_null) - - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 6 +L(ret_vec_x2_test): + bsrl %eax, %eax + jz L(zero_2) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax ret - .p2align 4 -L(length_less16_part2): - movdqa 16(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - - mov %dh, %cl - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax +L(zero_2): + xorl %eax, %eax + ret - test %eax, %eax - jnz L(length_less16_part2_return) - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax + .p2align 4,, 5 +L(ret_vec_x2): + bsrl %eax, %eax + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - mov %esi, %ecx - sar %cl, %eax - test %eax, %eax - jz L(return_null) + .p2align 4,, 5 +L(ret_vec_x3): + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 8 +L(more_4x_vec): + testl %eax, %eax + jnz L(ret_vec_x2) + + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_x3) + + addq $-(VEC_SIZE * 4), %rcx + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) + + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end + keeping the code from spilling to the next cache line. */ + addq $(VEC_SIZE * 4 - 1), %rcx + andq $-(VEC_SIZE * 4), %rcx + leaq (VEC_SIZE * 4)(%rdi), %rdx + andq $-(VEC_SIZE * 4), %rdx + + .p2align 4,, 11 +L(loop_4x_vec): + movaps (VEC_SIZE * -1)(%rcx), %xmm1 + movaps (VEC_SIZE * -2)(%rcx), %xmm2 + movaps (VEC_SIZE * -3)(%rcx), %xmm3 + movaps (VEC_SIZE * -4)(%rcx), %xmm4 + pcmpeqb %xmm0, %xmm1 + pcmpeqb %xmm0, %xmm2 + pcmpeqb %xmm0, %xmm3 + pcmpeqb %xmm0, %xmm4 + + por %xmm1, %xmm2 + por %xmm3, %xmm4 + por %xmm2, %xmm4 + + pmovmskb %xmm4, %esi + testl %esi, %esi + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rcx + cmpq %rdx, %rcx + jne L(loop_4x_vec) + + subl %edi, %edx + + /* Ends up being 1-byte nop. */ + .p2align 4,, 2 +L(last_4x_vec): + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) + + testl %eax, %eax + jnz L(ret_vec_x0) + + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_end) + + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $(VEC_SIZE * 3), %edx + ja L(last_vec) + bsrl %eax, %eax + jz L(ret_4) + addl %edx, %eax + jl L(zero_3) + addq %rdi, %rax +L(ret_4): ret - .p2align 4 -L(length_less16_part2_return): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax + /* Ends up being 1-byte nop. */ + .p2align 4,, 3 +L(loop_end): + pmovmskb %xmm1, %eax + sall $16, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm2, %eax + testl %eax, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm3, %eax + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + sall $16, %eax + orl %esi, %eax + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax ret -END (__memrchr) +L(ret_vec_end): + bsrl %eax, %eax + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax + ret + /* Use in L(last_4x_vec). In the same cache line. This is just a spare + aligning bytes. */ +L(zero_3): + xorl %eax, %eax + ret + /* 2-bytes from next cache line. */ +END(__memrchr) weak_alias (__memrchr, memrchr) -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v5 5/8] x86: Optimize memrchr-evex.S 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (2 preceding siblings ...) 2022-06-07 4:05 ` [PATCH v5 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein @ 2022-06-07 4:05 ` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein ` (2 subsequent siblings) 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:05 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ 1 file changed, 268 insertions(+), 271 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S index 0b99709c6b..f0bc4f175a 100644 --- a/sysdeps/x86_64/multiarch/memrchr-evex.S +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S @@ -19,319 +19,316 @@ #if IS_IN (libc) # include <sysdep.h> +# include "evex256-vecs.h" +# if VEC_SIZE != 32 +# error "VEC_SIZE != 32 unimplemented" +# endif + +# ifndef MEMRCHR +# define MEMRCHR __memrchr_evex +# endif + +# define PAGE_SIZE 4096 +# define VECMATCH VEC(0) + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN(MEMRCHR, 6) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) + + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdi, %rdx), %rax + vpbroadcastb %esi, %VECMATCH + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will guarantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret -# define VMOVA vmovdqa64 - -# define YMMMATCH ymm16 - -# define VEC_SIZE 32 - - .section .text.evex,"ax",@progbits -ENTRY (__memrchr_evex) - /* Broadcast CHAR to YMMMATCH. */ - vpbroadcastb %esi, %YMMMATCH - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP - - /* Check the last VEC_SIZE bytes. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) - - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx - - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 - kord %k1, %k2, %k5 - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 - - kord %k3, %k4, %k6 - kortestd %k5, %k6 - jz L(loop_4x_vec) - - /* There is a match. */ - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - kmovd %k1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0_dec): + decq %rax +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax ret - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) + /* Align rax (pointer to string). */ + andq $-VEC_SIZE, %rax - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) + /* Recompute length after aligning. */ + movq %rax, %rdx - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - ret + subq %rdi, %rdx - .p2align 4 + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) + + /* Must dec rax because L(ret_vec_x0_test) expects it. */ + decq %rax cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpb $0, (%rsi), %VECMATCH, %k0 + kmovd %k0, %r8d + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %ecx + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %ecx + shlxl %ecx, %r8d, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_1) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax + /* Continue creating zero labels that fit in aligning bytes and get + 2-byte encoding / are in the same cache line as condition. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - ret + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx - - kmovd %k1, %eax - - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx - - /* Check for zero length. */ - testl %edx, %edx - jz L(zero) - - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) - - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + .p2align 4,, 8 +L(ret_vec_x2): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + .p2align 4,, 8 +L(ret_vec_x3): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - /* Check the last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax + .p2align 4,, 8 +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + testl %ecx, %ecx + jnz L(ret_vec_x3) - andl %edx, %eax - testl %eax, %eax - jz L(zero) + /* Check if near end before re-aligning (otherwise might do an + unnecessary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - ret + decq %rax + andq $-(VEC_SIZE * 4), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + andq $-(VEC_SIZE * 4), %rdx .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx - - /* Check the last VEC. */ - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 +L(loop_4x_vec): + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later + on). */ + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 + + /* VEC(2/3) will have zero-byte where we found a CHAR. */ + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 + + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} + vptestnmb %VEC(3), %VEC(3), %k2 + + /* Any 1s and we found CHAR. */ + kortestd %k2, %k4 + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) + + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ + subq $-(VEC_SIZE * 4), %rdx + movq %rdx, %rax + subl %edi, %edx +L(last_4x_vec): + + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - kmovd %k1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jnz L(last_vec_x1) + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - /* Check the second last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 + testl %ecx, %ecx + jnz L(ret_vec_x1) - movl %r8d, %ecx + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - kmovd %k1, %eax + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret_1) + xorl %eax, %eax +L(ret_1): + ret - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 6 +L(loop_end): + kmovd %k1, %ecx + notl %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vptestnmb %VEC(2), %VEC(2), %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + kmovd %k2, %ecx + kmovd %k4, %esi + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + addq %rcx, %rax + ret + .p2align 4,, 4 +L(ret_vec_x0_end): + addq $(VEC_SIZE), %rax +L(ret_vec_x1_end): + bsrl %ecx, %ecx + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax ret -END (__memrchr_evex) + +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v5 6/8] x86: Optimize memrchr-avx2.S 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (3 preceding siblings ...) 2022-06-07 4:05 ` [PATCH v5 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-07 4:05 ` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:05 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memrchr-avx2.S | 534 ++++++++++---------- 2 files changed, 257 insertions(+), 278 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S index cea2d2a72d..5e9beeeef2 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMRCHR __memrchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S index ba2ce7cb03..7d11a41618 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S @@ -21,340 +21,318 @@ # include <sysdep.h> # ifndef MEMRCHR -# define MEMRCHR __memrchr_avx2 +# define MEMRCHR __memrchr_avx2 # endif # ifndef VZEROUPPER -# define VZEROUPPER vzeroupper +# define VZEROUPPER vzeroupper # endif # ifndef SECTION # define SECTION(p) p##.avx # endif -# define VEC_SIZE 32 +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + .section SECTION(.text), "ax", @progbits +ENTRY(MEMRCHR) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) - .section SECTION(.text),"ax",@progbits -ENTRY (MEMRCHR) - /* Broadcast CHAR to YMM0. */ vmovd %esi, %xmm0 - vpbroadcastb %xmm0, %ymm0 - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP - - /* Check the last VEC_SIZE bytes. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x0) + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdx, %rdi), %rax - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) + vpbroadcastb %xmm0, %ymm0 - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) + +L(ret_vec_x0_test): + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + + /* Hoist vzeroupper (not great for RTM) to save code size. This allows + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vmovdqa (%rdi), %ymm1 - vmovdqa VEC_SIZE(%rdi), %ymm2 - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 - - vpcmpeqb %ymm1, %ymm0, %ymm1 - vpcmpeqb %ymm2, %ymm0, %ymm2 - vpcmpeqb %ymm3, %ymm0, %ymm3 - vpcmpeqb %ymm4, %ymm0, %ymm4 - - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - testl %eax, %eax - jz L(loop_4x_vec) - - /* There is a match. */ - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpmovmskb %ymm1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax L(return_vzeroupper): ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) - - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Align rax (string pointer). */ + andq $-VEC_SIZE, %rax + + /* Recompute remaining length after aligning. */ + movq %rax, %rdx + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 + subq %rdi, %rdx + decq %rax + vpmovmskb %ymm1, %ecx + /* Fall through for short (hotter than length). */ + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which in turn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpeqb (%rsi), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %r8d + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %r8d + shlxl %r8d, %ecx, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret + .p2align 4,, 11 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + subq %rcx, %rax VZEROUPPER_RETURN + .p2align 4,, 10 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - VZEROUPPER_RETURN + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax - VZEROUPPER_RETURN + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) + +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - .p2align 4 -L(null): + /* First in aligning bytes. */ +L(zero_2): xorl %eax, %eax ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx + .p2align 4,, 4 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - vpcmpeqb (%rdi), %ymm0, %ymm1 - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx + .p2align 4,, 11 +L(ret_vec_x2): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - vpmovmskb %ymm1, %eax + .p2align 4,, 14 +L(ret_vec_x3): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Check for zero length. */ - testl %edx, %edx - jz L(null) + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) + testl %ecx, %ecx + jnz L(ret_vec_x3) - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + /* Align rax to (VEC_SIZE - 1). */ + orq $(VEC_SIZE * 4 - 1), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + orq $(VEC_SIZE * 4 - 1), %rdx - /* Check the last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + .p2align 4 +L(loop_4x_vec): + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 - movl $1, %edx - sall %cl, %edx - subl $1, %edx + vpor %ymm1, %ymm2, %ymm2 + vpor %ymm3, %ymm4, %ymm4 + vpor %ymm2, %ymm4, %ymm4 + vpmovmskb %ymm4, %esi - andl %edx, %eax - testl %eax, %eax - jz L(zero) + testl %esi, %esi + jnz L(loop_end) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - VZEROUPPER_RETURN + addq $(VEC_SIZE * -4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) - .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx + subl %edi, %edx + incl %edx - /* Check the last VEC. */ - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 +L(last_4x_vec): + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - vpmovmskb %ymm1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_end) - /* Remove the trailing bytes. */ - andl %edx, %eax + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) - testl %eax, %eax - jnz L(last_vec_x1) + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - /* Check the second last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) + + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret0) + xorl %eax, %eax +L(ret0): + ret - movl %r8d, %ecx - vpmovmskb %ymm1, %eax + .p2align 4 +L(loop_end): + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vpmovmskb %ymm2, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + vpmovmskb %ymm3, %ecx + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + .p2align 4,, 4 +L(ret_vec_x1_end): + /* 64-bit version will automatically add 32 (VEC_SIZE). */ + lzcntq %rcx, %rcx + subq %rcx, %rax + VZEROUPPER_RETURN - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 4 +L(ret_vec_x0_end): + lzcntl %ecx, %ecx + subq %rcx, %rax VZEROUPPER_RETURN -END (MEMRCHR) + + /* 2 bytes until next cache line. */ +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v5 7/8] x86: Shrink code size of memchr-avx2.S 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (4 preceding siblings ...) 2022-06-07 4:05 ` [PATCH v5 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-07 4:05 ` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:05 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- 2 files changed, 60 insertions(+), 50 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S index 87b076c7c4..c4d71938c5 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMCHR __memchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S index 75bd7262e0..28a01280ec 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S @@ -57,7 +57,7 @@ # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 5) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ # ifdef __ILP32__ @@ -87,12 +87,14 @@ ENTRY (MEMCHR) # endif testl %eax, %eax jz L(aligned_more) - tzcntl %eax, %eax + bsfl %eax, %eax addq %rdi, %rax - VZEROUPPER_RETURN +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN + # ifndef USE_AS_RAWMEMCHR - .p2align 5 + .p2align 4 L(first_vec_x0): /* Check if first match was before length. */ tzcntl %eax, %eax @@ -100,58 +102,31 @@ L(first_vec_x0): /* NB: Multiply length by 4 to get byte count. */ sall $2, %edx # endif - xorl %ecx, %ecx + COND_VZEROUPPER + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch + block. branch here as opposed to cmovcc is not that costly. Common + usage of memchr is to check if the return was NULL (if string was + known to contain CHAR user would use rawmemchr). This branch will be + highly correlated with the user branch and can be used by most + modern branch predictors to predict the user branch. */ cmpl %eax, %edx - leaq (%rdi, %rax), %rax - cmovle %rcx, %rax - VZEROUPPER_RETURN - -L(null): - xorl %eax, %eax - ret -# endif - .p2align 4 -L(cross_page_boundary): - /* Save pointer before aligning as its original value is - necessary for computer return address if byte is found or - adjusting length if it is not and this is memchr. */ - movq %rdi, %rcx - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr - and rdi for rawmemchr. */ - orq $(VEC_SIZE - 1), %ALGN_PTR_REG - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax -# ifndef USE_AS_RAWMEMCHR - /* Calculate length until end of page (length checked for a - match). */ - leaq 1(%ALGN_PTR_REG), %rsi - subq %RRAW_PTR_REG, %rsi -# ifdef USE_AS_WMEMCHR - /* NB: Divide bytes by 4 to get wchar_t count. */ - shrl $2, %esi -# endif -# endif - /* Remove the leading bytes. */ - sarxl %ERAW_PTR_REG, %eax, %eax -# ifndef USE_AS_RAWMEMCHR - /* Check the end of data. */ - cmpq %rsi, %rdx - jbe L(first_vec_x0) + jle L(null) + addq %rdi, %rax + ret # endif - testl %eax, %eax - jz L(cross_page_continue) - tzcntl %eax, %eax - addq %RRAW_PTR_REG, %rax -L(return_vzeroupper): - ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 + .p2align 4,, 10 L(first_vec_x1): - tzcntl %eax, %eax + bsfl %eax, %eax incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - +# ifndef USE_AS_RAWMEMCHR + /* First in aligning bytes here. */ +L(null): + xorl %eax, %eax + ret +# endif .p2align 4 L(first_vec_x2): tzcntl %eax, %eax @@ -340,7 +315,7 @@ L(first_vec_x1_check): incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - .p2align 4 + .p2align 4,, 6 L(set_zero_end): xorl %eax, %eax VZEROUPPER_RETURN @@ -428,5 +403,39 @@ L(last_vec_x3): VZEROUPPER_RETURN # endif + .p2align 4 +L(cross_page_boundary): + /* Save pointer before aligning as its original value is necessary for + computer return address if byte is found or adjusting length if it + is not and this is memchr. */ + movq %rdi, %rcx + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for + rawmemchr. */ + andq $-VEC_SIZE, %ALGN_PTR_REG + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 + vpmovmskb %ymm1, %eax +# ifndef USE_AS_RAWMEMCHR + /* Calculate length until end of page (length checked for a match). */ + leal VEC_SIZE(%ALGN_PTR_REG), %esi + subl %ERAW_PTR_REG, %esi +# ifdef USE_AS_WMEMCHR + /* NB: Divide bytes by 4 to get wchar_t count. */ + shrl $2, %esi +# endif +# endif + /* Remove the leading bytes. */ + sarxl %ERAW_PTR_REG, %eax, %eax +# ifndef USE_AS_RAWMEMCHR + /* Check the end of data. */ + cmpq %rsi, %rdx + jbe L(first_vec_x0) +# endif + testl %eax, %eax + jz L(cross_page_continue) + bsfl %eax, %eax + addq %RRAW_PTR_REG, %rax + VZEROUPPER_RETURN + + END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v5 8/8] x86: Shrink code size of memchr-evex.S 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (5 preceding siblings ...) 2022-06-07 4:05 ` [PATCH v5 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein @ 2022-06-07 4:05 ` Noah Goldstein 6 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:05 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-evex.S | 46 ++++++++++++++------------ 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S index cfaf02907d..0fd11b7632 100644 --- a/sysdeps/x86_64/multiarch/memchr-evex.S +++ b/sysdeps/x86_64/multiarch/memchr-evex.S @@ -88,7 +88,7 @@ # define PAGE_SIZE 4096 .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 6) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ test %RDX_LP, %RDX_LP @@ -131,22 +131,24 @@ L(zero): xorl %eax, %eax ret - .p2align 5 + .p2align 4 L(first_vec_x0): - /* Check if first match was before length. */ - tzcntl %eax, %eax - xorl %ecx, %ecx - cmpl %eax, %edx - leaq (%rdi, %rax, CHAR_SIZE), %rax - cmovle %rcx, %rax + /* Check if first match was before length. NB: tzcnt has false data- + dependency on destination. eax already had a data-dependency on esi + so this should have no affect here. */ + tzcntl %eax, %esi +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rsi, CHAR_SIZE), %rdi +# else + addq %rsi, %rdi +# endif + xorl %eax, %eax + cmpl %esi, %edx + cmovg %rdi, %rax ret -# else - /* NB: first_vec_x0 is 17 bytes which will leave - cross_page_boundary (which is relatively cold) close enough - to ideal alignment. So only realign L(cross_page_boundary) if - rawmemchr. */ - .p2align 4 # endif + + .p2align 4 L(cross_page_boundary): /* Save pointer before aligning as its original value is necessary for computer return address if byte is found or @@ -400,10 +402,14 @@ L(last_2x_vec): L(zero_end): ret +L(set_zero_end): + xorl %eax, %eax + ret .p2align 4 L(first_vec_x1_check): - tzcntl %eax, %eax + /* eax must be non-zero. Use bsfl to save code size. */ + bsfl %eax, %eax /* Adjust length. */ subl $-(CHAR_PER_VEC * 4), %edx /* Check if match within remaining length. */ @@ -412,9 +418,6 @@ L(first_vec_x1_check): /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax ret -L(set_zero_end): - xorl %eax, %eax - ret .p2align 4 L(loop_4x_vec_end): @@ -464,7 +467,7 @@ L(loop_4x_vec_end): # endif ret - .p2align 4 + .p2align 4,, 10 L(last_vec_x1_return): tzcntl %eax, %eax # if defined USE_AS_WMEMCHR || RET_OFFSET != 0 @@ -496,6 +499,7 @@ L(last_vec_x3_return): # endif # ifndef USE_AS_RAWMEMCHR + .p2align 4,, 5 L(last_4x_vec_or_less_cmpeq): VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 kmovd %k0, %eax @@ -546,7 +550,7 @@ L(last_4x_vec): # endif andl %ecx, %eax jz L(zero_end2) - tzcntl %eax, %eax + bsfl %eax, %eax leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax L(zero_end2): ret @@ -562,6 +566,6 @@ L(last_vec_x3): leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax ret # endif - + /* 7 bytes from next cache line. */ END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein ` (3 preceding siblings ...) 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein @ 2022-06-07 4:11 ` Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein ` (7 more replies) 4 siblings, 8 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:11 UTC (permalink / raw) To: libc-alpha This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 34 ++++++++ sysdeps/x86_64/multiarch/avx-vecs.h | 47 +++++++++++ sysdeps/x86_64/multiarch/evex-vecs-common.h | 39 +++++++++ sysdeps/x86_64/multiarch/evex256-vecs.h | 35 ++++++++ sysdeps/x86_64/multiarch/evex512-vecs.h | 35 ++++++++ sysdeps/x86_64/multiarch/sse2-vecs.h | 47 +++++++++++ sysdeps/x86_64/multiarch/vec-macros.h | 90 +++++++++++++++++++++ 7 files changed, 327 insertions(+) create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex-vecs-common.h create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h new file mode 100644 index 0000000000..3f531dd47f --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -0,0 +1,34 @@ +/* Common config for AVX-RTM VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_RTM_VECS_H +#define _AVX_RTM_VECS_H 1 + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define USE_WITH_RTM 1 +#include "avx-vecs.h" + +#undef SECTION +#define SECTION(p) p##.avx.rtm + +#endif diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h new file mode 100644 index 0000000000..89680f5db8 --- /dev/null +++ b/sysdeps/x86_64/multiarch/avx-vecs.h @@ -0,0 +1,47 @@ +/* Common config for AVX VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _AVX_VECS_H +#define _AVX_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 32 +#include "vec-macros.h" + +#define USE_WITH_AVX 1 +#define SECTION(p) p##.avx + +/* 4-byte mov instructions with AVX2. */ +#define MOV_SIZE 4 +/* 1 (ret) + 3 (vzeroupper). */ +#define RET_SIZE 4 +#define VZEROUPPER vzeroupper + +#define VMOVU vmovdqu +#define VMOVA vmovdqa +#define VMOVNT vmovntdq + +/* Often need to access xmm portion. */ +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex-vecs-common.h b/sysdeps/x86_64/multiarch/evex-vecs-common.h new file mode 100644 index 0000000000..99806ebcd7 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex-vecs-common.h @@ -0,0 +1,39 @@ +/* Common config for EVEX256 and EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX_VECS_COMMON_H +#define _EVEX_VECS_COMMON_H 1 + +#include "vec-macros.h" + +/* 6-byte mov instructions with EVEX. */ +#define MOV_SIZE 6 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU vmovdqu64 +#define VMOVA vmovdqa64 +#define VMOVNT vmovntdq + +#define VEC_xmm VEC_hi_xmm +#define VEC_ymm VEC_hi_ymm +#define VEC_zmm VEC_hi_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h new file mode 100644 index 0000000000..222ba46dc7 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h @@ -0,0 +1,35 @@ +/* Common config for EVEX256 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX256_VECS_H +#define _EVEX256_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 32 +#include "evex-vecs-common.h" + +#define USE_WITH_EVEX256 1 +#define SECTION(p) p##.evex + +#define VEC VEC_ymm + +#endif diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h new file mode 100644 index 0000000000..d1784d5368 --- /dev/null +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h @@ -0,0 +1,35 @@ +/* Common config for EVEX512 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _EVEX512_VECS_H +#define _EVEX512_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 64 +#include "evex-vecs-common.h" + +#define USE_WITH_EVEX512 1 +#define SECTION(p) p##.evex512 + +#define VEC VEC_zmm + +#endif diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h new file mode 100644 index 0000000000..2b77a59d56 --- /dev/null +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h @@ -0,0 +1,47 @@ +/* Common config for SSE2 VECs + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _SSE2_VECS_H +#define _SSE2_VECS_H 1 + +#ifdef VEC_SIZE +# error "Multiple VEC configs included!" +#endif + +#define VEC_SIZE 16 +#include "vec-macros.h" + +#define USE_WITH_SSE2 1 +#define SECTION(p) p + +/* 3-byte mov instructions with SSE2. */ +#define MOV_SIZE 3 +/* No vzeroupper needed. */ +#define RET_SIZE 1 +#define VZEROUPPER + +#define VMOVU movups +#define VMOVA movaps +#define VMOVNT movntdq + +#define VEC_xmm VEC_any_xmm +#define VEC VEC_any_xmm + + +#endif diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h new file mode 100644 index 0000000000..9f3ffecede --- /dev/null +++ b/sysdeps/x86_64/multiarch/vec-macros.h @@ -0,0 +1,90 @@ +/* Macro helpers for VEC_{type}({vec_num}) + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _VEC_MACROS_H +#define _VEC_MACROS_H 1 + +#ifndef VEC_SIZE +# error "Never include this file directly. Always include a vector config." +#endif + +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same + VEC(N) values. */ +#define VEC_hi_xmm0 xmm16 +#define VEC_hi_xmm1 xmm17 +#define VEC_hi_xmm2 xmm18 +#define VEC_hi_xmm3 xmm19 +#define VEC_hi_xmm4 xmm20 +#define VEC_hi_xmm5 xmm21 +#define VEC_hi_xmm6 xmm22 +#define VEC_hi_xmm7 xmm23 +#define VEC_hi_xmm8 xmm24 +#define VEC_hi_xmm9 xmm25 +#define VEC_hi_xmm10 xmm26 +#define VEC_hi_xmm11 xmm27 +#define VEC_hi_xmm12 xmm28 +#define VEC_hi_xmm13 xmm29 +#define VEC_hi_xmm14 xmm30 +#define VEC_hi_xmm15 xmm31 + +#define VEC_hi_ymm0 ymm16 +#define VEC_hi_ymm1 ymm17 +#define VEC_hi_ymm2 ymm18 +#define VEC_hi_ymm3 ymm19 +#define VEC_hi_ymm4 ymm20 +#define VEC_hi_ymm5 ymm21 +#define VEC_hi_ymm6 ymm22 +#define VEC_hi_ymm7 ymm23 +#define VEC_hi_ymm8 ymm24 +#define VEC_hi_ymm9 ymm25 +#define VEC_hi_ymm10 ymm26 +#define VEC_hi_ymm11 ymm27 +#define VEC_hi_ymm12 ymm28 +#define VEC_hi_ymm13 ymm29 +#define VEC_hi_ymm14 ymm30 +#define VEC_hi_ymm15 ymm31 + +#define VEC_hi_zmm0 zmm16 +#define VEC_hi_zmm1 zmm17 +#define VEC_hi_zmm2 zmm18 +#define VEC_hi_zmm3 zmm19 +#define VEC_hi_zmm4 zmm20 +#define VEC_hi_zmm5 zmm21 +#define VEC_hi_zmm6 zmm22 +#define VEC_hi_zmm7 zmm23 +#define VEC_hi_zmm8 zmm24 +#define VEC_hi_zmm9 zmm25 +#define VEC_hi_zmm10 zmm26 +#define VEC_hi_zmm11 zmm27 +#define VEC_hi_zmm12 zmm28 +#define VEC_hi_zmm13 zmm29 +#define VEC_hi_zmm14 zmm30 +#define VEC_hi_zmm15 zmm31 + +#define PRIMITIVE_VEC(vec, num) vec##num + +#define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) +#define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) +#define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) + +#define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) +#define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) +#define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) + +#endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v6 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein @ 2022-06-07 4:11 ` Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein ` (6 subsequent siblings) 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:11 UTC (permalink / raw) To: libc-alpha The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This can be useful when hoisting a vzeroupper to save code size for example: ``` L(foo): cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax VZEROUPPER_RETURN L(bar): xorl %eax, %eax VZEROUPPER_RETURN ``` Can become: ``` L(foo): COND_VZEROUPPER cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax ret L(bar): xorl %eax, %eax ret ``` This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> --- sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 1 + sysdeps/x86_64/sysdep.h | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h index 3f531dd47f..6ca9f5e6ba 100644 --- a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h @@ -20,6 +20,7 @@ #ifndef _AVX_RTM_VECS_H #define _AVX_RTM_VECS_H 1 +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h index f14d50786d..4f512d5566 100644 --- a/sysdeps/x86_64/sysdep.h +++ b/sysdeps/x86_64/sysdep.h @@ -106,6 +106,24 @@ lose: \ vzeroupper; \ ret +/* Can be used to replace vzeroupper that is not directly before a + return. This is useful when hoisting a vzeroupper from multiple + return paths to decrease the total number of vzerouppers and code + size. */ +#define COND_VZEROUPPER_XTEST \ + xtest; \ + jz 1f; \ + vzeroall; \ + jmp 2f; \ +1: \ + vzeroupper; \ +2: + +/* In RTM define this as COND_VZEROUPPER_XTEST. */ +#ifndef COND_VZEROUPPER +# define COND_VZEROUPPER vzeroupper +#endif + /* Zero upper vector registers and return. */ #ifndef ZERO_UPPER_VEC_REGISTERS_RETURN # define ZERO_UPPER_VEC_REGISTERS_RETURN \ -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v6 3/8] Benchtests: Improve memrchr benchmarks 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-07 4:11 ` Noah Goldstein 2022-06-07 18:03 ` H.J. Lu 2022-06-07 4:11 ` [PATCH v6 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein ` (5 subsequent siblings) 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:11 UTC (permalink / raw) To: libc-alpha Add a second iteration for memrchr to set `pos` starting from the end of the buffer. Previously `pos` was only set relative to the begining of the buffer. This isn't really useful for memrchr because the begining of the search space is (buf + len). --- benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- 1 file changed, 65 insertions(+), 45 deletions(-) diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c index 4d7212332f..0facda2fa0 100644 --- a/benchtests/bench-memchr.c +++ b/benchtests/bench-memchr.c @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, static void do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, - int seek_char) + int seek_char, int invert_pos) { size_t i; @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, if (pos < len) { - buf[align + pos] = seek_char; + if (invert_pos) + buf[align + len - pos] = seek_char; + else + buf[align + pos] = seek_char; buf[align + len] = -seek_char; } else @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, json_attr_uint (json_ctx, "pos", pos); json_attr_uint (json_ctx, "len", len); json_attr_uint (json_ctx, "seek_char", seek_char); + json_attr_uint (json_ctx, "invert_pos", invert_pos); json_array_begin (json_ctx, "timings"); @@ -123,6 +127,7 @@ int test_main (void) { size_t i; + int repeats; json_ctx_t json_ctx; test_init (); @@ -142,53 +147,68 @@ test_main (void) json_array_begin (&json_ctx, "results"); - for (i = 1; i < 8; ++i) + for (repeats = 0; repeats < 2; ++repeats) { - do_test (&json_ctx, 0, 16 << i, 2048, 23); - do_test (&json_ctx, i, 64, 256, 23); - do_test (&json_ctx, 0, 16 << i, 2048, 0); - do_test (&json_ctx, i, 64, 256, 0); - - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); + do_test (&json_ctx, i, 64, 256, 23, repeats); + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); + do_test (&json_ctx, i, 64, 256, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, i, 256, 23); - do_test (&json_ctx, 0, i, 256, 0); - do_test (&json_ctx, i, i, 256, 23); - do_test (&json_ctx, i, i, 256, 0); + /* Also test the position close to the beginning for memrchr. */ + do_test (&json_ctx, 0, i, 256, 23, repeats); + do_test (&json_ctx, 0, i, 256, 0, repeats); + do_test (&json_ctx, i, i, 256, 23, repeats); + do_test (&json_ctx, i, i, 256, 0, repeats); #endif - } - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, i, i << 5, 192, 23); - do_test (&json_ctx, i, i << 5, 192, 0); - do_test (&json_ctx, i, i << 5, 256, 23); - do_test (&json_ctx, i, i << 5, 256, 0); - do_test (&json_ctx, i, i << 5, 512, 23); - do_test (&json_ctx, i, i << 5, 512, 0); - - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); - } - for (i = 1; i < 32; ++i) - { - do_test (&json_ctx, 0, i, i + 1, 23); - do_test (&json_ctx, 0, i, i + 1, 0); - do_test (&json_ctx, i, i, i + 1, 23); - do_test (&json_ctx, i, i, i + 1, 0); - do_test (&json_ctx, 0, i, i - 1, 23); - do_test (&json_ctx, 0, i, i - 1, 0); - do_test (&json_ctx, i, i, i - 1, 23); - do_test (&json_ctx, i, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); + } + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, i, i << 5, 192, 23, repeats); + do_test (&json_ctx, i, i << 5, 192, 0, repeats); + do_test (&json_ctx, i, i << 5, 256, 23, repeats); + do_test (&json_ctx, i, i << 5, 256, 0, repeats); + do_test (&json_ctx, i, i << 5, 512, 23, repeats); + do_test (&json_ctx, i, i << 5, 512, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); + } + for (i = 1; i < 32; ++i) + { + do_test (&json_ctx, 0, i, i + 1, 23, repeats); + do_test (&json_ctx, 0, i, i + 1, 0, repeats); + do_test (&json_ctx, i, i, i + 1, 23, repeats); + do_test (&json_ctx, i, i, i + 1, 0, repeats); + do_test (&json_ctx, 0, i, i - 1, 23, repeats); + do_test (&json_ctx, 0, i, i - 1, 0, repeats); + do_test (&json_ctx, i, i, i - 1, 23, repeats); + do_test (&json_ctx, i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); + #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, 1, i + 1, 23); - do_test (&json_ctx, 0, 2, i + 1, 0); + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); +#endif + } +#ifndef USE_AS_MEMRCHR + break; #endif } -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 3/8] Benchtests: Improve memrchr benchmarks 2022-06-07 4:11 ` [PATCH v6 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein @ 2022-06-07 18:03 ` H.J. Lu 0 siblings, 0 replies; 82+ messages in thread From: H.J. Lu @ 2022-06-07 18:03 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > Add a second iteration for memrchr to set `pos` starting from the end > of the buffer. > > Previously `pos` was only set relative to the begining of the > buffer. This isn't really useful for memrchr because the begining > of the search space is (buf + len). > --- > benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- > 1 file changed, 65 insertions(+), 45 deletions(-) > > diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c > index 4d7212332f..0facda2fa0 100644 > --- a/benchtests/bench-memchr.c > +++ b/benchtests/bench-memchr.c > @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, > > static void > do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > - int seek_char) > + int seek_char, int invert_pos) > { > size_t i; > > @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > > if (pos < len) > { > - buf[align + pos] = seek_char; > + if (invert_pos) > + buf[align + len - pos] = seek_char; > + else > + buf[align + pos] = seek_char; > buf[align + len] = -seek_char; > } > else > @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, > json_attr_uint (json_ctx, "pos", pos); > json_attr_uint (json_ctx, "len", len); > json_attr_uint (json_ctx, "seek_char", seek_char); > + json_attr_uint (json_ctx, "invert_pos", invert_pos); > > json_array_begin (json_ctx, "timings"); > > @@ -123,6 +127,7 @@ int > test_main (void) > { > size_t i; > + int repeats; > json_ctx_t json_ctx; > test_init (); > > @@ -142,53 +147,68 @@ test_main (void) > > json_array_begin (&json_ctx, "results"); > > - for (i = 1; i < 8; ++i) > + for (repeats = 0; repeats < 2; ++repeats) > { > - do_test (&json_ctx, 0, 16 << i, 2048, 23); > - do_test (&json_ctx, i, 64, 256, 23); > - do_test (&json_ctx, 0, 16 << i, 2048, 0); > - do_test (&json_ctx, i, 64, 256, 0); > - > - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); > + for (i = 1; i < 8; ++i) > + { > + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); > + do_test (&json_ctx, i, 64, 256, 23, repeats); > + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); > + do_test (&json_ctx, i, 64, 256, 0, repeats); > + > + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); > #ifdef USE_AS_MEMRCHR > - /* Also test the position close to the beginning for memrchr. */ > - do_test (&json_ctx, 0, i, 256, 23); > - do_test (&json_ctx, 0, i, 256, 0); > - do_test (&json_ctx, i, i, 256, 23); > - do_test (&json_ctx, i, i, 256, 0); > + /* Also test the position close to the beginning for memrchr. */ > + do_test (&json_ctx, 0, i, 256, 23, repeats); > + do_test (&json_ctx, 0, i, 256, 0, repeats); > + do_test (&json_ctx, i, i, 256, 23, repeats); > + do_test (&json_ctx, i, i, 256, 0, repeats); > #endif > - } > - for (i = 1; i < 8; ++i) > - { > - do_test (&json_ctx, i, i << 5, 192, 23); > - do_test (&json_ctx, i, i << 5, 192, 0); > - do_test (&json_ctx, i, i << 5, 256, 23); > - do_test (&json_ctx, i, i << 5, 256, 0); > - do_test (&json_ctx, i, i << 5, 512, 23); > - do_test (&json_ctx, i, i << 5, 512, 0); > - > - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); > - } > - for (i = 1; i < 32; ++i) > - { > - do_test (&json_ctx, 0, i, i + 1, 23); > - do_test (&json_ctx, 0, i, i + 1, 0); > - do_test (&json_ctx, i, i, i + 1, 23); > - do_test (&json_ctx, i, i, i + 1, 0); > - do_test (&json_ctx, 0, i, i - 1, 23); > - do_test (&json_ctx, 0, i, i - 1, 0); > - do_test (&json_ctx, i, i, i - 1, 23); > - do_test (&json_ctx, i, i, i - 1, 0); > - > - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); > - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); > - > - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); > - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); > + } > + for (i = 1; i < 8; ++i) > + { > + do_test (&json_ctx, i, i << 5, 192, 23, repeats); > + do_test (&json_ctx, i, i << 5, 192, 0, repeats); > + do_test (&json_ctx, i, i << 5, 256, 23, repeats); > + do_test (&json_ctx, i, i << 5, 256, 0, repeats); > + do_test (&json_ctx, i, i << 5, 512, 23, repeats); > + do_test (&json_ctx, i, i << 5, 512, 0, repeats); > + > + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); > + } > + for (i = 1; i < 32; ++i) > + { > + do_test (&json_ctx, 0, i, i + 1, 23, repeats); > + do_test (&json_ctx, 0, i, i + 1, 0, repeats); > + do_test (&json_ctx, i, i, i + 1, 23, repeats); > + do_test (&json_ctx, i, i, i + 1, 0, repeats); > + do_test (&json_ctx, 0, i, i - 1, 23, repeats); > + do_test (&json_ctx, 0, i, i - 1, 0, repeats); > + do_test (&json_ctx, i, i, i - 1, 23, repeats); > + do_test (&json_ctx, i, i, i - 1, 0, repeats); > + > + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); > + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); > + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); > + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); > + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); > + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); > + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); > + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); > + > + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); > + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); > + > + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); > + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); > + > #ifdef USE_AS_MEMRCHR > - /* Also test the position close to the beginning for memrchr. */ > - do_test (&json_ctx, 0, 1, i + 1, 23); > - do_test (&json_ctx, 0, 2, i + 1, 0); > + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); > + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); > +#endif > + } > +#ifndef USE_AS_MEMRCHR > + break; > #endif > } > > -- > 2.34.1 > Please change begining to beginning in commit log. Otherwise, it is OK. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v6 4/8] x86: Optimize memrchr-sse2.S 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein @ 2022-06-07 4:11 ` Noah Goldstein 2022-06-07 18:04 ` H.J. Lu 2022-06-07 4:11 ` [PATCH v6 5/8] x86: Optimize memrchr-evex.S Noah Goldstein ` (4 subsequent siblings) 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:11 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. --- sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- 1 file changed, 292 insertions(+), 321 deletions(-) diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S index d1a9f47911..b0dffd2ae2 100644 --- a/sysdeps/x86_64/memrchr.S +++ b/sysdeps/x86_64/memrchr.S @@ -18,362 +18,333 @@ <https://www.gnu.org/licenses/>. */ #include <sysdep.h> +#define VEC_SIZE 16 +#define PAGE_SIZE 4096 .text -ENTRY (__memrchr) - movd %esi, %xmm1 - - sub $16, %RDX_LP - jbe L(length_less16) - - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add %RDX_LP, %RDI_LP - pshufd $0, %xmm1, %xmm1 - - movdqu (%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - -/* Check if there is a match. */ - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - mov %edi, %ecx - and $15, %ecx - jz L(loop_prolog) - - add $16, %rdi - add $16, %rdx - and $-16, %rdi - sub %rcx, %rdx - - .p2align 4 -L(loop_prolog): - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm4 - pcmpeqb %xmm1, %xmm4 - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches0) - - mov %edi, %ecx - and $63, %ecx - jz L(align64_loop) - - add $64, %rdi - add $64, %rdx - and $-64, %rdi - sub %rcx, %rdx - - .p2align 4 -L(align64_loop): - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa (%rdi), %xmm0 - movdqa 16(%rdi), %xmm2 - movdqa 32(%rdi), %xmm3 - movdqa 48(%rdi), %xmm4 - - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm1, %xmm2 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm1, %xmm4 - - pmaxub %xmm3, %xmm0 - pmaxub %xmm4, %xmm2 - pmaxub %xmm0, %xmm2 - pmovmskb %xmm2, %eax - - test %eax, %eax - jz L(align64_loop) - - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches48) - - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm2 - - pcmpeqb %xmm1, %xmm2 - pcmpeqb (%rdi), %xmm1 - - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches16) - - pmovmskb %xmm1, %eax - bsr %eax, %eax - - add %rdi, %rax +ENTRY_P2ALIGN(__memrchr, 6) +#ifdef __ILP32__ + /* Clear upper bits. */ + mov %RDX_LP, %RDX_LP +#endif + movd %esi, %xmm0 + + /* Get end pointer. */ + leaq (%rdx, %rdi), %rcx + + punpcklbw %xmm0, %xmm0 + punpcklwd %xmm0, %xmm0 + pshufd $0, %xmm0, %xmm0 + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %ecx + jz L(page_cross) + + /* NB: This load happens regardless of whether rdx (len) is zero. Since + it doesn't cross a page and the standard gurantees any pointer have + at least one-valid byte this load must be safe. For the entire + history of the x86 memrchr implementation this has been possible so + no code "should" be relying on a zero-length check before this load. + The zero-length check is moved to the page cross case because it is + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE + into 2-cache lines. */ + movups -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is + zero. */ + bsrl %eax, %eax + jz L(ret_0) + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here + if out of bounds. */ + addl %edx, %eax + jl L(zero_0) + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base + ptr. */ + addq %rdi, %rax +L(ret_0): ret - .p2align 4 -L(exit_loop): - add $64, %edx - cmp $32, %edx - jbe L(exit_loop_32) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16_1) - cmp $48, %edx - jbe L(return_null) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches0_1) - xor %eax, %eax + .p2align 4,, 5 +L(ret_vec_x0): + bsrl %eax, %eax + leaq -(VEC_SIZE)(%rcx, %rax), %rax ret - .p2align 4 -L(exit_loop_32): - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48_1) - cmp $16, %edx - jbe L(return_null) - - pcmpeqb 32(%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches32_1) - xor %eax, %eax + .p2align 4,, 2 +L(zero_0): + xorl %eax, %eax ret - .p2align 4 -L(matches0): - bsr %eax, %eax - add %rdi, %rax - ret - - .p2align 4 -L(matches16): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax - ret - .p2align 4 -L(matches32): - bsr %eax, %eax - lea 32(%rax, %rdi), %rax + .p2align 4,, 8 +L(more_1x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) + + /* Align rcx (pointer to string). */ + decq %rcx + andq $-VEC_SIZE, %rcx + + movq %rcx, %rdx + /* NB: We could consistenyl save 1-byte in this pattern with `movaps + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is + it adds more frontend uops (even if the moves can be eliminated) and + some percentage of the time actual backend uops. */ + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + subq %rdi, %rdx + pmovmskb %xmm1, %eax + + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) +L(last_2x_vec): + subl $VEC_SIZE, %edx + jbe L(ret_vec_x0_test) + + testl %eax, %eax + jnz L(ret_vec_x0) + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_1) + addl %edx, %eax + jl L(zero_0) + addq %rdi, %rax +L(ret_1): ret - .p2align 4 -L(matches48): - bsr %eax, %eax - lea 48(%rax, %rdi), %rax + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) + causes the hot pause (length <= VEC_SIZE) to span multiple cache + lines. Naturally aligned % 16 to 8-bytes. */ +L(page_cross): + /* Zero length check. */ + testq %rdx, %rdx + jz L(zero_0) + + leaq -1(%rcx), %r8 + andq $-(VEC_SIZE), %r8 + + movaps (%r8), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %esi + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + negl %ecx + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count + explicitly. */ + andl $(VEC_SIZE - 1), %ecx + shl %cl, %esi + movzwl %si, %eax + leaq (%rdi, %rdx), %rcx + cmpq %rdi, %r8 + ja L(more_1x_vec) + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_2) + addl %edx, %eax + jl L(zero_1) + addq %rdi, %rax +L(ret_2): ret - .p2align 4 -L(matches0_1): - bsr %eax, %eax - sub $64, %rdx - add %rax, %rdx - jl L(return_null) - add %rdi, %rax + /* Fits in aliging bytes. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(matches16_1): - bsr %eax, %eax - sub $48, %rdx - add %rax, %rdx - jl L(return_null) - lea 16(%rdi, %rax), %rax + .p2align 4,, 5 +L(ret_vec_x1): + bsrl %eax, %eax + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(matches32_1): - bsr %eax, %eax - sub $32, %rdx - add %rax, %rdx - jl L(return_null) - lea 32(%rdi, %rax), %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) - .p2align 4 -L(matches48_1): - bsr %eax, %eax - sub $16, %rdx - add %rax, %rdx - jl L(return_null) - lea 48(%rdi, %rax), %rax - ret + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + testl %eax, %eax + jnz L(ret_vec_x1) - .p2align 4 -L(return_null): - xor %eax, %eax - ret - .p2align 4 -L(length_less16_offset0): - test %edx, %edx - jz L(return_null) + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - mov %dl, %cl - pcmpeqb (%rdi), %xmm1 + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) - mov $1, %edx - sal %cl, %edx - sub $1, %edx + addl $(VEC_SIZE), %edx + jle L(ret_vec_x2_test) - pmovmskb %xmm1, %eax +L(last_vec): + testl %eax, %eax + jnz L(ret_vec_x2) - and %edx, %eax - test %eax, %eax - jz L(return_null) + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - bsr %eax, %eax - add %rdi, %rax + subl $(VEC_SIZE), %edx + bsrl %eax, %eax + jz L(ret_3) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax +L(ret_3): ret - .p2align 4 -L(length_less16): - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add $16, %edx - - pshufd $0, %xmm1, %xmm1 - - mov %edi, %ecx - and $15, %ecx - jz L(length_less16_offset0) - - mov %cl, %dh - mov %ecx, %esi - add %dl, %dh - and $-16, %rdi - - sub $16, %dh - ja L(length_less16_part2) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - - sar %cl, %eax - mov %dl, %cl - - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax - test %eax, %eax - jz L(return_null) - - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 6 +L(ret_vec_x2_test): + bsrl %eax, %eax + jz L(zero_2) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax ret - .p2align 4 -L(length_less16_part2): - movdqa 16(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - - mov %dh, %cl - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax +L(zero_2): + xorl %eax, %eax + ret - test %eax, %eax - jnz L(length_less16_part2_return) - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax + .p2align 4,, 5 +L(ret_vec_x2): + bsrl %eax, %eax + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - mov %esi, %ecx - sar %cl, %eax - test %eax, %eax - jz L(return_null) + .p2align 4,, 5 +L(ret_vec_x3): + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 8 +L(more_4x_vec): + testl %eax, %eax + jnz L(ret_vec_x2) + + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_x3) + + addq $-(VEC_SIZE * 4), %rcx + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) + + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end + keeping the code from spilling to the next cache line. */ + addq $(VEC_SIZE * 4 - 1), %rcx + andq $-(VEC_SIZE * 4), %rcx + leaq (VEC_SIZE * 4)(%rdi), %rdx + andq $-(VEC_SIZE * 4), %rdx + + .p2align 4,, 11 +L(loop_4x_vec): + movaps (VEC_SIZE * -1)(%rcx), %xmm1 + movaps (VEC_SIZE * -2)(%rcx), %xmm2 + movaps (VEC_SIZE * -3)(%rcx), %xmm3 + movaps (VEC_SIZE * -4)(%rcx), %xmm4 + pcmpeqb %xmm0, %xmm1 + pcmpeqb %xmm0, %xmm2 + pcmpeqb %xmm0, %xmm3 + pcmpeqb %xmm0, %xmm4 + + por %xmm1, %xmm2 + por %xmm3, %xmm4 + por %xmm2, %xmm4 + + pmovmskb %xmm4, %esi + testl %esi, %esi + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rcx + cmpq %rdx, %rcx + jne L(loop_4x_vec) + + subl %edi, %edx + + /* Ends up being 1-byte nop. */ + .p2align 4,, 2 +L(last_4x_vec): + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) + + testl %eax, %eax + jnz L(ret_vec_x0) + + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_end) + + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $(VEC_SIZE * 3), %edx + ja L(last_vec) + bsrl %eax, %eax + jz L(ret_4) + addl %edx, %eax + jl L(zero_3) + addq %rdi, %rax +L(ret_4): ret - .p2align 4 -L(length_less16_part2_return): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax + /* Ends up being 1-byte nop. */ + .p2align 4,, 3 +L(loop_end): + pmovmskb %xmm1, %eax + sall $16, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm2, %eax + testl %eax, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm3, %eax + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + sall $16, %eax + orl %esi, %eax + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax ret -END (__memrchr) +L(ret_vec_end): + bsrl %eax, %eax + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax + ret + /* Use in L(last_4x_vec). In the same cache line. This is just a spare + aligning bytes. */ +L(zero_3): + xorl %eax, %eax + ret + /* 2-bytes from next cache line. */ +END(__memrchr) weak_alias (__memrchr, memrchr) -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 4/8] x86: Optimize memrchr-sse2.S 2022-06-07 4:11 ` [PATCH v6 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein @ 2022-06-07 18:04 ` H.J. Lu 2022-07-14 2:19 ` Sunil Pandey 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 18:04 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The new code: > 1. prioritizes smaller lengths more. > 2. optimizes target placement more carefully. > 3. reuses logic more. > 4. fixes up various inefficiencies in the logic. > > The total code size saving is: 394 bytes > Geometric Mean of all benchmarks New / Old: 0.874 > > Regressions: > 1. The page cross case is now colder, especially re-entry from the > page cross case if a match is not found in the first VEC > (roughly 50%). My general opinion with this patch is this is > acceptable given the "coldness" of this case (less than 4%) and > generally performance improvement in the other far more common > cases. > > 2. There are some regressions 5-15% for medium/large user-arg > lengths that have a match in the first VEC. This is because the > logic was rewritten to optimize finds in the first VEC if the > user-arg length is shorter (where we see roughly 20-50% > performance improvements). It is not always the case this is a > regression. My intuition is some frontend quirk is partially > explaining the data although I haven't been able to find the > root cause. > > Full xcheck passes on x86_64. > --- > sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- > 1 file changed, 292 insertions(+), 321 deletions(-) > > diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S > index d1a9f47911..b0dffd2ae2 100644 > --- a/sysdeps/x86_64/memrchr.S > +++ b/sysdeps/x86_64/memrchr.S > @@ -18,362 +18,333 @@ > <https://www.gnu.org/licenses/>. */ > > #include <sysdep.h> > +#define VEC_SIZE 16 > +#define PAGE_SIZE 4096 > > .text > -ENTRY (__memrchr) > - movd %esi, %xmm1 > - > - sub $16, %RDX_LP > - jbe L(length_less16) > - > - punpcklbw %xmm1, %xmm1 > - punpcklbw %xmm1, %xmm1 > - > - add %RDX_LP, %RDI_LP > - pshufd $0, %xmm1, %xmm1 > - > - movdqu (%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - > -/* Check if there is a match. */ > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches0) > - > - sub $64, %rdi > - mov %edi, %ecx > - and $15, %ecx > - jz L(loop_prolog) > - > - add $16, %rdi > - add $16, %rdx > - and $-16, %rdi > - sub %rcx, %rdx > - > - .p2align 4 > -L(loop_prolog): > - sub $64, %rdx > - jbe L(exit_loop) > - > - movdqa 48(%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches48) > - > - movdqa 32(%rdi), %xmm2 > - pcmpeqb %xmm1, %xmm2 > - pmovmskb %xmm2, %eax > - test %eax, %eax > - jnz L(matches32) > - > - movdqa 16(%rdi), %xmm3 > - pcmpeqb %xmm1, %xmm3 > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches16) > - > - movdqa (%rdi), %xmm4 > - pcmpeqb %xmm1, %xmm4 > - pmovmskb %xmm4, %eax > - test %eax, %eax > - jnz L(matches0) > - > - sub $64, %rdi > - sub $64, %rdx > - jbe L(exit_loop) > - > - movdqa 48(%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches48) > - > - movdqa 32(%rdi), %xmm2 > - pcmpeqb %xmm1, %xmm2 > - pmovmskb %xmm2, %eax > - test %eax, %eax > - jnz L(matches32) > - > - movdqa 16(%rdi), %xmm3 > - pcmpeqb %xmm1, %xmm3 > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches16) > - > - movdqa (%rdi), %xmm3 > - pcmpeqb %xmm1, %xmm3 > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches0) > - > - mov %edi, %ecx > - and $63, %ecx > - jz L(align64_loop) > - > - add $64, %rdi > - add $64, %rdx > - and $-64, %rdi > - sub %rcx, %rdx > - > - .p2align 4 > -L(align64_loop): > - sub $64, %rdi > - sub $64, %rdx > - jbe L(exit_loop) > - > - movdqa (%rdi), %xmm0 > - movdqa 16(%rdi), %xmm2 > - movdqa 32(%rdi), %xmm3 > - movdqa 48(%rdi), %xmm4 > - > - pcmpeqb %xmm1, %xmm0 > - pcmpeqb %xmm1, %xmm2 > - pcmpeqb %xmm1, %xmm3 > - pcmpeqb %xmm1, %xmm4 > - > - pmaxub %xmm3, %xmm0 > - pmaxub %xmm4, %xmm2 > - pmaxub %xmm0, %xmm2 > - pmovmskb %xmm2, %eax > - > - test %eax, %eax > - jz L(align64_loop) > - > - pmovmskb %xmm4, %eax > - test %eax, %eax > - jnz L(matches48) > - > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches32) > - > - movdqa 16(%rdi), %xmm2 > - > - pcmpeqb %xmm1, %xmm2 > - pcmpeqb (%rdi), %xmm1 > - > - pmovmskb %xmm2, %eax > - test %eax, %eax > - jnz L(matches16) > - > - pmovmskb %xmm1, %eax > - bsr %eax, %eax > - > - add %rdi, %rax > +ENTRY_P2ALIGN(__memrchr, 6) > +#ifdef __ILP32__ > + /* Clear upper bits. */ > + mov %RDX_LP, %RDX_LP > +#endif > + movd %esi, %xmm0 > + > + /* Get end pointer. */ > + leaq (%rdx, %rdi), %rcx > + > + punpcklbw %xmm0, %xmm0 > + punpcklwd %xmm0, %xmm0 > + pshufd $0, %xmm0, %xmm0 > + > + /* Check if we can load 1x VEC without cross a page. */ > + testl $(PAGE_SIZE - VEC_SIZE), %ecx > + jz L(page_cross) > + > + /* NB: This load happens regardless of whether rdx (len) is zero. Since > + it doesn't cross a page and the standard gurantees any pointer have > + at least one-valid byte this load must be safe. For the entire > + history of the x86 memrchr implementation this has been possible so > + no code "should" be relying on a zero-length check before this load. > + The zero-length check is moved to the page cross case because it is > + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE > + into 2-cache lines. */ > + movups -(VEC_SIZE)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + subq $VEC_SIZE, %rdx > + ja L(more_1x_vec) > +L(ret_vec_x0_test): > + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is > + zero. */ > + bsrl %eax, %eax > + jz L(ret_0) > + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here > + if out of bounds. */ > + addl %edx, %eax > + jl L(zero_0) > + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base > + ptr. */ > + addq %rdi, %rax > +L(ret_0): > ret > > - .p2align 4 > -L(exit_loop): > - add $64, %edx > - cmp $32, %edx > - jbe L(exit_loop_32) > - > - movdqa 48(%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches48) > - > - movdqa 32(%rdi), %xmm2 > - pcmpeqb %xmm1, %xmm2 > - pmovmskb %xmm2, %eax > - test %eax, %eax > - jnz L(matches32) > - > - movdqa 16(%rdi), %xmm3 > - pcmpeqb %xmm1, %xmm3 > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches16_1) > - cmp $48, %edx > - jbe L(return_null) > - > - pcmpeqb (%rdi), %xmm1 > - pmovmskb %xmm1, %eax > - test %eax, %eax > - jnz L(matches0_1) > - xor %eax, %eax > + .p2align 4,, 5 > +L(ret_vec_x0): > + bsrl %eax, %eax > + leaq -(VEC_SIZE)(%rcx, %rax), %rax > ret > > - .p2align 4 > -L(exit_loop_32): > - movdqa 48(%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches48_1) > - cmp $16, %edx > - jbe L(return_null) > - > - pcmpeqb 32(%rdi), %xmm1 > - pmovmskb %xmm1, %eax > - test %eax, %eax > - jnz L(matches32_1) > - xor %eax, %eax > + .p2align 4,, 2 > +L(zero_0): > + xorl %eax, %eax > ret > > - .p2align 4 > -L(matches0): > - bsr %eax, %eax > - add %rdi, %rax > - ret > - > - .p2align 4 > -L(matches16): > - bsr %eax, %eax > - lea 16(%rax, %rdi), %rax > - ret > > - .p2align 4 > -L(matches32): > - bsr %eax, %eax > - lea 32(%rax, %rdi), %rax > + .p2align 4,, 8 > +L(more_1x_vec): > + testl %eax, %eax > + jnz L(ret_vec_x0) > + > + /* Align rcx (pointer to string). */ > + decq %rcx > + andq $-VEC_SIZE, %rcx > + > + movq %rcx, %rdx > + /* NB: We could consistenyl save 1-byte in this pattern with `movaps > + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is > + it adds more frontend uops (even if the moves can be eliminated) and > + some percentage of the time actual backend uops. */ > + movaps -(VEC_SIZE)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + subq %rdi, %rdx > + pmovmskb %xmm1, %eax > + > + cmpq $(VEC_SIZE * 2), %rdx > + ja L(more_2x_vec) > +L(last_2x_vec): > + subl $VEC_SIZE, %edx > + jbe L(ret_vec_x0_test) > + > + testl %eax, %eax > + jnz L(ret_vec_x0) > + > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + subl $VEC_SIZE, %edx > + bsrl %eax, %eax > + jz L(ret_1) > + addl %edx, %eax > + jl L(zero_0) > + addq %rdi, %rax > +L(ret_1): > ret > > - .p2align 4 > -L(matches48): > - bsr %eax, %eax > - lea 48(%rax, %rdi), %rax > + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) > + causes the hot pause (length <= VEC_SIZE) to span multiple cache > + lines. Naturally aligned % 16 to 8-bytes. */ > +L(page_cross): > + /* Zero length check. */ > + testq %rdx, %rdx > + jz L(zero_0) > + > + leaq -1(%rcx), %r8 > + andq $-(VEC_SIZE), %r8 > + > + movaps (%r8), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %esi > + /* Shift out negative alignment (because we are starting from endptr and > + working backwards). */ > + negl %ecx > + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count > + explicitly. */ > + andl $(VEC_SIZE - 1), %ecx > + shl %cl, %esi > + movzwl %si, %eax > + leaq (%rdi, %rdx), %rcx > + cmpq %rdi, %r8 > + ja L(more_1x_vec) > + subl $VEC_SIZE, %edx > + bsrl %eax, %eax > + jz L(ret_2) > + addl %edx, %eax > + jl L(zero_1) > + addq %rdi, %rax > +L(ret_2): > ret > > - .p2align 4 > -L(matches0_1): > - bsr %eax, %eax > - sub $64, %rdx > - add %rax, %rdx > - jl L(return_null) > - add %rdi, %rax > + /* Fits in aliging bytes. */ > +L(zero_1): > + xorl %eax, %eax > ret > > - .p2align 4 > -L(matches16_1): > - bsr %eax, %eax > - sub $48, %rdx > - add %rax, %rdx > - jl L(return_null) > - lea 16(%rdi, %rax), %rax > + .p2align 4,, 5 > +L(ret_vec_x1): > + bsrl %eax, %eax > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > ret > > - .p2align 4 > -L(matches32_1): > - bsr %eax, %eax > - sub $32, %rdx > - add %rax, %rdx > - jl L(return_null) > - lea 32(%rdi, %rax), %rax > - ret > + .p2align 4,, 8 > +L(more_2x_vec): > + testl %eax, %eax > + jnz L(ret_vec_x0) > > - .p2align 4 > -L(matches48_1): > - bsr %eax, %eax > - sub $16, %rdx > - add %rax, %rdx > - jl L(return_null) > - lea 48(%rdi, %rax), %rax > - ret > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + testl %eax, %eax > + jnz L(ret_vec_x1) > > - .p2align 4 > -L(return_null): > - xor %eax, %eax > - ret > > - .p2align 4 > -L(length_less16_offset0): > - test %edx, %edx > - jz L(return_null) > + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > > - mov %dl, %cl > - pcmpeqb (%rdi), %xmm1 > + subq $(VEC_SIZE * 4), %rdx > + ja L(more_4x_vec) > > - mov $1, %edx > - sal %cl, %edx > - sub $1, %edx > + addl $(VEC_SIZE), %edx > + jle L(ret_vec_x2_test) > > - pmovmskb %xmm1, %eax > +L(last_vec): > + testl %eax, %eax > + jnz L(ret_vec_x2) > > - and %edx, %eax > - test %eax, %eax > - jz L(return_null) > + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > > - bsr %eax, %eax > - add %rdi, %rax > + subl $(VEC_SIZE), %edx > + bsrl %eax, %eax > + jz L(ret_3) > + addl %edx, %eax > + jl L(zero_2) > + addq %rdi, %rax > +L(ret_3): > ret > > - .p2align 4 > -L(length_less16): > - punpcklbw %xmm1, %xmm1 > - punpcklbw %xmm1, %xmm1 > - > - add $16, %edx > - > - pshufd $0, %xmm1, %xmm1 > - > - mov %edi, %ecx > - and $15, %ecx > - jz L(length_less16_offset0) > - > - mov %cl, %dh > - mov %ecx, %esi > - add %dl, %dh > - and $-16, %rdi > - > - sub $16, %dh > - ja L(length_less16_part2) > - > - pcmpeqb (%rdi), %xmm1 > - pmovmskb %xmm1, %eax > - > - sar %cl, %eax > - mov %dl, %cl > - > - mov $1, %edx > - sal %cl, %edx > - sub $1, %edx > - > - and %edx, %eax > - test %eax, %eax > - jz L(return_null) > - > - bsr %eax, %eax > - add %rdi, %rax > - add %rsi, %rax > + .p2align 4,, 6 > +L(ret_vec_x2_test): > + bsrl %eax, %eax > + jz L(zero_2) > + addl %edx, %eax > + jl L(zero_2) > + addq %rdi, %rax > ret > > - .p2align 4 > -L(length_less16_part2): > - movdqa 16(%rdi), %xmm2 > - pcmpeqb %xmm1, %xmm2 > - pmovmskb %xmm2, %eax > - > - mov %dh, %cl > - mov $1, %edx > - sal %cl, %edx > - sub $1, %edx > - > - and %edx, %eax > +L(zero_2): > + xorl %eax, %eax > + ret > > - test %eax, %eax > - jnz L(length_less16_part2_return) > > - pcmpeqb (%rdi), %xmm1 > - pmovmskb %xmm1, %eax > + .p2align 4,, 5 > +L(ret_vec_x2): > + bsrl %eax, %eax > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > + ret > > - mov %esi, %ecx > - sar %cl, %eax > - test %eax, %eax > - jz L(return_null) > + .p2align 4,, 5 > +L(ret_vec_x3): > + bsrl %eax, %eax > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > + ret > > - bsr %eax, %eax > - add %rdi, %rax > - add %rsi, %rax > + .p2align 4,, 8 > +L(more_4x_vec): > + testl %eax, %eax > + jnz L(ret_vec_x2) > + > + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + testl %eax, %eax > + jnz L(ret_vec_x3) > + > + addq $-(VEC_SIZE * 4), %rcx > + cmpq $(VEC_SIZE * 4), %rdx > + jbe L(last_4x_vec) > + > + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end > + keeping the code from spilling to the next cache line. */ > + addq $(VEC_SIZE * 4 - 1), %rcx > + andq $-(VEC_SIZE * 4), %rcx > + leaq (VEC_SIZE * 4)(%rdi), %rdx > + andq $-(VEC_SIZE * 4), %rdx > + > + .p2align 4,, 11 > +L(loop_4x_vec): > + movaps (VEC_SIZE * -1)(%rcx), %xmm1 > + movaps (VEC_SIZE * -2)(%rcx), %xmm2 > + movaps (VEC_SIZE * -3)(%rcx), %xmm3 > + movaps (VEC_SIZE * -4)(%rcx), %xmm4 > + pcmpeqb %xmm0, %xmm1 > + pcmpeqb %xmm0, %xmm2 > + pcmpeqb %xmm0, %xmm3 > + pcmpeqb %xmm0, %xmm4 > + > + por %xmm1, %xmm2 > + por %xmm3, %xmm4 > + por %xmm2, %xmm4 > + > + pmovmskb %xmm4, %esi > + testl %esi, %esi > + jnz L(loop_end) > + > + addq $-(VEC_SIZE * 4), %rcx > + cmpq %rdx, %rcx > + jne L(loop_4x_vec) > + > + subl %edi, %edx > + > + /* Ends up being 1-byte nop. */ > + .p2align 4,, 2 > +L(last_4x_vec): > + movaps -(VEC_SIZE)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + cmpl $(VEC_SIZE * 2), %edx > + jbe L(last_2x_vec) > + > + testl %eax, %eax > + jnz L(ret_vec_x0) > + > + > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + testl %eax, %eax > + jnz L(ret_vec_end) > + > + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + subl $(VEC_SIZE * 3), %edx > + ja L(last_vec) > + bsrl %eax, %eax > + jz L(ret_4) > + addl %edx, %eax > + jl L(zero_3) > + addq %rdi, %rax > +L(ret_4): > ret > > - .p2align 4 > -L(length_less16_part2_return): > - bsr %eax, %eax > - lea 16(%rax, %rdi), %rax > + /* Ends up being 1-byte nop. */ > + .p2align 4,, 3 > +L(loop_end): > + pmovmskb %xmm1, %eax > + sall $16, %eax > + jnz L(ret_vec_end) > + > + pmovmskb %xmm2, %eax > + testl %eax, %eax > + jnz L(ret_vec_end) > + > + pmovmskb %xmm3, %eax > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > + then it won't affect the result in esi (VEC4). If ecx is non-zero > + then CHAR in VEC3 and bsrq will use that position. */ > + sall $16, %eax > + orl %esi, %eax > + bsrl %eax, %eax > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > ret > > -END (__memrchr) > +L(ret_vec_end): > + bsrl %eax, %eax > + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax > + ret > + /* Use in L(last_4x_vec). In the same cache line. This is just a spare > + aligning bytes. */ > +L(zero_3): > + xorl %eax, %eax > + ret > + /* 2-bytes from next cache line. */ > +END(__memrchr) > weak_alias (__memrchr, memrchr) > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 4/8] x86: Optimize memrchr-sse2.S 2022-06-07 18:04 ` H.J. Lu @ 2022-07-14 2:19 ` Sunil Pandey 0 siblings, 0 replies; 82+ messages in thread From: Sunil Pandey @ 2022-07-14 2:19 UTC (permalink / raw) To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library On Tue, Jun 7, 2022 at 11:07 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The new code: > > 1. prioritizes smaller lengths more. > > 2. optimizes target placement more carefully. > > 3. reuses logic more. > > 4. fixes up various inefficiencies in the logic. > > > > The total code size saving is: 394 bytes > > Geometric Mean of all benchmarks New / Old: 0.874 > > > > Regressions: > > 1. The page cross case is now colder, especially re-entry from the > > page cross case if a match is not found in the first VEC > > (roughly 50%). My general opinion with this patch is this is > > acceptable given the "coldness" of this case (less than 4%) and > > generally performance improvement in the other far more common > > cases. > > > > 2. There are some regressions 5-15% for medium/large user-arg > > lengths that have a match in the first VEC. This is because the > > logic was rewritten to optimize finds in the first VEC if the > > user-arg length is shorter (where we see roughly 20-50% > > performance improvements). It is not always the case this is a > > regression. My intuition is some frontend quirk is partially > > explaining the data although I haven't been able to find the > > root cause. > > > > Full xcheck passes on x86_64. > > --- > > sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- > > 1 file changed, 292 insertions(+), 321 deletions(-) > > > > diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S > > index d1a9f47911..b0dffd2ae2 100644 > > --- a/sysdeps/x86_64/memrchr.S > > +++ b/sysdeps/x86_64/memrchr.S > > @@ -18,362 +18,333 @@ > > <https://www.gnu.org/licenses/>. */ > > > > #include <sysdep.h> > > +#define VEC_SIZE 16 > > +#define PAGE_SIZE 4096 > > > > .text > > -ENTRY (__memrchr) > > - movd %esi, %xmm1 > > - > > - sub $16, %RDX_LP > > - jbe L(length_less16) > > - > > - punpcklbw %xmm1, %xmm1 > > - punpcklbw %xmm1, %xmm1 > > - > > - add %RDX_LP, %RDI_LP > > - pshufd $0, %xmm1, %xmm1 > > - > > - movdqu (%rdi), %xmm0 > > - pcmpeqb %xmm1, %xmm0 > > - > > -/* Check if there is a match. */ > > - pmovmskb %xmm0, %eax > > - test %eax, %eax > > - jnz L(matches0) > > - > > - sub $64, %rdi > > - mov %edi, %ecx > > - and $15, %ecx > > - jz L(loop_prolog) > > - > > - add $16, %rdi > > - add $16, %rdx > > - and $-16, %rdi > > - sub %rcx, %rdx > > - > > - .p2align 4 > > -L(loop_prolog): > > - sub $64, %rdx > > - jbe L(exit_loop) > > - > > - movdqa 48(%rdi), %xmm0 > > - pcmpeqb %xmm1, %xmm0 > > - pmovmskb %xmm0, %eax > > - test %eax, %eax > > - jnz L(matches48) > > - > > - movdqa 32(%rdi), %xmm2 > > - pcmpeqb %xmm1, %xmm2 > > - pmovmskb %xmm2, %eax > > - test %eax, %eax > > - jnz L(matches32) > > - > > - movdqa 16(%rdi), %xmm3 > > - pcmpeqb %xmm1, %xmm3 > > - pmovmskb %xmm3, %eax > > - test %eax, %eax > > - jnz L(matches16) > > - > > - movdqa (%rdi), %xmm4 > > - pcmpeqb %xmm1, %xmm4 > > - pmovmskb %xmm4, %eax > > - test %eax, %eax > > - jnz L(matches0) > > - > > - sub $64, %rdi > > - sub $64, %rdx > > - jbe L(exit_loop) > > - > > - movdqa 48(%rdi), %xmm0 > > - pcmpeqb %xmm1, %xmm0 > > - pmovmskb %xmm0, %eax > > - test %eax, %eax > > - jnz L(matches48) > > - > > - movdqa 32(%rdi), %xmm2 > > - pcmpeqb %xmm1, %xmm2 > > - pmovmskb %xmm2, %eax > > - test %eax, %eax > > - jnz L(matches32) > > - > > - movdqa 16(%rdi), %xmm3 > > - pcmpeqb %xmm1, %xmm3 > > - pmovmskb %xmm3, %eax > > - test %eax, %eax > > - jnz L(matches16) > > - > > - movdqa (%rdi), %xmm3 > > - pcmpeqb %xmm1, %xmm3 > > - pmovmskb %xmm3, %eax > > - test %eax, %eax > > - jnz L(matches0) > > - > > - mov %edi, %ecx > > - and $63, %ecx > > - jz L(align64_loop) > > - > > - add $64, %rdi > > - add $64, %rdx > > - and $-64, %rdi > > - sub %rcx, %rdx > > - > > - .p2align 4 > > -L(align64_loop): > > - sub $64, %rdi > > - sub $64, %rdx > > - jbe L(exit_loop) > > - > > - movdqa (%rdi), %xmm0 > > - movdqa 16(%rdi), %xmm2 > > - movdqa 32(%rdi), %xmm3 > > - movdqa 48(%rdi), %xmm4 > > - > > - pcmpeqb %xmm1, %xmm0 > > - pcmpeqb %xmm1, %xmm2 > > - pcmpeqb %xmm1, %xmm3 > > - pcmpeqb %xmm1, %xmm4 > > - > > - pmaxub %xmm3, %xmm0 > > - pmaxub %xmm4, %xmm2 > > - pmaxub %xmm0, %xmm2 > > - pmovmskb %xmm2, %eax > > - > > - test %eax, %eax > > - jz L(align64_loop) > > - > > - pmovmskb %xmm4, %eax > > - test %eax, %eax > > - jnz L(matches48) > > - > > - pmovmskb %xmm3, %eax > > - test %eax, %eax > > - jnz L(matches32) > > - > > - movdqa 16(%rdi), %xmm2 > > - > > - pcmpeqb %xmm1, %xmm2 > > - pcmpeqb (%rdi), %xmm1 > > - > > - pmovmskb %xmm2, %eax > > - test %eax, %eax > > - jnz L(matches16) > > - > > - pmovmskb %xmm1, %eax > > - bsr %eax, %eax > > - > > - add %rdi, %rax > > +ENTRY_P2ALIGN(__memrchr, 6) > > +#ifdef __ILP32__ > > + /* Clear upper bits. */ > > + mov %RDX_LP, %RDX_LP > > +#endif > > + movd %esi, %xmm0 > > + > > + /* Get end pointer. */ > > + leaq (%rdx, %rdi), %rcx > > + > > + punpcklbw %xmm0, %xmm0 > > + punpcklwd %xmm0, %xmm0 > > + pshufd $0, %xmm0, %xmm0 > > + > > + /* Check if we can load 1x VEC without cross a page. */ > > + testl $(PAGE_SIZE - VEC_SIZE), %ecx > > + jz L(page_cross) > > + > > + /* NB: This load happens regardless of whether rdx (len) is zero. Since > > + it doesn't cross a page and the standard gurantees any pointer have > > + at least one-valid byte this load must be safe. For the entire > > + history of the x86 memrchr implementation this has been possible so > > + no code "should" be relying on a zero-length check before this load. > > + The zero-length check is moved to the page cross case because it is > > + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE > > + into 2-cache lines. */ > > + movups -(VEC_SIZE)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > + > > + subq $VEC_SIZE, %rdx > > + ja L(more_1x_vec) > > +L(ret_vec_x0_test): > > + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is > > + zero. */ > > + bsrl %eax, %eax > > + jz L(ret_0) > > + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here > > + if out of bounds. */ > > + addl %edx, %eax > > + jl L(zero_0) > > + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base > > + ptr. */ > > + addq %rdi, %rax > > +L(ret_0): > > ret > > > > - .p2align 4 > > -L(exit_loop): > > - add $64, %edx > > - cmp $32, %edx > > - jbe L(exit_loop_32) > > - > > - movdqa 48(%rdi), %xmm0 > > - pcmpeqb %xmm1, %xmm0 > > - pmovmskb %xmm0, %eax > > - test %eax, %eax > > - jnz L(matches48) > > - > > - movdqa 32(%rdi), %xmm2 > > - pcmpeqb %xmm1, %xmm2 > > - pmovmskb %xmm2, %eax > > - test %eax, %eax > > - jnz L(matches32) > > - > > - movdqa 16(%rdi), %xmm3 > > - pcmpeqb %xmm1, %xmm3 > > - pmovmskb %xmm3, %eax > > - test %eax, %eax > > - jnz L(matches16_1) > > - cmp $48, %edx > > - jbe L(return_null) > > - > > - pcmpeqb (%rdi), %xmm1 > > - pmovmskb %xmm1, %eax > > - test %eax, %eax > > - jnz L(matches0_1) > > - xor %eax, %eax > > + .p2align 4,, 5 > > +L(ret_vec_x0): > > + bsrl %eax, %eax > > + leaq -(VEC_SIZE)(%rcx, %rax), %rax > > ret > > > > - .p2align 4 > > -L(exit_loop_32): > > - movdqa 48(%rdi), %xmm0 > > - pcmpeqb %xmm1, %xmm0 > > - pmovmskb %xmm0, %eax > > - test %eax, %eax > > - jnz L(matches48_1) > > - cmp $16, %edx > > - jbe L(return_null) > > - > > - pcmpeqb 32(%rdi), %xmm1 > > - pmovmskb %xmm1, %eax > > - test %eax, %eax > > - jnz L(matches32_1) > > - xor %eax, %eax > > + .p2align 4,, 2 > > +L(zero_0): > > + xorl %eax, %eax > > ret > > > > - .p2align 4 > > -L(matches0): > > - bsr %eax, %eax > > - add %rdi, %rax > > - ret > > - > > - .p2align 4 > > -L(matches16): > > - bsr %eax, %eax > > - lea 16(%rax, %rdi), %rax > > - ret > > > > - .p2align 4 > > -L(matches32): > > - bsr %eax, %eax > > - lea 32(%rax, %rdi), %rax > > + .p2align 4,, 8 > > +L(more_1x_vec): > > + testl %eax, %eax > > + jnz L(ret_vec_x0) > > + > > + /* Align rcx (pointer to string). */ > > + decq %rcx > > + andq $-VEC_SIZE, %rcx > > + > > + movq %rcx, %rdx > > + /* NB: We could consistenyl save 1-byte in this pattern with `movaps > > + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is > > + it adds more frontend uops (even if the moves can be eliminated) and > > + some percentage of the time actual backend uops. */ > > + movaps -(VEC_SIZE)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + subq %rdi, %rdx > > + pmovmskb %xmm1, %eax > > + > > + cmpq $(VEC_SIZE * 2), %rdx > > + ja L(more_2x_vec) > > +L(last_2x_vec): > > + subl $VEC_SIZE, %edx > > + jbe L(ret_vec_x0_test) > > + > > + testl %eax, %eax > > + jnz L(ret_vec_x0) > > + > > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > + > > + subl $VEC_SIZE, %edx > > + bsrl %eax, %eax > > + jz L(ret_1) > > + addl %edx, %eax > > + jl L(zero_0) > > + addq %rdi, %rax > > +L(ret_1): > > ret > > > > - .p2align 4 > > -L(matches48): > > - bsr %eax, %eax > > - lea 48(%rax, %rdi), %rax > > + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) > > + causes the hot pause (length <= VEC_SIZE) to span multiple cache > > + lines. Naturally aligned % 16 to 8-bytes. */ > > +L(page_cross): > > + /* Zero length check. */ > > + testq %rdx, %rdx > > + jz L(zero_0) > > + > > + leaq -1(%rcx), %r8 > > + andq $-(VEC_SIZE), %r8 > > + > > + movaps (%r8), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %esi > > + /* Shift out negative alignment (because we are starting from endptr and > > + working backwards). */ > > + negl %ecx > > + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count > > + explicitly. */ > > + andl $(VEC_SIZE - 1), %ecx > > + shl %cl, %esi > > + movzwl %si, %eax > > + leaq (%rdi, %rdx), %rcx > > + cmpq %rdi, %r8 > > + ja L(more_1x_vec) > > + subl $VEC_SIZE, %edx > > + bsrl %eax, %eax > > + jz L(ret_2) > > + addl %edx, %eax > > + jl L(zero_1) > > + addq %rdi, %rax > > +L(ret_2): > > ret > > > > - .p2align 4 > > -L(matches0_1): > > - bsr %eax, %eax > > - sub $64, %rdx > > - add %rax, %rdx > > - jl L(return_null) > > - add %rdi, %rax > > + /* Fits in aliging bytes. */ > > +L(zero_1): > > + xorl %eax, %eax > > ret > > > > - .p2align 4 > > -L(matches16_1): > > - bsr %eax, %eax > > - sub $48, %rdx > > - add %rax, %rdx > > - jl L(return_null) > > - lea 16(%rdi, %rax), %rax > > + .p2align 4,, 5 > > +L(ret_vec_x1): > > + bsrl %eax, %eax > > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > > ret > > > > - .p2align 4 > > -L(matches32_1): > > - bsr %eax, %eax > > - sub $32, %rdx > > - add %rax, %rdx > > - jl L(return_null) > > - lea 32(%rdi, %rax), %rax > > - ret > > + .p2align 4,, 8 > > +L(more_2x_vec): > > + testl %eax, %eax > > + jnz L(ret_vec_x0) > > > > - .p2align 4 > > -L(matches48_1): > > - bsr %eax, %eax > > - sub $16, %rdx > > - add %rax, %rdx > > - jl L(return_null) > > - lea 48(%rdi, %rax), %rax > > - ret > > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > + testl %eax, %eax > > + jnz L(ret_vec_x1) > > > > - .p2align 4 > > -L(return_null): > > - xor %eax, %eax > > - ret > > > > - .p2align 4 > > -L(length_less16_offset0): > > - test %edx, %edx > > - jz L(return_null) > > + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > > > - mov %dl, %cl > > - pcmpeqb (%rdi), %xmm1 > > + subq $(VEC_SIZE * 4), %rdx > > + ja L(more_4x_vec) > > > > - mov $1, %edx > > - sal %cl, %edx > > - sub $1, %edx > > + addl $(VEC_SIZE), %edx > > + jle L(ret_vec_x2_test) > > > > - pmovmskb %xmm1, %eax > > +L(last_vec): > > + testl %eax, %eax > > + jnz L(ret_vec_x2) > > > > - and %edx, %eax > > - test %eax, %eax > > - jz L(return_null) > > + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > > > - bsr %eax, %eax > > - add %rdi, %rax > > + subl $(VEC_SIZE), %edx > > + bsrl %eax, %eax > > + jz L(ret_3) > > + addl %edx, %eax > > + jl L(zero_2) > > + addq %rdi, %rax > > +L(ret_3): > > ret > > > > - .p2align 4 > > -L(length_less16): > > - punpcklbw %xmm1, %xmm1 > > - punpcklbw %xmm1, %xmm1 > > - > > - add $16, %edx > > - > > - pshufd $0, %xmm1, %xmm1 > > - > > - mov %edi, %ecx > > - and $15, %ecx > > - jz L(length_less16_offset0) > > - > > - mov %cl, %dh > > - mov %ecx, %esi > > - add %dl, %dh > > - and $-16, %rdi > > - > > - sub $16, %dh > > - ja L(length_less16_part2) > > - > > - pcmpeqb (%rdi), %xmm1 > > - pmovmskb %xmm1, %eax > > - > > - sar %cl, %eax > > - mov %dl, %cl > > - > > - mov $1, %edx > > - sal %cl, %edx > > - sub $1, %edx > > - > > - and %edx, %eax > > - test %eax, %eax > > - jz L(return_null) > > - > > - bsr %eax, %eax > > - add %rdi, %rax > > - add %rsi, %rax > > + .p2align 4,, 6 > > +L(ret_vec_x2_test): > > + bsrl %eax, %eax > > + jz L(zero_2) > > + addl %edx, %eax > > + jl L(zero_2) > > + addq %rdi, %rax > > ret > > > > - .p2align 4 > > -L(length_less16_part2): > > - movdqa 16(%rdi), %xmm2 > > - pcmpeqb %xmm1, %xmm2 > > - pmovmskb %xmm2, %eax > > - > > - mov %dh, %cl > > - mov $1, %edx > > - sal %cl, %edx > > - sub $1, %edx > > - > > - and %edx, %eax > > +L(zero_2): > > + xorl %eax, %eax > > + ret > > > > - test %eax, %eax > > - jnz L(length_less16_part2_return) > > > > - pcmpeqb (%rdi), %xmm1 > > - pmovmskb %xmm1, %eax > > + .p2align 4,, 5 > > +L(ret_vec_x2): > > + bsrl %eax, %eax > > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > > + ret > > > > - mov %esi, %ecx > > - sar %cl, %eax > > - test %eax, %eax > > - jz L(return_null) > > + .p2align 4,, 5 > > +L(ret_vec_x3): > > + bsrl %eax, %eax > > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > > + ret > > > > - bsr %eax, %eax > > - add %rdi, %rax > > - add %rsi, %rax > > + .p2align 4,, 8 > > +L(more_4x_vec): > > + testl %eax, %eax > > + jnz L(ret_vec_x2) > > + > > + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > + > > + testl %eax, %eax > > + jnz L(ret_vec_x3) > > + > > + addq $-(VEC_SIZE * 4), %rcx > > + cmpq $(VEC_SIZE * 4), %rdx > > + jbe L(last_4x_vec) > > + > > + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end > > + keeping the code from spilling to the next cache line. */ > > + addq $(VEC_SIZE * 4 - 1), %rcx > > + andq $-(VEC_SIZE * 4), %rcx > > + leaq (VEC_SIZE * 4)(%rdi), %rdx > > + andq $-(VEC_SIZE * 4), %rdx > > + > > + .p2align 4,, 11 > > +L(loop_4x_vec): > > + movaps (VEC_SIZE * -1)(%rcx), %xmm1 > > + movaps (VEC_SIZE * -2)(%rcx), %xmm2 > > + movaps (VEC_SIZE * -3)(%rcx), %xmm3 > > + movaps (VEC_SIZE * -4)(%rcx), %xmm4 > > + pcmpeqb %xmm0, %xmm1 > > + pcmpeqb %xmm0, %xmm2 > > + pcmpeqb %xmm0, %xmm3 > > + pcmpeqb %xmm0, %xmm4 > > + > > + por %xmm1, %xmm2 > > + por %xmm3, %xmm4 > > + por %xmm2, %xmm4 > > + > > + pmovmskb %xmm4, %esi > > + testl %esi, %esi > > + jnz L(loop_end) > > + > > + addq $-(VEC_SIZE * 4), %rcx > > + cmpq %rdx, %rcx > > + jne L(loop_4x_vec) > > + > > + subl %edi, %edx > > + > > + /* Ends up being 1-byte nop. */ > > + .p2align 4,, 2 > > +L(last_4x_vec): > > + movaps -(VEC_SIZE)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > + > > + cmpl $(VEC_SIZE * 2), %edx > > + jbe L(last_2x_vec) > > + > > + testl %eax, %eax > > + jnz L(ret_vec_x0) > > + > > + > > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > + > > + testl %eax, %eax > > + jnz L(ret_vec_end) > > + > > + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %eax > > + > > + subl $(VEC_SIZE * 3), %edx > > + ja L(last_vec) > > + bsrl %eax, %eax > > + jz L(ret_4) > > + addl %edx, %eax > > + jl L(zero_3) > > + addq %rdi, %rax > > +L(ret_4): > > ret > > > > - .p2align 4 > > -L(length_less16_part2_return): > > - bsr %eax, %eax > > - lea 16(%rax, %rdi), %rax > > + /* Ends up being 1-byte nop. */ > > + .p2align 4,, 3 > > +L(loop_end): > > + pmovmskb %xmm1, %eax > > + sall $16, %eax > > + jnz L(ret_vec_end) > > + > > + pmovmskb %xmm2, %eax > > + testl %eax, %eax > > + jnz L(ret_vec_end) > > + > > + pmovmskb %xmm3, %eax > > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > > + then it won't affect the result in esi (VEC4). If ecx is non-zero > > + then CHAR in VEC3 and bsrq will use that position. */ > > + sall $16, %eax > > + orl %esi, %eax > > + bsrl %eax, %eax > > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > > ret > > > > -END (__memrchr) > > +L(ret_vec_end): > > + bsrl %eax, %eax > > + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax > > + ret > > + /* Use in L(last_4x_vec). In the same cache line. This is just a spare > > + aligning bytes. */ > > +L(zero_3): > > + xorl %eax, %eax > > + ret > > + /* 2-bytes from next cache line. */ > > +END(__memrchr) > > weak_alias (__memrchr, memrchr) > > -- > > 2.34.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v6 5/8] x86: Optimize memrchr-evex.S 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (2 preceding siblings ...) 2022-06-07 4:11 ` [PATCH v6 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein @ 2022-06-07 4:11 ` Noah Goldstein 2022-06-07 18:21 ` H.J. Lu 2022-06-07 4:11 ` [PATCH v6 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein ` (3 subsequent siblings) 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:11 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ 1 file changed, 268 insertions(+), 271 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S index 0b99709c6b..2d7da06dfc 100644 --- a/sysdeps/x86_64/multiarch/memrchr-evex.S +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S @@ -19,319 +19,316 @@ #if IS_IN (libc) # include <sysdep.h> +# include "evex256-vecs.h" +# if VEC_SIZE != 32 +# error "VEC_SIZE != 32 unimplemented" +# endif + +# ifndef MEMRCHR +# define MEMRCHR __memrchr_evex +# endif + +# define PAGE_SIZE 4096 +# define VECMATCH VEC(0) + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN(MEMRCHR, 6) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) + + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdi, %rdx), %rax + vpbroadcastb %esi, %VECMATCH + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will guarantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret -# define VMOVA vmovdqa64 - -# define YMMMATCH ymm16 - -# define VEC_SIZE 32 - - .section .text.evex,"ax",@progbits -ENTRY (__memrchr_evex) - /* Broadcast CHAR to YMMMATCH. */ - vpbroadcastb %esi, %YMMMATCH - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP - - /* Check the last VEC_SIZE bytes. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) - - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx - - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 - kord %k1, %k2, %k5 - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 - - kord %k3, %k4, %k6 - kortestd %k5, %k6 - jz L(loop_4x_vec) - - /* There is a match. */ - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - kmovd %k1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0_dec): + decq %rax +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax ret - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) + /* Align rax (pointer to string). */ + andq $-VEC_SIZE, %rax - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) + /* Recompute length after aligning. */ + movq %rax, %rdx - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - ret + subq %rdi, %rdx - .p2align 4 + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) + + /* Must dec rax because L(ret_vec_x0_test) expects it. */ + decq %rax cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which in turn in necessary for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpb $0, (%rsi), %VECMATCH, %k0 + kmovd %k0, %r8d + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %ecx + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %ecx + shlxl %ecx, %r8d, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_1) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax + /* Continue creating zero labels that fit in aligning bytes and get + 2-byte encoding / are in the same cache line as condition. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - ret + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx - - kmovd %k1, %eax - - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx - - /* Check for zero length. */ - testl %edx, %edx - jz L(zero) - - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) - - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + .p2align 4,, 8 +L(ret_vec_x2): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + .p2align 4,, 8 +L(ret_vec_x3): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - /* Check the last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax + .p2align 4,, 8 +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + testl %ecx, %ecx + jnz L(ret_vec_x3) - andl %edx, %eax - testl %eax, %eax - jz L(zero) + /* Check if near end before re-aligning (otherwise might do an + unnecessary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - ret + decq %rax + andq $-(VEC_SIZE * 4), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + andq $-(VEC_SIZE * 4), %rdx .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx - - /* Check the last VEC. */ - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 +L(loop_4x_vec): + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later + on). */ + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 + + /* VEC(2/3) will have zero-byte where we found a CHAR. */ + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 + + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} + vptestnmb %VEC(3), %VEC(3), %k2 + + /* Any 1s and we found CHAR. */ + kortestd %k2, %k4 + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) + + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ + subq $-(VEC_SIZE * 4), %rdx + movq %rdx, %rax + subl %edi, %edx +L(last_4x_vec): + + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - kmovd %k1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jnz L(last_vec_x1) + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - /* Check the second last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 + testl %ecx, %ecx + jnz L(ret_vec_x1) - movl %r8d, %ecx + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - kmovd %k1, %eax + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret_1) + xorl %eax, %eax +L(ret_1): + ret - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 6 +L(loop_end): + kmovd %k1, %ecx + notl %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vptestnmb %VEC(2), %VEC(2), %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + kmovd %k2, %ecx + kmovd %k4, %esi + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + addq %rcx, %rax + ret + .p2align 4,, 4 +L(ret_vec_x0_end): + addq $(VEC_SIZE), %rax +L(ret_vec_x1_end): + bsrl %ecx, %ecx + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax ret -END (__memrchr_evex) + +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 5/8] x86: Optimize memrchr-evex.S 2022-06-07 4:11 ` [PATCH v6 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-07 18:21 ` H.J. Lu 2022-07-14 2:21 ` Sunil Pandey 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 18:21 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The new code: > 1. prioritizes smaller user-arg lengths more. > 2. optimizes target placement more carefully > 3. reuses logic more > 4. fixes up various inefficiencies in the logic. The biggest > case here is the `lzcnt` logic for checking returns which > saves either a branch or multiple instructions. > > The total code size saving is: 263 bytes > Geometric Mean of all benchmarks New / Old: 0.755 > > Regressions: > There are some regressions. Particularly where the length (user arg > length) is large but the position of the match char is near the > beginning of the string (in first VEC). This case has roughly a > 20% regression. > > This is because the new logic gives the hot path for immediate matches > to shorter lengths (the more common input). This case has roughly > a 35% speedup. > > Full xcheck passes on x86_64. > --- > sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ > 1 file changed, 268 insertions(+), 271 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S > index 0b99709c6b..2d7da06dfc 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-evex.S > +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S > @@ -19,319 +19,316 @@ > #if IS_IN (libc) > > # include <sysdep.h> > +# include "evex256-vecs.h" > +# if VEC_SIZE != 32 > +# error "VEC_SIZE != 32 unimplemented" > +# endif > + > +# ifndef MEMRCHR > +# define MEMRCHR __memrchr_evex > +# endif > + > +# define PAGE_SIZE 4096 > +# define VECMATCH VEC(0) > + > + .section SECTION(.text), "ax", @progbits > +ENTRY_P2ALIGN(MEMRCHR, 6) > +# ifdef __ILP32__ > + /* Clear upper bits. */ > + and %RDX_LP, %RDX_LP > +# else > + test %RDX_LP, %RDX_LP > +# endif > + jz L(zero_0) > + > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > + correct page cross check and 2) it correctly sets up end ptr to be > + subtract by lzcnt aligned. */ > + leaq -1(%rdi, %rdx), %rax > + vpbroadcastb %esi, %VECMATCH > + > + /* Check if we can load 1x VEC without cross a page. */ > + testl $(PAGE_SIZE - VEC_SIZE), %eax > + jz L(page_cross) > + > + /* Don't use rax for pointer here because EVEX has better encoding with > + offset % VEC_SIZE == 0. */ > + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 > + kmovd %k0, %ecx > + > + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ > + cmpq $VEC_SIZE, %rdx > + ja L(more_1x_vec) > +L(ret_vec_x0_test): > + > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > + will guarantee edx (len) is less than it. */ > + lzcntl %ecx, %ecx > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > -# define VMOVA vmovdqa64 > - > -# define YMMMATCH ymm16 > - > -# define VEC_SIZE 32 > - > - .section .text.evex,"ax",@progbits > -ENTRY (__memrchr_evex) > - /* Broadcast CHAR to YMMMATCH. */ > - vpbroadcastb %esi, %YMMMATCH > - > - sub $VEC_SIZE, %RDX_LP > - jbe L(last_vec_or_less) > - > - add %RDX_LP, %RDI_LP > - > - /* Check the last VEC_SIZE bytes. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - subq $(VEC_SIZE * 4), %rdi > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(aligned_more) > - > - /* Align data for aligned loads in the loop. */ > - addq $VEC_SIZE, %rdi > - addq $VEC_SIZE, %rdx > - andq $-VEC_SIZE, %rdi > - subq %rcx, %rdx > - > - .p2align 4 > -L(aligned_more): > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > - since data is only aligned to VEC_SIZE. */ > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > - kmovd %k4, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > - There are some overlaps with above if data isn't aligned > - to 4 * VEC_SIZE. */ > - movl %edi, %ecx > - andl $(VEC_SIZE * 4 - 1), %ecx > - jz L(loop_4x_vec) > - > - addq $(VEC_SIZE * 4), %rdi > - addq $(VEC_SIZE * 4), %rdx > - andq $-(VEC_SIZE * 4), %rdi > - subq %rcx, %rdx > + /* Fits in aligning bytes of first cache line. */ > +L(zero_0): > + xorl %eax, %eax > + ret > > - .p2align 4 > -L(loop_4x_vec): > - /* Compare 4 * VEC at a time forward. */ > - subq $(VEC_SIZE * 4), %rdi > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 > - kord %k1, %k2, %k5 > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 > - > - kord %k3, %k4, %k6 > - kortestd %k5, %k6 > - jz L(loop_4x_vec) > - > - /* There is a match. */ > - kmovd %k4, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - kmovd %k1, %eax > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 9 > +L(ret_vec_x0_dec): > + decq %rax > +L(ret_vec_x0): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_4x_vec_or_less): > - addl $(VEC_SIZE * 4), %edx > - cmpl $(VEC_SIZE * 2), %edx > - jbe L(last_2x_vec) > + .p2align 4,, 10 > +L(more_1x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > + /* Align rax (pointer to string). */ > + andq $-VEC_SIZE, %rax > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > + /* Recompute length after aligning. */ > + movq %rax, %rdx > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1_check) > - cmpl $(VEC_SIZE * 3), %edx > - jbe L(zero) > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > - kmovd %k4, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 4), %rdx > - addq %rax, %rdx > - jl L(zero) > - addq %rdi, %rax > - ret > + subq %rdi, %rdx > > - .p2align 4 > + cmpq $(VEC_SIZE * 2), %rdx > + ja L(more_2x_vec) > L(last_2x_vec): > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3_check) > + > + /* Must dec rax because L(ret_vec_x0_test) expects it. */ > + decq %rax > cmpl $VEC_SIZE, %edx > - jbe L(zero) > - > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 2), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + jbe L(ret_vec_x0_test) > + > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + /* Don't use rax for pointer here because EVEX has better encoding with > + offset % VEC_SIZE == 0. */ > + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 > + kmovd %k0, %ecx > + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_vec_x0): > - bsrl %eax, %eax > - addq %rdi, %rax > + /* Inexpensive place to put this regarding code size / target alignments > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > + case which in turn in necessary for hot path (len <= VEC_SIZE) to fit is necessary? > + in first cache line. */ > +L(page_cross): > + movq %rax, %rsi > + andq $-VEC_SIZE, %rsi > + vpcmpb $0, (%rsi), %VECMATCH, %k0 > + kmovd %k0, %r8d > + /* Shift out negative alignment (because we are starting from endptr and > + working backwards). */ > + movl %eax, %ecx > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > + notl %ecx > + shlxl %ecx, %r8d, %ecx > + cmpq %rdi, %rsi > + ja L(more_1x_vec) > + lzcntl %ecx, %ecx > + cmpl %ecx, %edx > + jle L(zero_1) > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_vec_x1): > - bsrl %eax, %eax > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > + /* Continue creating zero labels that fit in aligning bytes and get > + 2-byte encoding / are in the same cache line as condition. */ > +L(zero_1): > + xorl %eax, %eax > ret > > - .p2align 4 > -L(last_vec_x2): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + .p2align 4,, 8 > +L(ret_vec_x1): > + /* This will naturally add 32 to position. */ > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > ret > > - .p2align 4 > -L(last_vec_x3): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + .p2align 4,, 8 > +L(more_2x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0_dec) > > - .p2align 4 > -L(last_vec_x1_check): > - bsrl %eax, %eax > - subq $(VEC_SIZE * 3), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - ret > + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - .p2align 4 > -L(last_vec_x3_check): > - bsrl %eax, %eax > - subq $VEC_SIZE, %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - .p2align 4 > -L(zero): > - xorl %eax, %eax > + subq $(VEC_SIZE * 4), %rdx > + ja L(more_4x_vec) > + > + cmpl $(VEC_SIZE * -1), %edx > + jle L(ret_vec_x2_test) > +L(last_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > + > + > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 3 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_1) > ret > > - .p2align 4 > -L(last_vec_or_less_aligned): > - movl %edx, %ecx > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - > - movl $1, %edx > - /* Support rdx << 32. */ > - salq %cl, %rdx > - subq $1, %rdx > - > - kmovd %k1, %eax > - > - /* Remove the trailing bytes. */ > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > - > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 8 > +L(ret_vec_x2_test): > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_1) > ret > > - .p2align 4 > -L(last_vec_or_less): > - addl $VEC_SIZE, %edx > - > - /* Check for zero length. */ > - testl %edx, %edx > - jz L(zero) > - > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(last_vec_or_less_aligned) > - > - movl %ecx, %esi > - movl %ecx, %r8d > - addl %edx, %esi > - andq $-VEC_SIZE, %rdi > + .p2align 4,, 8 > +L(ret_vec_x2): > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > + ret > > - subl $VEC_SIZE, %esi > - ja L(last_vec_2x_aligned) > + .p2align 4,, 8 > +L(ret_vec_x3): > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > + ret > > - /* Check the last VEC. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > + .p2align 4,, 8 > +L(more_4x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > > - /* Remove the leading and trailing bytes. */ > - sarl %cl, %eax > - movl %edx, %ecx > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + testl %ecx, %ecx > + jnz L(ret_vec_x3) > > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > + /* Check if near end before re-aligning (otherwise might do an > + unnecessary loop iteration). */ > + addq $-(VEC_SIZE * 4), %rax > + cmpq $(VEC_SIZE * 4), %rdx > + jbe L(last_4x_vec) > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > - ret > + decq %rax > + andq $-(VEC_SIZE * 4), %rax > + movq %rdi, %rdx > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > + lengths that overflow can be valid and break the comparison. */ > + andq $-(VEC_SIZE * 4), %rdx > > .p2align 4 > -L(last_vec_2x_aligned): > - movl %esi, %ecx > - > - /* Check the last VEC. */ > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > +L(loop_4x_vec): > + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later > + on). */ > + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 > + > + /* VEC(2/3) will have zero-byte where we found a CHAR. */ > + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) > + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) > + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 > + > + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where > + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ > + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} > + vptestnmb %VEC(3), %VEC(3), %k2 > + > + /* Any 1s and we found CHAR. */ > + kortestd %k2, %k4 > + jnz L(loop_end) > + > + addq $-(VEC_SIZE * 4), %rax > + cmpq %rdx, %rax > + jne L(loop_4x_vec) > + > + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ > + subq $-(VEC_SIZE * 4), %rdx > + movq %rdx, %rax > + subl %edi, %edx > +L(last_4x_vec): > + > + /* Used no matter what. */ > + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + cmpl $(VEC_SIZE * 2), %edx > + jbe L(last_2x_vec) > > - kmovd %k1, %eax > + testl %ecx, %ecx > + jnz L(ret_vec_x0_dec) > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > > - testl %eax, %eax > - jnz L(last_vec_x1) > + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - /* Check the second last VEC. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - movl %r8d, %ecx > + /* Used no matter what. */ > + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - kmovd %k1, %eax > + cmpl $(VEC_SIZE * 3), %edx > + ja L(last_vec) > > - /* Remove the leading bytes. Must use unsigned right shift for > - bsrl below. */ > - shrl %cl, %eax > - testl %eax, %eax > - jz L(zero) > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + jbe L(ret_1) > + xorl %eax, %eax > +L(ret_1): > + ret > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > + .p2align 4,, 6 > +L(loop_end): > + kmovd %k1, %ecx > + notl %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > + > + vptestnmb %VEC(2), %VEC(2), %k0 > + kmovd %k0, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > + > + kmovd %k2, %ecx > + kmovd %k4, %esi > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > + then it won't affect the result in esi (VEC4). If ecx is non-zero > + then CHAR in VEC3 and bsrq will use that position. */ > + salq $32, %rcx > + orq %rsi, %rcx > + bsrq %rcx, %rcx > + addq %rcx, %rax > + ret > + .p2align 4,, 4 > +L(ret_vec_x0_end): > + addq $(VEC_SIZE), %rax > +L(ret_vec_x1_end): > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax > ret > -END (__memrchr_evex) > + > +END(MEMRCHR) > #endif > -- > 2.34.1 > OK with the updated comments. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 5/8] x86: Optimize memrchr-evex.S 2022-06-07 18:21 ` H.J. Lu @ 2022-07-14 2:21 ` Sunil Pandey 0 siblings, 0 replies; 82+ messages in thread From: Sunil Pandey @ 2022-07-14 2:21 UTC (permalink / raw) To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library On Tue, Jun 7, 2022 at 11:23 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The new code: > > 1. prioritizes smaller user-arg lengths more. > > 2. optimizes target placement more carefully > > 3. reuses logic more > > 4. fixes up various inefficiencies in the logic. The biggest > > case here is the `lzcnt` logic for checking returns which > > saves either a branch or multiple instructions. > > > > The total code size saving is: 263 bytes > > Geometric Mean of all benchmarks New / Old: 0.755 > > > > Regressions: > > There are some regressions. Particularly where the length (user arg > > length) is large but the position of the match char is near the > > beginning of the string (in first VEC). This case has roughly a > > 20% regression. > > > > This is because the new logic gives the hot path for immediate matches > > to shorter lengths (the more common input). This case has roughly > > a 35% speedup. > > > > Full xcheck passes on x86_64. > > --- > > sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ > > 1 file changed, 268 insertions(+), 271 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S > > index 0b99709c6b..2d7da06dfc 100644 > > --- a/sysdeps/x86_64/multiarch/memrchr-evex.S > > +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S > > @@ -19,319 +19,316 @@ > > #if IS_IN (libc) > > > > # include <sysdep.h> > > +# include "evex256-vecs.h" > > +# if VEC_SIZE != 32 > > +# error "VEC_SIZE != 32 unimplemented" > > +# endif > > + > > +# ifndef MEMRCHR > > +# define MEMRCHR __memrchr_evex > > +# endif > > + > > +# define PAGE_SIZE 4096 > > +# define VECMATCH VEC(0) > > + > > + .section SECTION(.text), "ax", @progbits > > +ENTRY_P2ALIGN(MEMRCHR, 6) > > +# ifdef __ILP32__ > > + /* Clear upper bits. */ > > + and %RDX_LP, %RDX_LP > > +# else > > + test %RDX_LP, %RDX_LP > > +# endif > > + jz L(zero_0) > > + > > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > > + correct page cross check and 2) it correctly sets up end ptr to be > > + subtract by lzcnt aligned. */ > > + leaq -1(%rdi, %rdx), %rax > > + vpbroadcastb %esi, %VECMATCH > > + > > + /* Check if we can load 1x VEC without cross a page. */ > > + testl $(PAGE_SIZE - VEC_SIZE), %eax > > + jz L(page_cross) > > + > > + /* Don't use rax for pointer here because EVEX has better encoding with > > + offset % VEC_SIZE == 0. */ > > + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > + > > + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ > > + cmpq $VEC_SIZE, %rdx > > + ja L(more_1x_vec) > > +L(ret_vec_x0_test): > > + > > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > > + will guarantee edx (len) is less than it. */ > > + lzcntl %ecx, %ecx > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > + ret > > > > -# define VMOVA vmovdqa64 > > - > > -# define YMMMATCH ymm16 > > - > > -# define VEC_SIZE 32 > > - > > - .section .text.evex,"ax",@progbits > > -ENTRY (__memrchr_evex) > > - /* Broadcast CHAR to YMMMATCH. */ > > - vpbroadcastb %esi, %YMMMATCH > > - > > - sub $VEC_SIZE, %RDX_LP > > - jbe L(last_vec_or_less) > > - > > - add %RDX_LP, %RDI_LP > > - > > - /* Check the last VEC_SIZE bytes. */ > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x0) > > - > > - subq $(VEC_SIZE * 4), %rdi > > - movl %edi, %ecx > > - andl $(VEC_SIZE - 1), %ecx > > - jz L(aligned_more) > > - > > - /* Align data for aligned loads in the loop. */ > > - addq $VEC_SIZE, %rdi > > - addq $VEC_SIZE, %rdx > > - andq $-VEC_SIZE, %rdi > > - subq %rcx, %rdx > > - > > - .p2align 4 > > -L(aligned_more): > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > - > > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > > - since data is only aligned to VEC_SIZE. */ > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > > - kmovd %k2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > > - kmovd %k3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > - > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > > - kmovd %k4, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x0) > > - > > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > > - There are some overlaps with above if data isn't aligned > > - to 4 * VEC_SIZE. */ > > - movl %edi, %ecx > > - andl $(VEC_SIZE * 4 - 1), %ecx > > - jz L(loop_4x_vec) > > - > > - addq $(VEC_SIZE * 4), %rdi > > - addq $(VEC_SIZE * 4), %rdx > > - andq $-(VEC_SIZE * 4), %rdi > > - subq %rcx, %rdx > > + /* Fits in aligning bytes of first cache line. */ > > +L(zero_0): > > + xorl %eax, %eax > > + ret > > > > - .p2align 4 > > -L(loop_4x_vec): > > - /* Compare 4 * VEC at a time forward. */ > > - subq $(VEC_SIZE * 4), %rdi > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > - > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 > > - kord %k1, %k2, %k5 > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 > > - > > - kord %k3, %k4, %k6 > > - kortestd %k5, %k6 > > - jz L(loop_4x_vec) > > - > > - /* There is a match. */ > > - kmovd %k4, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - kmovd %k3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - kmovd %k2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > - > > - kmovd %k1, %eax > > - bsrl %eax, %eax > > - addq %rdi, %rax > > + .p2align 4,, 9 > > +L(ret_vec_x0_dec): > > + decq %rax > > +L(ret_vec_x0): > > + lzcntl %ecx, %ecx > > + subq %rcx, %rax > > ret > > > > - .p2align 4 > > -L(last_4x_vec_or_less): > > - addl $(VEC_SIZE * 4), %edx > > - cmpl $(VEC_SIZE * 2), %edx > > - jbe L(last_2x_vec) > > + .p2align 4,, 10 > > +L(more_1x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > + /* Align rax (pointer to string). */ > > + andq $-VEC_SIZE, %rax > > > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > > - kmovd %k2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > + /* Recompute length after aligning. */ > > + movq %rax, %rdx > > > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > > - kmovd %k3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1_check) > > - cmpl $(VEC_SIZE * 3), %edx > > - jbe L(zero) > > + /* Need no matter what. */ > > + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > > - kmovd %k4, %eax > > - testl %eax, %eax > > - jz L(zero) > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 4), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addq %rdi, %rax > > - ret > > + subq %rdi, %rdx > > > > - .p2align 4 > > + cmpq $(VEC_SIZE * 2), %rdx > > + ja L(more_2x_vec) > > L(last_2x_vec): > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3_check) > > + > > + /* Must dec rax because L(ret_vec_x0_test) expects it. */ > > + decq %rax > > cmpl $VEC_SIZE, %edx > > - jbe L(zero) > > - > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jz L(zero) > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 2), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $(VEC_SIZE * 2), %eax > > - addq %rdi, %rax > > + jbe L(ret_vec_x0_test) > > + > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > + > > + /* Don't use rax for pointer here because EVEX has better encoding with > > + offset % VEC_SIZE == 0. */ > > + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ > > + lzcntq %rcx, %rcx > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > ret > > > > - .p2align 4 > > -L(last_vec_x0): > > - bsrl %eax, %eax > > - addq %rdi, %rax > > + /* Inexpensive place to put this regarding code size / target alignments > > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > > + case which in turn in necessary for hot path (len <= VEC_SIZE) to fit > is necessary? > > > + in first cache line. */ > > +L(page_cross): > > + movq %rax, %rsi > > + andq $-VEC_SIZE, %rsi > > + vpcmpb $0, (%rsi), %VECMATCH, %k0 > > + kmovd %k0, %r8d > > + /* Shift out negative alignment (because we are starting from endptr and > > + working backwards). */ > > + movl %eax, %ecx > > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > > + notl %ecx > > + shlxl %ecx, %r8d, %ecx > > + cmpq %rdi, %rsi > > + ja L(more_1x_vec) > > + lzcntl %ecx, %ecx > > + cmpl %ecx, %edx > > + jle L(zero_1) > > + subq %rcx, %rax > > ret > > > > - .p2align 4 > > -L(last_vec_x1): > > - bsrl %eax, %eax > > - addl $VEC_SIZE, %eax > > - addq %rdi, %rax > > + /* Continue creating zero labels that fit in aligning bytes and get > > + 2-byte encoding / are in the same cache line as condition. */ > > +L(zero_1): > > + xorl %eax, %eax > > ret > > > > - .p2align 4 > > -L(last_vec_x2): > > - bsrl %eax, %eax > > - addl $(VEC_SIZE * 2), %eax > > - addq %rdi, %rax > > + .p2align 4,, 8 > > +L(ret_vec_x1): > > + /* This will naturally add 32 to position. */ > > + bsrl %ecx, %ecx > > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > > ret > > > > - .p2align 4 > > -L(last_vec_x3): > > - bsrl %eax, %eax > > - addl $(VEC_SIZE * 3), %eax > > - addq %rdi, %rax > > - ret > > + .p2align 4,, 8 > > +L(more_2x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_dec) > > > > - .p2align 4 > > -L(last_vec_x1_check): > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 3), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $VEC_SIZE, %eax > > - addq %rdi, %rax > > - ret > > + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > > > - .p2align 4 > > -L(last_vec_x3_check): > > - bsrl %eax, %eax > > - subq $VEC_SIZE, %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $(VEC_SIZE * 3), %eax > > - addq %rdi, %rax > > - ret > > + /* Need no matter what. */ > > + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - .p2align 4 > > -L(zero): > > - xorl %eax, %eax > > + subq $(VEC_SIZE * 4), %rdx > > + ja L(more_4x_vec) > > + > > + cmpl $(VEC_SIZE * -1), %edx > > + jle L(ret_vec_x2_test) > > +L(last_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > + > > + > > + /* Need no matter what. */ > > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 3 + 1), %rax > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + ja L(zero_1) > > ret > > > > - .p2align 4 > > -L(last_vec_or_less_aligned): > > - movl %edx, %ecx > > - > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > - > > - movl $1, %edx > > - /* Support rdx << 32. */ > > - salq %cl, %rdx > > - subq $1, %rdx > > - > > - kmovd %k1, %eax > > - > > - /* Remove the trailing bytes. */ > > - andl %edx, %eax > > - testl %eax, %eax > > - jz L(zero) > > - > > - bsrl %eax, %eax > > - addq %rdi, %rax > > + .p2align 4,, 8 > > +L(ret_vec_x2_test): > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 2 + 1), %rax > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + ja L(zero_1) > > ret > > > > - .p2align 4 > > -L(last_vec_or_less): > > - addl $VEC_SIZE, %edx > > - > > - /* Check for zero length. */ > > - testl %edx, %edx > > - jz L(zero) > > - > > - movl %edi, %ecx > > - andl $(VEC_SIZE - 1), %ecx > > - jz L(last_vec_or_less_aligned) > > - > > - movl %ecx, %esi > > - movl %ecx, %r8d > > - addl %edx, %esi > > - andq $-VEC_SIZE, %rdi > > + .p2align 4,, 8 > > +L(ret_vec_x2): > > + bsrl %ecx, %ecx > > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > > + ret > > > > - subl $VEC_SIZE, %esi > > - ja L(last_vec_2x_aligned) > > + .p2align 4,, 8 > > +L(ret_vec_x3): > > + bsrl %ecx, %ecx > > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > > + ret > > > > - /* Check the last VEC. */ > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > + .p2align 4,, 8 > > +L(more_4x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > > > - /* Remove the leading and trailing bytes. */ > > - sarl %cl, %eax > > - movl %edx, %ecx > > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - movl $1, %edx > > - sall %cl, %edx > > - subl $1, %edx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x3) > > > > - andl %edx, %eax > > - testl %eax, %eax > > - jz L(zero) > > + /* Check if near end before re-aligning (otherwise might do an > > + unnecessary loop iteration). */ > > + addq $-(VEC_SIZE * 4), %rax > > + cmpq $(VEC_SIZE * 4), %rdx > > + jbe L(last_4x_vec) > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - addq %r8, %rax > > - ret > > + decq %rax > > + andq $-(VEC_SIZE * 4), %rax > > + movq %rdi, %rdx > > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > > + lengths that overflow can be valid and break the comparison. */ > > + andq $-(VEC_SIZE * 4), %rdx > > > > .p2align 4 > > -L(last_vec_2x_aligned): > > - movl %esi, %ecx > > - > > - /* Check the last VEC. */ > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > > +L(loop_4x_vec): > > + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later > > + on). */ > > + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 > > + > > + /* VEC(2/3) will have zero-byte where we found a CHAR. */ > > + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) > > + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) > > + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 > > + > > + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where > > + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ > > + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} > > + vptestnmb %VEC(3), %VEC(3), %k2 > > + > > + /* Any 1s and we found CHAR. */ > > + kortestd %k2, %k4 > > + jnz L(loop_end) > > + > > + addq $-(VEC_SIZE * 4), %rax > > + cmpq %rdx, %rax > > + jne L(loop_4x_vec) > > + > > + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ > > + subq $-(VEC_SIZE * 4), %rdx > > + movq %rdx, %rax > > + subl %edi, %edx > > +L(last_4x_vec): > > + > > + /* Used no matter what. */ > > + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - movl $1, %edx > > - sall %cl, %edx > > - subl $1, %edx > > + cmpl $(VEC_SIZE * 2), %edx > > + jbe L(last_2x_vec) > > > > - kmovd %k1, %eax > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_dec) > > > > - /* Remove the trailing bytes. */ > > - andl %edx, %eax > > > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - /* Check the second last VEC. */ > > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > > > - movl %r8d, %ecx > > + /* Used no matter what. */ > > + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 > > + kmovd %k0, %ecx > > > > - kmovd %k1, %eax > > + cmpl $(VEC_SIZE * 3), %edx > > + ja L(last_vec) > > > > - /* Remove the leading bytes. Must use unsigned right shift for > > - bsrl below. */ > > - shrl %cl, %eax > > - testl %eax, %eax > > - jz L(zero) > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 2 + 1), %rax > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + jbe L(ret_1) > > + xorl %eax, %eax > > +L(ret_1): > > + ret > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - addq %r8, %rax > > + .p2align 4,, 6 > > +L(loop_end): > > + kmovd %k1, %ecx > > + notl %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_end) > > + > > + vptestnmb %VEC(2), %VEC(2), %k0 > > + kmovd %k0, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1_end) > > + > > + kmovd %k2, %ecx > > + kmovd %k4, %esi > > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > > + then it won't affect the result in esi (VEC4). If ecx is non-zero > > + then CHAR in VEC3 and bsrq will use that position. */ > > + salq $32, %rcx > > + orq %rsi, %rcx > > + bsrq %rcx, %rcx > > + addq %rcx, %rax > > + ret > > + .p2align 4,, 4 > > +L(ret_vec_x0_end): > > + addq $(VEC_SIZE), %rax > > +L(ret_vec_x1_end): > > + bsrl %ecx, %ecx > > + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax > > ret > > -END (__memrchr_evex) > > + > > +END(MEMRCHR) > > #endif > > -- > > 2.34.1 > > > > OK with the updated comments. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v6 6/8] x86: Optimize memrchr-avx2.S 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (3 preceding siblings ...) 2022-06-07 4:11 ` [PATCH v6 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-07 4:11 ` Noah Goldstein 2022-06-07 18:17 ` H.J. Lu 2022-06-07 4:11 ` [PATCH v6 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein ` (2 subsequent siblings) 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:11 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memrchr-avx2.S | 534 ++++++++++---------- 2 files changed, 257 insertions(+), 278 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S index cea2d2a72d..5e9beeeef2 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMRCHR __memrchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S index ba2ce7cb03..bea4528068 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S @@ -21,340 +21,318 @@ # include <sysdep.h> # ifndef MEMRCHR -# define MEMRCHR __memrchr_avx2 +# define MEMRCHR __memrchr_avx2 # endif # ifndef VZEROUPPER -# define VZEROUPPER vzeroupper +# define VZEROUPPER vzeroupper # endif # ifndef SECTION # define SECTION(p) p##.avx # endif -# define VEC_SIZE 32 +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + .section SECTION(.text), "ax", @progbits +ENTRY(MEMRCHR) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) - .section SECTION(.text),"ax",@progbits -ENTRY (MEMRCHR) - /* Broadcast CHAR to YMM0. */ vmovd %esi, %xmm0 - vpbroadcastb %xmm0, %ymm0 - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP - - /* Check the last VEC_SIZE bytes. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x0) + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdx, %rdi), %rax - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) + vpbroadcastb %xmm0, %ymm0 - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) + +L(ret_vec_x0_test): + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + + /* Hoist vzeroupper (not great for RTM) to save code size. This allows + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vmovdqa (%rdi), %ymm1 - vmovdqa VEC_SIZE(%rdi), %ymm2 - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 - - vpcmpeqb %ymm1, %ymm0, %ymm1 - vpcmpeqb %ymm2, %ymm0, %ymm2 - vpcmpeqb %ymm3, %ymm0, %ymm3 - vpcmpeqb %ymm4, %ymm0, %ymm4 - - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - testl %eax, %eax - jz L(loop_4x_vec) - - /* There is a match. */ - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpmovmskb %ymm1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax L(return_vzeroupper): ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) - - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Align rax (string pointer). */ + andq $-VEC_SIZE, %rax + + /* Recompute remaining length after aligning. */ + movq %rax, %rdx + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 + subq %rdi, %rdx + decq %rax + vpmovmskb %ymm1, %ecx + /* Fall through for short (hotter than length). */ + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which in turn in necessary for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpeqb (%rsi), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %r8d + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %r8d + shlxl %r8d, %ecx, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret + .p2align 4,, 11 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + subq %rcx, %rax VZEROUPPER_RETURN + .p2align 4,, 10 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - VZEROUPPER_RETURN + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax - VZEROUPPER_RETURN + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) + +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - .p2align 4 -L(null): + /* First in aligning bytes. */ +L(zero_2): xorl %eax, %eax ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx + .p2align 4,, 4 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - vpcmpeqb (%rdi), %ymm0, %ymm1 - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx + .p2align 4,, 11 +L(ret_vec_x2): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - vpmovmskb %ymm1, %eax + .p2align 4,, 14 +L(ret_vec_x3): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Check for zero length. */ - testl %edx, %edx - jz L(null) + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) + testl %ecx, %ecx + jnz L(ret_vec_x3) - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + /* Align rax to (VEC_SIZE - 1). */ + orq $(VEC_SIZE * 4 - 1), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + orq $(VEC_SIZE * 4 - 1), %rdx - /* Check the last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + .p2align 4 +L(loop_4x_vec): + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 - movl $1, %edx - sall %cl, %edx - subl $1, %edx + vpor %ymm1, %ymm2, %ymm2 + vpor %ymm3, %ymm4, %ymm4 + vpor %ymm2, %ymm4, %ymm4 + vpmovmskb %ymm4, %esi - andl %edx, %eax - testl %eax, %eax - jz L(zero) + testl %esi, %esi + jnz L(loop_end) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - VZEROUPPER_RETURN + addq $(VEC_SIZE * -4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) - .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx + subl %edi, %edx + incl %edx - /* Check the last VEC. */ - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 +L(last_4x_vec): + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - vpmovmskb %ymm1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_end) - /* Remove the trailing bytes. */ - andl %edx, %eax + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) - testl %eax, %eax - jnz L(last_vec_x1) + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - /* Check the second last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) + + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret0) + xorl %eax, %eax +L(ret0): + ret - movl %r8d, %ecx - vpmovmskb %ymm1, %eax + .p2align 4 +L(loop_end): + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vpmovmskb %ymm2, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + vpmovmskb %ymm3, %ecx + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + .p2align 4,, 4 +L(ret_vec_x1_end): + /* 64-bit version will automatically add 32 (VEC_SIZE). */ + lzcntq %rcx, %rcx + subq %rcx, %rax + VZEROUPPER_RETURN - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 4 +L(ret_vec_x0_end): + lzcntl %ecx, %ecx + subq %rcx, %rax VZEROUPPER_RETURN -END (MEMRCHR) + + /* 2 bytes until next cache line. */ +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 6/8] x86: Optimize memrchr-avx2.S 2022-06-07 4:11 ` [PATCH v6 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-07 18:17 ` H.J. Lu 2022-07-14 2:26 ` Sunil Pandey 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 18:17 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The new code: > 1. prioritizes smaller user-arg lengths more. > 2. optimizes target placement more carefully > 3. reuses logic more > 4. fixes up various inefficiencies in the logic. The biggest > case here is the `lzcnt` logic for checking returns which > saves either a branch or multiple instructions. > > The total code size saving is: 306 bytes > Geometric Mean of all benchmarks New / Old: 0.760 > > Regressions: > There are some regressions. Particularly where the length (user arg > length) is large but the position of the match char is near the > beginning of the string (in first VEC). This case has roughly a > 10-20% regression. > > This is because the new logic gives the hot path for immediate matches > to shorter lengths (the more common input). This case has roughly > a 15-45% speedup. > > Full xcheck passes on x86_64. > --- > sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + > sysdeps/x86_64/multiarch/memrchr-avx2.S | 534 ++++++++++---------- > 2 files changed, 257 insertions(+), 278 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > index cea2d2a72d..5e9beeeef2 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > @@ -2,6 +2,7 @@ > # define MEMRCHR __memrchr_avx2_rtm > #endif > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S > index ba2ce7cb03..bea4528068 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S > @@ -21,340 +21,318 @@ > # include <sysdep.h> > > # ifndef MEMRCHR > -# define MEMRCHR __memrchr_avx2 > +# define MEMRCHR __memrchr_avx2 > # endif > > # ifndef VZEROUPPER > -# define VZEROUPPER vzeroupper > +# define VZEROUPPER vzeroupper > # endif > > # ifndef SECTION > # define SECTION(p) p##.avx > # endif > > -# define VEC_SIZE 32 > +# define VEC_SIZE 32 > +# define PAGE_SIZE 4096 > + .section SECTION(.text), "ax", @progbits > +ENTRY(MEMRCHR) > +# ifdef __ILP32__ > + /* Clear upper bits. */ > + and %RDX_LP, %RDX_LP > +# else > + test %RDX_LP, %RDX_LP > +# endif > + jz L(zero_0) > > - .section SECTION(.text),"ax",@progbits > -ENTRY (MEMRCHR) > - /* Broadcast CHAR to YMM0. */ > vmovd %esi, %xmm0 > - vpbroadcastb %xmm0, %ymm0 > - > - sub $VEC_SIZE, %RDX_LP > - jbe L(last_vec_or_less) > - > - add %RDX_LP, %RDI_LP > - > - /* Check the last VEC_SIZE bytes. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > + correct page cross check and 2) it correctly sets up end ptr to be > + subtract by lzcnt aligned. */ > + leaq -1(%rdx, %rdi), %rax > > - subq $(VEC_SIZE * 4), %rdi > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(aligned_more) > + vpbroadcastb %xmm0, %ymm0 > > - /* Align data for aligned loads in the loop. */ > - addq $VEC_SIZE, %rdi > - addq $VEC_SIZE, %rdx > - andq $-VEC_SIZE, %rdi > - subq %rcx, %rdx > + /* Check if we can load 1x VEC without cross a page. */ > + testl $(PAGE_SIZE - VEC_SIZE), %eax > + jz L(page_cross) > + > + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + cmpq $VEC_SIZE, %rdx > + ja L(more_1x_vec) > + > +L(ret_vec_x0_test): > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > + will gurantee edx (len) is less than it. */ > + lzcntl %ecx, %ecx > + > + /* Hoist vzeroupper (not great for RTM) to save code size. This allows > + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > - .p2align 4 > -L(aligned_more): > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > - since data is only aligned to VEC_SIZE. */ > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpcmpeqb (%rdi), %ymm0, %ymm4 > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > - There are some overlaps with above if data isn't aligned > - to 4 * VEC_SIZE. */ > - movl %edi, %ecx > - andl $(VEC_SIZE * 4 - 1), %ecx > - jz L(loop_4x_vec) > - > - addq $(VEC_SIZE * 4), %rdi > - addq $(VEC_SIZE * 4), %rdx > - andq $-(VEC_SIZE * 4), %rdi > - subq %rcx, %rdx > + /* Fits in aligning bytes of first cache line. */ > +L(zero_0): > + xorl %eax, %eax > + ret > > - .p2align 4 > -L(loop_4x_vec): > - /* Compare 4 * VEC at a time forward. */ > - subq $(VEC_SIZE * 4), %rdi > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - vmovdqa (%rdi), %ymm1 > - vmovdqa VEC_SIZE(%rdi), %ymm2 > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 > - > - vpcmpeqb %ymm1, %ymm0, %ymm1 > - vpcmpeqb %ymm2, %ymm0, %ymm2 > - vpcmpeqb %ymm3, %ymm0, %ymm3 > - vpcmpeqb %ymm4, %ymm0, %ymm4 > - > - vpor %ymm1, %ymm2, %ymm5 > - vpor %ymm3, %ymm4, %ymm6 > - vpor %ymm5, %ymm6, %ymm5 > - > - vpmovmskb %ymm5, %eax > - testl %eax, %eax > - jz L(loop_4x_vec) > - > - /* There is a match. */ > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpmovmskb %ymm1, %eax > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 9 > +L(ret_vec_x0): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > L(return_vzeroupper): > ZERO_UPPER_VEC_REGISTERS_RETURN > > - .p2align 4 > -L(last_4x_vec_or_less): > - addl $(VEC_SIZE * 4), %edx > - cmpl $(VEC_SIZE * 2), %edx > - jbe L(last_2x_vec) > - > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1_check) > - cmpl $(VEC_SIZE * 3), %edx > - jbe L(zero) > - > - vpcmpeqb (%rdi), %ymm0, %ymm4 > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 4), %rdx > - addq %rax, %rdx > - jl L(zero) > - addq %rdi, %rax > - VZEROUPPER_RETURN > - > - .p2align 4 > + .p2align 4,, 10 > +L(more_1x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + /* Align rax (string pointer). */ > + andq $-VEC_SIZE, %rax > + > + /* Recompute remaining length after aligning. */ > + movq %rax, %rdx > + /* Need this comparison next no matter what. */ > + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 > + subq %rdi, %rdx > + decq %rax > + vpmovmskb %ymm1, %ecx > + /* Fall through for short (hotter than length). */ > + cmpq $(VEC_SIZE * 2), %rdx > + ja L(more_2x_vec) > L(last_2x_vec): > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3_check) > cmpl $VEC_SIZE, %edx > - jbe L(zero) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 2), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > - > - .p2align 4 > -L(last_vec_x0): > - bsrl %eax, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > + jbe L(ret_vec_x0_test) > + > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + /* 64-bit lzcnt. This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > - .p2align 4 > -L(last_vec_x1): > - bsrl %eax, %eax > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > - .p2align 4 > -L(last_vec_x2): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + /* Inexpensive place to put this regarding code size / target alignments > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > + case which in turn in necessary for hot path (len <= VEC_SIZE) to fit is necessary? > + in first cache line. */ > +L(page_cross): > + movq %rax, %rsi > + andq $-VEC_SIZE, %rsi > + vpcmpeqb (%rsi), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + /* Shift out negative alignment (because we are starting from endptr and > + working backwards). */ > + movl %eax, %r8d > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > + notl %r8d > + shlxl %r8d, %ecx, %ecx > + cmpq %rdi, %rsi > + ja L(more_1x_vec) > + lzcntl %ecx, %ecx > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > + .p2align 4,, 11 > +L(ret_vec_x1): > + /* This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + subq %rcx, %rax > VZEROUPPER_RETURN > + .p2align 4,, 10 > +L(more_2x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > > - .p2align 4 > -L(last_vec_x3): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - .p2align 4 > -L(last_vec_x1_check): > - bsrl %eax, %eax > - subq $(VEC_SIZE * 3), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > - .p2align 4 > -L(last_vec_x3_check): > - bsrl %eax, %eax > - subq $VEC_SIZE, %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > + /* Needed no matter what. */ > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - .p2align 4 > -L(zero): > - xorl %eax, %eax > - VZEROUPPER_RETURN > + subq $(VEC_SIZE * 4), %rdx > + ja L(more_4x_vec) > + > + cmpl $(VEC_SIZE * -1), %edx > + jle L(ret_vec_x2_test) > + > +L(last_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > + > + /* Needed no matter what. */ > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 3), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_2) > + ret > > - .p2align 4 > -L(null): > + /* First in aligning bytes. */ > +L(zero_2): > xorl %eax, %eax > ret > > - .p2align 4 > -L(last_vec_or_less_aligned): > - movl %edx, %ecx > + .p2align 4,, 4 > +L(ret_vec_x2_test): > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_2) > + ret > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > - movl $1, %edx > - /* Support rdx << 32. */ > - salq %cl, %rdx > - subq $1, %rdx > + .p2align 4,, 11 > +L(ret_vec_x2): > + /* ecx must be non-zero. */ > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - vpmovmskb %ymm1, %eax > + .p2align 4,, 14 > +L(ret_vec_x3): > + /* ecx must be non-zero. */ > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > > - bsrl %eax, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > .p2align 4 > -L(last_vec_or_less): > - addl $VEC_SIZE, %edx > +L(more_4x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > > - /* Check for zero length. */ > - testl %edx, %edx > - jz L(null) > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(last_vec_or_less_aligned) > + testl %ecx, %ecx > + jnz L(ret_vec_x3) > > - movl %ecx, %esi > - movl %ecx, %r8d > - addl %edx, %esi > - andq $-VEC_SIZE, %rdi > + /* Check if near end before re-aligning (otherwise might do an > + unnecissary loop iteration). */ > + addq $-(VEC_SIZE * 4), %rax > + cmpq $(VEC_SIZE * 4), %rdx > + jbe L(last_4x_vec) > > - subl $VEC_SIZE, %esi > - ja L(last_vec_2x_aligned) > + /* Align rax to (VEC_SIZE - 1). */ > + orq $(VEC_SIZE * 4 - 1), %rax > + movq %rdi, %rdx > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > + lengths that overflow can be valid and break the comparison. */ > + orq $(VEC_SIZE * 4 - 1), %rdx > > - /* Check the last VEC. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - > - /* Remove the leading and trailing bytes. */ > - sarl %cl, %eax > - movl %edx, %ecx > + .p2align 4 > +L(loop_4x_vec): > + /* Need this comparison next no matter what. */ > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + vpor %ymm1, %ymm2, %ymm2 > + vpor %ymm3, %ymm4, %ymm4 > + vpor %ymm2, %ymm4, %ymm4 > + vpmovmskb %ymm4, %esi > > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > + testl %esi, %esi > + jnz L(loop_end) > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > - VZEROUPPER_RETURN > + addq $(VEC_SIZE * -4), %rax > + cmpq %rdx, %rax > + jne L(loop_4x_vec) > > - .p2align 4 > -L(last_vec_2x_aligned): > - movl %esi, %ecx > + subl %edi, %edx > + incl %edx > > - /* Check the last VEC. */ > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 > +L(last_4x_vec): > + /* Used no matter what. */ > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + cmpl $(VEC_SIZE * 2), %edx > + jbe L(last_2x_vec) > > - vpmovmskb %ymm1, %eax > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > > - testl %eax, %eax > - jnz L(last_vec_x1) > + /* Used no matter what. */ > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - /* Check the second last VEC. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > + cmpl $(VEC_SIZE * 3), %edx > + ja L(last_vec) > + > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + jbe L(ret0) > + xorl %eax, %eax > +L(ret0): > + ret > > - movl %r8d, %ecx > > - vpmovmskb %ymm1, %eax > + .p2align 4 > +L(loop_end): > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > + > + vpmovmskb %ymm2, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > + > + vpmovmskb %ymm3, %ecx > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > + then it won't affect the result in esi (VEC4). If ecx is non-zero > + then CHAR in VEC3 and bsrq will use that position. */ > + salq $32, %rcx > + orq %rsi, %rcx > + bsrq %rcx, %rcx > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - /* Remove the leading bytes. Must use unsigned right shift for > - bsrl below. */ > - shrl %cl, %eax > - testl %eax, %eax > - jz L(zero) > + .p2align 4,, 4 > +L(ret_vec_x1_end): > + /* 64-bit version will automatically add 32 (VEC_SIZE). */ > + lzcntq %rcx, %rcx > + subq %rcx, %rax > + VZEROUPPER_RETURN > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > + .p2align 4,, 4 > +L(ret_vec_x0_end): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > VZEROUPPER_RETURN > -END (MEMRCHR) > + > + /* 2 bytes until next cache line. */ > +END(MEMRCHR) > #endif > -- > 2.34.1 > OK with the updated comments. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 6/8] x86: Optimize memrchr-avx2.S 2022-06-07 18:17 ` H.J. Lu @ 2022-07-14 2:26 ` Sunil Pandey 2022-07-14 2:43 ` Noah Goldstein 0 siblings, 1 reply; 82+ messages in thread From: Sunil Pandey @ 2022-07-14 2:26 UTC (permalink / raw) To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library On Tue, Jun 7, 2022 at 11:18 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The new code: > > 1. prioritizes smaller user-arg lengths more. > > 2. optimizes target placement more carefully > > 3. reuses logic more > > 4. fixes up various inefficiencies in the logic. The biggest > > case here is the `lzcnt` logic for checking returns which > > saves either a branch or multiple instructions. > > > > The total code size saving is: 306 bytes > > Geometric Mean of all benchmarks New / Old: 0.760 > > > > Regressions: > > There are some regressions. Particularly where the length (user arg > > length) is large but the position of the match char is near the > > beginning of the string (in first VEC). This case has roughly a > > 10-20% regression. > > > > This is because the new logic gives the hot path for immediate matches > > to shorter lengths (the more common input). This case has roughly > > a 15-45% speedup. > > > > Full xcheck passes on x86_64. > > --- > > sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + > > sysdeps/x86_64/multiarch/memrchr-avx2.S | 534 ++++++++++---------- > > 2 files changed, 257 insertions(+), 278 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > index cea2d2a72d..5e9beeeef2 100644 > > --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > @@ -2,6 +2,7 @@ > > # define MEMRCHR __memrchr_avx2_rtm > > #endif > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S > > index ba2ce7cb03..bea4528068 100644 > > --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S > > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S > > @@ -21,340 +21,318 @@ > > # include <sysdep.h> > > > > # ifndef MEMRCHR > > -# define MEMRCHR __memrchr_avx2 > > +# define MEMRCHR __memrchr_avx2 > > # endif > > > > # ifndef VZEROUPPER > > -# define VZEROUPPER vzeroupper > > +# define VZEROUPPER vzeroupper > > # endif > > > > # ifndef SECTION > > # define SECTION(p) p##.avx > > # endif > > > > -# define VEC_SIZE 32 > > +# define VEC_SIZE 32 > > +# define PAGE_SIZE 4096 > > + .section SECTION(.text), "ax", @progbits > > +ENTRY(MEMRCHR) > > +# ifdef __ILP32__ > > + /* Clear upper bits. */ > > + and %RDX_LP, %RDX_LP > > +# else > > + test %RDX_LP, %RDX_LP > > +# endif > > + jz L(zero_0) > > > > - .section SECTION(.text),"ax",@progbits > > -ENTRY (MEMRCHR) > > - /* Broadcast CHAR to YMM0. */ > > vmovd %esi, %xmm0 > > - vpbroadcastb %xmm0, %ymm0 > > - > > - sub $VEC_SIZE, %RDX_LP > > - jbe L(last_vec_or_less) > > - > > - add %RDX_LP, %RDI_LP > > - > > - /* Check the last VEC_SIZE bytes. */ > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x0) > > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > > + correct page cross check and 2) it correctly sets up end ptr to be > > + subtract by lzcnt aligned. */ > > + leaq -1(%rdx, %rdi), %rax > > > > - subq $(VEC_SIZE * 4), %rdi > > - movl %edi, %ecx > > - andl $(VEC_SIZE - 1), %ecx > > - jz L(aligned_more) > > + vpbroadcastb %xmm0, %ymm0 > > > > - /* Align data for aligned loads in the loop. */ > > - addq $VEC_SIZE, %rdi > > - addq $VEC_SIZE, %rdx > > - andq $-VEC_SIZE, %rdi > > - subq %rcx, %rdx > > + /* Check if we can load 1x VEC without cross a page. */ > > + testl $(PAGE_SIZE - VEC_SIZE), %eax > > + jz L(page_cross) > > + > > + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + cmpq $VEC_SIZE, %rdx > > + ja L(more_1x_vec) > > + > > +L(ret_vec_x0_test): > > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > > + will gurantee edx (len) is less than it. */ > > + lzcntl %ecx, %ecx > > + > > + /* Hoist vzeroupper (not great for RTM) to save code size. This allows > > + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ > > + COND_VZEROUPPER > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > + ret > > > > - .p2align 4 > > -L(aligned_more): > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > - > > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > > - since data is only aligned to VEC_SIZE. */ > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > > - vpmovmskb %ymm2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > > - vpmovmskb %ymm3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > - > > - vpcmpeqb (%rdi), %ymm0, %ymm4 > > - vpmovmskb %ymm4, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x0) > > - > > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > > - There are some overlaps with above if data isn't aligned > > - to 4 * VEC_SIZE. */ > > - movl %edi, %ecx > > - andl $(VEC_SIZE * 4 - 1), %ecx > > - jz L(loop_4x_vec) > > - > > - addq $(VEC_SIZE * 4), %rdi > > - addq $(VEC_SIZE * 4), %rdx > > - andq $-(VEC_SIZE * 4), %rdi > > - subq %rcx, %rdx > > + /* Fits in aligning bytes of first cache line. */ > > +L(zero_0): > > + xorl %eax, %eax > > + ret > > > > - .p2align 4 > > -L(loop_4x_vec): > > - /* Compare 4 * VEC at a time forward. */ > > - subq $(VEC_SIZE * 4), %rdi > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > - > > - vmovdqa (%rdi), %ymm1 > > - vmovdqa VEC_SIZE(%rdi), %ymm2 > > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 > > - > > - vpcmpeqb %ymm1, %ymm0, %ymm1 > > - vpcmpeqb %ymm2, %ymm0, %ymm2 > > - vpcmpeqb %ymm3, %ymm0, %ymm3 > > - vpcmpeqb %ymm4, %ymm0, %ymm4 > > - > > - vpor %ymm1, %ymm2, %ymm5 > > - vpor %ymm3, %ymm4, %ymm6 > > - vpor %ymm5, %ymm6, %ymm5 > > - > > - vpmovmskb %ymm5, %eax > > - testl %eax, %eax > > - jz L(loop_4x_vec) > > - > > - /* There is a match. */ > > - vpmovmskb %ymm4, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - vpmovmskb %ymm3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - vpmovmskb %ymm2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > - > > - vpmovmskb %ymm1, %eax > > - bsrl %eax, %eax > > - addq %rdi, %rax > > + .p2align 4,, 9 > > +L(ret_vec_x0): > > + lzcntl %ecx, %ecx > > + subq %rcx, %rax > > L(return_vzeroupper): > > ZERO_UPPER_VEC_REGISTERS_RETURN > > > > - .p2align 4 > > -L(last_4x_vec_or_less): > > - addl $(VEC_SIZE * 4), %edx > > - cmpl $(VEC_SIZE * 2), %edx > > - jbe L(last_2x_vec) > > - > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > > - vpmovmskb %ymm2, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x2) > > - > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > > - vpmovmskb %ymm3, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x1_check) > > - cmpl $(VEC_SIZE * 3), %edx > > - jbe L(zero) > > - > > - vpcmpeqb (%rdi), %ymm0, %ymm4 > > - vpmovmskb %ymm4, %eax > > - testl %eax, %eax > > - jz L(zero) > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 4), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > + .p2align 4,, 10 > > +L(more_1x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > + > > + /* Align rax (string pointer). */ > > + andq $-VEC_SIZE, %rax > > + > > + /* Recompute remaining length after aligning. */ > > + movq %rax, %rdx > > + /* Need this comparison next no matter what. */ > > + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 > > + subq %rdi, %rdx > > + decq %rax > > + vpmovmskb %ymm1, %ecx > > + /* Fall through for short (hotter than length). */ > > + cmpq $(VEC_SIZE * 2), %rdx > > + ja L(more_2x_vec) > > L(last_2x_vec): > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jnz L(last_vec_x3_check) > > cmpl $VEC_SIZE, %edx > > - jbe L(zero) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - testl %eax, %eax > > - jz L(zero) > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 2), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $(VEC_SIZE * 2), %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(last_vec_x0): > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > + jbe L(ret_vec_x0_test) > > + > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > + > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + /* 64-bit lzcnt. This will naturally add 32 to position. */ > > + lzcntq %rcx, %rcx > > + COND_VZEROUPPER > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > + ret > > > > - .p2align 4 > > -L(last_vec_x1): > > - bsrl %eax, %eax > > - addl $VEC_SIZE, %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(last_vec_x2): > > - bsrl %eax, %eax > > - addl $(VEC_SIZE * 2), %eax > > - addq %rdi, %rax > > + /* Inexpensive place to put this regarding code size / target alignments > > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > > + case which in turn in necessary for hot path (len <= VEC_SIZE) to fit > is necessary? > > + in first cache line. */ > > +L(page_cross): > > + movq %rax, %rsi > > + andq $-VEC_SIZE, %rsi > > + vpcmpeqb (%rsi), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + /* Shift out negative alignment (because we are starting from endptr and > > + working backwards). */ > > + movl %eax, %r8d > > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > > + notl %r8d > > + shlxl %r8d, %ecx, %ecx > > + cmpq %rdi, %rsi > > + ja L(more_1x_vec) > > + lzcntl %ecx, %ecx > > + COND_VZEROUPPER > > + cmpl %ecx, %edx > > + jle L(zero_0) > > + subq %rcx, %rax > > + ret > > + .p2align 4,, 11 > > +L(ret_vec_x1): > > + /* This will naturally add 32 to position. */ > > + lzcntq %rcx, %rcx > > + subq %rcx, %rax > > VZEROUPPER_RETURN > > + .p2align 4,, 10 > > +L(more_2x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0) > > > > - .p2align 4 > > -L(last_vec_x3): > > - bsrl %eax, %eax > > - addl $(VEC_SIZE * 3), %eax > > - addq %rdi, %rax > > - ret > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > > > - .p2align 4 > > -L(last_vec_x1_check): > > - bsrl %eax, %eax > > - subq $(VEC_SIZE * 3), %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $VEC_SIZE, %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(last_vec_x3_check): > > - bsrl %eax, %eax > > - subq $VEC_SIZE, %rdx > > - addq %rax, %rdx > > - jl L(zero) > > - addl $(VEC_SIZE * 3), %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > + /* Needed no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > > > - .p2align 4 > > -L(zero): > > - xorl %eax, %eax > > - VZEROUPPER_RETURN > > + subq $(VEC_SIZE * 4), %rdx > > + ja L(more_4x_vec) > > + > > + cmpl $(VEC_SIZE * -1), %edx > > + jle L(ret_vec_x2_test) > > + > > +L(last_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > + > > + /* Needed no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 3), %rax > > + COND_VZEROUPPER > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + ja L(zero_2) > > + ret > > > > - .p2align 4 > > -L(null): > > + /* First in aligning bytes. */ > > +L(zero_2): > > xorl %eax, %eax > > ret > > > > - .p2align 4 > > -L(last_vec_or_less_aligned): > > - movl %edx, %ecx > > + .p2align 4,, 4 > > +L(ret_vec_x2_test): > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 2), %rax > > + COND_VZEROUPPER > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + ja L(zero_2) > > + ret > > > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > > > - movl $1, %edx > > - /* Support rdx << 32. */ > > - salq %cl, %rdx > > - subq $1, %rdx > > + .p2align 4,, 11 > > +L(ret_vec_x2): > > + /* ecx must be non-zero. */ > > + bsrl %ecx, %ecx > > + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax > > + VZEROUPPER_RETURN > > > > - vpmovmskb %ymm1, %eax > > + .p2align 4,, 14 > > +L(ret_vec_x3): > > + /* ecx must be non-zero. */ > > + bsrl %ecx, %ecx > > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > > + VZEROUPPER_RETURN > > > > - /* Remove the trailing bytes. */ > > - andl %edx, %eax > > - testl %eax, %eax > > - jz L(zero) > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - VZEROUPPER_RETURN > > > > .p2align 4 > > -L(last_vec_or_less): > > - addl $VEC_SIZE, %edx > > +L(more_4x_vec): > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > > > - /* Check for zero length. */ > > - testl %edx, %edx > > - jz L(null) > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > > > - movl %edi, %ecx > > - andl $(VEC_SIZE - 1), %ecx > > - jz L(last_vec_or_less_aligned) > > + testl %ecx, %ecx > > + jnz L(ret_vec_x3) > > > > - movl %ecx, %esi > > - movl %ecx, %r8d > > - addl %edx, %esi > > - andq $-VEC_SIZE, %rdi > > + /* Check if near end before re-aligning (otherwise might do an > > + unnecissary loop iteration). */ > > + addq $-(VEC_SIZE * 4), %rax > > + cmpq $(VEC_SIZE * 4), %rdx > > + jbe L(last_4x_vec) > > > > - subl $VEC_SIZE, %esi > > - ja L(last_vec_2x_aligned) > > + /* Align rax to (VEC_SIZE - 1). */ > > + orq $(VEC_SIZE * 4 - 1), %rax > > + movq %rdi, %rdx > > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > > + lengths that overflow can be valid and break the comparison. */ > > + orq $(VEC_SIZE * 4 - 1), %rdx > > > > - /* Check the last VEC. */ > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > - > > - /* Remove the leading and trailing bytes. */ > > - sarl %cl, %eax > > - movl %edx, %ecx > > + .p2align 4 > > +L(loop_4x_vec): > > + /* Need this comparison next no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 > > > > - movl $1, %edx > > - sall %cl, %edx > > - subl $1, %edx > > + vpor %ymm1, %ymm2, %ymm2 > > + vpor %ymm3, %ymm4, %ymm4 > > + vpor %ymm2, %ymm4, %ymm4 > > + vpmovmskb %ymm4, %esi > > > > - andl %edx, %eax > > - testl %eax, %eax > > - jz L(zero) > > + testl %esi, %esi > > + jnz L(loop_end) > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - addq %r8, %rax > > - VZEROUPPER_RETURN > > + addq $(VEC_SIZE * -4), %rax > > + cmpq %rdx, %rax > > + jne L(loop_4x_vec) > > > > - .p2align 4 > > -L(last_vec_2x_aligned): > > - movl %esi, %ecx > > + subl %edi, %edx > > + incl %edx > > > > - /* Check the last VEC. */ > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 > > +L(last_4x_vec): > > + /* Used no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > > > - movl $1, %edx > > - sall %cl, %edx > > - subl $1, %edx > > + cmpl $(VEC_SIZE * 2), %edx > > + jbe L(last_2x_vec) > > > > - vpmovmskb %ymm1, %eax > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_end) > > > > - /* Remove the trailing bytes. */ > > - andl %edx, %eax > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1_end) > > > > - testl %eax, %eax > > - jnz L(last_vec_x1) > > + /* Used no matter what. */ > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %ecx > > > > - /* Check the second last VEC. */ > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > + cmpl $(VEC_SIZE * 3), %edx > > + ja L(last_vec) > > + > > + lzcntl %ecx, %ecx > > + subq $(VEC_SIZE * 2), %rax > > + COND_VZEROUPPER > > + subq %rcx, %rax > > + cmpq %rax, %rdi > > + jbe L(ret0) > > + xorl %eax, %eax > > +L(ret0): > > + ret > > > > - movl %r8d, %ecx > > > > - vpmovmskb %ymm1, %eax > > + .p2align 4 > > +L(loop_end): > > + vpmovmskb %ymm1, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x0_end) > > + > > + vpmovmskb %ymm2, %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1_end) > > + > > + vpmovmskb %ymm3, %ecx > > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > > + then it won't affect the result in esi (VEC4). If ecx is non-zero > > + then CHAR in VEC3 and bsrq will use that position. */ > > + salq $32, %rcx > > + orq %rsi, %rcx > > + bsrq %rcx, %rcx > > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > > + VZEROUPPER_RETURN > > > > - /* Remove the leading bytes. Must use unsigned right shift for > > - bsrl below. */ > > - shrl %cl, %eax > > - testl %eax, %eax > > - jz L(zero) > > + .p2align 4,, 4 > > +L(ret_vec_x1_end): > > + /* 64-bit version will automatically add 32 (VEC_SIZE). */ > > + lzcntq %rcx, %rcx > > + subq %rcx, %rax > > + VZEROUPPER_RETURN > > > > - bsrl %eax, %eax > > - addq %rdi, %rax > > - addq %r8, %rax > > + .p2align 4,, 4 > > +L(ret_vec_x0_end): > > + lzcntl %ecx, %ecx > > + subq %rcx, %rax > > VZEROUPPER_RETURN > > -END (MEMRCHR) > > + > > + /* 2 bytes until next cache line. */ > > +END(MEMRCHR) > > #endif > > -- > > 2.34.1 > > > > OK with the updated comments. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 6/8] x86: Optimize memrchr-avx2.S 2022-07-14 2:26 ` Sunil Pandey @ 2022-07-14 2:43 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-07-14 2:43 UTC (permalink / raw) To: Sunil Pandey; +Cc: H.J. Lu, GNU C Library On Wed, Jul 13, 2022 at 7:26 PM Sunil Pandey <skpgkp2@gmail.com> wrote: > > On Tue, Jun 7, 2022 at 11:18 AM H.J. Lu via Libc-alpha > <libc-alpha@sourceware.org> wrote: > > > > On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > The new code: > > > 1. prioritizes smaller user-arg lengths more. > > > 2. optimizes target placement more carefully > > > 3. reuses logic more > > > 4. fixes up various inefficiencies in the logic. The biggest > > > case here is the `lzcnt` logic for checking returns which > > > saves either a branch or multiple instructions. > > > > > > The total code size saving is: 306 bytes > > > Geometric Mean of all benchmarks New / Old: 0.760 > > > > > > Regressions: > > > There are some regressions. Particularly where the length (user arg > > > length) is large but the position of the match char is near the > > > beginning of the string (in first VEC). This case has roughly a > > > 10-20% regression. > > > > > > This is because the new logic gives the hot path for immediate matches > > > to shorter lengths (the more common input). This case has roughly > > > a 15-45% speedup. > > > > > > Full xcheck passes on x86_64. > > > --- > > > sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + > > > sysdeps/x86_64/multiarch/memrchr-avx2.S | 534 ++++++++++---------- > > > 2 files changed, 257 insertions(+), 278 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > > index cea2d2a72d..5e9beeeef2 100644 > > > --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > > > @@ -2,6 +2,7 @@ > > > # define MEMRCHR __memrchr_avx2_rtm > > > #endif > > > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S > > > index ba2ce7cb03..bea4528068 100644 > > > --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S > > > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S > > > @@ -21,340 +21,318 @@ > > > # include <sysdep.h> > > > > > > # ifndef MEMRCHR > > > -# define MEMRCHR __memrchr_avx2 > > > +# define MEMRCHR __memrchr_avx2 > > > # endif > > > > > > # ifndef VZEROUPPER > > > -# define VZEROUPPER vzeroupper > > > +# define VZEROUPPER vzeroupper > > > # endif > > > > > > # ifndef SECTION > > > # define SECTION(p) p##.avx > > > # endif > > > > > > -# define VEC_SIZE 32 > > > +# define VEC_SIZE 32 > > > +# define PAGE_SIZE 4096 > > > + .section SECTION(.text), "ax", @progbits > > > +ENTRY(MEMRCHR) > > > +# ifdef __ILP32__ > > > + /* Clear upper bits. */ > > > + and %RDX_LP, %RDX_LP > > > +# else > > > + test %RDX_LP, %RDX_LP > > > +# endif > > > + jz L(zero_0) > > > > > > - .section SECTION(.text),"ax",@progbits > > > -ENTRY (MEMRCHR) > > > - /* Broadcast CHAR to YMM0. */ > > > vmovd %esi, %xmm0 > > > - vpbroadcastb %xmm0, %ymm0 > > > - > > > - sub $VEC_SIZE, %RDX_LP > > > - jbe L(last_vec_or_less) > > > - > > > - add %RDX_LP, %RDI_LP > > > - > > > - /* Check the last VEC_SIZE bytes. */ > > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x0) > > > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > > > + correct page cross check and 2) it correctly sets up end ptr to be > > > + subtract by lzcnt aligned. */ > > > + leaq -1(%rdx, %rdi), %rax > > > > > > - subq $(VEC_SIZE * 4), %rdi > > > - movl %edi, %ecx > > > - andl $(VEC_SIZE - 1), %ecx > > > - jz L(aligned_more) > > > + vpbroadcastb %xmm0, %ymm0 > > > > > > - /* Align data for aligned loads in the loop. */ > > > - addq $VEC_SIZE, %rdi > > > - addq $VEC_SIZE, %rdx > > > - andq $-VEC_SIZE, %rdi > > > - subq %rcx, %rdx > > > + /* Check if we can load 1x VEC without cross a page. */ > > > + testl $(PAGE_SIZE - VEC_SIZE), %eax > > > + jz L(page_cross) > > > + > > > + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > + cmpq $VEC_SIZE, %rdx > > > + ja L(more_1x_vec) > > > + > > > +L(ret_vec_x0_test): > > > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > > > + will gurantee edx (len) is less than it. */ > > > + lzcntl %ecx, %ecx > > > + > > > + /* Hoist vzeroupper (not great for RTM) to save code size. This allows > > > + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ > > > + COND_VZEROUPPER > > > + cmpl %ecx, %edx > > > + jle L(zero_0) > > > + subq %rcx, %rax > > > + ret > > > > > > - .p2align 4 > > > -L(aligned_more): > > > - subq $(VEC_SIZE * 4), %rdx > > > - jbe L(last_4x_vec_or_less) > > > - > > > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > - since data is only aligned to VEC_SIZE. */ > > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x3) > > > - > > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > > > - vpmovmskb %ymm2, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x2) > > > - > > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > > > - vpmovmskb %ymm3, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x1) > > > - > > > - vpcmpeqb (%rdi), %ymm0, %ymm4 > > > - vpmovmskb %ymm4, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x0) > > > - > > > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > > > - There are some overlaps with above if data isn't aligned > > > - to 4 * VEC_SIZE. */ > > > - movl %edi, %ecx > > > - andl $(VEC_SIZE * 4 - 1), %ecx > > > - jz L(loop_4x_vec) > > > - > > > - addq $(VEC_SIZE * 4), %rdi > > > - addq $(VEC_SIZE * 4), %rdx > > > - andq $-(VEC_SIZE * 4), %rdi > > > - subq %rcx, %rdx > > > + /* Fits in aligning bytes of first cache line. */ > > > +L(zero_0): > > > + xorl %eax, %eax > > > + ret > > > > > > - .p2align 4 > > > -L(loop_4x_vec): > > > - /* Compare 4 * VEC at a time forward. */ > > > - subq $(VEC_SIZE * 4), %rdi > > > - subq $(VEC_SIZE * 4), %rdx > > > - jbe L(last_4x_vec_or_less) > > > - > > > - vmovdqa (%rdi), %ymm1 > > > - vmovdqa VEC_SIZE(%rdi), %ymm2 > > > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 > > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 > > > - > > > - vpcmpeqb %ymm1, %ymm0, %ymm1 > > > - vpcmpeqb %ymm2, %ymm0, %ymm2 > > > - vpcmpeqb %ymm3, %ymm0, %ymm3 > > > - vpcmpeqb %ymm4, %ymm0, %ymm4 > > > - > > > - vpor %ymm1, %ymm2, %ymm5 > > > - vpor %ymm3, %ymm4, %ymm6 > > > - vpor %ymm5, %ymm6, %ymm5 > > > - > > > - vpmovmskb %ymm5, %eax > > > - testl %eax, %eax > > > - jz L(loop_4x_vec) > > > - > > > - /* There is a match. */ > > > - vpmovmskb %ymm4, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x3) > > > - > > > - vpmovmskb %ymm3, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x2) > > > - > > > - vpmovmskb %ymm2, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x1) > > > - > > > - vpmovmskb %ymm1, %eax > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > + .p2align 4,, 9 > > > +L(ret_vec_x0): > > > + lzcntl %ecx, %ecx > > > + subq %rcx, %rax > > > L(return_vzeroupper): > > > ZERO_UPPER_VEC_REGISTERS_RETURN > > > > > > - .p2align 4 > > > -L(last_4x_vec_or_less): > > > - addl $(VEC_SIZE * 4), %edx > > > - cmpl $(VEC_SIZE * 2), %edx > > > - jbe L(last_2x_vec) > > > - > > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x3) > > > - > > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > > > - vpmovmskb %ymm2, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x2) > > > - > > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > > > - vpmovmskb %ymm3, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x1_check) > > > - cmpl $(VEC_SIZE * 3), %edx > > > - jbe L(zero) > > > - > > > - vpcmpeqb (%rdi), %ymm0, %ymm4 > > > - vpmovmskb %ymm4, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > - bsrl %eax, %eax > > > - subq $(VEC_SIZE * 4), %rdx > > > - addq %rax, %rdx > > > - jl L(zero) > > > - addq %rdi, %rax > > > - VZEROUPPER_RETURN > > > - > > > - .p2align 4 > > > + .p2align 4,, 10 > > > +L(more_1x_vec): > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0) > > > + > > > + /* Align rax (string pointer). */ > > > + andq $-VEC_SIZE, %rax > > > + > > > + /* Recompute remaining length after aligning. */ > > > + movq %rax, %rdx > > > + /* Need this comparison next no matter what. */ > > > + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 > > > + subq %rdi, %rdx > > > + decq %rax > > > + vpmovmskb %ymm1, %ecx > > > + /* Fall through for short (hotter than length). */ > > > + cmpq $(VEC_SIZE * 2), %rdx > > > + ja L(more_2x_vec) > > > L(last_2x_vec): > > > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jnz L(last_vec_x3_check) > > > cmpl $VEC_SIZE, %edx > > > - jbe L(zero) > > > - > > > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > - bsrl %eax, %eax > > > - subq $(VEC_SIZE * 2), %rdx > > > - addq %rax, %rdx > > > - jl L(zero) > > > - addl $(VEC_SIZE * 2), %eax > > > - addq %rdi, %rax > > > - VZEROUPPER_RETURN > > > - > > > - .p2align 4 > > > -L(last_vec_x0): > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > - VZEROUPPER_RETURN > > > + jbe L(ret_vec_x0_test) > > > + > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0) > > > + > > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > + /* 64-bit lzcnt. This will naturally add 32 to position. */ > > > + lzcntq %rcx, %rcx > > > + COND_VZEROUPPER > > > + cmpl %ecx, %edx > > > + jle L(zero_0) > > > + subq %rcx, %rax > > > + ret > > > > > > - .p2align 4 > > > -L(last_vec_x1): > > > - bsrl %eax, %eax > > > - addl $VEC_SIZE, %eax > > > - addq %rdi, %rax > > > - VZEROUPPER_RETURN > > > > > > - .p2align 4 > > > -L(last_vec_x2): > > > - bsrl %eax, %eax > > > - addl $(VEC_SIZE * 2), %eax > > > - addq %rdi, %rax > > > + /* Inexpensive place to put this regarding code size / target alignments > > > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > > > + case which in turn in necessary for hot path (len <= VEC_SIZE) to fit > > is necessary? > > > + in first cache line. */ > > > +L(page_cross): > > > + movq %rax, %rsi > > > + andq $-VEC_SIZE, %rsi > > > + vpcmpeqb (%rsi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > + /* Shift out negative alignment (because we are starting from endptr and > > > + working backwards). */ > > > + movl %eax, %r8d > > > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > > > + notl %r8d > > > + shlxl %r8d, %ecx, %ecx > > > + cmpq %rdi, %rsi > > > + ja L(more_1x_vec) > > > + lzcntl %ecx, %ecx > > > + COND_VZEROUPPER > > > + cmpl %ecx, %edx > > > + jle L(zero_0) > > > + subq %rcx, %rax > > > + ret > > > + .p2align 4,, 11 > > > +L(ret_vec_x1): > > > + /* This will naturally add 32 to position. */ > > > + lzcntq %rcx, %rcx > > > + subq %rcx, %rax > > > VZEROUPPER_RETURN > > > + .p2align 4,, 10 > > > +L(more_2x_vec): > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0) > > > > > > - .p2align 4 > > > -L(last_vec_x3): > > > - bsrl %eax, %eax > > > - addl $(VEC_SIZE * 3), %eax > > > - addq %rdi, %rax > > > - ret > > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x1) > > > > > > - .p2align 4 > > > -L(last_vec_x1_check): > > > - bsrl %eax, %eax > > > - subq $(VEC_SIZE * 3), %rdx > > > - addq %rax, %rdx > > > - jl L(zero) > > > - addl $VEC_SIZE, %eax > > > - addq %rdi, %rax > > > - VZEROUPPER_RETURN > > > > > > - .p2align 4 > > > -L(last_vec_x3_check): > > > - bsrl %eax, %eax > > > - subq $VEC_SIZE, %rdx > > > - addq %rax, %rdx > > > - jl L(zero) > > > - addl $(VEC_SIZE * 3), %eax > > > - addq %rdi, %rax > > > - VZEROUPPER_RETURN > > > + /* Needed no matter what. */ > > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > > > > - .p2align 4 > > > -L(zero): > > > - xorl %eax, %eax > > > - VZEROUPPER_RETURN > > > + subq $(VEC_SIZE * 4), %rdx > > > + ja L(more_4x_vec) > > > + > > > + cmpl $(VEC_SIZE * -1), %edx > > > + jle L(ret_vec_x2_test) > > > + > > > +L(last_vec): > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x2) > > > + > > > + /* Needed no matter what. */ > > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > + lzcntl %ecx, %ecx > > > + subq $(VEC_SIZE * 3), %rax > > > + COND_VZEROUPPER > > > + subq %rcx, %rax > > > + cmpq %rax, %rdi > > > + ja L(zero_2) > > > + ret > > > > > > - .p2align 4 > > > -L(null): > > > + /* First in aligning bytes. */ > > > +L(zero_2): > > > xorl %eax, %eax > > > ret > > > > > > - .p2align 4 > > > -L(last_vec_or_less_aligned): > > > - movl %edx, %ecx > > > + .p2align 4,, 4 > > > +L(ret_vec_x2_test): > > > + lzcntl %ecx, %ecx > > > + subq $(VEC_SIZE * 2), %rax > > > + COND_VZEROUPPER > > > + subq %rcx, %rax > > > + cmpq %rax, %rdi > > > + ja L(zero_2) > > > + ret > > > > > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > > > > > - movl $1, %edx > > > - /* Support rdx << 32. */ > > > - salq %cl, %rdx > > > - subq $1, %rdx > > > + .p2align 4,, 11 > > > +L(ret_vec_x2): > > > + /* ecx must be non-zero. */ > > > + bsrl %ecx, %ecx > > > + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax > > > + VZEROUPPER_RETURN > > > > > > - vpmovmskb %ymm1, %eax > > > + .p2align 4,, 14 > > > +L(ret_vec_x3): > > > + /* ecx must be non-zero. */ > > > + bsrl %ecx, %ecx > > > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > > > + VZEROUPPER_RETURN > > > > > > - /* Remove the trailing bytes. */ > > > - andl %edx, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > - VZEROUPPER_RETURN > > > > > > .p2align 4 > > > -L(last_vec_or_less): > > > - addl $VEC_SIZE, %edx > > > +L(more_4x_vec): > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x2) > > > > > > - /* Check for zero length. */ > > > - testl %edx, %edx > > > - jz L(null) > > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > > > > - movl %edi, %ecx > > > - andl $(VEC_SIZE - 1), %ecx > > > - jz L(last_vec_or_less_aligned) > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x3) > > > > > > - movl %ecx, %esi > > > - movl %ecx, %r8d > > > - addl %edx, %esi > > > - andq $-VEC_SIZE, %rdi > > > + /* Check if near end before re-aligning (otherwise might do an > > > + unnecissary loop iteration). */ > > > + addq $-(VEC_SIZE * 4), %rax > > > + cmpq $(VEC_SIZE * 4), %rdx > > > + jbe L(last_4x_vec) > > > > > > - subl $VEC_SIZE, %esi > > > - ja L(last_vec_2x_aligned) > > > + /* Align rax to (VEC_SIZE - 1). */ > > > + orq $(VEC_SIZE * 4 - 1), %rax > > > + movq %rdi, %rdx > > > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > > > + lengths that overflow can be valid and break the comparison. */ > > > + orq $(VEC_SIZE * 4 - 1), %rdx > > > > > > - /* Check the last VEC. */ > > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - > > > - /* Remove the leading and trailing bytes. */ > > > - sarl %cl, %eax > > > - movl %edx, %ecx > > > + .p2align 4 > > > +L(loop_4x_vec): > > > + /* Need this comparison next no matter what. */ > > > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 > > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 > > > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 > > > > > > - movl $1, %edx > > > - sall %cl, %edx > > > - subl $1, %edx > > > + vpor %ymm1, %ymm2, %ymm2 > > > + vpor %ymm3, %ymm4, %ymm4 > > > + vpor %ymm2, %ymm4, %ymm4 > > > + vpmovmskb %ymm4, %esi > > > > > > - andl %edx, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > + testl %esi, %esi > > > + jnz L(loop_end) > > > > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > - addq %r8, %rax > > > - VZEROUPPER_RETURN > > > + addq $(VEC_SIZE * -4), %rax > > > + cmpq %rdx, %rax > > > + jne L(loop_4x_vec) > > > > > > - .p2align 4 > > > -L(last_vec_2x_aligned): > > > - movl %esi, %ecx > > > + subl %edi, %edx > > > + incl %edx > > > > > > - /* Check the last VEC. */ > > > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 > > > +L(last_4x_vec): > > > + /* Used no matter what. */ > > > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > > > > - movl $1, %edx > > > - sall %cl, %edx > > > - subl $1, %edx > > > + cmpl $(VEC_SIZE * 2), %edx > > > + jbe L(last_2x_vec) > > > > > > - vpmovmskb %ymm1, %eax > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0_end) > > > > > > - /* Remove the trailing bytes. */ > > > - andl %edx, %eax > > > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x1_end) > > > > > > - testl %eax, %eax > > > - jnz L(last_vec_x1) > > > + /* Used no matter what. */ > > > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %ecx > > > > > > - /* Check the second last VEC. */ > > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > > + cmpl $(VEC_SIZE * 3), %edx > > > + ja L(last_vec) > > > + > > > + lzcntl %ecx, %ecx > > > + subq $(VEC_SIZE * 2), %rax > > > + COND_VZEROUPPER > > > + subq %rcx, %rax > > > + cmpq %rax, %rdi > > > + jbe L(ret0) > > > + xorl %eax, %eax > > > +L(ret0): > > > + ret > > > > > > - movl %r8d, %ecx > > > > > > - vpmovmskb %ymm1, %eax > > > + .p2align 4 > > > +L(loop_end): > > > + vpmovmskb %ymm1, %ecx > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x0_end) > > > + > > > + vpmovmskb %ymm2, %ecx > > > + testl %ecx, %ecx > > > + jnz L(ret_vec_x1_end) > > > + > > > + vpmovmskb %ymm3, %ecx > > > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > > > + then it won't affect the result in esi (VEC4). If ecx is non-zero > > > + then CHAR in VEC3 and bsrq will use that position. */ > > > + salq $32, %rcx > > > + orq %rsi, %rcx > > > + bsrq %rcx, %rcx > > > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > > > + VZEROUPPER_RETURN > > > > > > - /* Remove the leading bytes. Must use unsigned right shift for > > > - bsrl below. */ > > > - shrl %cl, %eax > > > - testl %eax, %eax > > > - jz L(zero) > > > + .p2align 4,, 4 > > > +L(ret_vec_x1_end): > > > + /* 64-bit version will automatically add 32 (VEC_SIZE). */ > > > + lzcntq %rcx, %rcx > > > + subq %rcx, %rax > > > + VZEROUPPER_RETURN > > > > > > - bsrl %eax, %eax > > > - addq %rdi, %rax > > > - addq %r8, %rax > > > + .p2align 4,, 4 > > > +L(ret_vec_x0_end): > > > + lzcntl %ecx, %ecx > > > + subq %rcx, %rax > > > VZEROUPPER_RETURN > > > -END (MEMRCHR) > > > + > > > + /* 2 bytes until next cache line. */ > > > +END(MEMRCHR) > > > #endif > > > -- > > > 2.34.1 > > > > > > > OK with the updated comments. > > > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > > > Thanks. > > > > -- > > H.J. > > I would like to backport this patch to release branches. > Any comments or objections? Probably should also do: https://sourceware.org/git/?p=glibc.git;a=commit;h=227afaa67213efcdce6a870ef5086200f1076438 > > --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v6 7/8] x86: Shrink code size of memchr-avx2.S 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (4 preceding siblings ...) 2022-06-07 4:11 ` [PATCH v6 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-07 4:11 ` Noah Goldstein 2022-06-07 18:18 ` H.J. Lu 2022-06-07 4:11 ` [PATCH v6 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-07 18:04 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library H.J. Lu 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:11 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- 2 files changed, 60 insertions(+), 50 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S index 87b076c7c4..c4d71938c5 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMCHR __memchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S index 75bd7262e0..28a01280ec 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S @@ -57,7 +57,7 @@ # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 5) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ # ifdef __ILP32__ @@ -87,12 +87,14 @@ ENTRY (MEMCHR) # endif testl %eax, %eax jz L(aligned_more) - tzcntl %eax, %eax + bsfl %eax, %eax addq %rdi, %rax - VZEROUPPER_RETURN +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN + # ifndef USE_AS_RAWMEMCHR - .p2align 5 + .p2align 4 L(first_vec_x0): /* Check if first match was before length. */ tzcntl %eax, %eax @@ -100,58 +102,31 @@ L(first_vec_x0): /* NB: Multiply length by 4 to get byte count. */ sall $2, %edx # endif - xorl %ecx, %ecx + COND_VZEROUPPER + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch + block. branch here as opposed to cmovcc is not that costly. Common + usage of memchr is to check if the return was NULL (if string was + known to contain CHAR user would use rawmemchr). This branch will be + highly correlated with the user branch and can be used by most + modern branch predictors to predict the user branch. */ cmpl %eax, %edx - leaq (%rdi, %rax), %rax - cmovle %rcx, %rax - VZEROUPPER_RETURN - -L(null): - xorl %eax, %eax - ret -# endif - .p2align 4 -L(cross_page_boundary): - /* Save pointer before aligning as its original value is - necessary for computer return address if byte is found or - adjusting length if it is not and this is memchr. */ - movq %rdi, %rcx - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr - and rdi for rawmemchr. */ - orq $(VEC_SIZE - 1), %ALGN_PTR_REG - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax -# ifndef USE_AS_RAWMEMCHR - /* Calculate length until end of page (length checked for a - match). */ - leaq 1(%ALGN_PTR_REG), %rsi - subq %RRAW_PTR_REG, %rsi -# ifdef USE_AS_WMEMCHR - /* NB: Divide bytes by 4 to get wchar_t count. */ - shrl $2, %esi -# endif -# endif - /* Remove the leading bytes. */ - sarxl %ERAW_PTR_REG, %eax, %eax -# ifndef USE_AS_RAWMEMCHR - /* Check the end of data. */ - cmpq %rsi, %rdx - jbe L(first_vec_x0) + jle L(null) + addq %rdi, %rax + ret # endif - testl %eax, %eax - jz L(cross_page_continue) - tzcntl %eax, %eax - addq %RRAW_PTR_REG, %rax -L(return_vzeroupper): - ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 + .p2align 4,, 10 L(first_vec_x1): - tzcntl %eax, %eax + bsfl %eax, %eax incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - +# ifndef USE_AS_RAWMEMCHR + /* First in aligning bytes here. */ +L(null): + xorl %eax, %eax + ret +# endif .p2align 4 L(first_vec_x2): tzcntl %eax, %eax @@ -340,7 +315,7 @@ L(first_vec_x1_check): incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - .p2align 4 + .p2align 4,, 6 L(set_zero_end): xorl %eax, %eax VZEROUPPER_RETURN @@ -428,5 +403,39 @@ L(last_vec_x3): VZEROUPPER_RETURN # endif + .p2align 4 +L(cross_page_boundary): + /* Save pointer before aligning as its original value is necessary for + computer return address if byte is found or adjusting length if it + is not and this is memchr. */ + movq %rdi, %rcx + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for + rawmemchr. */ + andq $-VEC_SIZE, %ALGN_PTR_REG + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 + vpmovmskb %ymm1, %eax +# ifndef USE_AS_RAWMEMCHR + /* Calculate length until end of page (length checked for a match). */ + leal VEC_SIZE(%ALGN_PTR_REG), %esi + subl %ERAW_PTR_REG, %esi +# ifdef USE_AS_WMEMCHR + /* NB: Divide bytes by 4 to get wchar_t count. */ + shrl $2, %esi +# endif +# endif + /* Remove the leading bytes. */ + sarxl %ERAW_PTR_REG, %eax, %eax +# ifndef USE_AS_RAWMEMCHR + /* Check the end of data. */ + cmpq %rsi, %rdx + jbe L(first_vec_x0) +# endif + testl %eax, %eax + jz L(cross_page_continue) + bsfl %eax, %eax + addq %RRAW_PTR_REG, %rax + VZEROUPPER_RETURN + + END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 7/8] x86: Shrink code size of memchr-avx2.S 2022-06-07 4:11 ` [PATCH v6 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein @ 2022-06-07 18:18 ` H.J. Lu 2022-07-14 2:31 ` Sunil Pandey 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 18:18 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > This is not meant as a performance optimization. The previous code was > far to liberal in aligning targets and wasted code size unnecissarily. > > The total code size saving is: 59 bytes > > There are no major changes in the benchmarks. > Geometric Mean of all benchmarks New / Old: 0.967 > > Full xcheck passes on x86_64. > --- > sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + > sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- > 2 files changed, 60 insertions(+), 50 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > index 87b076c7c4..c4d71938c5 100644 > --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > @@ -2,6 +2,7 @@ > # define MEMCHR __memchr_avx2_rtm > #endif > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S > index 75bd7262e0..28a01280ec 100644 > --- a/sysdeps/x86_64/multiarch/memchr-avx2.S > +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S > @@ -57,7 +57,7 @@ > # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) > > .section SECTION(.text),"ax",@progbits > -ENTRY (MEMCHR) > +ENTRY_P2ALIGN (MEMCHR, 5) > # ifndef USE_AS_RAWMEMCHR > /* Check for zero length. */ > # ifdef __ILP32__ > @@ -87,12 +87,14 @@ ENTRY (MEMCHR) > # endif > testl %eax, %eax > jz L(aligned_more) > - tzcntl %eax, %eax > + bsfl %eax, %eax > addq %rdi, %rax > - VZEROUPPER_RETURN > +L(return_vzeroupper): > + ZERO_UPPER_VEC_REGISTERS_RETURN > + > > # ifndef USE_AS_RAWMEMCHR > - .p2align 5 > + .p2align 4 > L(first_vec_x0): > /* Check if first match was before length. */ > tzcntl %eax, %eax > @@ -100,58 +102,31 @@ L(first_vec_x0): > /* NB: Multiply length by 4 to get byte count. */ > sall $2, %edx > # endif > - xorl %ecx, %ecx > + COND_VZEROUPPER > + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch > + block. branch here as opposed to cmovcc is not that costly. Common > + usage of memchr is to check if the return was NULL (if string was > + known to contain CHAR user would use rawmemchr). This branch will be > + highly correlated with the user branch and can be used by most > + modern branch predictors to predict the user branch. */ > cmpl %eax, %edx > - leaq (%rdi, %rax), %rax > - cmovle %rcx, %rax > - VZEROUPPER_RETURN > - > -L(null): > - xorl %eax, %eax > - ret > -# endif > - .p2align 4 > -L(cross_page_boundary): > - /* Save pointer before aligning as its original value is > - necessary for computer return address if byte is found or > - adjusting length if it is not and this is memchr. */ > - movq %rdi, %rcx > - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr > - and rdi for rawmemchr. */ > - orq $(VEC_SIZE - 1), %ALGN_PTR_REG > - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > -# ifndef USE_AS_RAWMEMCHR > - /* Calculate length until end of page (length checked for a > - match). */ > - leaq 1(%ALGN_PTR_REG), %rsi > - subq %RRAW_PTR_REG, %rsi > -# ifdef USE_AS_WMEMCHR > - /* NB: Divide bytes by 4 to get wchar_t count. */ > - shrl $2, %esi > -# endif > -# endif > - /* Remove the leading bytes. */ > - sarxl %ERAW_PTR_REG, %eax, %eax > -# ifndef USE_AS_RAWMEMCHR > - /* Check the end of data. */ > - cmpq %rsi, %rdx > - jbe L(first_vec_x0) > + jle L(null) > + addq %rdi, %rax > + ret > # endif > - testl %eax, %eax > - jz L(cross_page_continue) > - tzcntl %eax, %eax > - addq %RRAW_PTR_REG, %rax > -L(return_vzeroupper): > - ZERO_UPPER_VEC_REGISTERS_RETURN > > - .p2align 4 > + .p2align 4,, 10 > L(first_vec_x1): > - tzcntl %eax, %eax > + bsfl %eax, %eax > incq %rdi > addq %rdi, %rax > VZEROUPPER_RETURN > - > +# ifndef USE_AS_RAWMEMCHR > + /* First in aligning bytes here. */ > +L(null): > + xorl %eax, %eax > + ret > +# endif > .p2align 4 > L(first_vec_x2): > tzcntl %eax, %eax > @@ -340,7 +315,7 @@ L(first_vec_x1_check): > incq %rdi > addq %rdi, %rax > VZEROUPPER_RETURN > - .p2align 4 > + .p2align 4,, 6 > L(set_zero_end): > xorl %eax, %eax > VZEROUPPER_RETURN > @@ -428,5 +403,39 @@ L(last_vec_x3): > VZEROUPPER_RETURN > # endif > > + .p2align 4 > +L(cross_page_boundary): > + /* Save pointer before aligning as its original value is necessary for > + computer return address if byte is found or adjusting length if it > + is not and this is memchr. */ > + movq %rdi, %rcx > + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for > + rawmemchr. */ > + andq $-VEC_SIZE, %ALGN_PTR_REG > + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 > + vpmovmskb %ymm1, %eax > +# ifndef USE_AS_RAWMEMCHR > + /* Calculate length until end of page (length checked for a match). */ > + leal VEC_SIZE(%ALGN_PTR_REG), %esi > + subl %ERAW_PTR_REG, %esi > +# ifdef USE_AS_WMEMCHR > + /* NB: Divide bytes by 4 to get wchar_t count. */ > + shrl $2, %esi > +# endif > +# endif > + /* Remove the leading bytes. */ > + sarxl %ERAW_PTR_REG, %eax, %eax > +# ifndef USE_AS_RAWMEMCHR > + /* Check the end of data. */ > + cmpq %rsi, %rdx > + jbe L(first_vec_x0) > +# endif > + testl %eax, %eax > + jz L(cross_page_continue) > + bsfl %eax, %eax > + addq %RRAW_PTR_REG, %rax > + VZEROUPPER_RETURN > + > + > END (MEMCHR) > #endif > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 7/8] x86: Shrink code size of memchr-avx2.S 2022-06-07 18:18 ` H.J. Lu @ 2022-07-14 2:31 ` Sunil Pandey 2022-07-14 2:41 ` Noah Goldstein 0 siblings, 1 reply; 82+ messages in thread From: Sunil Pandey @ 2022-07-14 2:31 UTC (permalink / raw) To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library On Tue, Jun 7, 2022 at 11:19 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > This is not meant as a performance optimization. The previous code was > > far to liberal in aligning targets and wasted code size unnecissarily. > > > > The total code size saving is: 59 bytes > > > > There are no major changes in the benchmarks. > > Geometric Mean of all benchmarks New / Old: 0.967 > > > > Full xcheck passes on x86_64. > > --- > > sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + > > sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- > > 2 files changed, 60 insertions(+), 50 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > > index 87b076c7c4..c4d71938c5 100644 > > --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > > @@ -2,6 +2,7 @@ > > # define MEMCHR __memchr_avx2_rtm > > #endif > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S > > index 75bd7262e0..28a01280ec 100644 > > --- a/sysdeps/x86_64/multiarch/memchr-avx2.S > > +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S > > @@ -57,7 +57,7 @@ > > # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) > > > > .section SECTION(.text),"ax",@progbits > > -ENTRY (MEMCHR) > > +ENTRY_P2ALIGN (MEMCHR, 5) > > # ifndef USE_AS_RAWMEMCHR > > /* Check for zero length. */ > > # ifdef __ILP32__ > > @@ -87,12 +87,14 @@ ENTRY (MEMCHR) > > # endif > > testl %eax, %eax > > jz L(aligned_more) > > - tzcntl %eax, %eax > > + bsfl %eax, %eax > > addq %rdi, %rax > > - VZEROUPPER_RETURN > > +L(return_vzeroupper): > > + ZERO_UPPER_VEC_REGISTERS_RETURN > > + > > > > # ifndef USE_AS_RAWMEMCHR > > - .p2align 5 > > + .p2align 4 > > L(first_vec_x0): > > /* Check if first match was before length. */ > > tzcntl %eax, %eax > > @@ -100,58 +102,31 @@ L(first_vec_x0): > > /* NB: Multiply length by 4 to get byte count. */ > > sall $2, %edx > > # endif > > - xorl %ecx, %ecx > > + COND_VZEROUPPER > > + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch > > + block. branch here as opposed to cmovcc is not that costly. Common > > + usage of memchr is to check if the return was NULL (if string was > > + known to contain CHAR user would use rawmemchr). This branch will be > > + highly correlated with the user branch and can be used by most > > + modern branch predictors to predict the user branch. */ > > cmpl %eax, %edx > > - leaq (%rdi, %rax), %rax > > - cmovle %rcx, %rax > > - VZEROUPPER_RETURN > > - > > -L(null): > > - xorl %eax, %eax > > - ret > > -# endif > > - .p2align 4 > > -L(cross_page_boundary): > > - /* Save pointer before aligning as its original value is > > - necessary for computer return address if byte is found or > > - adjusting length if it is not and this is memchr. */ > > - movq %rdi, %rcx > > - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr > > - and rdi for rawmemchr. */ > > - orq $(VEC_SIZE - 1), %ALGN_PTR_REG > > - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 > > - vpmovmskb %ymm1, %eax > > -# ifndef USE_AS_RAWMEMCHR > > - /* Calculate length until end of page (length checked for a > > - match). */ > > - leaq 1(%ALGN_PTR_REG), %rsi > > - subq %RRAW_PTR_REG, %rsi > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Divide bytes by 4 to get wchar_t count. */ > > - shrl $2, %esi > > -# endif > > -# endif > > - /* Remove the leading bytes. */ > > - sarxl %ERAW_PTR_REG, %eax, %eax > > -# ifndef USE_AS_RAWMEMCHR > > - /* Check the end of data. */ > > - cmpq %rsi, %rdx > > - jbe L(first_vec_x0) > > + jle L(null) > > + addq %rdi, %rax > > + ret > > # endif > > - testl %eax, %eax > > - jz L(cross_page_continue) > > - tzcntl %eax, %eax > > - addq %RRAW_PTR_REG, %rax > > -L(return_vzeroupper): > > - ZERO_UPPER_VEC_REGISTERS_RETURN > > > > - .p2align 4 > > + .p2align 4,, 10 > > L(first_vec_x1): > > - tzcntl %eax, %eax > > + bsfl %eax, %eax > > incq %rdi > > addq %rdi, %rax > > VZEROUPPER_RETURN > > - > > +# ifndef USE_AS_RAWMEMCHR > > + /* First in aligning bytes here. */ > > +L(null): > > + xorl %eax, %eax > > + ret > > +# endif > > .p2align 4 > > L(first_vec_x2): > > tzcntl %eax, %eax > > @@ -340,7 +315,7 @@ L(first_vec_x1_check): > > incq %rdi > > addq %rdi, %rax > > VZEROUPPER_RETURN > > - .p2align 4 > > + .p2align 4,, 6 > > L(set_zero_end): > > xorl %eax, %eax > > VZEROUPPER_RETURN > > @@ -428,5 +403,39 @@ L(last_vec_x3): > > VZEROUPPER_RETURN > > # endif > > > > + .p2align 4 > > +L(cross_page_boundary): > > + /* Save pointer before aligning as its original value is necessary for > > + computer return address if byte is found or adjusting length if it > > + is not and this is memchr. */ > > + movq %rdi, %rcx > > + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for > > + rawmemchr. */ > > + andq $-VEC_SIZE, %ALGN_PTR_REG > > + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 > > + vpmovmskb %ymm1, %eax > > +# ifndef USE_AS_RAWMEMCHR > > + /* Calculate length until end of page (length checked for a match). */ > > + leal VEC_SIZE(%ALGN_PTR_REG), %esi > > + subl %ERAW_PTR_REG, %esi > > +# ifdef USE_AS_WMEMCHR > > + /* NB: Divide bytes by 4 to get wchar_t count. */ > > + shrl $2, %esi > > +# endif > > +# endif > > + /* Remove the leading bytes. */ > > + sarxl %ERAW_PTR_REG, %eax, %eax > > +# ifndef USE_AS_RAWMEMCHR > > + /* Check the end of data. */ > > + cmpq %rsi, %rdx > > + jbe L(first_vec_x0) > > +# endif > > + testl %eax, %eax > > + jz L(cross_page_continue) > > + bsfl %eax, %eax > > + addq %RRAW_PTR_REG, %rax > > + VZEROUPPER_RETURN > > + > > + > > END (MEMCHR) > > #endif > > -- > > 2.34.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 7/8] x86: Shrink code size of memchr-avx2.S 2022-07-14 2:31 ` Sunil Pandey @ 2022-07-14 2:41 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-07-14 2:41 UTC (permalink / raw) To: Sunil Pandey; +Cc: H.J. Lu, GNU C Library On Wed, Jul 13, 2022 at 7:32 PM Sunil Pandey <skpgkp2@gmail.com> wrote: > > On Tue, Jun 7, 2022 at 11:19 AM H.J. Lu via Libc-alpha > <libc-alpha@sourceware.org> wrote: > > > > On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > This is not meant as a performance optimization. The previous code was > > > far to liberal in aligning targets and wasted code size unnecissarily. > > > > > > The total code size saving is: 59 bytes > > > > > > There are no major changes in the benchmarks. > > > Geometric Mean of all benchmarks New / Old: 0.967 > > > > > > Full xcheck passes on x86_64. > > > --- > > > sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + > > > sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- > > > 2 files changed, 60 insertions(+), 50 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > > > index 87b076c7c4..c4d71938c5 100644 > > > --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > > > +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S > > > @@ -2,6 +2,7 @@ > > > # define MEMCHR __memchr_avx2_rtm > > > #endif > > > > > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > > > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > > > > > diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S > > > index 75bd7262e0..28a01280ec 100644 > > > --- a/sysdeps/x86_64/multiarch/memchr-avx2.S > > > +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S > > > @@ -57,7 +57,7 @@ > > > # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) > > > > > > .section SECTION(.text),"ax",@progbits > > > -ENTRY (MEMCHR) > > > +ENTRY_P2ALIGN (MEMCHR, 5) > > > # ifndef USE_AS_RAWMEMCHR > > > /* Check for zero length. */ > > > # ifdef __ILP32__ > > > @@ -87,12 +87,14 @@ ENTRY (MEMCHR) > > > # endif > > > testl %eax, %eax > > > jz L(aligned_more) > > > - tzcntl %eax, %eax > > > + bsfl %eax, %eax > > > addq %rdi, %rax > > > - VZEROUPPER_RETURN > > > +L(return_vzeroupper): > > > + ZERO_UPPER_VEC_REGISTERS_RETURN > > > + > > > > > > # ifndef USE_AS_RAWMEMCHR > > > - .p2align 5 > > > + .p2align 4 > > > L(first_vec_x0): > > > /* Check if first match was before length. */ > > > tzcntl %eax, %eax > > > @@ -100,58 +102,31 @@ L(first_vec_x0): > > > /* NB: Multiply length by 4 to get byte count. */ > > > sall $2, %edx > > > # endif > > > - xorl %ecx, %ecx > > > + COND_VZEROUPPER > > > + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch > > > + block. branch here as opposed to cmovcc is not that costly. Common > > > + usage of memchr is to check if the return was NULL (if string was > > > + known to contain CHAR user would use rawmemchr). This branch will be > > > + highly correlated with the user branch and can be used by most > > > + modern branch predictors to predict the user branch. */ > > > cmpl %eax, %edx > > > - leaq (%rdi, %rax), %rax > > > - cmovle %rcx, %rax > > > - VZEROUPPER_RETURN > > > - > > > -L(null): > > > - xorl %eax, %eax > > > - ret > > > -# endif > > > - .p2align 4 > > > -L(cross_page_boundary): > > > - /* Save pointer before aligning as its original value is > > > - necessary for computer return address if byte is found or > > > - adjusting length if it is not and this is memchr. */ > > > - movq %rdi, %rcx > > > - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr > > > - and rdi for rawmemchr. */ > > > - orq $(VEC_SIZE - 1), %ALGN_PTR_REG > > > - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > -# ifndef USE_AS_RAWMEMCHR > > > - /* Calculate length until end of page (length checked for a > > > - match). */ > > > - leaq 1(%ALGN_PTR_REG), %rsi > > > - subq %RRAW_PTR_REG, %rsi > > > -# ifdef USE_AS_WMEMCHR > > > - /* NB: Divide bytes by 4 to get wchar_t count. */ > > > - shrl $2, %esi > > > -# endif > > > -# endif > > > - /* Remove the leading bytes. */ > > > - sarxl %ERAW_PTR_REG, %eax, %eax > > > -# ifndef USE_AS_RAWMEMCHR > > > - /* Check the end of data. */ > > > - cmpq %rsi, %rdx > > > - jbe L(first_vec_x0) > > > + jle L(null) > > > + addq %rdi, %rax > > > + ret > > > # endif > > > - testl %eax, %eax > > > - jz L(cross_page_continue) > > > - tzcntl %eax, %eax > > > - addq %RRAW_PTR_REG, %rax > > > -L(return_vzeroupper): > > > - ZERO_UPPER_VEC_REGISTERS_RETURN > > > > > > - .p2align 4 > > > + .p2align 4,, 10 > > > L(first_vec_x1): > > > - tzcntl %eax, %eax > > > + bsfl %eax, %eax > > > incq %rdi > > > addq %rdi, %rax > > > VZEROUPPER_RETURN > > > - > > > +# ifndef USE_AS_RAWMEMCHR > > > + /* First in aligning bytes here. */ > > > +L(null): > > > + xorl %eax, %eax > > > + ret > > > +# endif > > > .p2align 4 > > > L(first_vec_x2): > > > tzcntl %eax, %eax > > > @@ -340,7 +315,7 @@ L(first_vec_x1_check): > > > incq %rdi > > > addq %rdi, %rax > > > VZEROUPPER_RETURN > > > - .p2align 4 > > > + .p2align 4,, 6 > > > L(set_zero_end): > > > xorl %eax, %eax > > > VZEROUPPER_RETURN > > > @@ -428,5 +403,39 @@ L(last_vec_x3): > > > VZEROUPPER_RETURN > > > # endif > > > > > > + .p2align 4 > > > +L(cross_page_boundary): > > > + /* Save pointer before aligning as its original value is necessary for > > > + computer return address if byte is found or adjusting length if it > > > + is not and this is memchr. */ > > > + movq %rdi, %rcx > > > + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for > > > + rawmemchr. */ > > > + andq $-VEC_SIZE, %ALGN_PTR_REG > > > + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > +# ifndef USE_AS_RAWMEMCHR > > > + /* Calculate length until end of page (length checked for a match). */ > > > + leal VEC_SIZE(%ALGN_PTR_REG), %esi > > > + subl %ERAW_PTR_REG, %esi > > > +# ifdef USE_AS_WMEMCHR > > > + /* NB: Divide bytes by 4 to get wchar_t count. */ > > > + shrl $2, %esi > > > +# endif > > > +# endif > > > + /* Remove the leading bytes. */ > > > + sarxl %ERAW_PTR_REG, %eax, %eax > > > +# ifndef USE_AS_RAWMEMCHR > > > + /* Check the end of data. */ > > > + cmpq %rsi, %rdx > > > + jbe L(first_vec_x0) > > > +# endif > > > + testl %eax, %eax > > > + jz L(cross_page_continue) > > > + bsfl %eax, %eax > > > + addq %RRAW_PTR_REG, %rax > > > + VZEROUPPER_RETURN > > > + > > > + > > > END (MEMCHR) > > > #endif > > > -- > > > 2.34.1 > > > > > > > LGTM. > > > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > > > Thanks. > > > > -- > > H.J. > > I would like to backport this patch to release branches. > Any comments or objections? Probably best to squash with: https://sourceware.org/git/?p=glibc.git;a=commit;h=2c9af8421d2b4a7fcce163e7bc81a118d22fd346 > > --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v6 8/8] x86: Shrink code size of memchr-evex.S 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (5 preceding siblings ...) 2022-06-07 4:11 ` [PATCH v6 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein @ 2022-06-07 4:11 ` Noah Goldstein 2022-06-07 18:19 ` H.J. Lu 2022-06-07 18:04 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library H.J. Lu 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-07 4:11 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-evex.S | 46 ++++++++++++++------------ 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S index cfaf02907d..0fd11b7632 100644 --- a/sysdeps/x86_64/multiarch/memchr-evex.S +++ b/sysdeps/x86_64/multiarch/memchr-evex.S @@ -88,7 +88,7 @@ # define PAGE_SIZE 4096 .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 6) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ test %RDX_LP, %RDX_LP @@ -131,22 +131,24 @@ L(zero): xorl %eax, %eax ret - .p2align 5 + .p2align 4 L(first_vec_x0): - /* Check if first match was before length. */ - tzcntl %eax, %eax - xorl %ecx, %ecx - cmpl %eax, %edx - leaq (%rdi, %rax, CHAR_SIZE), %rax - cmovle %rcx, %rax + /* Check if first match was before length. NB: tzcnt has false data- + dependency on destination. eax already had a data-dependency on esi + so this should have no affect here. */ + tzcntl %eax, %esi +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rsi, CHAR_SIZE), %rdi +# else + addq %rsi, %rdi +# endif + xorl %eax, %eax + cmpl %esi, %edx + cmovg %rdi, %rax ret -# else - /* NB: first_vec_x0 is 17 bytes which will leave - cross_page_boundary (which is relatively cold) close enough - to ideal alignment. So only realign L(cross_page_boundary) if - rawmemchr. */ - .p2align 4 # endif + + .p2align 4 L(cross_page_boundary): /* Save pointer before aligning as its original value is necessary for computer return address if byte is found or @@ -400,10 +402,14 @@ L(last_2x_vec): L(zero_end): ret +L(set_zero_end): + xorl %eax, %eax + ret .p2align 4 L(first_vec_x1_check): - tzcntl %eax, %eax + /* eax must be non-zero. Use bsfl to save code size. */ + bsfl %eax, %eax /* Adjust length. */ subl $-(CHAR_PER_VEC * 4), %edx /* Check if match within remaining length. */ @@ -412,9 +418,6 @@ L(first_vec_x1_check): /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax ret -L(set_zero_end): - xorl %eax, %eax - ret .p2align 4 L(loop_4x_vec_end): @@ -464,7 +467,7 @@ L(loop_4x_vec_end): # endif ret - .p2align 4 + .p2align 4,, 10 L(last_vec_x1_return): tzcntl %eax, %eax # if defined USE_AS_WMEMCHR || RET_OFFSET != 0 @@ -496,6 +499,7 @@ L(last_vec_x3_return): # endif # ifndef USE_AS_RAWMEMCHR + .p2align 4,, 5 L(last_4x_vec_or_less_cmpeq): VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 kmovd %k0, %eax @@ -546,7 +550,7 @@ L(last_4x_vec): # endif andl %ecx, %eax jz L(zero_end2) - tzcntl %eax, %eax + bsfl %eax, %eax leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax L(zero_end2): ret @@ -562,6 +566,6 @@ L(last_vec_x3): leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax ret # endif - + /* 7 bytes from next cache line. */ END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 8/8] x86: Shrink code size of memchr-evex.S 2022-06-07 4:11 ` [PATCH v6 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein @ 2022-06-07 18:19 ` H.J. Lu 2022-07-14 2:32 ` Sunil Pandey 0 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 18:19 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > This is not meant as a performance optimization. The previous code was > far to liberal in aligning targets and wasted code size unnecissarily. > > The total code size saving is: 64 bytes > > There are no non-negligible changes in the benchmarks. > Geometric Mean of all benchmarks New / Old: 1.000 > > Full xcheck passes on x86_64. > --- > sysdeps/x86_64/multiarch/memchr-evex.S | 46 ++++++++++++++------------ > 1 file changed, 25 insertions(+), 21 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S > index cfaf02907d..0fd11b7632 100644 > --- a/sysdeps/x86_64/multiarch/memchr-evex.S > +++ b/sysdeps/x86_64/multiarch/memchr-evex.S > @@ -88,7 +88,7 @@ > # define PAGE_SIZE 4096 > > .section SECTION(.text),"ax",@progbits > -ENTRY (MEMCHR) > +ENTRY_P2ALIGN (MEMCHR, 6) > # ifndef USE_AS_RAWMEMCHR > /* Check for zero length. */ > test %RDX_LP, %RDX_LP > @@ -131,22 +131,24 @@ L(zero): > xorl %eax, %eax > ret > > - .p2align 5 > + .p2align 4 > L(first_vec_x0): > - /* Check if first match was before length. */ > - tzcntl %eax, %eax > - xorl %ecx, %ecx > - cmpl %eax, %edx > - leaq (%rdi, %rax, CHAR_SIZE), %rax > - cmovle %rcx, %rax > + /* Check if first match was before length. NB: tzcnt has false data- > + dependency on destination. eax already had a data-dependency on esi > + so this should have no affect here. */ > + tzcntl %eax, %esi > +# ifdef USE_AS_WMEMCHR > + leaq (%rdi, %rsi, CHAR_SIZE), %rdi > +# else > + addq %rsi, %rdi > +# endif > + xorl %eax, %eax > + cmpl %esi, %edx > + cmovg %rdi, %rax > ret > -# else > - /* NB: first_vec_x0 is 17 bytes which will leave > - cross_page_boundary (which is relatively cold) close enough > - to ideal alignment. So only realign L(cross_page_boundary) if > - rawmemchr. */ > - .p2align 4 > # endif > + > + .p2align 4 > L(cross_page_boundary): > /* Save pointer before aligning as its original value is > necessary for computer return address if byte is found or > @@ -400,10 +402,14 @@ L(last_2x_vec): > L(zero_end): > ret > > +L(set_zero_end): > + xorl %eax, %eax > + ret > > .p2align 4 > L(first_vec_x1_check): > - tzcntl %eax, %eax > + /* eax must be non-zero. Use bsfl to save code size. */ > + bsfl %eax, %eax > /* Adjust length. */ > subl $-(CHAR_PER_VEC * 4), %edx > /* Check if match within remaining length. */ > @@ -412,9 +418,6 @@ L(first_vec_x1_check): > /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ > leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax > ret > -L(set_zero_end): > - xorl %eax, %eax > - ret > > .p2align 4 > L(loop_4x_vec_end): > @@ -464,7 +467,7 @@ L(loop_4x_vec_end): > # endif > ret > > - .p2align 4 > + .p2align 4,, 10 > L(last_vec_x1_return): > tzcntl %eax, %eax > # if defined USE_AS_WMEMCHR || RET_OFFSET != 0 > @@ -496,6 +499,7 @@ L(last_vec_x3_return): > # endif > > # ifndef USE_AS_RAWMEMCHR > + .p2align 4,, 5 > L(last_4x_vec_or_less_cmpeq): > VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 > kmovd %k0, %eax > @@ -546,7 +550,7 @@ L(last_4x_vec): > # endif > andl %ecx, %eax > jz L(zero_end2) > - tzcntl %eax, %eax > + bsfl %eax, %eax > leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax > L(zero_end2): > ret > @@ -562,6 +566,6 @@ L(last_vec_x3): > leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax > ret > # endif > - > + /* 7 bytes from next cache line. */ > END (MEMCHR) > #endif > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 8/8] x86: Shrink code size of memchr-evex.S 2022-06-07 18:19 ` H.J. Lu @ 2022-07-14 2:32 ` Sunil Pandey 0 siblings, 0 replies; 82+ messages in thread From: Sunil Pandey @ 2022-07-14 2:32 UTC (permalink / raw) To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library On Tue, Jun 7, 2022 at 11:20 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > This is not meant as a performance optimization. The previous code was > > far to liberal in aligning targets and wasted code size unnecissarily. > > > > The total code size saving is: 64 bytes > > > > There are no non-negligible changes in the benchmarks. > > Geometric Mean of all benchmarks New / Old: 1.000 > > > > Full xcheck passes on x86_64. > > --- > > sysdeps/x86_64/multiarch/memchr-evex.S | 46 ++++++++++++++------------ > > 1 file changed, 25 insertions(+), 21 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S > > index cfaf02907d..0fd11b7632 100644 > > --- a/sysdeps/x86_64/multiarch/memchr-evex.S > > +++ b/sysdeps/x86_64/multiarch/memchr-evex.S > > @@ -88,7 +88,7 @@ > > # define PAGE_SIZE 4096 > > > > .section SECTION(.text),"ax",@progbits > > -ENTRY (MEMCHR) > > +ENTRY_P2ALIGN (MEMCHR, 6) > > # ifndef USE_AS_RAWMEMCHR > > /* Check for zero length. */ > > test %RDX_LP, %RDX_LP > > @@ -131,22 +131,24 @@ L(zero): > > xorl %eax, %eax > > ret > > > > - .p2align 5 > > + .p2align 4 > > L(first_vec_x0): > > - /* Check if first match was before length. */ > > - tzcntl %eax, %eax > > - xorl %ecx, %ecx > > - cmpl %eax, %edx > > - leaq (%rdi, %rax, CHAR_SIZE), %rax > > - cmovle %rcx, %rax > > + /* Check if first match was before length. NB: tzcnt has false data- > > + dependency on destination. eax already had a data-dependency on esi > > + so this should have no affect here. */ > > + tzcntl %eax, %esi > > +# ifdef USE_AS_WMEMCHR > > + leaq (%rdi, %rsi, CHAR_SIZE), %rdi > > +# else > > + addq %rsi, %rdi > > +# endif > > + xorl %eax, %eax > > + cmpl %esi, %edx > > + cmovg %rdi, %rax > > ret > > -# else > > - /* NB: first_vec_x0 is 17 bytes which will leave > > - cross_page_boundary (which is relatively cold) close enough > > - to ideal alignment. So only realign L(cross_page_boundary) if > > - rawmemchr. */ > > - .p2align 4 > > # endif > > + > > + .p2align 4 > > L(cross_page_boundary): > > /* Save pointer before aligning as its original value is > > necessary for computer return address if byte is found or > > @@ -400,10 +402,14 @@ L(last_2x_vec): > > L(zero_end): > > ret > > > > +L(set_zero_end): > > + xorl %eax, %eax > > + ret > > > > .p2align 4 > > L(first_vec_x1_check): > > - tzcntl %eax, %eax > > + /* eax must be non-zero. Use bsfl to save code size. */ > > + bsfl %eax, %eax > > /* Adjust length. */ > > subl $-(CHAR_PER_VEC * 4), %edx > > /* Check if match within remaining length. */ > > @@ -412,9 +418,6 @@ L(first_vec_x1_check): > > /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ > > leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax > > ret > > -L(set_zero_end): > > - xorl %eax, %eax > > - ret > > > > .p2align 4 > > L(loop_4x_vec_end): > > @@ -464,7 +467,7 @@ L(loop_4x_vec_end): > > # endif > > ret > > > > - .p2align 4 > > + .p2align 4,, 10 > > L(last_vec_x1_return): > > tzcntl %eax, %eax > > # if defined USE_AS_WMEMCHR || RET_OFFSET != 0 > > @@ -496,6 +499,7 @@ L(last_vec_x3_return): > > # endif > > > > # ifndef USE_AS_RAWMEMCHR > > + .p2align 4,, 5 > > L(last_4x_vec_or_less_cmpeq): > > VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 > > kmovd %k0, %eax > > @@ -546,7 +550,7 @@ L(last_4x_vec): > > # endif > > andl %ecx, %eax > > jz L(zero_end2) > > - tzcntl %eax, %eax > > + bsfl %eax, %eax > > leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax > > L(zero_end2): > > ret > > @@ -562,6 +566,6 @@ L(last_vec_x3): > > leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax > > ret > > # endif > > - > > + /* 7 bytes from next cache line. */ > > END (MEMCHR) > > #endif > > -- > > 2.34.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (6 preceding siblings ...) 2022-06-07 4:11 ` [PATCH v6 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein @ 2022-06-07 18:04 ` H.J. Lu 2022-07-14 2:07 ` Sunil Pandey 7 siblings, 1 reply; 82+ messages in thread From: H.J. Lu @ 2022-06-07 18:04 UTC (permalink / raw) To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > This patch does not touch any existing code and is only meant to be a > tool for future patches so that simple source files can more easily be > maintained to target multiple VEC classes. > > There is no difference in the objdump of libc.so before and after this > patch. > --- > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 34 ++++++++ > sysdeps/x86_64/multiarch/avx-vecs.h | 47 +++++++++++ > sysdeps/x86_64/multiarch/evex-vecs-common.h | 39 +++++++++ > sysdeps/x86_64/multiarch/evex256-vecs.h | 35 ++++++++ > sysdeps/x86_64/multiarch/evex512-vecs.h | 35 ++++++++ > sysdeps/x86_64/multiarch/sse2-vecs.h | 47 +++++++++++ > sysdeps/x86_64/multiarch/vec-macros.h | 90 +++++++++++++++++++++ > 7 files changed, 327 insertions(+) > create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/evex-vecs-common.h > create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > new file mode 100644 > index 0000000000..3f531dd47f > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > @@ -0,0 +1,34 @@ > +/* Common config for AVX-RTM VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX_RTM_VECS_H > +#define _AVX_RTM_VECS_H 1 > + > +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > + > +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > + > +#define USE_WITH_RTM 1 > +#include "avx-vecs.h" > + > +#undef SECTION > +#define SECTION(p) p##.avx.rtm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h > new file mode 100644 > index 0000000000..89680f5db8 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx-vecs.h > @@ -0,0 +1,47 @@ > +/* Common config for AVX VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX_VECS_H > +#define _AVX_VECS_H 1 > + > +#ifdef VEC_SIZE > +# error "Multiple VEC configs included!" > +#endif > + > +#define VEC_SIZE 32 > +#include "vec-macros.h" > + > +#define USE_WITH_AVX 1 > +#define SECTION(p) p##.avx > + > +/* 4-byte mov instructions with AVX2. */ > +#define MOV_SIZE 4 > +/* 1 (ret) + 3 (vzeroupper). */ > +#define RET_SIZE 4 > +#define VZEROUPPER vzeroupper > + > +#define VMOVU vmovdqu > +#define VMOVA vmovdqa > +#define VMOVNT vmovntdq > + > +/* Often need to access xmm portion. */ > +#define VEC_xmm VEC_any_xmm > +#define VEC VEC_any_ymm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/evex-vecs-common.h b/sysdeps/x86_64/multiarch/evex-vecs-common.h > new file mode 100644 > index 0000000000..99806ebcd7 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/evex-vecs-common.h > @@ -0,0 +1,39 @@ > +/* Common config for EVEX256 and EVEX512 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _EVEX_VECS_COMMON_H > +#define _EVEX_VECS_COMMON_H 1 > + > +#include "vec-macros.h" > + > +/* 6-byte mov instructions with EVEX. */ > +#define MOV_SIZE 6 > +/* No vzeroupper needed. */ > +#define RET_SIZE 1 > +#define VZEROUPPER > + > +#define VMOVU vmovdqu64 > +#define VMOVA vmovdqa64 > +#define VMOVNT vmovntdq > + > +#define VEC_xmm VEC_hi_xmm > +#define VEC_ymm VEC_hi_ymm > +#define VEC_zmm VEC_hi_zmm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h > new file mode 100644 > index 0000000000..222ba46dc7 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h > @@ -0,0 +1,35 @@ > +/* Common config for EVEX256 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _EVEX256_VECS_H > +#define _EVEX256_VECS_H 1 > + > +#ifdef VEC_SIZE > +# error "Multiple VEC configs included!" > +#endif > + > +#define VEC_SIZE 32 > +#include "evex-vecs-common.h" > + > +#define USE_WITH_EVEX256 1 > +#define SECTION(p) p##.evex > + > +#define VEC VEC_ymm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h > new file mode 100644 > index 0000000000..d1784d5368 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h > @@ -0,0 +1,35 @@ > +/* Common config for EVEX512 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _EVEX512_VECS_H > +#define _EVEX512_VECS_H 1 > + > +#ifdef VEC_SIZE > +# error "Multiple VEC configs included!" > +#endif > + > +#define VEC_SIZE 64 > +#include "evex-vecs-common.h" > + > +#define USE_WITH_EVEX512 1 > +#define SECTION(p) p##.evex512 > + > +#define VEC VEC_zmm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h > new file mode 100644 > index 0000000000..2b77a59d56 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h > @@ -0,0 +1,47 @@ > +/* Common config for SSE2 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _SSE2_VECS_H > +#define _SSE2_VECS_H 1 > + > +#ifdef VEC_SIZE > +# error "Multiple VEC configs included!" > +#endif > + > +#define VEC_SIZE 16 > +#include "vec-macros.h" > + > +#define USE_WITH_SSE2 1 > +#define SECTION(p) p > + > +/* 3-byte mov instructions with SSE2. */ > +#define MOV_SIZE 3 > +/* No vzeroupper needed. */ > +#define RET_SIZE 1 > +#define VZEROUPPER > + > +#define VMOVU movups > +#define VMOVA movaps > +#define VMOVNT movntdq > + > +#define VEC_xmm VEC_any_xmm > +#define VEC VEC_any_xmm > + > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h > new file mode 100644 > index 0000000000..9f3ffecede > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/vec-macros.h > @@ -0,0 +1,90 @@ > +/* Macro helpers for VEC_{type}({vec_num}) > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _VEC_MACROS_H > +#define _VEC_MACROS_H 1 > + > +#ifndef VEC_SIZE > +# error "Never include this file directly. Always include a vector config." > +#endif > + > +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same > + VEC(N) values. */ > +#define VEC_hi_xmm0 xmm16 > +#define VEC_hi_xmm1 xmm17 > +#define VEC_hi_xmm2 xmm18 > +#define VEC_hi_xmm3 xmm19 > +#define VEC_hi_xmm4 xmm20 > +#define VEC_hi_xmm5 xmm21 > +#define VEC_hi_xmm6 xmm22 > +#define VEC_hi_xmm7 xmm23 > +#define VEC_hi_xmm8 xmm24 > +#define VEC_hi_xmm9 xmm25 > +#define VEC_hi_xmm10 xmm26 > +#define VEC_hi_xmm11 xmm27 > +#define VEC_hi_xmm12 xmm28 > +#define VEC_hi_xmm13 xmm29 > +#define VEC_hi_xmm14 xmm30 > +#define VEC_hi_xmm15 xmm31 > + > +#define VEC_hi_ymm0 ymm16 > +#define VEC_hi_ymm1 ymm17 > +#define VEC_hi_ymm2 ymm18 > +#define VEC_hi_ymm3 ymm19 > +#define VEC_hi_ymm4 ymm20 > +#define VEC_hi_ymm5 ymm21 > +#define VEC_hi_ymm6 ymm22 > +#define VEC_hi_ymm7 ymm23 > +#define VEC_hi_ymm8 ymm24 > +#define VEC_hi_ymm9 ymm25 > +#define VEC_hi_ymm10 ymm26 > +#define VEC_hi_ymm11 ymm27 > +#define VEC_hi_ymm12 ymm28 > +#define VEC_hi_ymm13 ymm29 > +#define VEC_hi_ymm14 ymm30 > +#define VEC_hi_ymm15 ymm31 > + > +#define VEC_hi_zmm0 zmm16 > +#define VEC_hi_zmm1 zmm17 > +#define VEC_hi_zmm2 zmm18 > +#define VEC_hi_zmm3 zmm19 > +#define VEC_hi_zmm4 zmm20 > +#define VEC_hi_zmm5 zmm21 > +#define VEC_hi_zmm6 zmm22 > +#define VEC_hi_zmm7 zmm23 > +#define VEC_hi_zmm8 zmm24 > +#define VEC_hi_zmm9 zmm25 > +#define VEC_hi_zmm10 zmm26 > +#define VEC_hi_zmm11 zmm27 > +#define VEC_hi_zmm12 zmm28 > +#define VEC_hi_zmm13 zmm29 > +#define VEC_hi_zmm14 zmm30 > +#define VEC_hi_zmm15 zmm31 > + > +#define PRIMITIVE_VEC(vec, num) vec##num > + > +#define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) > +#define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) > +#define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) > + > +#define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) > +#define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) > +#define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) > + > +#endif > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-07 18:04 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library H.J. Lu @ 2022-07-14 2:07 ` Sunil Pandey 0 siblings, 0 replies; 82+ messages in thread From: Sunil Pandey @ 2022-07-14 2:07 UTC (permalink / raw) To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library On Tue, Jun 7, 2022 at 11:05 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jun 6, 2022 at 9:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > This patch does not touch any existing code and is only meant to be a > > tool for future patches so that simple source files can more easily be > > maintained to target multiple VEC classes. > > > > There is no difference in the objdump of libc.so before and after this > > patch. > > --- > > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 34 ++++++++ > > sysdeps/x86_64/multiarch/avx-vecs.h | 47 +++++++++++ > > sysdeps/x86_64/multiarch/evex-vecs-common.h | 39 +++++++++ > > sysdeps/x86_64/multiarch/evex256-vecs.h | 35 ++++++++ > > sysdeps/x86_64/multiarch/evex512-vecs.h | 35 ++++++++ > > sysdeps/x86_64/multiarch/sse2-vecs.h | 47 +++++++++++ > > sysdeps/x86_64/multiarch/vec-macros.h | 90 +++++++++++++++++++++ > > 7 files changed, 327 insertions(+) > > create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/evex-vecs-common.h > > create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h > > > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > new file mode 100644 > > index 0000000000..3f531dd47f > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > > @@ -0,0 +1,34 @@ > > +/* Common config for AVX-RTM VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _AVX_RTM_VECS_H > > +#define _AVX_RTM_VECS_H 1 > > + > > +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > + > > +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > > + > > +#define USE_WITH_RTM 1 > > +#include "avx-vecs.h" > > + > > +#undef SECTION > > +#define SECTION(p) p##.avx.rtm > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h > > new file mode 100644 > > index 0000000000..89680f5db8 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/avx-vecs.h > > @@ -0,0 +1,47 @@ > > +/* Common config for AVX VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _AVX_VECS_H > > +#define _AVX_VECS_H 1 > > + > > +#ifdef VEC_SIZE > > +# error "Multiple VEC configs included!" > > +#endif > > + > > +#define VEC_SIZE 32 > > +#include "vec-macros.h" > > + > > +#define USE_WITH_AVX 1 > > +#define SECTION(p) p##.avx > > + > > +/* 4-byte mov instructions with AVX2. */ > > +#define MOV_SIZE 4 > > +/* 1 (ret) + 3 (vzeroupper). */ > > +#define RET_SIZE 4 > > +#define VZEROUPPER vzeroupper > > + > > +#define VMOVU vmovdqu > > +#define VMOVA vmovdqa > > +#define VMOVNT vmovntdq > > + > > +/* Often need to access xmm portion. */ > > +#define VEC_xmm VEC_any_xmm > > +#define VEC VEC_any_ymm > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/evex-vecs-common.h b/sysdeps/x86_64/multiarch/evex-vecs-common.h > > new file mode 100644 > > index 0000000000..99806ebcd7 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/evex-vecs-common.h > > @@ -0,0 +1,39 @@ > > +/* Common config for EVEX256 and EVEX512 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _EVEX_VECS_COMMON_H > > +#define _EVEX_VECS_COMMON_H 1 > > + > > +#include "vec-macros.h" > > + > > +/* 6-byte mov instructions with EVEX. */ > > +#define MOV_SIZE 6 > > +/* No vzeroupper needed. */ > > +#define RET_SIZE 1 > > +#define VZEROUPPER > > + > > +#define VMOVU vmovdqu64 > > +#define VMOVA vmovdqa64 > > +#define VMOVNT vmovntdq > > + > > +#define VEC_xmm VEC_hi_xmm > > +#define VEC_ymm VEC_hi_ymm > > +#define VEC_zmm VEC_hi_zmm > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h > > new file mode 100644 > > index 0000000000..222ba46dc7 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h > > @@ -0,0 +1,35 @@ > > +/* Common config for EVEX256 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _EVEX256_VECS_H > > +#define _EVEX256_VECS_H 1 > > + > > +#ifdef VEC_SIZE > > +# error "Multiple VEC configs included!" > > +#endif > > + > > +#define VEC_SIZE 32 > > +#include "evex-vecs-common.h" > > + > > +#define USE_WITH_EVEX256 1 > > +#define SECTION(p) p##.evex > > + > > +#define VEC VEC_ymm > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h > > new file mode 100644 > > index 0000000000..d1784d5368 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h > > @@ -0,0 +1,35 @@ > > +/* Common config for EVEX512 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _EVEX512_VECS_H > > +#define _EVEX512_VECS_H 1 > > + > > +#ifdef VEC_SIZE > > +# error "Multiple VEC configs included!" > > +#endif > > + > > +#define VEC_SIZE 64 > > +#include "evex-vecs-common.h" > > + > > +#define USE_WITH_EVEX512 1 > > +#define SECTION(p) p##.evex512 > > + > > +#define VEC VEC_zmm > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h > > new file mode 100644 > > index 0000000000..2b77a59d56 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h > > @@ -0,0 +1,47 @@ > > +/* Common config for SSE2 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _SSE2_VECS_H > > +#define _SSE2_VECS_H 1 > > + > > +#ifdef VEC_SIZE > > +# error "Multiple VEC configs included!" > > +#endif > > + > > +#define VEC_SIZE 16 > > +#include "vec-macros.h" > > + > > +#define USE_WITH_SSE2 1 > > +#define SECTION(p) p > > + > > +/* 3-byte mov instructions with SSE2. */ > > +#define MOV_SIZE 3 > > +/* No vzeroupper needed. */ > > +#define RET_SIZE 1 > > +#define VZEROUPPER > > + > > +#define VMOVU movups > > +#define VMOVA movaps > > +#define VMOVNT movntdq > > + > > +#define VEC_xmm VEC_any_xmm > > +#define VEC VEC_any_xmm > > + > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h > > new file mode 100644 > > index 0000000000..9f3ffecede > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/vec-macros.h > > @@ -0,0 +1,90 @@ > > +/* Macro helpers for VEC_{type}({vec_num}) > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <https://www.gnu.org/licenses/>. */ > > + > > +#ifndef _VEC_MACROS_H > > +#define _VEC_MACROS_H 1 > > + > > +#ifndef VEC_SIZE > > +# error "Never include this file directly. Always include a vector config." > > +#endif > > + > > +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same > > + VEC(N) values. */ > > +#define VEC_hi_xmm0 xmm16 > > +#define VEC_hi_xmm1 xmm17 > > +#define VEC_hi_xmm2 xmm18 > > +#define VEC_hi_xmm3 xmm19 > > +#define VEC_hi_xmm4 xmm20 > > +#define VEC_hi_xmm5 xmm21 > > +#define VEC_hi_xmm6 xmm22 > > +#define VEC_hi_xmm7 xmm23 > > +#define VEC_hi_xmm8 xmm24 > > +#define VEC_hi_xmm9 xmm25 > > +#define VEC_hi_xmm10 xmm26 > > +#define VEC_hi_xmm11 xmm27 > > +#define VEC_hi_xmm12 xmm28 > > +#define VEC_hi_xmm13 xmm29 > > +#define VEC_hi_xmm14 xmm30 > > +#define VEC_hi_xmm15 xmm31 > > + > > +#define VEC_hi_ymm0 ymm16 > > +#define VEC_hi_ymm1 ymm17 > > +#define VEC_hi_ymm2 ymm18 > > +#define VEC_hi_ymm3 ymm19 > > +#define VEC_hi_ymm4 ymm20 > > +#define VEC_hi_ymm5 ymm21 > > +#define VEC_hi_ymm6 ymm22 > > +#define VEC_hi_ymm7 ymm23 > > +#define VEC_hi_ymm8 ymm24 > > +#define VEC_hi_ymm9 ymm25 > > +#define VEC_hi_ymm10 ymm26 > > +#define VEC_hi_ymm11 ymm27 > > +#define VEC_hi_ymm12 ymm28 > > +#define VEC_hi_ymm13 ymm29 > > +#define VEC_hi_ymm14 ymm30 > > +#define VEC_hi_ymm15 ymm31 > > + > > +#define VEC_hi_zmm0 zmm16 > > +#define VEC_hi_zmm1 zmm17 > > +#define VEC_hi_zmm2 zmm18 > > +#define VEC_hi_zmm3 zmm19 > > +#define VEC_hi_zmm4 zmm20 > > +#define VEC_hi_zmm5 zmm21 > > +#define VEC_hi_zmm6 zmm22 > > +#define VEC_hi_zmm7 zmm23 > > +#define VEC_hi_zmm8 zmm24 > > +#define VEC_hi_zmm9 zmm25 > > +#define VEC_hi_zmm10 zmm26 > > +#define VEC_hi_zmm11 zmm27 > > +#define VEC_hi_zmm12 zmm28 > > +#define VEC_hi_zmm13 zmm29 > > +#define VEC_hi_zmm14 zmm30 > > +#define VEC_hi_zmm15 zmm31 > > + > > +#define PRIMITIVE_VEC(vec, num) vec##num > > + > > +#define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) > > +#define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) > > +#define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) > > + > > +#define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) > > +#define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) > > +#define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) > > + > > +#endif > > -- > > 2.34.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v1 3/8] Benchtests: Improve memrchr benchmarks 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein @ 2022-06-03 4:42 ` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein ` (5 subsequent siblings) 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:42 UTC (permalink / raw) To: libc-alpha Add a second iteration for memrchr to set `pos` starting from the end of the buffer. Previously `pos` was only set relative to the begining of the buffer. This isn't really useful for memchr because the begining of the search space is (buf + len). --- benchtests/bench-memchr.c | 110 ++++++++++++++++++++++---------------- 1 file changed, 65 insertions(+), 45 deletions(-) diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c index 4d7212332f..0facda2fa0 100644 --- a/benchtests/bench-memchr.c +++ b/benchtests/bench-memchr.c @@ -76,7 +76,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, static void do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, - int seek_char) + int seek_char, int invert_pos) { size_t i; @@ -96,7 +96,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, if (pos < len) { - buf[align + pos] = seek_char; + if (invert_pos) + buf[align + len - pos] = seek_char; + else + buf[align + pos] = seek_char; buf[align + len] = -seek_char; } else @@ -109,6 +112,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, json_attr_uint (json_ctx, "pos", pos); json_attr_uint (json_ctx, "len", len); json_attr_uint (json_ctx, "seek_char", seek_char); + json_attr_uint (json_ctx, "invert_pos", invert_pos); json_array_begin (json_ctx, "timings"); @@ -123,6 +127,7 @@ int test_main (void) { size_t i; + int repeats; json_ctx_t json_ctx; test_init (); @@ -142,53 +147,68 @@ test_main (void) json_array_begin (&json_ctx, "results"); - for (i = 1; i < 8; ++i) + for (repeats = 0; repeats < 2; ++repeats) { - do_test (&json_ctx, 0, 16 << i, 2048, 23); - do_test (&json_ctx, i, 64, 256, 23); - do_test (&json_ctx, 0, 16 << i, 2048, 0); - do_test (&json_ctx, i, 64, 256, 0); - - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0); + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); + do_test (&json_ctx, i, 64, 256, 23, repeats); + do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); + do_test (&json_ctx, i, 64, 256, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, i, 256, 23); - do_test (&json_ctx, 0, i, 256, 0); - do_test (&json_ctx, i, i, 256, 23); - do_test (&json_ctx, i, i, 256, 0); + /* Also test the position close to the beginning for memrchr. */ + do_test (&json_ctx, 0, i, 256, 23, repeats); + do_test (&json_ctx, 0, i, 256, 0, repeats); + do_test (&json_ctx, i, i, 256, 23, repeats); + do_test (&json_ctx, i, i, 256, 0, repeats); #endif - } - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, i, i << 5, 192, 23); - do_test (&json_ctx, i, i << 5, 192, 0); - do_test (&json_ctx, i, i << 5, 256, 23); - do_test (&json_ctx, i, i << 5, 256, 0); - do_test (&json_ctx, i, i << 5, 512, 23); - do_test (&json_ctx, i, i << 5, 512, 0); - - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23); - } - for (i = 1; i < 32; ++i) - { - do_test (&json_ctx, 0, i, i + 1, 23); - do_test (&json_ctx, 0, i, i + 1, 0); - do_test (&json_ctx, i, i, i + 1, 23); - do_test (&json_ctx, i, i, i + 1, 0); - do_test (&json_ctx, 0, i, i - 1, 23); - do_test (&json_ctx, 0, i, i - 1, 0); - do_test (&json_ctx, i, i, i - 1, 23); - do_test (&json_ctx, i, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0); - - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23); - do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0); + } + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, i, i << 5, 192, 23, repeats); + do_test (&json_ctx, i, i << 5, 192, 0, repeats); + do_test (&json_ctx, i, i << 5, 256, 23, repeats); + do_test (&json_ctx, i, i << 5, 256, 0, repeats); + do_test (&json_ctx, i, i << 5, 512, 23, repeats); + do_test (&json_ctx, i, i << 5, 512, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); + } + for (i = 1; i < 32; ++i) + { + do_test (&json_ctx, 0, i, i + 1, 23, repeats); + do_test (&json_ctx, 0, i, i + 1, 0, repeats); + do_test (&json_ctx, i, i, i + 1, 23, repeats); + do_test (&json_ctx, i, i, i + 1, 0, repeats); + do_test (&json_ctx, 0, i, i - 1, 23, repeats); + do_test (&json_ctx, 0, i, i - 1, 0, repeats); + do_test (&json_ctx, i, i, i - 1, 23, repeats); + do_test (&json_ctx, i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i + 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2, i, i - 1, 0, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () / 2 + i, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i - 1, 0, repeats); + + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 23, repeats); + do_test (&json_ctx, getpagesize () - 15, i, i + 1, 0, repeats); + #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, 1, i + 1, 23); - do_test (&json_ctx, 0, 2, i + 1, 0); + do_test (&json_ctx, 0, 1, i + 1, 23, repeats); + do_test (&json_ctx, 0, 2, i + 1, 0, repeats); +#endif + } +#ifndef USE_AS_MEMRCHR + break; #endif } -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v1 4/8] x86: Optimize memrchr-sse2.S 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein @ 2022-06-03 4:42 ` Noah Goldstein 2022-06-03 4:47 ` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 5/8] x86: Optimize memrchr-evex.S Noah Goldstein ` (4 subsequent siblings) 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:42 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. --- sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- 1 file changed, 292 insertions(+), 321 deletions(-) diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S index d1a9f47911..b0dffd2ae2 100644 --- a/sysdeps/x86_64/memrchr.S +++ b/sysdeps/x86_64/memrchr.S @@ -18,362 +18,333 @@ <https://www.gnu.org/licenses/>. */ #include <sysdep.h> +#define VEC_SIZE 16 +#define PAGE_SIZE 4096 .text -ENTRY (__memrchr) - movd %esi, %xmm1 - - sub $16, %RDX_LP - jbe L(length_less16) - - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add %RDX_LP, %RDI_LP - pshufd $0, %xmm1, %xmm1 - - movdqu (%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - -/* Check if there is a match. */ - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - mov %edi, %ecx - and $15, %ecx - jz L(loop_prolog) - - add $16, %rdi - add $16, %rdx - and $-16, %rdi - sub %rcx, %rdx - - .p2align 4 -L(loop_prolog): - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm4 - pcmpeqb %xmm1, %xmm4 - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches0) - - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16) - - movdqa (%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches0) - - mov %edi, %ecx - and $63, %ecx - jz L(align64_loop) - - add $64, %rdi - add $64, %rdx - and $-64, %rdi - sub %rcx, %rdx - - .p2align 4 -L(align64_loop): - sub $64, %rdi - sub $64, %rdx - jbe L(exit_loop) - - movdqa (%rdi), %xmm0 - movdqa 16(%rdi), %xmm2 - movdqa 32(%rdi), %xmm3 - movdqa 48(%rdi), %xmm4 - - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm1, %xmm2 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm1, %xmm4 - - pmaxub %xmm3, %xmm0 - pmaxub %xmm4, %xmm2 - pmaxub %xmm0, %xmm2 - pmovmskb %xmm2, %eax - - test %eax, %eax - jz L(align64_loop) - - pmovmskb %xmm4, %eax - test %eax, %eax - jnz L(matches48) - - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm2 - - pcmpeqb %xmm1, %xmm2 - pcmpeqb (%rdi), %xmm1 - - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches16) - - pmovmskb %xmm1, %eax - bsr %eax, %eax - - add %rdi, %rax +ENTRY_P2ALIGN(__memrchr, 6) +#ifdef __ILP32__ + /* Clear upper bits. */ + mov %RDX_LP, %RDX_LP +#endif + movd %esi, %xmm0 + + /* Get end pointer. */ + leaq (%rdx, %rdi), %rcx + + punpcklbw %xmm0, %xmm0 + punpcklwd %xmm0, %xmm0 + pshufd $0, %xmm0, %xmm0 + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %ecx + jz L(page_cross) + + /* NB: This load happens regardless of whether rdx (len) is zero. Since + it doesn't cross a page and the standard gurantees any pointer have + at least one-valid byte this load must be safe. For the entire + history of the x86 memrchr implementation this has been possible so + no code "should" be relying on a zero-length check before this load. + The zero-length check is moved to the page cross case because it is + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE + into 2-cache lines. */ + movups -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is + zero. */ + bsrl %eax, %eax + jz L(ret_0) + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here + if out of bounds. */ + addl %edx, %eax + jl L(zero_0) + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base + ptr. */ + addq %rdi, %rax +L(ret_0): ret - .p2align 4 -L(exit_loop): - add $64, %edx - cmp $32, %edx - jbe L(exit_loop_32) - - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48) - - movdqa 32(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - test %eax, %eax - jnz L(matches32) - - movdqa 16(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm3 - pmovmskb %xmm3, %eax - test %eax, %eax - jnz L(matches16_1) - cmp $48, %edx - jbe L(return_null) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches0_1) - xor %eax, %eax + .p2align 4,, 5 +L(ret_vec_x0): + bsrl %eax, %eax + leaq -(VEC_SIZE)(%rcx, %rax), %rax ret - .p2align 4 -L(exit_loop_32): - movdqa 48(%rdi), %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - jnz L(matches48_1) - cmp $16, %edx - jbe L(return_null) - - pcmpeqb 32(%rdi), %xmm1 - pmovmskb %xmm1, %eax - test %eax, %eax - jnz L(matches32_1) - xor %eax, %eax + .p2align 4,, 2 +L(zero_0): + xorl %eax, %eax ret - .p2align 4 -L(matches0): - bsr %eax, %eax - add %rdi, %rax - ret - - .p2align 4 -L(matches16): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax - ret - .p2align 4 -L(matches32): - bsr %eax, %eax - lea 32(%rax, %rdi), %rax + .p2align 4,, 8 +L(more_1x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) + + /* Align rcx (pointer to string). */ + decq %rcx + andq $-VEC_SIZE, %rcx + + movq %rcx, %rdx + /* NB: We could consistenyl save 1-byte in this pattern with `movaps + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is + it adds more frontend uops (even if the moves can be eliminated) and + some percentage of the time actual backend uops. */ + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + subq %rdi, %rdx + pmovmskb %xmm1, %eax + + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) +L(last_2x_vec): + subl $VEC_SIZE, %edx + jbe L(ret_vec_x0_test) + + testl %eax, %eax + jnz L(ret_vec_x0) + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_1) + addl %edx, %eax + jl L(zero_0) + addq %rdi, %rax +L(ret_1): ret - .p2align 4 -L(matches48): - bsr %eax, %eax - lea 48(%rax, %rdi), %rax + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) + causes the hot pause (length <= VEC_SIZE) to span multiple cache + lines. Naturally aligned % 16 to 8-bytes. */ +L(page_cross): + /* Zero length check. */ + testq %rdx, %rdx + jz L(zero_0) + + leaq -1(%rcx), %r8 + andq $-(VEC_SIZE), %r8 + + movaps (%r8), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %esi + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + negl %ecx + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count + explicitly. */ + andl $(VEC_SIZE - 1), %ecx + shl %cl, %esi + movzwl %si, %eax + leaq (%rdi, %rdx), %rcx + cmpq %rdi, %r8 + ja L(more_1x_vec) + subl $VEC_SIZE, %edx + bsrl %eax, %eax + jz L(ret_2) + addl %edx, %eax + jl L(zero_1) + addq %rdi, %rax +L(ret_2): ret - .p2align 4 -L(matches0_1): - bsr %eax, %eax - sub $64, %rdx - add %rax, %rdx - jl L(return_null) - add %rdi, %rax + /* Fits in aliging bytes. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(matches16_1): - bsr %eax, %eax - sub $48, %rdx - add %rax, %rdx - jl L(return_null) - lea 16(%rdi, %rax), %rax + .p2align 4,, 5 +L(ret_vec_x1): + bsrl %eax, %eax + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(matches32_1): - bsr %eax, %eax - sub $32, %rdx - add %rax, %rdx - jl L(return_null) - lea 32(%rdi, %rax), %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %eax, %eax + jnz L(ret_vec_x0) - .p2align 4 -L(matches48_1): - bsr %eax, %eax - sub $16, %rdx - add %rax, %rdx - jl L(return_null) - lea 48(%rdi, %rax), %rax - ret + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + testl %eax, %eax + jnz L(ret_vec_x1) - .p2align 4 -L(return_null): - xor %eax, %eax - ret - .p2align 4 -L(length_less16_offset0): - test %edx, %edx - jz L(return_null) + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - mov %dl, %cl - pcmpeqb (%rdi), %xmm1 + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) - mov $1, %edx - sal %cl, %edx - sub $1, %edx + addl $(VEC_SIZE), %edx + jle L(ret_vec_x2_test) - pmovmskb %xmm1, %eax +L(last_vec): + testl %eax, %eax + jnz L(ret_vec_x2) - and %edx, %eax - test %eax, %eax - jz L(return_null) + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax - bsr %eax, %eax - add %rdi, %rax + subl $(VEC_SIZE), %edx + bsrl %eax, %eax + jz L(ret_3) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax +L(ret_3): ret - .p2align 4 -L(length_less16): - punpcklbw %xmm1, %xmm1 - punpcklbw %xmm1, %xmm1 - - add $16, %edx - - pshufd $0, %xmm1, %xmm1 - - mov %edi, %ecx - and $15, %ecx - jz L(length_less16_offset0) - - mov %cl, %dh - mov %ecx, %esi - add %dl, %dh - and $-16, %rdi - - sub $16, %dh - ja L(length_less16_part2) - - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax - - sar %cl, %eax - mov %dl, %cl - - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax - test %eax, %eax - jz L(return_null) - - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 6 +L(ret_vec_x2_test): + bsrl %eax, %eax + jz L(zero_2) + addl %edx, %eax + jl L(zero_2) + addq %rdi, %rax ret - .p2align 4 -L(length_less16_part2): - movdqa 16(%rdi), %xmm2 - pcmpeqb %xmm1, %xmm2 - pmovmskb %xmm2, %eax - - mov %dh, %cl - mov $1, %edx - sal %cl, %edx - sub $1, %edx - - and %edx, %eax +L(zero_2): + xorl %eax, %eax + ret - test %eax, %eax - jnz L(length_less16_part2_return) - pcmpeqb (%rdi), %xmm1 - pmovmskb %xmm1, %eax + .p2align 4,, 5 +L(ret_vec_x2): + bsrl %eax, %eax + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - mov %esi, %ecx - sar %cl, %eax - test %eax, %eax - jz L(return_null) + .p2align 4,, 5 +L(ret_vec_x3): + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - bsr %eax, %eax - add %rdi, %rax - add %rsi, %rax + .p2align 4,, 8 +L(more_4x_vec): + testl %eax, %eax + jnz L(ret_vec_x2) + + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_x3) + + addq $-(VEC_SIZE * 4), %rcx + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) + + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end + keeping the code from spilling to the next cache line. */ + addq $(VEC_SIZE * 4 - 1), %rcx + andq $-(VEC_SIZE * 4), %rcx + leaq (VEC_SIZE * 4)(%rdi), %rdx + andq $-(VEC_SIZE * 4), %rdx + + .p2align 4,, 11 +L(loop_4x_vec): + movaps (VEC_SIZE * -1)(%rcx), %xmm1 + movaps (VEC_SIZE * -2)(%rcx), %xmm2 + movaps (VEC_SIZE * -3)(%rcx), %xmm3 + movaps (VEC_SIZE * -4)(%rcx), %xmm4 + pcmpeqb %xmm0, %xmm1 + pcmpeqb %xmm0, %xmm2 + pcmpeqb %xmm0, %xmm3 + pcmpeqb %xmm0, %xmm4 + + por %xmm1, %xmm2 + por %xmm3, %xmm4 + por %xmm2, %xmm4 + + pmovmskb %xmm4, %esi + testl %esi, %esi + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rcx + cmpq %rdx, %rcx + jne L(loop_4x_vec) + + subl %edi, %edx + + /* Ends up being 1-byte nop. */ + .p2align 4,, 2 +L(last_4x_vec): + movaps -(VEC_SIZE)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) + + testl %eax, %eax + jnz L(ret_vec_x0) + + + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + testl %eax, %eax + jnz L(ret_vec_end) + + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %eax + + subl $(VEC_SIZE * 3), %edx + ja L(last_vec) + bsrl %eax, %eax + jz L(ret_4) + addl %edx, %eax + jl L(zero_3) + addq %rdi, %rax +L(ret_4): ret - .p2align 4 -L(length_less16_part2_return): - bsr %eax, %eax - lea 16(%rax, %rdi), %rax + /* Ends up being 1-byte nop. */ + .p2align 4,, 3 +L(loop_end): + pmovmskb %xmm1, %eax + sall $16, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm2, %eax + testl %eax, %eax + jnz L(ret_vec_end) + + pmovmskb %xmm3, %eax + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + sall $16, %eax + orl %esi, %eax + bsrl %eax, %eax + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax ret -END (__memrchr) +L(ret_vec_end): + bsrl %eax, %eax + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax + ret + /* Use in L(last_4x_vec). In the same cache line. This is just a spare + aligning bytes. */ +L(zero_3): + xorl %eax, %eax + ret + /* 2-bytes from next cache line. */ +END(__memrchr) weak_alias (__memrchr, memrchr) -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v1 4/8] x86: Optimize memrchr-sse2.S 2022-06-03 4:42 ` [PATCH v1 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein @ 2022-06-03 4:47 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:47 UTC (permalink / raw) To: GNU C Library On Thu, Jun 2, 2022 at 11:42 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The new code: > 1. prioritizes smaller lengths more. > 2. optimizes target placement more carefully. > 3. reuses logic more. > 4. fixes up various inefficiencies in the logic. > > The total code size saving is: 394 bytes > Geometric Mean of all benchmarks New / Old: 0.874 > > Regressions: > 1. The page cross case is now colder, especially re-entry from the > page cross case if a match is not found in the first VEC > (roughly 50%). My general opinion with this patch is this is > acceptable given the "coldness" of this case (less than 4%) and > generally performance improvement in the other far more common > cases. > > 2. There are some regressions 5-15% for medium/large user-arg > lengths that have a match in the first VEC. This is because the > logic was rewritten to optimize finds in the first VEC if the > user-arg length is shorter (where we see roughly 20-50% > performance improvements). It is not always the case this is a > regression. My intuition is some frontend quirk is partially > explaining the data although I haven't been able to find the > root cause. > > Full xcheck passes on x86_64. > --- Least confident with numbers in this patch. Geometric mean of N = 30 runs. Benchmarked on Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i71165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html Aggregate Geometric Mean of New / Old: 0.8743388468654057 Results For: memrchr len, align, pos, seek_char, invert_pos, New / Old 2048, 0, 32, 23, 0, 0.993 256, 1, 64, 23, 0, 0.903 2048, 0, 32, 0, 0, 0.89 256, 1, 64, 0, 0, 0.904 256, 4081, 64, 0, 0, 0.907 256, 0, 1, 23, 0, 0.95 256, 0, 1, 0, 0, 0.95 256, 1, 1, 23, 0, 0.885 256, 1, 1, 0, 0, 0.883 2048, 0, 64, 23, 0, 0.8 256, 2, 64, 23, 0, 0.905 2048, 0, 64, 0, 0, 0.795 256, 2, 64, 0, 0, 0.905 256, 0, 2, 23, 0, 0.949 256, 0, 2, 0, 0, 0.949 256, 2, 2, 23, 0, 0.885 256, 2, 2, 0, 0, 0.886 2048, 0, 128, 23, 0, 0.781 256, 3, 64, 23, 0, 0.904 2048, 0, 128, 0, 0, 0.804 256, 3, 64, 0, 0, 0.904 256, 0, 3, 23, 0, 0.948 256, 0, 3, 0, 0, 0.948 256, 3, 3, 23, 0, 0.886 256, 3, 3, 0, 0, 0.881 2048, 0, 256, 23, 0, 0.715 256, 4, 64, 23, 0, 0.896 2048, 0, 256, 0, 0, 0.747 256, 4, 64, 0, 0, 0.897 256, 0, 4, 23, 0, 0.948 256, 0, 4, 0, 0, 0.95 256, 4, 4, 23, 0, 0.884 256, 4, 4, 0, 0, 0.885 2048, 0, 512, 23, 0, 0.66 256, 5, 64, 23, 0, 0.905 2048, 0, 512, 0, 0, 0.674 256, 5, 64, 0, 0, 0.905 256, 0, 5, 23, 0, 0.951 256, 0, 5, 0, 0, 0.95 256, 5, 5, 23, 0, 0.885 256, 5, 5, 0, 0, 0.883 2048, 0, 1024, 23, 0, 0.952 256, 6, 64, 23, 0, 0.905 2048, 0, 1024, 0, 0, 0.952 256, 6, 64, 0, 0, 0.904 256, 0, 6, 23, 0, 0.95 256, 0, 6, 0, 0, 0.95 256, 6, 6, 23, 0, 0.884 256, 6, 6, 0, 0, 0.884 2048, 0, 2048, 23, 0, 0.843 256, 7, 64, 23, 0, 0.904 2048, 0, 2048, 0, 0, 0.839 256, 7, 64, 0, 0, 0.906 256, 0, 7, 23, 0, 0.951 256, 0, 7, 0, 0, 0.951 256, 7, 7, 23, 0, 0.887 256, 7, 7, 0, 0, 0.885 192, 1, 32, 23, 0, 0.867 192, 1, 32, 0, 0, 0.866 256, 1, 32, 23, 0, 0.888 256, 1, 32, 0, 0, 0.888 512, 1, 32, 23, 0, 1.103 512, 1, 32, 0, 0, 1.102 256, 4081, 32, 23, 0, 0.924 192, 2, 64, 23, 0, 1.081 192, 2, 64, 0, 0, 1.081 512, 2, 64, 23, 0, 1.131 512, 2, 64, 0, 0, 1.129 256, 4081, 64, 23, 0, 0.905 192, 3, 96, 23, 0, 1.174 192, 3, 96, 0, 0, 1.174 256, 3, 96, 23, 0, 0.73 256, 3, 96, 0, 0, 0.73 512, 3, 96, 23, 0, 0.755 512, 3, 96, 0, 0, 0.757 256, 4081, 96, 23, 0, 0.835 192, 4, 128, 23, 0, 0.898 192, 4, 128, 0, 0, 0.895 256, 4, 128, 23, 0, 1.081 256, 4, 128, 0, 0, 1.082 512, 4, 128, 23, 0, 1.088 512, 4, 128, 0, 0, 1.087 256, 4081, 128, 23, 0, 1.252 192, 5, 160, 23, 0, 0.894 192, 5, 160, 0, 0, 0.894 256, 5, 160, 23, 0, 1.174 256, 5, 160, 0, 0, 1.174 512, 5, 160, 23, 0, 1.093 512, 5, 160, 0, 0, 1.097 256, 4081, 160, 23, 0, 1.255 192, 6, 192, 23, 0, 0.869 192, 6, 192, 0, 0, 0.869 256, 6, 192, 23, 0, 0.903 256, 6, 192, 0, 0, 0.899 512, 6, 192, 23, 0, 0.999 512, 6, 192, 0, 0, 1.0 256, 4081, 192, 23, 0, 0.91 192, 7, 224, 23, 0, 0.869 192, 7, 224, 0, 0, 0.868 256, 7, 224, 23, 0, 0.893 256, 7, 224, 0, 0, 0.893 512, 7, 224, 23, 0, 0.718 512, 7, 224, 0, 0, 0.718 256, 4081, 224, 23, 0, 0.903 2, 0, 1, 23, 0, 1.026 2, 0, 1, 0, 0, 1.029 2, 1, 1, 23, 0, 0.874 2, 1, 1, 0, 0, 0.875 0, 0, 1, 23, 0, 0.583 0, 0, 1, 0, 0, 0.583 0, 1, 1, 23, 0, 0.539 0, 1, 1, 0, 0, 0.538 2, 2048, 1, 23, 0, 0.751 2, 2048, 1, 0, 0, 0.749 2, 2049, 1, 23, 0, 0.638 2, 2049, 1, 0, 0, 0.638 0, 2048, 1, 23, 0, 0.5 0, 2048, 1, 0, 0, 0.5 0, 2049, 1, 23, 0, 0.462 0, 2049, 1, 0, 0, 0.462 0, 4081, 1, 23, 0, 0.462 0, 4081, 1, 0, 0, 0.462 2, 4081, 1, 23, 0, 0.61 2, 4081, 1, 0, 0, 0.609 2, 0, 2, 0, 0, 0.889 3, 0, 2, 23, 0, 1.05 3, 0, 2, 0, 0, 1.034 3, 2, 2, 23, 0, 0.9 3, 2, 2, 0, 0, 0.887 1, 0, 2, 23, 0, 0.942 1, 0, 2, 0, 0, 0.941 1, 2, 2, 23, 0, 1.043 1, 2, 2, 0, 0, 1.11 3, 2048, 2, 23, 0, 0.75 3, 2048, 2, 0, 0, 0.75 3, 2050, 2, 23, 0, 0.638 3, 2050, 2, 0, 0, 0.639 1, 2048, 2, 23, 0, 0.666 1, 2048, 2, 0, 0, 0.668 1, 2050, 2, 23, 0, 0.734 1, 2050, 2, 0, 0, 0.727 1, 4081, 2, 23, 0, 0.725 1, 4081, 2, 0, 0, 0.726 3, 4081, 2, 23, 0, 0.614 3, 4081, 2, 0, 0, 0.619 3, 0, 1, 23, 0, 1.043 4, 0, 3, 23, 0, 1.04 4, 0, 3, 0, 0, 1.043 4, 3, 3, 23, 0, 0.886 4, 3, 3, 0, 0, 0.901 2, 0, 3, 23, 0, 0.923 2, 0, 3, 0, 0, 0.933 2, 3, 3, 23, 0, 1.01 2, 3, 3, 0, 0, 1.083 4, 2048, 3, 23, 0, 0.751 4, 2048, 3, 0, 0, 0.75 4, 2051, 3, 23, 0, 0.638 4, 2051, 3, 0, 0, 0.641 2, 2048, 3, 23, 0, 0.67 2, 2048, 3, 0, 0, 0.67 2, 2051, 3, 23, 0, 0.728 2, 2051, 3, 0, 0, 0.73 2, 4081, 3, 23, 0, 0.727 2, 4081, 3, 0, 0, 0.726 4, 4081, 3, 23, 0, 0.613 4, 4081, 3, 0, 0, 0.63 4, 0, 1, 23, 0, 1.073 4, 0, 2, 0, 0, 1.055 5, 0, 4, 23, 0, 1.055 5, 0, 4, 0, 0, 1.066 5, 4, 4, 23, 0, 0.893 5, 4, 4, 0, 0, 0.892 3, 0, 4, 23, 0, 0.911 3, 0, 4, 0, 0, 0.913 3, 4, 4, 23, 0, 0.988 3, 4, 4, 0, 0, 1.055 5, 2048, 4, 23, 0, 0.751 5, 2048, 4, 0, 0, 0.752 5, 2052, 4, 23, 0, 0.64 5, 2052, 4, 0, 0, 0.639 3, 2048, 4, 23, 0, 0.668 3, 2048, 4, 0, 0, 0.669 3, 2052, 4, 23, 0, 0.73 3, 2052, 4, 0, 0, 0.731 3, 4081, 4, 23, 0, 0.726 3, 4081, 4, 0, 0, 0.73 5, 4081, 4, 23, 0, 0.62 5, 4081, 4, 0, 0, 0.611 5, 0, 1, 23, 0, 1.044 5, 0, 2, 0, 0, 1.048 6, 0, 5, 23, 0, 1.062 6, 0, 5, 0, 0, 1.064 6, 5, 5, 23, 0, 0.898 6, 5, 5, 0, 0, 0.896 4, 0, 5, 23, 0, 0.894 4, 0, 5, 0, 0, 0.894 4, 5, 5, 23, 0, 0.974 4, 5, 5, 0, 0, 1.042 6, 2048, 5, 23, 0, 0.752 6, 2048, 5, 0, 0, 0.751 6, 2053, 5, 23, 0, 0.639 6, 2053, 5, 0, 0, 0.638 4, 2048, 5, 23, 0, 0.667 4, 2048, 5, 0, 0, 0.668 4, 2053, 5, 23, 0, 0.73 4, 2053, 5, 0, 0, 0.729 4, 4081, 5, 23, 0, 0.726 4, 4081, 5, 0, 0, 0.727 6, 4081, 5, 23, 0, 0.626 6, 4081, 5, 0, 0, 0.619 6, 0, 1, 23, 0, 1.045 6, 0, 2, 0, 0, 1.049 7, 0, 6, 23, 0, 1.032 7, 0, 6, 0, 0, 1.038 7, 6, 6, 23, 0, 0.889 7, 6, 6, 0, 0, 0.894 5, 0, 6, 23, 0, 0.89 5, 0, 6, 0, 0, 0.891 5, 6, 6, 23, 0, 0.971 5, 6, 6, 0, 0, 0.997 7, 2048, 6, 23, 0, 0.751 7, 2048, 6, 0, 0, 0.747 7, 2054, 6, 23, 0, 0.639 7, 2054, 6, 0, 0, 0.64 5, 2048, 6, 23, 0, 0.667 5, 2048, 6, 0, 0, 0.669 5, 2054, 6, 23, 0, 0.732 5, 2054, 6, 0, 0, 0.728 5, 4081, 6, 23, 0, 0.729 5, 4081, 6, 0, 0, 0.727 7, 4081, 6, 23, 0, 0.631 7, 4081, 6, 0, 0, 0.619 7, 0, 1, 23, 0, 1.042 7, 0, 2, 0, 0, 1.039 8, 0, 7, 23, 0, 1.034 8, 0, 7, 0, 0, 1.04 8, 7, 7, 23, 0, 0.876 8, 7, 7, 0, 0, 0.883 6, 0, 7, 23, 0, 0.891 6, 0, 7, 0, 0, 0.895 6, 7, 7, 23, 0, 0.986 6, 7, 7, 0, 0, 0.996 8, 2048, 7, 23, 0, 0.754 8, 2048, 7, 0, 0, 0.754 8, 2055, 7, 23, 0, 0.638 8, 2055, 7, 0, 0, 0.638 6, 2048, 7, 23, 0, 0.667 6, 2048, 7, 0, 0, 0.67 6, 2055, 7, 23, 0, 0.73 6, 2055, 7, 0, 0, 0.729 6, 4081, 7, 23, 0, 0.726 6, 4081, 7, 0, 0, 0.727 8, 4081, 7, 23, 0, 0.61 8, 4081, 7, 0, 0, 0.616 8, 0, 1, 23, 0, 1.031 8, 0, 2, 0, 0, 1.032 9, 0, 8, 23, 0, 1.044 9, 0, 8, 0, 0, 1.037 9, 8, 8, 23, 0, 0.652 9, 8, 8, 0, 0, 0.643 7, 0, 8, 23, 0, 0.897 7, 0, 8, 0, 0, 0.889 7, 8, 8, 23, 0, 0.969 7, 8, 8, 0, 0, 1.015 9, 2048, 8, 23, 0, 0.753 9, 2048, 8, 0, 0, 0.75 9, 2056, 8, 23, 0, 0.645 9, 2056, 8, 0, 0, 0.655 7, 2048, 8, 23, 0, 0.667 7, 2048, 8, 0, 0, 0.671 7, 2056, 8, 23, 0, 0.731 7, 2056, 8, 0, 0, 0.731 7, 4081, 8, 23, 0, 0.723 7, 4081, 8, 0, 0, 0.724 9, 4081, 8, 23, 0, 0.653 9, 4081, 8, 0, 0, 0.638 9, 0, 1, 23, 0, 1.037 9, 0, 2, 0, 0, 1.032 10, 0, 9, 23, 0, 1.033 10, 0, 9, 0, 0, 1.03 10, 9, 9, 23, 0, 0.66 10, 9, 9, 0, 0, 0.657 8, 0, 9, 23, 0, 0.888 8, 0, 9, 0, 0, 0.891 8, 9, 9, 23, 0, 0.631 8, 9, 9, 0, 0, 0.632 10, 2048, 9, 23, 0, 0.767 10, 2048, 9, 0, 0, 0.759 10, 2057, 9, 23, 0, 0.666 10, 2057, 9, 0, 0, 0.647 8, 2048, 9, 23, 0, 0.669 8, 2048, 9, 0, 0, 0.668 8, 2057, 9, 23, 0, 0.629 8, 2057, 9, 0, 0, 0.641 8, 4081, 9, 23, 0, 0.727 8, 4081, 9, 0, 0, 0.764 10, 4081, 9, 23, 0, 0.642 10, 4081, 9, 0, 0, 0.653 10, 0, 1, 23, 0, 1.031 10, 0, 2, 0, 0, 1.038 11, 0, 10, 23, 0, 1.032 11, 0, 10, 0, 0, 1.029 11, 10, 10, 23, 0, 0.652 11, 10, 10, 0, 0, 0.656 9, 0, 10, 23, 0, 0.893 9, 0, 10, 0, 0, 0.894 9, 10, 10, 23, 0, 0.629 9, 10, 10, 0, 0, 0.707 11, 2048, 10, 23, 0, 0.753 11, 2048, 10, 0, 0, 0.749 11, 2058, 10, 23, 0, 0.662 11, 2058, 10, 0, 0, 0.661 9, 2048, 10, 23, 0, 0.673 9, 2048, 10, 0, 0, 0.666 9, 2058, 10, 23, 0, 0.629 9, 2058, 10, 0, 0, 0.663 9, 4081, 10, 23, 0, 0.727 9, 4081, 10, 0, 0, 0.779 11, 4081, 10, 23, 0, 0.624 11, 4081, 10, 0, 0, 0.619 11, 0, 1, 23, 0, 1.03 11, 0, 2, 0, 0, 1.03 12, 0, 11, 23, 0, 1.039 12, 0, 11, 0, 0, 1.03 12, 11, 11, 23, 0, 0.653 12, 11, 11, 0, 0, 0.652 10, 0, 11, 23, 0, 0.896 10, 0, 11, 0, 0, 0.889 10, 11, 11, 23, 0, 0.628 10, 11, 11, 0, 0, 0.696 12, 2048, 11, 23, 0, 0.752 12, 2048, 11, 0, 0, 0.754 12, 2059, 11, 23, 0, 0.657 12, 2059, 11, 0, 0, 0.652 10, 2048, 11, 23, 0, 0.67 10, 2048, 11, 0, 0, 0.668 10, 2059, 11, 23, 0, 0.627 10, 2059, 11, 0, 0, 0.677 10, 4081, 11, 23, 0, 0.726 10, 4081, 11, 0, 0, 0.771 12, 4081, 11, 23, 0, 0.648 12, 4081, 11, 0, 0, 0.624 12, 0, 1, 23, 0, 1.047 12, 0, 2, 0, 0, 1.042 13, 0, 12, 23, 0, 1.043 13, 0, 12, 0, 0, 1.04 13, 12, 12, 23, 0, 0.66 13, 12, 12, 0, 0, 0.647 11, 0, 12, 23, 0, 0.891 11, 0, 12, 0, 0, 0.895 11, 12, 12, 23, 0, 0.629 11, 12, 12, 0, 0, 0.655 13, 2048, 12, 23, 0, 0.749 13, 2048, 12, 0, 0, 0.748 13, 2060, 12, 23, 0, 0.647 13, 2060, 12, 0, 0, 0.636 11, 2048, 12, 23, 0, 0.669 11, 2048, 12, 0, 0, 0.668 11, 2060, 12, 23, 0, 0.627 11, 2060, 12, 0, 0, 0.664 11, 4081, 12, 23, 0, 0.725 11, 4081, 12, 0, 0, 0.766 13, 4081, 12, 23, 0, 0.674 13, 4081, 12, 0, 0, 0.633 13, 0, 1, 23, 0, 1.036 13, 0, 2, 0, 0, 1.029 14, 0, 13, 23, 0, 1.029 14, 0, 13, 0, 0, 1.032 14, 13, 13, 23, 0, 0.646 14, 13, 13, 0, 0, 0.655 12, 0, 13, 23, 0, 0.889 12, 0, 13, 0, 0, 0.89 12, 13, 13, 23, 0, 0.628 12, 13, 13, 0, 0, 0.684 14, 2048, 13, 23, 0, 0.748 14, 2048, 13, 0, 0, 0.749 14, 2061, 13, 23, 0, 0.644 14, 2061, 13, 0, 0, 0.651 12, 2048, 13, 23, 0, 0.67 12, 2048, 13, 0, 0, 0.667 12, 2061, 13, 23, 0, 0.627 12, 2061, 13, 0, 0, 0.655 12, 4081, 13, 23, 0, 0.725 12, 4081, 13, 0, 0, 0.758 14, 4081, 13, 23, 0, 0.645 14, 4081, 13, 0, 0, 0.638 14, 0, 1, 23, 0, 1.046 14, 0, 2, 0, 0, 1.029 15, 0, 14, 23, 0, 1.028 15, 0, 14, 0, 0, 1.029 15, 14, 14, 23, 0, 0.65 15, 14, 14, 0, 0, 0.671 13, 0, 14, 23, 0, 0.891 13, 0, 14, 0, 0, 0.89 13, 14, 14, 23, 0, 0.637 13, 14, 14, 0, 0, 0.628 15, 2048, 14, 23, 0, 0.75 15, 2048, 14, 0, 0, 0.751 15, 2062, 14, 23, 0, 0.647 15, 2062, 14, 0, 0, 0.655 13, 2048, 14, 23, 0, 0.667 13, 2048, 14, 0, 0, 0.667 13, 2062, 14, 23, 0, 0.658 13, 2062, 14, 0, 0, 0.655 13, 4081, 14, 23, 0, 0.726 13, 4081, 14, 0, 0, 0.778 15, 4081, 14, 23, 0, 0.872 15, 4081, 14, 0, 0, 0.872 15, 0, 1, 23, 0, 1.052 15, 0, 2, 0, 0, 1.028 16, 0, 15, 23, 0, 0.724 16, 0, 15, 0, 0, 0.724 16, 15, 15, 23, 0, 0.65 16, 15, 15, 0, 0, 0.65 14, 0, 15, 23, 0, 0.889 14, 0, 15, 0, 0, 0.889 14, 15, 15, 23, 0, 0.626 14, 15, 15, 0, 0, 0.665 16, 2048, 15, 23, 0, 0.735 16, 2048, 15, 0, 0, 0.717 16, 2063, 15, 23, 0, 0.648 16, 2063, 15, 0, 0, 0.651 14, 2048, 15, 23, 0, 0.667 14, 2048, 15, 0, 0, 0.667 14, 2063, 15, 23, 0, 0.627 14, 2063, 15, 0, 0, 0.694 14, 4081, 15, 23, 0, 0.725 14, 4081, 15, 0, 0, 0.801 16, 4081, 15, 23, 0, 0.999 16, 4081, 15, 0, 0, 0.999 16, 0, 1, 23, 0, 0.751 16, 0, 2, 0, 0, 0.731 17, 0, 16, 23, 0, 1.167 17, 0, 16, 0, 0, 1.165 17, 16, 16, 23, 0, 1.167 17, 16, 16, 0, 0, 1.167 15, 0, 16, 23, 0, 0.889 15, 0, 16, 0, 0, 0.889 15, 16, 16, 23, 0, 0.666 15, 16, 16, 0, 0, 0.712 17, 2048, 16, 23, 0, 1.167 17, 2048, 16, 0, 0, 1.167 17, 2064, 16, 23, 0, 1.167 17, 2064, 16, 0, 0, 1.167 15, 2048, 16, 23, 0, 0.667 15, 2048, 16, 0, 0, 0.667 15, 2064, 16, 23, 0, 0.667 15, 2064, 16, 0, 0, 0.696 15, 4081, 16, 23, 0, 0.956 15, 4081, 16, 0, 0, 1.098 17, 4081, 16, 23, 0, 1.5 17, 4081, 16, 0, 0, 1.5 17, 0, 1, 23, 0, 1.167 17, 0, 2, 0, 0, 1.167 18, 0, 17, 23, 0, 1.167 18, 0, 17, 0, 0, 1.167 18, 17, 17, 23, 0, 1.167 18, 17, 17, 0, 0, 1.167 16, 0, 17, 23, 0, 0.667 16, 0, 17, 0, 0, 0.667 16, 17, 17, 23, 0, 0.627 16, 17, 17, 0, 0, 0.627 18, 2048, 17, 23, 0, 1.167 18, 2048, 17, 0, 0, 1.167 18, 2065, 17, 23, 0, 1.167 18, 2065, 17, 0, 0, 1.167 16, 2048, 17, 23, 0, 0.667 16, 2048, 17, 0, 0, 0.667 16, 2065, 17, 23, 0, 0.627 16, 2065, 17, 0, 0, 0.627 16, 4081, 17, 23, 0, 1.046 16, 4081, 17, 0, 0, 1.095 18, 4081, 17, 23, 0, 1.5 18, 4081, 17, 0, 0, 1.5 18, 0, 1, 23, 0, 0.852 18, 0, 2, 0, 0, 1.167 19, 0, 18, 23, 0, 1.167 19, 0, 18, 0, 0, 1.167 19, 18, 18, 23, 0, 1.167 19, 18, 18, 0, 0, 1.167 17, 0, 18, 23, 0, 0.889 17, 0, 18, 0, 0, 0.889 17, 18, 18, 23, 0, 0.889 17, 18, 18, 0, 0, 0.8 19, 2048, 18, 23, 0, 1.167 19, 2048, 18, 0, 0, 1.167 19, 2066, 18, 23, 0, 1.167 19, 2066, 18, 0, 0, 1.167 17, 2048, 18, 23, 0, 0.889 17, 2048, 18, 0, 0, 0.889 17, 2066, 18, 23, 0, 0.889 17, 2066, 18, 0, 0, 0.8 17, 4081, 18, 23, 0, 1.11 17, 4081, 18, 0, 0, 1.047 19, 4081, 18, 23, 0, 1.5 19, 4081, 18, 0, 0, 1.5 19, 0, 1, 23, 0, 0.897 19, 0, 2, 0, 0, 0.878 20, 0, 19, 23, 0, 1.167 20, 0, 19, 0, 0, 1.167 20, 19, 19, 23, 0, 1.167 20, 19, 19, 0, 0, 1.167 18, 0, 19, 23, 0, 0.889 18, 0, 19, 0, 0, 0.889 18, 19, 19, 23, 0, 0.889 18, 19, 19, 0, 0, 0.8 20, 2048, 19, 23, 0, 1.167 20, 2048, 19, 0, 0, 1.167 20, 2067, 19, 23, 0, 1.167 20, 2067, 19, 0, 0, 1.167 18, 2048, 19, 23, 0, 0.889 18, 2048, 19, 0, 0, 0.889 18, 2067, 19, 23, 0, 0.889 18, 2067, 19, 0, 0, 0.8 18, 4081, 19, 23, 0, 1.11 18, 4081, 19, 0, 0, 1.047 20, 4081, 19, 23, 0, 1.5 20, 4081, 19, 0, 0, 1.5 20, 0, 1, 23, 0, 0.906 20, 0, 2, 0, 0, 0.899 21, 0, 20, 23, 0, 1.167 21, 0, 20, 0, 0, 1.167 21, 20, 20, 23, 0, 1.167 21, 20, 20, 0, 0, 1.167 19, 0, 20, 23, 0, 0.889 19, 0, 20, 0, 0, 0.889 19, 20, 20, 23, 0, 0.889 19, 20, 20, 0, 0, 0.8 21, 2048, 20, 23, 0, 1.167 21, 2048, 20, 0, 0, 1.167 21, 2068, 20, 23, 0, 1.167 21, 2068, 20, 0, 0, 1.167 19, 2048, 20, 23, 0, 0.889 19, 2048, 20, 0, 0, 0.889 19, 2068, 20, 23, 0, 0.889 19, 2068, 20, 0, 0, 0.8 19, 4081, 20, 23, 0, 1.11 19, 4081, 20, 0, 0, 1.047 21, 4081, 20, 23, 0, 1.5 21, 4081, 20, 0, 0, 1.5 21, 0, 1, 23, 0, 0.902 21, 0, 2, 0, 0, 0.891 22, 0, 21, 23, 0, 1.167 22, 0, 21, 0, 0, 1.167 22, 21, 21, 23, 0, 1.167 22, 21, 21, 0, 0, 1.167 20, 0, 21, 23, 0, 0.889 20, 0, 21, 0, 0, 0.889 20, 21, 21, 23, 0, 0.889 20, 21, 21, 0, 0, 0.8 22, 2048, 21, 23, 0, 1.167 22, 2048, 21, 0, 0, 1.167 22, 2069, 21, 23, 0, 1.167 22, 2069, 21, 0, 0, 1.167 20, 2048, 21, 23, 0, 0.889 20, 2048, 21, 0, 0, 0.889 20, 2069, 21, 23, 0, 0.889 20, 2069, 21, 0, 0, 0.8 20, 4081, 21, 23, 0, 1.11 20, 4081, 21, 0, 0, 1.06 22, 4081, 21, 23, 0, 1.5 22, 4081, 21, 0, 0, 1.5 22, 0, 1, 23, 0, 0.915 22, 0, 2, 0, 0, 0.906 23, 0, 22, 23, 0, 1.167 23, 0, 22, 0, 0, 1.167 23, 22, 22, 23, 0, 1.167 23, 22, 22, 0, 0, 1.167 21, 0, 22, 23, 0, 0.889 21, 0, 22, 0, 0, 0.889 21, 22, 22, 23, 0, 0.889 21, 22, 22, 0, 0, 0.8 23, 2048, 22, 23, 0, 1.167 23, 2048, 22, 0, 0, 1.167 23, 2070, 22, 23, 0, 1.167 23, 2070, 22, 0, 0, 1.167 21, 2048, 22, 23, 0, 0.889 21, 2048, 22, 0, 0, 0.889 21, 2070, 22, 23, 0, 0.889 21, 2070, 22, 0, 0, 0.8 21, 4081, 22, 23, 0, 1.11 21, 4081, 22, 0, 0, 1.059 23, 4081, 22, 23, 0, 1.5 23, 4081, 22, 0, 0, 1.5 23, 0, 1, 23, 0, 0.914 23, 0, 2, 0, 0, 0.907 24, 0, 23, 23, 0, 1.167 24, 0, 23, 0, 0, 1.167 24, 23, 23, 23, 0, 1.167 24, 23, 23, 0, 0, 1.167 22, 0, 23, 23, 0, 0.889 22, 0, 23, 0, 0, 0.889 22, 23, 23, 23, 0, 0.889 22, 23, 23, 0, 0, 0.8 24, 2048, 23, 23, 0, 1.167 24, 2048, 23, 0, 0, 1.167 24, 2071, 23, 23, 0, 1.167 24, 2071, 23, 0, 0, 1.167 22, 2048, 23, 23, 0, 0.889 22, 2048, 23, 0, 0, 0.889 22, 2071, 23, 23, 0, 0.889 22, 2071, 23, 0, 0, 0.8 22, 4081, 23, 23, 0, 1.11 22, 4081, 23, 0, 0, 1.049 24, 4081, 23, 23, 0, 1.5 24, 4081, 23, 0, 0, 1.5 24, 0, 1, 23, 0, 0.915 24, 0, 2, 0, 0, 0.915 25, 0, 24, 23, 0, 1.167 25, 0, 24, 0, 0, 1.167 25, 24, 24, 23, 0, 1.167 25, 24, 24, 0, 0, 1.167 23, 0, 24, 23, 0, 0.889 23, 0, 24, 0, 0, 0.889 23, 24, 24, 23, 0, 0.889 23, 24, 24, 0, 0, 0.8 25, 2048, 24, 23, 0, 1.167 25, 2048, 24, 0, 0, 1.167 25, 2072, 24, 23, 0, 1.167 25, 2072, 24, 0, 0, 1.167 23, 2048, 24, 23, 0, 0.889 23, 2048, 24, 0, 0, 0.889 23, 2072, 24, 23, 0, 0.889 23, 2072, 24, 0, 0, 0.8 23, 4081, 24, 23, 0, 1.11 23, 4081, 24, 0, 0, 1.05 25, 4081, 24, 23, 0, 1.5 25, 4081, 24, 0, 0, 1.5 25, 0, 1, 23, 0, 0.917 25, 0, 2, 0, 0, 0.918 26, 0, 25, 23, 0, 1.167 26, 0, 25, 0, 0, 1.167 26, 25, 25, 23, 0, 1.167 26, 25, 25, 0, 0, 1.167 24, 0, 25, 23, 0, 0.889 24, 0, 25, 0, 0, 0.889 24, 25, 25, 23, 0, 0.898 24, 25, 25, 0, 0, 0.832 26, 2048, 25, 23, 0, 1.167 26, 2048, 25, 0, 0, 1.167 26, 2073, 25, 23, 0, 1.167 26, 2073, 25, 0, 0, 1.167 24, 2048, 25, 23, 0, 0.889 24, 2048, 25, 0, 0, 0.889 24, 2073, 25, 23, 0, 0.879 24, 2073, 25, 0, 0, 0.814 24, 4081, 25, 23, 0, 1.11 24, 4081, 25, 0, 0, 1.049 26, 4081, 25, 23, 0, 1.5 26, 4081, 25, 0, 0, 1.5 26, 0, 1, 23, 0, 0.869 26, 0, 2, 0, 0, 0.869 27, 0, 26, 23, 0, 1.167 27, 0, 26, 0, 0, 1.167 27, 26, 26, 23, 0, 1.167 27, 26, 26, 0, 0, 1.167 25, 0, 26, 23, 0, 0.889 25, 0, 26, 0, 0, 0.889 25, 26, 26, 23, 0, 0.871 25, 26, 26, 0, 0, 0.827 27, 2048, 26, 23, 0, 1.167 27, 2048, 26, 0, 0, 1.167 27, 2074, 26, 23, 0, 1.167 27, 2074, 26, 0, 0, 1.167 25, 2048, 26, 23, 0, 0.889 25, 2048, 26, 0, 0, 0.889 25, 2074, 26, 23, 0, 0.88 25, 2074, 26, 0, 0, 0.823 25, 4081, 26, 23, 0, 1.11 25, 4081, 26, 0, 0, 1.047 27, 4081, 26, 23, 0, 1.5 27, 4081, 26, 0, 0, 1.5 27, 0, 1, 23, 0, 0.865 27, 0, 2, 0, 0, 0.857 28, 0, 27, 23, 0, 1.167 28, 0, 27, 0, 0, 1.167 28, 27, 27, 23, 0, 1.167 28, 27, 27, 0, 0, 1.167 26, 0, 27, 23, 0, 0.889 26, 0, 27, 0, 0, 0.889 26, 27, 27, 23, 0, 0.884 26, 27, 27, 0, 0, 0.82 28, 2048, 27, 23, 0, 1.167 28, 2048, 27, 0, 0, 1.167 28, 2075, 27, 23, 0, 1.167 28, 2075, 27, 0, 0, 1.167 26, 2048, 27, 23, 0, 0.889 26, 2048, 27, 0, 0, 0.889 26, 2075, 27, 23, 0, 0.892 26, 2075, 27, 0, 0, 0.83 26, 4081, 27, 23, 0, 1.11 26, 4081, 27, 0, 0, 1.054 28, 4081, 27, 23, 0, 1.5 28, 4081, 27, 0, 0, 1.5 28, 0, 1, 23, 0, 0.866 28, 0, 2, 0, 0, 0.867 29, 0, 28, 23, 0, 1.167 29, 0, 28, 0, 0, 1.167 29, 28, 28, 23, 0, 1.167 29, 28, 28, 0, 0, 1.167 27, 0, 28, 23, 0, 0.889 27, 0, 28, 0, 0, 0.889 27, 28, 28, 23, 0, 0.892 27, 28, 28, 0, 0, 0.825 29, 2048, 28, 23, 0, 1.167 29, 2048, 28, 0, 0, 1.167 29, 2076, 28, 23, 0, 1.167 29, 2076, 28, 0, 0, 1.167 27, 2048, 28, 23, 0, 0.889 27, 2048, 28, 0, 0, 0.888 27, 2076, 28, 23, 0, 0.898 27, 2076, 28, 0, 0, 0.821 27, 4081, 28, 23, 0, 1.11 27, 4081, 28, 0, 0, 1.052 29, 4081, 28, 23, 0, 1.5 29, 4081, 28, 0, 0, 1.5 29, 0, 1, 23, 0, 0.854 29, 0, 2, 0, 0, 0.86 30, 0, 29, 23, 0, 1.166 30, 0, 29, 0, 0, 1.167 30, 29, 29, 23, 0, 1.167 30, 29, 29, 0, 0, 1.167 28, 0, 29, 23, 0, 0.887 28, 0, 29, 0, 0, 0.888 28, 29, 29, 23, 0, 0.891 28, 29, 29, 0, 0, 0.843 30, 2048, 29, 23, 0, 1.166 30, 2048, 29, 0, 0, 1.167 30, 2077, 29, 23, 0, 1.167 30, 2077, 29, 0, 0, 1.165 28, 2048, 29, 23, 0, 0.886 28, 2048, 29, 0, 0, 0.887 28, 2077, 29, 23, 0, 0.891 28, 2077, 29, 0, 0, 0.836 28, 4081, 29, 23, 0, 1.106 28, 4081, 29, 0, 0, 1.063 30, 4081, 29, 23, 0, 1.496 30, 4081, 29, 0, 0, 1.496 30, 0, 1, 23, 0, 0.874 30, 0, 2, 0, 0, 0.873 31, 0, 30, 23, 0, 1.164 31, 0, 30, 0, 0, 1.161 31, 30, 30, 23, 0, 1.162 31, 30, 30, 0, 0, 1.163 29, 0, 30, 23, 0, 0.884 29, 0, 30, 0, 0, 0.884 29, 30, 30, 23, 0, 0.893 29, 30, 30, 0, 0, 0.847 31, 2048, 30, 23, 0, 1.163 31, 2048, 30, 0, 0, 1.162 31, 2078, 30, 23, 0, 1.161 31, 2078, 30, 0, 0, 1.161 29, 2048, 30, 23, 0, 0.884 29, 2048, 30, 0, 0, 0.884 29, 2078, 30, 23, 0, 0.894 29, 2078, 30, 0, 0, 0.848 29, 4081, 30, 23, 0, 1.102 29, 4081, 30, 0, 0, 1.074 31, 4081, 30, 23, 0, 1.159 31, 4081, 30, 0, 0, 1.16 31, 0, 1, 23, 0, 0.859 31, 0, 2, 0, 0, 0.858 32, 0, 31, 23, 0, 1.161 32, 0, 31, 0, 0, 1.161 32, 31, 31, 23, 0, 1.161 32, 31, 31, 0, 0, 1.161 30, 0, 31, 23, 0, 0.882 30, 0, 31, 0, 0, 0.883 30, 31, 31, 23, 0, 0.897 30, 31, 31, 0, 0, 0.854 32, 2048, 31, 23, 0, 1.161 32, 2048, 31, 0, 0, 1.161 32, 2079, 31, 23, 0, 1.159 32, 2079, 31, 0, 0, 1.158 30, 2048, 31, 23, 0, 0.881 30, 2048, 31, 0, 0, 0.882 30, 2079, 31, 23, 0, 0.891 30, 2079, 31, 0, 0, 0.851 30, 4081, 31, 23, 0, 1.1 30, 4081, 31, 0, 0, 1.066 32, 4081, 31, 23, 0, 1.157 32, 4081, 31, 0, 0, 1.157 32, 0, 1, 23, 0, 0.798 32, 0, 2, 0, 0, 0.798 2048, 0, 32, 23, 1, 0.993 256, 1, 64, 23, 1, 0.89 2048, 0, 32, 0, 1, 0.992 256, 1, 64, 0, 1, 0.894 256, 4081, 64, 0, 1, 0.903 256, 0, 1, 23, 1, 1.158 256, 0, 1, 0, 1, 1.157 256, 1, 1, 23, 1, 1.158 256, 1, 1, 0, 1, 1.158 2048, 0, 64, 23, 1, 0.79 256, 2, 64, 23, 1, 0.894 2048, 0, 64, 0, 1, 0.79 256, 2, 64, 0, 1, 0.894 256, 0, 2, 23, 1, 1.161 256, 0, 2, 0, 1, 1.161 256, 2, 2, 23, 1, 1.161 256, 2, 2, 0, 1, 1.161 2048, 0, 128, 23, 1, 1.319 256, 3, 64, 23, 1, 0.897 2048, 0, 128, 0, 1, 1.323 256, 3, 64, 0, 1, 0.9 256, 0, 3, 23, 1, 1.166 256, 0, 3, 0, 1, 1.167 256, 3, 3, 23, 1, 1.167 256, 3, 3, 0, 1, 1.167 2048, 0, 256, 23, 1, 0.995 256, 4, 64, 23, 1, 0.902 2048, 0, 256, 0, 1, 0.993 256, 4, 64, 0, 1, 0.901 256, 0, 4, 23, 1, 1.167 256, 0, 4, 0, 1, 1.167 256, 4, 4, 23, 1, 1.167 256, 4, 4, 0, 1, 1.167 2048, 0, 512, 23, 1, 1.109 256, 5, 64, 23, 1, 0.903 2048, 0, 512, 0, 1, 1.109 256, 5, 64, 0, 1, 0.897 256, 0, 5, 23, 1, 1.167 256, 0, 5, 0, 1, 1.167 256, 5, 5, 23, 1, 1.167 256, 5, 5, 0, 1, 1.167 2048, 0, 1024, 23, 1, 0.951 256, 6, 64, 23, 1, 0.902 2048, 0, 1024, 0, 1, 0.953 256, 6, 64, 0, 1, 0.9 256, 0, 6, 23, 1, 1.167 256, 0, 6, 0, 1, 1.167 256, 6, 6, 23, 1, 1.167 256, 6, 6, 0, 1, 1.167 2048, 0, 2048, 23, 1, 0.896 256, 7, 64, 23, 1, 0.901 2048, 0, 2048, 0, 1, 0.845 256, 7, 64, 0, 1, 0.9 256, 0, 7, 23, 1, 1.165 256, 0, 7, 0, 1, 1.165 256, 7, 7, 23, 1, 1.165 256, 7, 7, 0, 1, 1.165 192, 1, 32, 23, 1, 0.892 192, 1, 32, 0, 1, 0.892 256, 1, 32, 23, 1, 0.892 256, 1, 32, 0, 1, 0.892 512, 1, 32, 23, 1, 0.892 512, 1, 32, 0, 1, 0.892 256, 4081, 32, 23, 1, 0.902 192, 2, 64, 23, 1, 0.902 192, 2, 64, 0, 1, 0.898 512, 2, 64, 23, 1, 0.9 512, 2, 64, 0, 1, 0.899 256, 4081, 64, 23, 1, 0.908 192, 3, 96, 23, 1, 1.174 192, 3, 96, 0, 1, 1.174 256, 3, 96, 23, 1, 1.174 256, 3, 96, 0, 1, 1.174 512, 3, 96, 23, 1, 1.174 512, 3, 96, 0, 1, 1.174 256, 4081, 96, 23, 1, 1.255 192, 4, 128, 23, 1, 1.08 192, 4, 128, 0, 1, 1.081 256, 4, 128, 23, 1, 1.081 256, 4, 128, 0, 1, 1.081 512, 4, 128, 23, 1, 1.08 512, 4, 128, 0, 1, 1.081 256, 4081, 128, 23, 1, 1.25 192, 5, 160, 23, 1, 0.862 192, 5, 160, 0, 1, 0.864 256, 5, 160, 23, 1, 0.729 256, 5, 160, 0, 1, 0.73 512, 5, 160, 23, 1, 0.73 512, 5, 160, 0, 1, 0.729 256, 4081, 160, 23, 1, 0.834 192, 6, 192, 23, 1, 0.868 192, 6, 192, 0, 1, 0.868 256, 6, 192, 23, 1, 0.903 256, 6, 192, 0, 1, 0.903 512, 6, 192, 23, 1, 0.902 512, 6, 192, 0, 1, 0.902 256, 4081, 192, 23, 1, 0.902 192, 7, 224, 23, 1, 0.866 192, 7, 224, 0, 1, 0.865 256, 7, 224, 23, 1, 0.885 256, 7, 224, 0, 1, 0.885 512, 7, 224, 23, 1, 0.948 512, 7, 224, 0, 1, 0.95 256, 4081, 224, 23, 1, 0.921 2, 0, 1, 23, 1, 1.02 2, 0, 1, 0, 1, 1.026 2, 1, 1, 23, 1, 0.873 2, 1, 1, 0, 1, 0.873 0, 0, 1, 23, 1, 0.581 0, 0, 1, 0, 1, 0.581 0, 1, 1, 23, 1, 0.537 0, 1, 1, 0, 1, 0.537 2, 2048, 1, 23, 1, 0.749 2, 2048, 1, 0, 1, 0.747 2, 2049, 1, 23, 1, 0.636 2, 2049, 1, 0, 1, 0.636 0, 2048, 1, 23, 1, 0.498 0, 2048, 1, 0, 1, 0.498 0, 2049, 1, 23, 1, 0.46 0, 2049, 1, 0, 1, 0.46 0, 4081, 1, 23, 1, 0.46 0, 4081, 1, 0, 1, 0.46 2, 4081, 1, 23, 1, 0.611 2, 4081, 1, 0, 1, 0.608 2, 0, 2, 0, 1, 0.894 3, 0, 2, 23, 1, 1.06 3, 0, 2, 0, 1, 1.027 3, 2, 2, 23, 1, 0.893 3, 2, 2, 0, 1, 0.892 1, 0, 2, 23, 1, 0.916 1, 0, 2, 0, 1, 0.916 1, 2, 2, 23, 1, 1.004 1, 2, 2, 0, 1, 1.094 3, 2048, 2, 23, 1, 0.749 3, 2048, 2, 0, 1, 0.749 3, 2050, 2, 23, 1, 0.636 3, 2050, 2, 0, 1, 0.636 1, 2048, 2, 23, 1, 0.665 1, 2048, 2, 0, 1, 0.664 1, 2050, 2, 23, 1, 0.723 1, 2050, 2, 0, 1, 0.724 1, 4081, 2, 23, 1, 0.72 1, 4081, 2, 0, 1, 0.74 3, 4081, 2, 23, 1, 0.618 3, 4081, 2, 0, 1, 0.597 3, 0, 1, 23, 1, 1.032 4, 0, 3, 23, 1, 1.036 4, 0, 3, 0, 1, 1.05 4, 3, 3, 23, 1, 0.889 4, 3, 3, 0, 1, 0.881 2, 0, 3, 23, 1, 0.907 2, 0, 3, 0, 1, 0.902 2, 3, 3, 23, 1, 0.982 2, 3, 3, 0, 1, 1.059 4, 2048, 3, 23, 1, 0.748 4, 2048, 3, 0, 1, 0.748 4, 2051, 3, 23, 1, 0.637 4, 2051, 3, 0, 1, 0.637 2, 2048, 3, 23, 1, 0.664 2, 2048, 3, 0, 1, 0.665 2, 2051, 3, 23, 1, 0.726 2, 2051, 3, 0, 1, 0.726 2, 4081, 3, 23, 1, 0.727 2, 4081, 3, 0, 1, 0.727 4, 4081, 3, 23, 1, 0.647 4, 4081, 3, 0, 1, 0.622 4, 0, 1, 23, 1, 1.043 4, 0, 2, 0, 1, 1.039 5, 0, 4, 23, 1, 1.043 5, 0, 4, 0, 1, 1.055 5, 4, 4, 23, 1, 0.889 5, 4, 4, 0, 1, 0.878 3, 0, 4, 23, 1, 0.889 3, 0, 4, 0, 1, 0.894 3, 4, 4, 23, 1, 0.958 3, 4, 4, 0, 1, 1.033 5, 2048, 4, 23, 1, 0.75 5, 2048, 4, 0, 1, 0.75 5, 2052, 4, 23, 1, 0.638 5, 2052, 4, 0, 1, 0.637 3, 2048, 4, 23, 1, 0.666 3, 2048, 4, 0, 1, 0.666 3, 2052, 4, 23, 1, 0.706 3, 2052, 4, 0, 1, 0.713 3, 4081, 4, 23, 1, 0.712 3, 4081, 4, 0, 1, 0.711 5, 4081, 4, 23, 1, 0.66 5, 4081, 4, 0, 1, 0.629 5, 0, 1, 23, 1, 1.026 5, 0, 2, 0, 1, 1.057 6, 0, 5, 23, 1, 1.031 6, 0, 5, 0, 1, 1.029 6, 5, 5, 23, 1, 0.894 6, 5, 5, 0, 1, 0.887 4, 0, 5, 23, 1, 0.889 4, 0, 5, 0, 1, 0.889 4, 5, 5, 23, 1, 0.932 4, 5, 5, 0, 1, 1.026 6, 2048, 5, 23, 1, 0.749 6, 2048, 5, 0, 1, 0.75 6, 2053, 5, 23, 1, 0.638 6, 2053, 5, 0, 1, 0.638 4, 2048, 5, 23, 1, 0.667 4, 2048, 5, 0, 1, 0.667 4, 2053, 5, 23, 1, 0.692 4, 2053, 5, 0, 1, 0.699 4, 4081, 5, 23, 1, 0.698 4, 4081, 5, 0, 1, 0.698 6, 4081, 5, 23, 1, 0.639 6, 4081, 5, 0, 1, 0.619 6, 0, 1, 23, 1, 1.027 6, 0, 2, 0, 1, 1.026 7, 0, 6, 23, 1, 1.028 7, 0, 6, 0, 1, 1.028 7, 6, 6, 23, 1, 0.874 7, 6, 6, 0, 1, 0.882 5, 0, 6, 23, 1, 0.888 5, 0, 6, 0, 1, 0.888 5, 6, 6, 23, 1, 0.942 5, 6, 6, 0, 1, 1.014 7, 2048, 6, 23, 1, 0.75 7, 2048, 6, 0, 1, 0.749 7, 2054, 6, 23, 1, 0.637 7, 2054, 6, 0, 1, 0.638 5, 2048, 6, 23, 1, 0.667 5, 2048, 6, 0, 1, 0.666 5, 2054, 6, 23, 1, 0.706 5, 2054, 6, 0, 1, 0.702 5, 4081, 6, 23, 1, 0.705 5, 4081, 6, 0, 1, 0.705 7, 4081, 6, 23, 1, 0.659 7, 4081, 6, 0, 1, 0.638 7, 0, 1, 23, 1, 1.042 7, 0, 2, 0, 1, 1.035 8, 0, 7, 23, 1, 1.033 8, 0, 7, 0, 1, 1.027 8, 7, 7, 23, 1, 0.886 8, 7, 7, 0, 1, 0.875 6, 0, 7, 23, 1, 0.889 6, 0, 7, 0, 1, 0.889 6, 7, 7, 23, 1, 0.912 6, 7, 7, 0, 1, 0.982 8, 2048, 7, 23, 1, 0.755 8, 2048, 7, 0, 1, 0.749 8, 2055, 7, 23, 1, 0.638 8, 2055, 7, 0, 1, 0.638 6, 2048, 7, 23, 1, 0.667 6, 2048, 7, 0, 1, 0.667 6, 2055, 7, 23, 1, 0.692 6, 2055, 7, 0, 1, 0.693 6, 4081, 7, 23, 1, 0.689 6, 4081, 7, 0, 1, 0.723 8, 4081, 7, 23, 1, 0.64 8, 4081, 7, 0, 1, 0.631 8, 0, 1, 23, 1, 1.028 8, 0, 2, 0, 1, 1.039 9, 0, 8, 23, 1, 1.029 9, 0, 8, 0, 1, 1.028 9, 8, 8, 23, 1, 0.55 9, 8, 8, 0, 1, 0.542 7, 0, 8, 23, 1, 0.889 7, 0, 8, 0, 1, 0.889 7, 8, 8, 23, 1, 0.934 7, 8, 8, 0, 1, 1.011 9, 2048, 8, 23, 1, 0.751 9, 2048, 8, 0, 1, 0.75 9, 2056, 8, 23, 1, 0.553 9, 2056, 8, 0, 1, 0.542 7, 2048, 8, 23, 1, 0.667 7, 2048, 8, 0, 1, 0.667 7, 2056, 8, 23, 1, 0.712 7, 2056, 8, 0, 1, 0.73 7, 4081, 8, 23, 1, 0.716 7, 4081, 8, 0, 1, 0.76 9, 4081, 8, 23, 1, 0.632 9, 4081, 8, 0, 1, 0.624 9, 0, 1, 23, 1, 1.028 9, 0, 2, 0, 1, 1.028 10, 0, 9, 23, 1, 1.027 10, 0, 9, 0, 1, 1.028 10, 9, 9, 23, 1, 0.545 10, 9, 9, 0, 1, 0.536 8, 0, 9, 23, 1, 0.889 8, 0, 9, 0, 1, 0.889 8, 9, 9, 23, 1, 0.627 8, 9, 9, 0, 1, 0.637 10, 2048, 9, 23, 1, 0.751 10, 2048, 9, 0, 1, 0.75 10, 2057, 9, 23, 1, 0.545 10, 2057, 9, 0, 1, 0.547 8, 2048, 9, 23, 1, 0.667 8, 2048, 9, 0, 1, 0.667 8, 2057, 9, 23, 1, 0.627 8, 2057, 9, 0, 1, 0.633 8, 4081, 9, 23, 1, 0.726 8, 4081, 9, 0, 1, 0.775 10, 4081, 9, 23, 1, 0.657 10, 4081, 9, 0, 1, 0.642 10, 0, 1, 23, 1, 1.03 10, 0, 2, 0, 1, 1.033 11, 0, 10, 23, 1, 1.029 11, 0, 10, 0, 1, 1.03 11, 10, 10, 23, 1, 0.542 11, 10, 10, 0, 1, 0.549 9, 0, 10, 23, 1, 0.889 9, 0, 10, 0, 1, 0.889 9, 10, 10, 23, 1, 0.627 9, 10, 10, 0, 1, 0.646 11, 2048, 10, 23, 1, 0.751 11, 2048, 10, 0, 1, 0.75 11, 2058, 10, 23, 1, 0.553 11, 2058, 10, 0, 1, 0.538 9, 2048, 10, 23, 1, 0.667 9, 2048, 10, 0, 1, 0.667 9, 2058, 10, 23, 1, 0.627 9, 2058, 10, 0, 1, 0.656 9, 4081, 10, 23, 1, 0.726 9, 4081, 10, 0, 1, 0.773 11, 4081, 10, 23, 1, 0.625 11, 4081, 10, 0, 1, 0.613 11, 0, 1, 23, 1, 1.029 11, 0, 2, 0, 1, 1.029 12, 0, 11, 23, 1, 1.028 12, 0, 11, 0, 1, 1.028 12, 11, 11, 23, 1, 0.545 12, 11, 11, 0, 1, 0.537 10, 0, 11, 23, 1, 0.889 10, 0, 11, 0, 1, 0.889 10, 11, 11, 23, 1, 0.627 10, 11, 11, 0, 1, 0.655 12, 2048, 11, 23, 1, 0.757 12, 2048, 11, 0, 1, 0.75 12, 2059, 11, 23, 1, 0.536 12, 2059, 11, 0, 1, 0.545 10, 2048, 11, 23, 1, 0.672 10, 2048, 11, 0, 1, 0.667 10, 2059, 11, 23, 1, 0.627 10, 2059, 11, 0, 1, 0.66 10, 4081, 11, 23, 1, 0.726 10, 4081, 11, 0, 1, 0.793 12, 4081, 11, 23, 1, 0.627 12, 4081, 11, 0, 1, 0.633 12, 0, 1, 23, 1, 1.028 12, 0, 2, 0, 1, 1.029 13, 0, 12, 23, 1, 1.028 13, 0, 12, 0, 1, 1.028 13, 12, 12, 23, 1, 0.547 13, 12, 12, 0, 1, 0.542 11, 0, 12, 23, 1, 0.889 11, 0, 12, 0, 1, 0.889 11, 12, 12, 23, 1, 0.627 11, 12, 12, 0, 1, 0.69 13, 2048, 12, 23, 1, 0.75 13, 2048, 12, 0, 1, 0.75 13, 2060, 12, 23, 1, 0.55 13, 2060, 12, 0, 1, 0.542 11, 2048, 12, 23, 1, 0.667 11, 2048, 12, 0, 1, 0.667 11, 2060, 12, 23, 1, 0.627 11, 2060, 12, 0, 1, 0.646 11, 4081, 12, 23, 1, 0.726 11, 4081, 12, 0, 1, 0.78 13, 4081, 12, 23, 1, 0.632 13, 4081, 12, 0, 1, 0.619 13, 0, 1, 23, 1, 1.028 13, 0, 2, 0, 1, 1.028 14, 0, 13, 23, 1, 1.032 14, 0, 13, 0, 1, 1.038 14, 13, 13, 23, 1, 0.55 14, 13, 13, 0, 1, 0.539 12, 0, 13, 23, 1, 0.889 12, 0, 13, 0, 1, 0.889 12, 13, 13, 23, 1, 0.627 12, 13, 13, 0, 1, 0.655 14, 2048, 13, 23, 1, 0.751 14, 2048, 13, 0, 1, 0.751 14, 2061, 13, 23, 1, 0.542 14, 2061, 13, 0, 1, 0.547 12, 2048, 13, 23, 1, 0.667 12, 2048, 13, 0, 1, 0.667 12, 2061, 13, 23, 1, 0.627 12, 2061, 13, 0, 1, 0.646 12, 4081, 13, 23, 1, 0.726 12, 4081, 13, 0, 1, 0.769 14, 4081, 13, 23, 1, 0.627 14, 4081, 13, 0, 1, 0.62 14, 0, 1, 23, 1, 1.035 14, 0, 2, 0, 1, 1.033 15, 0, 14, 23, 1, 1.028 15, 0, 14, 0, 1, 1.028 15, 14, 14, 23, 1, 0.545 15, 14, 14, 0, 1, 0.531 13, 0, 14, 23, 1, 0.889 13, 0, 14, 0, 1, 0.889 13, 14, 14, 23, 1, 0.628 13, 14, 14, 0, 1, 0.628 15, 2048, 14, 23, 1, 0.751 15, 2048, 14, 0, 1, 0.75 15, 2062, 14, 23, 1, 0.542 15, 2062, 14, 0, 1, 0.536 13, 2048, 14, 23, 1, 0.667 13, 2048, 14, 0, 1, 0.667 13, 2062, 14, 23, 1, 0.627 13, 2062, 14, 0, 1, 0.628 13, 4081, 14, 23, 1, 0.726 13, 4081, 14, 0, 1, 0.747 15, 4081, 14, 23, 1, 0.874 15, 4081, 14, 0, 1, 0.879 15, 0, 1, 23, 1, 1.028 15, 0, 2, 0, 1, 1.028 16, 0, 15, 23, 1, 0.728 16, 0, 15, 0, 1, 0.735 16, 15, 15, 23, 1, 0.647 16, 15, 15, 0, 1, 0.647 14, 0, 15, 23, 1, 0.889 14, 0, 15, 0, 1, 0.889 14, 15, 15, 23, 1, 0.627 14, 15, 15, 0, 1, 0.647 16, 2048, 15, 23, 1, 0.732 16, 2048, 15, 0, 1, 0.714 16, 2063, 15, 23, 1, 0.65 16, 2063, 15, 0, 1, 0.65 14, 2048, 15, 23, 1, 0.667 14, 2048, 15, 0, 1, 0.667 14, 2063, 15, 23, 1, 0.627 14, 2063, 15, 0, 1, 0.674 14, 4081, 15, 23, 1, 0.724 14, 4081, 15, 0, 1, 0.777 16, 4081, 15, 23, 1, 1.01 16, 4081, 15, 0, 1, 0.997 16, 0, 1, 23, 1, 0.722 16, 0, 2, 0, 1, 0.725 17, 0, 16, 23, 1, 1.167 17, 0, 16, 0, 1, 1.167 17, 16, 16, 23, 1, 1.167 17, 16, 16, 0, 1, 1.167 15, 0, 16, 23, 1, 0.891 15, 0, 16, 0, 1, 0.892 15, 16, 16, 23, 1, 0.668 15, 16, 16, 0, 1, 0.699 17, 2048, 16, 23, 1, 1.167 17, 2048, 16, 0, 1, 1.167 17, 2064, 16, 23, 1, 1.167 17, 2064, 16, 0, 1, 1.167 15, 2048, 16, 23, 1, 0.668 15, 2048, 16, 0, 1, 0.667 15, 2064, 16, 23, 1, 0.667 15, 2064, 16, 0, 1, 0.771 15, 4081, 16, 23, 1, 0.933 15, 4081, 16, 0, 1, 1.056 17, 4081, 16, 23, 1, 1.78 17, 4081, 16, 0, 1, 1.789 17, 0, 1, 23, 1, 1.17 17, 0, 2, 0, 1, 1.169 18, 0, 17, 23, 1, 0.859 18, 0, 17, 0, 1, 0.857 18, 17, 17, 23, 1, 0.857 18, 17, 17, 0, 1, 0.857 16, 0, 17, 23, 1, 0.673 16, 0, 17, 0, 1, 0.672 16, 17, 17, 23, 1, 0.628 16, 17, 17, 0, 1, 0.628 18, 2048, 17, 23, 1, 0.861 18, 2048, 17, 0, 1, 0.859 18, 2065, 17, 23, 1, 0.86 18, 2065, 17, 0, 1, 0.857 16, 2048, 17, 23, 1, 0.668 16, 2048, 17, 0, 1, 0.668 16, 2065, 17, 23, 1, 0.627 16, 2065, 17, 0, 1, 0.627 16, 4081, 17, 23, 1, 1.049 16, 4081, 17, 0, 1, 1.174 18, 4081, 17, 23, 1, 1.068 18, 4081, 17, 0, 1, 1.064 18, 0, 1, 23, 1, 1.172 18, 0, 2, 0, 1, 1.172 19, 0, 18, 23, 1, 0.865 19, 0, 18, 0, 1, 0.864 19, 18, 18, 23, 1, 0.86 19, 18, 18, 0, 1, 0.861 17, 0, 18, 23, 1, 0.895 17, 0, 18, 0, 1, 0.895 17, 18, 18, 23, 1, 0.896 17, 18, 18, 0, 1, 0.836 19, 2048, 18, 23, 1, 0.866 19, 2048, 18, 0, 1, 0.866 19, 2066, 18, 23, 1, 0.866 19, 2066, 18, 0, 1, 0.863 17, 2048, 18, 23, 1, 0.896 17, 2048, 18, 0, 1, 0.895 17, 2066, 18, 23, 1, 0.895 17, 2066, 18, 0, 1, 0.877 17, 4081, 18, 23, 1, 1.115 17, 4081, 18, 0, 1, 1.07 19, 4081, 18, 23, 1, 1.061 19, 4081, 18, 0, 1, 1.06 19, 0, 1, 23, 1, 1.168 19, 0, 2, 0, 1, 1.168 20, 0, 19, 23, 1, 0.855 20, 0, 19, 0, 1, 0.858 20, 19, 19, 23, 1, 0.856 20, 19, 19, 0, 1, 0.855 18, 0, 19, 23, 1, 0.89 18, 0, 19, 0, 1, 0.89 18, 19, 19, 23, 1, 0.89 18, 19, 19, 0, 1, 0.875 20, 2048, 19, 23, 1, 0.859 20, 2048, 19, 0, 1, 0.855 20, 2067, 19, 23, 1, 0.854 20, 2067, 19, 0, 1, 0.856 18, 2048, 19, 23, 1, 0.889 18, 2048, 19, 0, 1, 0.889 18, 2067, 19, 23, 1, 0.889 18, 2067, 19, 0, 1, 0.893 18, 4081, 19, 23, 1, 1.109 18, 4081, 19, 0, 1, 1.067 20, 4081, 19, 23, 1, 1.053 20, 4081, 19, 0, 1, 1.052 20, 0, 1, 23, 1, 1.165 20, 0, 2, 0, 1, 1.166 21, 0, 20, 23, 1, 0.855 21, 0, 20, 0, 1, 0.856 21, 20, 20, 23, 1, 0.854 21, 20, 20, 0, 1, 0.854 19, 0, 20, 23, 1, 0.888 19, 0, 20, 0, 1, 0.888 19, 20, 20, 23, 1, 0.888 19, 20, 20, 0, 1, 0.868 21, 2048, 20, 23, 1, 0.853 21, 2048, 20, 0, 1, 0.857 21, 2068, 20, 23, 1, 0.855 21, 2068, 20, 0, 1, 0.854 19, 2048, 20, 23, 1, 0.889 19, 2048, 20, 0, 1, 0.889 19, 2068, 20, 23, 1, 0.889 19, 2068, 20, 0, 1, 0.894 19, 4081, 20, 23, 1, 1.112 19, 4081, 20, 0, 1, 1.103 21, 4081, 20, 23, 1, 1.054 21, 4081, 20, 0, 1, 1.051 21, 0, 1, 23, 1, 1.169 21, 0, 2, 0, 1, 1.168 22, 0, 21, 23, 1, 0.853 22, 0, 21, 0, 1, 0.855 22, 21, 21, 23, 1, 0.856 22, 21, 21, 0, 1, 0.853 20, 0, 21, 23, 1, 0.889 20, 0, 21, 0, 1, 0.889 20, 21, 21, 23, 1, 0.889 20, 21, 21, 0, 1, 0.91 22, 2048, 21, 23, 1, 0.852 22, 2048, 21, 0, 1, 0.855 22, 2069, 21, 23, 1, 0.853 22, 2069, 21, 0, 1, 0.854 20, 2048, 21, 23, 1, 0.889 20, 2048, 21, 0, 1, 0.889 20, 2069, 21, 23, 1, 0.889 20, 2069, 21, 0, 1, 0.925 20, 4081, 21, 23, 1, 1.111 20, 4081, 21, 0, 1, 1.111 22, 4081, 21, 23, 1, 1.053 22, 4081, 21, 0, 1, 1.051 22, 0, 1, 23, 1, 1.167 22, 0, 2, 0, 1, 1.165 23, 0, 22, 23, 1, 0.853 23, 0, 22, 0, 1, 0.853 23, 22, 22, 23, 1, 0.853 23, 22, 22, 0, 1, 0.853 21, 0, 22, 23, 1, 0.888 21, 0, 22, 0, 1, 0.888 21, 22, 22, 23, 1, 0.889 21, 22, 22, 0, 1, 0.931 23, 2048, 22, 23, 1, 0.854 23, 2048, 22, 0, 1, 0.854 23, 2070, 22, 23, 1, 0.853 23, 2070, 22, 0, 1, 0.852 21, 2048, 22, 23, 1, 0.887 21, 2048, 22, 0, 1, 0.887 21, 2070, 22, 23, 1, 0.887 21, 2070, 22, 0, 1, 0.901 21, 4081, 22, 23, 1, 1.107 21, 4081, 22, 0, 1, 1.11 23, 4081, 22, 23, 1, 1.047 23, 4081, 22, 0, 1, 1.049 23, 0, 1, 23, 1, 1.163 23, 0, 2, 0, 1, 1.163 24, 0, 23, 23, 1, 0.851 24, 0, 23, 0, 1, 0.852 24, 23, 23, 23, 1, 0.852 24, 23, 23, 0, 1, 0.854 22, 0, 23, 23, 1, 0.888 22, 0, 23, 0, 1, 0.888 22, 23, 23, 23, 1, 0.888 22, 23, 23, 0, 1, 0.908 24, 2048, 23, 23, 1, 0.853 24, 2048, 23, 0, 1, 0.851 24, 2071, 23, 23, 1, 0.851 24, 2071, 23, 0, 1, 0.851 22, 2048, 23, 23, 1, 0.888 22, 2048, 23, 0, 1, 0.888 22, 2071, 23, 23, 1, 0.888 22, 2071, 23, 0, 1, 0.882 22, 4081, 23, 23, 1, 1.109 22, 4081, 23, 0, 1, 1.084 24, 4081, 23, 23, 1, 1.049 24, 4081, 23, 0, 1, 1.049 24, 0, 1, 23, 1, 1.164 24, 0, 2, 0, 1, 1.164 25, 0, 24, 23, 1, 0.855 25, 0, 24, 0, 1, 0.849 25, 24, 24, 23, 1, 0.859 25, 24, 24, 0, 1, 0.861 23, 0, 24, 23, 1, 0.885 23, 0, 24, 0, 1, 0.885 23, 24, 24, 23, 1, 0.887 23, 24, 24, 0, 1, 0.898 25, 2048, 24, 23, 1, 0.851 25, 2048, 24, 0, 1, 0.852 25, 2072, 24, 23, 1, 0.852 25, 2072, 24, 0, 1, 0.852 23, 2048, 24, 23, 1, 0.886 23, 2048, 24, 0, 1, 0.886 23, 2072, 24, 23, 1, 0.886 23, 2072, 24, 0, 1, 0.916 23, 4081, 24, 23, 1, 1.106 23, 4081, 24, 0, 1, 1.078 25, 4081, 24, 23, 1, 1.044 25, 4081, 24, 0, 1, 1.045 25, 0, 1, 23, 1, 1.163 25, 0, 2, 0, 1, 1.163 26, 0, 25, 23, 1, 0.849 26, 0, 25, 0, 1, 0.851 26, 25, 25, 23, 1, 0.844 26, 25, 25, 0, 1, 0.849 24, 0, 25, 23, 1, 0.885 24, 0, 25, 0, 1, 0.886 24, 25, 25, 23, 1, 0.875 24, 25, 25, 0, 1, 0.845 26, 2048, 25, 23, 1, 0.85 26, 2048, 25, 0, 1, 0.849 26, 2073, 25, 23, 1, 0.862 26, 2073, 25, 0, 1, 0.861 24, 2048, 25, 23, 1, 0.886 24, 2048, 25, 0, 1, 0.885 24, 2073, 25, 23, 1, 0.862 24, 2073, 25, 0, 1, 0.836 24, 4081, 25, 23, 1, 1.105 24, 4081, 25, 0, 1, 1.088 26, 4081, 25, 23, 1, 1.047 26, 4081, 25, 0, 1, 1.045 26, 0, 1, 23, 1, 1.163 26, 0, 2, 0, 1, 1.163 27, 0, 26, 23, 1, 0.853 27, 0, 26, 0, 1, 0.853 27, 26, 26, 23, 1, 0.85 27, 26, 26, 0, 1, 0.86 25, 0, 26, 23, 1, 0.888 25, 0, 26, 0, 1, 0.887 25, 26, 26, 23, 1, 0.867 25, 26, 26, 0, 1, 0.844 27, 2048, 26, 23, 1, 0.852 27, 2048, 26, 0, 1, 0.851 27, 2074, 26, 23, 1, 0.872 27, 2074, 26, 0, 1, 0.878 25, 2048, 26, 23, 1, 0.889 25, 2048, 26, 0, 1, 0.888 25, 2074, 26, 23, 1, 0.868 25, 2074, 26, 0, 1, 0.854 25, 4081, 26, 23, 1, 1.109 25, 4081, 26, 0, 1, 1.102 27, 4081, 26, 23, 1, 1.046 27, 4081, 26, 0, 1, 1.049 27, 0, 1, 23, 1, 1.165 27, 0, 2, 0, 1, 1.165 28, 0, 27, 23, 1, 0.853 28, 0, 27, 0, 1, 0.854 28, 27, 27, 23, 1, 0.873 28, 27, 27, 0, 1, 0.878 26, 0, 27, 23, 1, 0.887 26, 0, 27, 0, 1, 0.888 26, 27, 27, 23, 1, 0.875 26, 27, 27, 0, 1, 0.851 28, 2048, 27, 23, 1, 0.851 28, 2048, 27, 0, 1, 0.851 28, 2075, 27, 23, 1, 0.879 28, 2075, 27, 0, 1, 0.883 26, 2048, 27, 23, 1, 0.888 26, 2048, 27, 0, 1, 0.888 26, 2075, 27, 23, 1, 0.876 26, 2075, 27, 0, 1, 0.86 26, 4081, 27, 23, 1, 1.109 26, 4081, 27, 0, 1, 1.105 28, 4081, 27, 23, 1, 1.048 28, 4081, 27, 0, 1, 1.048 28, 0, 1, 23, 1, 1.164 28, 0, 2, 0, 1, 1.165 29, 0, 28, 23, 1, 0.854 29, 0, 28, 0, 1, 0.852 29, 28, 28, 23, 1, 0.887 29, 28, 28, 0, 1, 0.884 27, 0, 28, 23, 1, 0.887 27, 0, 28, 0, 1, 0.889 27, 28, 28, 23, 1, 0.885 27, 28, 28, 0, 1, 0.866 29, 2048, 28, 23, 1, 0.853 29, 2048, 28, 0, 1, 0.852 29, 2076, 28, 23, 1, 0.879 29, 2076, 28, 0, 1, 0.876 27, 2048, 28, 23, 1, 0.889 27, 2048, 28, 0, 1, 0.891 27, 2076, 28, 23, 1, 0.883 27, 2076, 28, 0, 1, 0.86 27, 4081, 28, 23, 1, 1.11 27, 4081, 28, 0, 1, 1.106 29, 4081, 28, 23, 1, 1.051 29, 4081, 28, 0, 1, 1.052 29, 0, 1, 23, 1, 1.168 29, 0, 2, 0, 1, 1.168 30, 0, 29, 23, 1, 0.856 30, 0, 29, 0, 1, 0.854 30, 29, 29, 23, 1, 0.873 30, 29, 29, 0, 1, 0.874 28, 0, 29, 23, 1, 0.891 28, 0, 29, 0, 1, 0.891 28, 29, 29, 23, 1, 0.884 28, 29, 29, 0, 1, 0.872 30, 2048, 29, 23, 1, 0.859 30, 2048, 29, 0, 1, 0.856 30, 2077, 29, 23, 1, 0.879 30, 2077, 29, 0, 1, 0.878 28, 2048, 29, 23, 1, 0.891 28, 2048, 29, 0, 1, 0.891 28, 2077, 29, 23, 1, 0.889 28, 2077, 29, 0, 1, 0.863 28, 4081, 29, 23, 1, 1.109 28, 4081, 29, 0, 1, 1.122 30, 4081, 29, 23, 1, 1.054 30, 4081, 29, 0, 1, 1.052 30, 0, 1, 23, 1, 1.163 30, 0, 2, 0, 1, 1.161 31, 0, 30, 23, 1, 0.851 31, 0, 30, 0, 1, 0.849 31, 30, 30, 23, 1, 0.871 31, 30, 30, 0, 1, 0.874 29, 0, 30, 23, 1, 0.884 29, 0, 30, 0, 1, 0.885 29, 30, 30, 23, 1, 0.888 29, 30, 30, 0, 1, 0.864 31, 2048, 30, 23, 1, 0.854 31, 2048, 30, 0, 1, 0.852 31, 2078, 30, 23, 1, 0.874 31, 2078, 30, 0, 1, 0.882 29, 2048, 30, 23, 1, 0.888 29, 2048, 30, 0, 1, 0.889 29, 2078, 30, 23, 1, 0.895 29, 2078, 30, 0, 1, 0.878 29, 4081, 30, 23, 1, 1.109 29, 4081, 30, 0, 1, 1.128 31, 4081, 30, 23, 1, 0.804 31, 4081, 30, 0, 1, 0.803 31, 0, 1, 23, 1, 1.167 31, 0, 2, 0, 1, 1.167 32, 0, 31, 23, 1, 0.802 32, 0, 31, 0, 1, 0.802 32, 31, 31, 23, 1, 0.798 32, 31, 31, 0, 1, 0.797 30, 0, 31, 23, 1, 0.88 30, 0, 31, 0, 1, 0.888 30, 31, 31, 23, 1, 0.96 30, 31, 31, 0, 1, 0.869 32, 2048, 31, 23, 1, 0.802 32, 2048, 31, 0, 1, 0.802 32, 2079, 31, 23, 1, 0.843 32, 2079, 31, 0, 1, 0.835 30, 2048, 31, 23, 1, 0.889 30, 2048, 31, 0, 1, 0.889 30, 2079, 31, 23, 1, 0.937 30, 2079, 31, 0, 1, 0.872 30, 4081, 31, 23, 1, 1.11 30, 4081, 31, 0, 1, 1.142 32, 4081, 31, 23, 1, 0.864 32, 4081, 31, 0, 1, 0.872 32, 0, 1, 23, 1, 1.167 32, 0, 2, 0, 1, 1.167 > sysdeps/x86_64/memrchr.S | 613 +++++++++++++++++++-------------------- > 1 file changed, 292 insertions(+), 321 deletions(-) > > diff --git a/sysdeps/x86_64/memrchr.S b/sysdeps/x86_64/memrchr.S > index d1a9f47911..b0dffd2ae2 100644 > --- a/sysdeps/x86_64/memrchr.S > +++ b/sysdeps/x86_64/memrchr.S > @@ -18,362 +18,333 @@ > <https://www.gnu.org/licenses/>. */ > > #include <sysdep.h> > +#define VEC_SIZE 16 > +#define PAGE_SIZE 4096 > > .text > -ENTRY (__memrchr) > - movd %esi, %xmm1 > - > - sub $16, %RDX_LP > - jbe L(length_less16) > - > - punpcklbw %xmm1, %xmm1 > - punpcklbw %xmm1, %xmm1 > - > - add %RDX_LP, %RDI_LP > - pshufd $0, %xmm1, %xmm1 > - > - movdqu (%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - > -/* Check if there is a match. */ > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches0) > - > - sub $64, %rdi > - mov %edi, %ecx > - and $15, %ecx > - jz L(loop_prolog) > - > - add $16, %rdi > - add $16, %rdx > - and $-16, %rdi > - sub %rcx, %rdx > - > - .p2align 4 > -L(loop_prolog): > - sub $64, %rdx > - jbe L(exit_loop) > - > - movdqa 48(%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches48) > - > - movdqa 32(%rdi), %xmm2 > - pcmpeqb %xmm1, %xmm2 > - pmovmskb %xmm2, %eax > - test %eax, %eax > - jnz L(matches32) > - > - movdqa 16(%rdi), %xmm3 > - pcmpeqb %xmm1, %xmm3 > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches16) > - > - movdqa (%rdi), %xmm4 > - pcmpeqb %xmm1, %xmm4 > - pmovmskb %xmm4, %eax > - test %eax, %eax > - jnz L(matches0) > - > - sub $64, %rdi > - sub $64, %rdx > - jbe L(exit_loop) > - > - movdqa 48(%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches48) > - > - movdqa 32(%rdi), %xmm2 > - pcmpeqb %xmm1, %xmm2 > - pmovmskb %xmm2, %eax > - test %eax, %eax > - jnz L(matches32) > - > - movdqa 16(%rdi), %xmm3 > - pcmpeqb %xmm1, %xmm3 > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches16) > - > - movdqa (%rdi), %xmm3 > - pcmpeqb %xmm1, %xmm3 > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches0) > - > - mov %edi, %ecx > - and $63, %ecx > - jz L(align64_loop) > - > - add $64, %rdi > - add $64, %rdx > - and $-64, %rdi > - sub %rcx, %rdx > - > - .p2align 4 > -L(align64_loop): > - sub $64, %rdi > - sub $64, %rdx > - jbe L(exit_loop) > - > - movdqa (%rdi), %xmm0 > - movdqa 16(%rdi), %xmm2 > - movdqa 32(%rdi), %xmm3 > - movdqa 48(%rdi), %xmm4 > - > - pcmpeqb %xmm1, %xmm0 > - pcmpeqb %xmm1, %xmm2 > - pcmpeqb %xmm1, %xmm3 > - pcmpeqb %xmm1, %xmm4 > - > - pmaxub %xmm3, %xmm0 > - pmaxub %xmm4, %xmm2 > - pmaxub %xmm0, %xmm2 > - pmovmskb %xmm2, %eax > - > - test %eax, %eax > - jz L(align64_loop) > - > - pmovmskb %xmm4, %eax > - test %eax, %eax > - jnz L(matches48) > - > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches32) > - > - movdqa 16(%rdi), %xmm2 > - > - pcmpeqb %xmm1, %xmm2 > - pcmpeqb (%rdi), %xmm1 > - > - pmovmskb %xmm2, %eax > - test %eax, %eax > - jnz L(matches16) > - > - pmovmskb %xmm1, %eax > - bsr %eax, %eax > - > - add %rdi, %rax > +ENTRY_P2ALIGN(__memrchr, 6) > +#ifdef __ILP32__ > + /* Clear upper bits. */ > + mov %RDX_LP, %RDX_LP > +#endif > + movd %esi, %xmm0 > + > + /* Get end pointer. */ > + leaq (%rdx, %rdi), %rcx > + > + punpcklbw %xmm0, %xmm0 > + punpcklwd %xmm0, %xmm0 > + pshufd $0, %xmm0, %xmm0 > + > + /* Check if we can load 1x VEC without cross a page. */ > + testl $(PAGE_SIZE - VEC_SIZE), %ecx > + jz L(page_cross) > + > + /* NB: This load happens regardless of whether rdx (len) is zero. Since > + it doesn't cross a page and the standard gurantees any pointer have > + at least one-valid byte this load must be safe. For the entire > + history of the x86 memrchr implementation this has been possible so > + no code "should" be relying on a zero-length check before this load. > + The zero-length check is moved to the page cross case because it is > + 1) pretty cold and including it pushes the hot case len <= VEC_SIZE > + into 2-cache lines. */ > + movups -(VEC_SIZE)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + subq $VEC_SIZE, %rdx > + ja L(more_1x_vec) > +L(ret_vec_x0_test): > + /* Zero-flag set if eax (src) is zero. Destination unchanged if src is > + zero. */ > + bsrl %eax, %eax > + jz L(ret_0) > + /* Check if the CHAR match is in bounds. Need to truly zero `eax` here > + if out of bounds. */ > + addl %edx, %eax > + jl L(zero_0) > + /* Since we subtracted VEC_SIZE from rdx earlier we can just add to base > + ptr. */ > + addq %rdi, %rax > +L(ret_0): > ret > > - .p2align 4 > -L(exit_loop): > - add $64, %edx > - cmp $32, %edx > - jbe L(exit_loop_32) > - > - movdqa 48(%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches48) > - > - movdqa 32(%rdi), %xmm2 > - pcmpeqb %xmm1, %xmm2 > - pmovmskb %xmm2, %eax > - test %eax, %eax > - jnz L(matches32) > - > - movdqa 16(%rdi), %xmm3 > - pcmpeqb %xmm1, %xmm3 > - pmovmskb %xmm3, %eax > - test %eax, %eax > - jnz L(matches16_1) > - cmp $48, %edx > - jbe L(return_null) > - > - pcmpeqb (%rdi), %xmm1 > - pmovmskb %xmm1, %eax > - test %eax, %eax > - jnz L(matches0_1) > - xor %eax, %eax > + .p2align 4,, 5 > +L(ret_vec_x0): > + bsrl %eax, %eax > + leaq -(VEC_SIZE)(%rcx, %rax), %rax > ret > > - .p2align 4 > -L(exit_loop_32): > - movdqa 48(%rdi), %xmm0 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %eax > - test %eax, %eax > - jnz L(matches48_1) > - cmp $16, %edx > - jbe L(return_null) > - > - pcmpeqb 32(%rdi), %xmm1 > - pmovmskb %xmm1, %eax > - test %eax, %eax > - jnz L(matches32_1) > - xor %eax, %eax > + .p2align 4,, 2 > +L(zero_0): > + xorl %eax, %eax > ret > > - .p2align 4 > -L(matches0): > - bsr %eax, %eax > - add %rdi, %rax > - ret > - > - .p2align 4 > -L(matches16): > - bsr %eax, %eax > - lea 16(%rax, %rdi), %rax > - ret > > - .p2align 4 > -L(matches32): > - bsr %eax, %eax > - lea 32(%rax, %rdi), %rax > + .p2align 4,, 8 > +L(more_1x_vec): > + testl %eax, %eax > + jnz L(ret_vec_x0) > + > + /* Align rcx (pointer to string). */ > + decq %rcx > + andq $-VEC_SIZE, %rcx > + > + movq %rcx, %rdx > + /* NB: We could consistenyl save 1-byte in this pattern with `movaps > + %xmm0, %xmm1; pcmpeq IMM8(r), %xmm1; ...`. The reason against it is > + it adds more frontend uops (even if the moves can be eliminated) and > + some percentage of the time actual backend uops. */ > + movaps -(VEC_SIZE)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + subq %rdi, %rdx > + pmovmskb %xmm1, %eax > + > + cmpq $(VEC_SIZE * 2), %rdx > + ja L(more_2x_vec) > +L(last_2x_vec): > + subl $VEC_SIZE, %edx > + jbe L(ret_vec_x0_test) > + > + testl %eax, %eax > + jnz L(ret_vec_x0) > + > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + subl $VEC_SIZE, %edx > + bsrl %eax, %eax > + jz L(ret_1) > + addl %edx, %eax > + jl L(zero_0) > + addq %rdi, %rax > +L(ret_1): > ret > > - .p2align 4 > -L(matches48): > - bsr %eax, %eax > - lea 48(%rax, %rdi), %rax > + /* Don't align. Otherwise lose 2-byte encoding in jump to L(page_cross) > + causes the hot pause (length <= VEC_SIZE) to span multiple cache > + lines. Naturally aligned % 16 to 8-bytes. */ > +L(page_cross): > + /* Zero length check. */ > + testq %rdx, %rdx > + jz L(zero_0) > + > + leaq -1(%rcx), %r8 > + andq $-(VEC_SIZE), %r8 > + > + movaps (%r8), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %esi > + /* Shift out negative alignment (because we are starting from endptr and > + working backwards). */ > + negl %ecx > + /* 32-bit shift but VEC_SIZE=16 so need to mask the shift count > + explicitly. */ > + andl $(VEC_SIZE - 1), %ecx > + shl %cl, %esi > + movzwl %si, %eax > + leaq (%rdi, %rdx), %rcx > + cmpq %rdi, %r8 > + ja L(more_1x_vec) > + subl $VEC_SIZE, %edx > + bsrl %eax, %eax > + jz L(ret_2) > + addl %edx, %eax > + jl L(zero_1) > + addq %rdi, %rax > +L(ret_2): > ret > > - .p2align 4 > -L(matches0_1): > - bsr %eax, %eax > - sub $64, %rdx > - add %rax, %rdx > - jl L(return_null) > - add %rdi, %rax > + /* Fits in aliging bytes. */ > +L(zero_1): > + xorl %eax, %eax > ret > > - .p2align 4 > -L(matches16_1): > - bsr %eax, %eax > - sub $48, %rdx > - add %rax, %rdx > - jl L(return_null) > - lea 16(%rdi, %rax), %rax > + .p2align 4,, 5 > +L(ret_vec_x1): > + bsrl %eax, %eax > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > ret > > - .p2align 4 > -L(matches32_1): > - bsr %eax, %eax > - sub $32, %rdx > - add %rax, %rdx > - jl L(return_null) > - lea 32(%rdi, %rax), %rax > - ret > + .p2align 4,, 8 > +L(more_2x_vec): > + testl %eax, %eax > + jnz L(ret_vec_x0) > > - .p2align 4 > -L(matches48_1): > - bsr %eax, %eax > - sub $16, %rdx > - add %rax, %rdx > - jl L(return_null) > - lea 48(%rdi, %rax), %rax > - ret > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + testl %eax, %eax > + jnz L(ret_vec_x1) > > - .p2align 4 > -L(return_null): > - xor %eax, %eax > - ret > > - .p2align 4 > -L(length_less16_offset0): > - test %edx, %edx > - jz L(return_null) > + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > > - mov %dl, %cl > - pcmpeqb (%rdi), %xmm1 > + subq $(VEC_SIZE * 4), %rdx > + ja L(more_4x_vec) > > - mov $1, %edx > - sal %cl, %edx > - sub $1, %edx > + addl $(VEC_SIZE), %edx > + jle L(ret_vec_x2_test) > > - pmovmskb %xmm1, %eax > +L(last_vec): > + testl %eax, %eax > + jnz L(ret_vec_x2) > > - and %edx, %eax > - test %eax, %eax > - jz L(return_null) > + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > > - bsr %eax, %eax > - add %rdi, %rax > + subl $(VEC_SIZE), %edx > + bsrl %eax, %eax > + jz L(ret_3) > + addl %edx, %eax > + jl L(zero_2) > + addq %rdi, %rax > +L(ret_3): > ret > > - .p2align 4 > -L(length_less16): > - punpcklbw %xmm1, %xmm1 > - punpcklbw %xmm1, %xmm1 > - > - add $16, %edx > - > - pshufd $0, %xmm1, %xmm1 > - > - mov %edi, %ecx > - and $15, %ecx > - jz L(length_less16_offset0) > - > - mov %cl, %dh > - mov %ecx, %esi > - add %dl, %dh > - and $-16, %rdi > - > - sub $16, %dh > - ja L(length_less16_part2) > - > - pcmpeqb (%rdi), %xmm1 > - pmovmskb %xmm1, %eax > - > - sar %cl, %eax > - mov %dl, %cl > - > - mov $1, %edx > - sal %cl, %edx > - sub $1, %edx > - > - and %edx, %eax > - test %eax, %eax > - jz L(return_null) > - > - bsr %eax, %eax > - add %rdi, %rax > - add %rsi, %rax > + .p2align 4,, 6 > +L(ret_vec_x2_test): > + bsrl %eax, %eax > + jz L(zero_2) > + addl %edx, %eax > + jl L(zero_2) > + addq %rdi, %rax > ret > > - .p2align 4 > -L(length_less16_part2): > - movdqa 16(%rdi), %xmm2 > - pcmpeqb %xmm1, %xmm2 > - pmovmskb %xmm2, %eax > - > - mov %dh, %cl > - mov $1, %edx > - sal %cl, %edx > - sub $1, %edx > - > - and %edx, %eax > +L(zero_2): > + xorl %eax, %eax > + ret > > - test %eax, %eax > - jnz L(length_less16_part2_return) > > - pcmpeqb (%rdi), %xmm1 > - pmovmskb %xmm1, %eax > + .p2align 4,, 5 > +L(ret_vec_x2): > + bsrl %eax, %eax > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > + ret > > - mov %esi, %ecx > - sar %cl, %eax > - test %eax, %eax > - jz L(return_null) > + .p2align 4,, 5 > +L(ret_vec_x3): > + bsrl %eax, %eax > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > + ret > > - bsr %eax, %eax > - add %rdi, %rax > - add %rsi, %rax > + .p2align 4,, 8 > +L(more_4x_vec): > + testl %eax, %eax > + jnz L(ret_vec_x2) > + > + movaps -(VEC_SIZE * 4)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + testl %eax, %eax > + jnz L(ret_vec_x3) > + > + addq $-(VEC_SIZE * 4), %rcx > + cmpq $(VEC_SIZE * 4), %rdx > + jbe L(last_4x_vec) > + > + /* Offset everything by 4x VEC_SIZE here to save a few bytes at the end > + keeping the code from spilling to the next cache line. */ > + addq $(VEC_SIZE * 4 - 1), %rcx > + andq $-(VEC_SIZE * 4), %rcx > + leaq (VEC_SIZE * 4)(%rdi), %rdx > + andq $-(VEC_SIZE * 4), %rdx > + > + .p2align 4,, 11 > +L(loop_4x_vec): > + movaps (VEC_SIZE * -1)(%rcx), %xmm1 > + movaps (VEC_SIZE * -2)(%rcx), %xmm2 > + movaps (VEC_SIZE * -3)(%rcx), %xmm3 > + movaps (VEC_SIZE * -4)(%rcx), %xmm4 > + pcmpeqb %xmm0, %xmm1 > + pcmpeqb %xmm0, %xmm2 > + pcmpeqb %xmm0, %xmm3 > + pcmpeqb %xmm0, %xmm4 > + > + por %xmm1, %xmm2 > + por %xmm3, %xmm4 > + por %xmm2, %xmm4 > + > + pmovmskb %xmm4, %esi > + testl %esi, %esi > + jnz L(loop_end) > + > + addq $-(VEC_SIZE * 4), %rcx > + cmpq %rdx, %rcx > + jne L(loop_4x_vec) > + > + subl %edi, %edx > + > + /* Ends up being 1-byte nop. */ > + .p2align 4,, 2 > +L(last_4x_vec): > + movaps -(VEC_SIZE)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + cmpl $(VEC_SIZE * 2), %edx > + jbe L(last_2x_vec) > + > + testl %eax, %eax > + jnz L(ret_vec_x0) > + > + > + movaps -(VEC_SIZE * 2)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + testl %eax, %eax > + jnz L(ret_vec_end) > + > + movaps -(VEC_SIZE * 3)(%rcx), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %eax > + > + subl $(VEC_SIZE * 3), %edx > + ja L(last_vec) > + bsrl %eax, %eax > + jz L(ret_4) > + addl %edx, %eax > + jl L(zero_3) > + addq %rdi, %rax > +L(ret_4): > ret > > - .p2align 4 > -L(length_less16_part2_return): > - bsr %eax, %eax > - lea 16(%rax, %rdi), %rax > + /* Ends up being 1-byte nop. */ > + .p2align 4,, 3 > +L(loop_end): > + pmovmskb %xmm1, %eax > + sall $16, %eax > + jnz L(ret_vec_end) > + > + pmovmskb %xmm2, %eax > + testl %eax, %eax > + jnz L(ret_vec_end) > + > + pmovmskb %xmm3, %eax > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > + then it won't affect the result in esi (VEC4). If ecx is non-zero > + then CHAR in VEC3 and bsrq will use that position. */ > + sall $16, %eax > + orl %esi, %eax > + bsrl %eax, %eax > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > ret > > -END (__memrchr) > +L(ret_vec_end): > + bsrl %eax, %eax > + leaq (VEC_SIZE * -2)(%rax, %rcx), %rax > + ret > + /* Use in L(last_4x_vec). In the same cache line. This is just a spare > + aligning bytes. */ > +L(zero_3): > + xorl %eax, %eax > + ret > + /* 2-bytes from next cache line. */ > +END(__memrchr) > weak_alias (__memrchr, memrchr) > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v1 5/8] x86: Optimize memrchr-evex.S 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (2 preceding siblings ...) 2022-06-03 4:42 ` [PATCH v1 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein @ 2022-06-03 4:42 ` Noah Goldstein 2022-06-03 4:49 ` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein ` (3 subsequent siblings) 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:42 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the begining of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ 1 file changed, 268 insertions(+), 271 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S index 0b99709c6b..ad541c0e50 100644 --- a/sysdeps/x86_64/multiarch/memrchr-evex.S +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S @@ -19,319 +19,316 @@ #if IS_IN (libc) # include <sysdep.h> +# include "evex256-vecs.h" +# if VEC_SIZE != 32 +# error "VEC_SIZE != 32 unimplemented" +# endif + +# ifndef MEMRCHR +# define MEMRCHR __memrchr_evex +# endif + +# define PAGE_SIZE 4096 +# define VECMATCH VEC(0) + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN(MEMRCHR, 6) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) + + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdi, %rdx), %rax + vpbroadcastb %esi, %VECMATCH + + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) +L(ret_vec_x0_test): + + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret -# define VMOVA vmovdqa64 - -# define YMMMATCH ymm16 - -# define VEC_SIZE 32 - - .section .text.evex,"ax",@progbits -ENTRY (__memrchr_evex) - /* Broadcast CHAR to YMMMATCH. */ - vpbroadcastb %esi, %YMMMATCH - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP - - /* Check the last VEC_SIZE bytes. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) - - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx - - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 - kord %k1, %k2, %k5 - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 - - kord %k3, %k4, %k6 - kortestd %k5, %k6 - jz L(loop_4x_vec) - - /* There is a match. */ - kmovd %k4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - kmovd %k1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0_dec): + decq %rax +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax ret - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3) + /* Align rax (pointer to string). */ + andq $-VEC_SIZE, %rax - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x2) + /* Recompute length after aligning. */ + movq %rax, %rdx - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 - kmovd %k3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - vpcmpb $0, (%rdi), %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - ret + subq %rdi, %rdx - .p2align 4 + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) + + /* Must dec rax because L(ret_vec_x0_test) expects it. */ + decq %rax cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Don't use rax for pointer here because EVEX has better encoding with + offset % VEC_SIZE == 0. */ + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 + kmovd %k0, %ecx + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpb $0, (%rsi), %VECMATCH, %k0 + kmovd %k0, %r8d + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %ecx + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %ecx + shlxl %ecx, %r8d, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + cmpl %ecx, %edx + jle L(zero_1) + subq %rcx, %rax ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax + /* Continue creating zero labels that fit in aligning bytes and get + 2-byte encoding / are in the same cache line as condition. */ +L(zero_1): + xorl %eax, %eax ret - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax ret - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + .p2align 4,, 8 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - ret + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + + /* Need no matter what. */ + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx - - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx - - kmovd %k1, %eax - - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 8 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_1) ret - .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx - - /* Check for zero length. */ - testl %edx, %edx - jz L(zero) - - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) - - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + .p2align 4,, 8 +L(ret_vec_x2): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + ret - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + .p2align 4,, 8 +L(ret_vec_x3): + bsrl %ecx, %ecx + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + ret - /* Check the last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 - kmovd %k1, %eax + .p2align 4,, 8 +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + testl %ecx, %ecx + jnz L(ret_vec_x3) - andl %edx, %eax - testl %eax, %eax - jz L(zero) + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - ret + decq %rax + andq $-(VEC_SIZE * 4), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + andq $-(VEC_SIZE * 4), %rdx .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx - - /* Check the last VEC. */ - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 +L(loop_4x_vec): + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later + on). */ + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 + + /* VEC(2/3) will have zero-byte where we found a CHAR. */ + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 + + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} + vptestnmb %VEC(3), %VEC(3), %k2 + + /* Any 1s and we found CHAR. */ + kortestd %k2, %k4 + jnz L(loop_end) + + addq $-(VEC_SIZE * 4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) + + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ + subq $-(VEC_SIZE * 4), %rdx + movq %rdx, %rax + subl %edi, %edx +L(last_4x_vec): + + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - kmovd %k1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_dec) - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jnz L(last_vec_x1) + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - /* Check the second last VEC. */ - vpcmpb $0, (%rdi), %YMMMATCH, %k1 + testl %ecx, %ecx + jnz L(ret_vec_x1) - movl %r8d, %ecx + /* Used no matter what. */ + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 + kmovd %k0, %ecx - kmovd %k1, %eax + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2 + 1), %rax + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret_1) + xorl %eax, %eax +L(ret_1): + ret - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 6 +L(loop_end): + kmovd %k1, %ecx + notl %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vptestnmb %VEC(2), %VEC(2), %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + kmovd %k2, %ecx + kmovd %k4, %esi + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + addq %rcx, %rax + ret + .p2align 4,, 4 +L(ret_vec_x0_end): + addq $(VEC_SIZE), %rax +L(ret_vec_x1_end): + bsrl %ecx, %ecx + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax ret -END (__memrchr_evex) + +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v1 5/8] x86: Optimize memrchr-evex.S 2022-06-03 4:42 ` [PATCH v1 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-03 4:49 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:49 UTC (permalink / raw) To: GNU C Library On Thu, Jun 2, 2022 at 11:42 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The new code: > 1. prioritizes smaller user-arg lengths more. > 2. optimizes target placement more carefully > 3. reuses logic more > 4. fixes up various inefficiencies in the logic. The biggest > case here is the `lzcnt` logic for checking returns which > saves either a branch or multiple instructions. > > The total code size saving is: 263 bytes > Geometric Mean of all benchmarks New / Old: 0.755 > > Regressions: > There are some regressions. Particularly where the length (user arg > length) is large but the position of the match char is near the > begining of the string (in first VEC). This case has roughly a > 20% regression. > > This is because the new logic gives the hot path for immediate matches > to shorter lengths (the more common input). This case has roughly > a 35% speedup. > > Full xcheck passes on x86_64. Geometric mean of N = 30 runs. Benchmarked on Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i71165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html Aggregate Geometric Mean of New / Old: 0.7552309594785345 Results For: memrchr len, align, pos, seek_char, invert_pos, New / Old 2048, 0, 32, 23, 0, 0.897 256, 1, 64, 23, 0, 0.773 2048, 0, 32, 0, 0, 0.897 256, 1, 64, 0, 0, 0.772 256, 4081, 64, 0, 0, 0.647 256, 0, 1, 23, 0, 0.773 256, 0, 1, 0, 0, 0.773 256, 1, 1, 23, 0, 0.799 256, 1, 1, 0, 0, 0.8 2048, 0, 64, 23, 0, 0.905 256, 2, 64, 23, 0, 0.773 2048, 0, 64, 0, 0, 0.904 256, 2, 64, 0, 0, 0.772 256, 0, 2, 23, 0, 0.772 256, 0, 2, 0, 0, 0.773 256, 2, 2, 23, 0, 0.799 256, 2, 2, 0, 0, 0.796 2048, 0, 128, 23, 0, 0.926 256, 3, 64, 23, 0, 0.772 2048, 0, 128, 0, 0, 0.925 256, 3, 64, 0, 0, 0.772 256, 0, 3, 23, 0, 0.772 256, 0, 3, 0, 0, 0.772 256, 3, 3, 23, 0, 0.796 256, 3, 3, 0, 0, 0.797 2048, 0, 256, 23, 0, 0.927 256, 4, 64, 23, 0, 0.764 2048, 0, 256, 0, 0, 0.929 256, 4, 64, 0, 0, 0.767 256, 0, 4, 23, 0, 0.772 256, 0, 4, 0, 0, 0.773 256, 4, 4, 23, 0, 0.8 256, 4, 4, 0, 0, 0.798 2048, 0, 512, 23, 0, 0.957 256, 5, 64, 23, 0, 0.773 2048, 0, 512, 0, 0, 0.956 256, 5, 64, 0, 0, 0.773 256, 0, 5, 23, 0, 0.774 256, 0, 5, 0, 0, 0.773 256, 5, 5, 23, 0, 0.798 256, 5, 5, 0, 0, 0.797 2048, 0, 1024, 23, 0, 1.034 256, 6, 64, 23, 0, 0.773 2048, 0, 1024, 0, 0, 1.034 256, 6, 64, 0, 0, 0.773 256, 0, 6, 23, 0, 0.774 256, 0, 6, 0, 0, 0.773 256, 6, 6, 23, 0, 0.799 256, 6, 6, 0, 0, 0.798 2048, 0, 2048, 23, 0, 0.902 256, 7, 64, 23, 0, 0.773 2048, 0, 2048, 0, 0, 0.901 256, 7, 64, 0, 0, 0.773 256, 0, 7, 23, 0, 0.774 256, 0, 7, 0, 0, 0.774 256, 7, 7, 23, 0, 0.802 256, 7, 7, 0, 0, 0.798 192, 1, 32, 23, 0, 0.62 192, 1, 32, 0, 0, 0.62 256, 1, 32, 23, 0, 0.787 256, 1, 32, 0, 0, 0.786 512, 1, 32, 23, 0, 0.819 512, 1, 32, 0, 0, 0.822 256, 4081, 32, 23, 0, 0.731 192, 2, 64, 23, 0, 0.852 192, 2, 64, 0, 0, 0.852 512, 2, 64, 23, 0, 0.883 512, 2, 64, 0, 0, 0.883 256, 4081, 64, 23, 0, 0.646 192, 3, 96, 23, 0, 0.847 192, 3, 96, 0, 0, 0.847 256, 3, 96, 23, 0, 0.782 256, 3, 96, 0, 0, 0.782 512, 3, 96, 23, 0, 0.933 512, 3, 96, 0, 0, 0.932 256, 4081, 96, 23, 0, 0.615 192, 4, 128, 23, 0, 0.836 192, 4, 128, 0, 0, 0.836 256, 4, 128, 23, 0, 0.852 256, 4, 128, 0, 0, 0.853 512, 4, 128, 23, 0, 0.96 512, 4, 128, 0, 0, 0.961 256, 4081, 128, 23, 0, 0.863 192, 5, 160, 23, 0, 1.166 192, 5, 160, 0, 0, 1.167 256, 5, 160, 23, 0, 0.847 256, 5, 160, 0, 0, 0.847 512, 5, 160, 23, 0, 0.949 512, 5, 160, 0, 0, 0.95 256, 4081, 160, 23, 0, 0.879 192, 6, 192, 23, 0, 0.696 192, 6, 192, 0, 0, 0.695 256, 6, 192, 23, 0, 0.836 256, 6, 192, 0, 0, 0.836 512, 6, 192, 23, 0, 0.936 512, 6, 192, 0, 0, 0.935 256, 4081, 192, 23, 0, 0.874 192, 7, 224, 23, 0, 0.697 192, 7, 224, 0, 0, 0.696 256, 7, 224, 23, 0, 1.167 256, 7, 224, 0, 0, 1.167 512, 7, 224, 23, 0, 0.95 512, 7, 224, 0, 0, 0.952 256, 4081, 224, 23, 0, 1.167 2, 0, 1, 23, 0, 0.874 2, 0, 1, 0, 0, 0.875 2, 1, 1, 23, 0, 0.796 2, 1, 1, 0, 0, 0.796 0, 0, 1, 23, 0, 0.857 0, 0, 1, 0, 0, 0.857 0, 1, 1, 23, 0, 0.857 0, 1, 1, 0, 0, 0.857 2, 2048, 1, 23, 0, 0.64 2, 2048, 1, 0, 0, 0.64 2, 2049, 1, 23, 0, 0.582 2, 2049, 1, 0, 0, 0.582 0, 2048, 1, 23, 0, 0.856 0, 2048, 1, 0, 0, 0.856 0, 2049, 1, 23, 0, 0.857 0, 2049, 1, 0, 0, 0.857 0, 4081, 1, 23, 0, 0.857 0, 4081, 1, 0, 0, 0.857 2, 4081, 1, 23, 0, 0.568 2, 4081, 1, 0, 0, 0.568 2, 0, 2, 0, 0, 0.874 3, 0, 2, 23, 0, 0.875 3, 0, 2, 0, 0, 0.875 3, 2, 2, 23, 0, 0.796 3, 2, 2, 0, 0, 0.796 1, 0, 2, 23, 0, 0.875 1, 0, 2, 0, 0, 0.875 1, 2, 2, 23, 0, 0.875 1, 2, 2, 0, 0, 0.875 3, 2048, 2, 23, 0, 0.64 3, 2048, 2, 0, 0, 0.64 3, 2050, 2, 23, 0, 0.582 3, 2050, 2, 0, 0, 0.582 1, 2048, 2, 23, 0, 0.75 1, 2048, 2, 0, 0, 0.751 1, 2050, 2, 23, 0, 0.688 1, 2050, 2, 0, 0, 0.71 1, 4081, 2, 23, 0, 0.714 1, 4081, 2, 0, 0, 0.726 3, 4081, 2, 23, 0, 0.567 3, 4081, 2, 0, 0, 0.567 3, 0, 1, 23, 0, 0.874 4, 0, 3, 23, 0, 0.875 4, 0, 3, 0, 0, 0.875 4, 3, 3, 23, 0, 0.793 4, 3, 3, 0, 0, 0.795 2, 0, 3, 23, 0, 0.875 2, 0, 3, 0, 0, 0.875 2, 3, 3, 23, 0, 0.779 2, 3, 3, 0, 0, 0.822 4, 2048, 3, 23, 0, 0.641 4, 2048, 3, 0, 0, 0.639 4, 2051, 3, 23, 0, 0.58 4, 2051, 3, 0, 0, 0.581 2, 2048, 3, 23, 0, 0.753 2, 2048, 3, 0, 0, 0.752 2, 2051, 3, 23, 0, 0.693 2, 2051, 3, 0, 0, 0.72 2, 4081, 3, 23, 0, 0.715 2, 4081, 3, 0, 0, 0.731 4, 4081, 3, 23, 0, 0.565 4, 4081, 3, 0, 0, 0.565 4, 0, 1, 23, 0, 0.878 4, 0, 2, 0, 0, 0.876 5, 0, 4, 23, 0, 0.877 5, 0, 4, 0, 0, 0.88 5, 4, 4, 23, 0, 0.794 5, 4, 4, 0, 0, 0.796 3, 0, 4, 23, 0, 0.877 3, 0, 4, 0, 0, 0.876 3, 4, 4, 23, 0, 0.785 3, 4, 4, 0, 0, 0.796 5, 2048, 4, 23, 0, 0.639 5, 2048, 4, 0, 0, 0.641 5, 2052, 4, 23, 0, 0.582 5, 2052, 4, 0, 0, 0.579 3, 2048, 4, 23, 0, 0.749 3, 2048, 4, 0, 0, 0.751 3, 2052, 4, 23, 0, 0.689 3, 2052, 4, 0, 0, 0.706 3, 4081, 4, 23, 0, 0.732 3, 4081, 4, 0, 0, 0.724 5, 4081, 4, 23, 0, 0.566 5, 4081, 4, 0, 0, 0.566 5, 0, 1, 23, 0, 0.876 5, 0, 2, 0, 0, 0.876 6, 0, 5, 23, 0, 0.877 6, 0, 5, 0, 0, 0.881 6, 5, 5, 23, 0, 0.797 6, 5, 5, 0, 0, 0.795 4, 0, 5, 23, 0, 0.876 4, 0, 5, 0, 0, 0.877 4, 5, 5, 23, 0, 0.769 4, 5, 5, 0, 0, 0.787 6, 2048, 5, 23, 0, 0.642 6, 2048, 5, 0, 0, 0.641 6, 2053, 5, 23, 0, 0.579 6, 2053, 5, 0, 0, 0.578 4, 2048, 5, 23, 0, 0.75 4, 2048, 5, 0, 0, 0.75 4, 2053, 5, 23, 0, 0.684 4, 2053, 5, 0, 0, 0.703 4, 4081, 5, 23, 0, 0.725 4, 4081, 5, 0, 0, 0.733 6, 4081, 5, 23, 0, 0.565 6, 4081, 5, 0, 0, 0.566 6, 0, 1, 23, 0, 0.876 6, 0, 2, 0, 0, 0.877 7, 0, 6, 23, 0, 0.876 7, 0, 6, 0, 0, 0.88 7, 6, 6, 23, 0, 0.792 7, 6, 6, 0, 0, 0.79 5, 0, 6, 23, 0, 0.875 5, 0, 6, 0, 0, 0.875 5, 6, 6, 23, 0, 0.806 5, 6, 6, 0, 0, 0.833 7, 2048, 6, 23, 0, 0.64 7, 2048, 6, 0, 0, 0.638 7, 2054, 6, 23, 0, 0.578 7, 2054, 6, 0, 0, 0.579 5, 2048, 6, 23, 0, 0.75 5, 2048, 6, 0, 0, 0.75 5, 2054, 6, 23, 0, 0.68 5, 2054, 6, 0, 0, 0.706 5, 4081, 6, 23, 0, 0.71 5, 4081, 6, 0, 0, 0.708 7, 4081, 6, 23, 0, 0.565 7, 4081, 6, 0, 0, 0.562 7, 0, 1, 23, 0, 0.872 7, 0, 2, 0, 0, 0.875 8, 0, 7, 23, 0, 0.875 8, 0, 7, 0, 0, 0.877 8, 7, 7, 23, 0, 0.79 8, 7, 7, 0, 0, 0.791 6, 0, 7, 23, 0, 0.875 6, 0, 7, 0, 0, 0.88 6, 7, 7, 23, 0, 0.77 6, 7, 7, 0, 0, 0.79 8, 2048, 7, 23, 0, 0.642 8, 2048, 7, 0, 0, 0.643 8, 2055, 7, 23, 0, 0.577 8, 2055, 7, 0, 0, 0.578 6, 2048, 7, 23, 0, 0.75 6, 2048, 7, 0, 0, 0.753 6, 2055, 7, 23, 0, 0.668 6, 2055, 7, 0, 0, 0.674 6, 4081, 7, 23, 0, 0.724 6, 4081, 7, 0, 0, 0.714 8, 4081, 7, 23, 0, 0.565 8, 4081, 7, 0, 0, 0.567 8, 0, 1, 23, 0, 0.876 8, 0, 2, 0, 0, 0.877 9, 0, 8, 23, 0, 0.875 9, 0, 8, 0, 0, 0.877 9, 8, 8, 23, 0, 0.792 9, 8, 8, 0, 0, 0.79 7, 0, 8, 23, 0, 0.875 7, 0, 8, 0, 0, 0.875 7, 8, 8, 23, 0, 0.788 7, 8, 8, 0, 0, 0.795 9, 2048, 8, 23, 0, 0.639 9, 2048, 8, 0, 0, 0.641 9, 2056, 8, 23, 0, 0.58 9, 2056, 8, 0, 0, 0.581 7, 2048, 8, 23, 0, 0.751 7, 2048, 8, 0, 0, 0.754 7, 2056, 8, 23, 0, 0.668 7, 2056, 8, 0, 0, 0.682 7, 4081, 8, 23, 0, 0.691 7, 4081, 8, 0, 0, 0.684 9, 4081, 8, 23, 0, 0.562 9, 4081, 8, 0, 0, 0.564 9, 0, 1, 23, 0, 0.874 9, 0, 2, 0, 0, 0.875 10, 0, 9, 23, 0, 0.878 10, 0, 9, 0, 0, 0.878 10, 9, 9, 23, 0, 0.793 10, 9, 9, 0, 0, 0.795 8, 0, 9, 23, 0, 0.875 8, 0, 9, 0, 0, 0.876 8, 9, 9, 23, 0, 0.788 8, 9, 9, 0, 0, 0.792 10, 2048, 9, 23, 0, 0.641 10, 2048, 9, 0, 0, 0.639 10, 2057, 9, 23, 0, 0.579 10, 2057, 9, 0, 0, 0.582 8, 2048, 9, 23, 0, 0.75 8, 2048, 9, 0, 0, 0.751 8, 2057, 9, 23, 0, 0.693 8, 2057, 9, 0, 0, 0.701 8, 4081, 9, 23, 0, 0.726 8, 4081, 9, 0, 0, 0.725 10, 4081, 9, 23, 0, 0.564 10, 4081, 9, 0, 0, 0.567 10, 0, 1, 23, 0, 0.874 10, 0, 2, 0, 0, 0.877 11, 0, 10, 23, 0, 0.877 11, 0, 10, 0, 0, 0.876 11, 10, 10, 23, 0, 0.795 11, 10, 10, 0, 0, 0.796 9, 0, 10, 23, 0, 0.876 9, 0, 10, 0, 0, 0.875 9, 10, 10, 23, 0, 0.792 9, 10, 10, 0, 0, 0.827 11, 2048, 10, 23, 0, 0.64 11, 2048, 10, 0, 0, 0.639 11, 2058, 10, 23, 0, 0.581 11, 2058, 10, 0, 0, 0.577 9, 2048, 10, 23, 0, 0.751 9, 2048, 10, 0, 0, 0.749 9, 2058, 10, 23, 0, 0.703 9, 2058, 10, 0, 0, 0.71 9, 4081, 10, 23, 0, 0.709 9, 4081, 10, 0, 0, 0.725 11, 4081, 10, 23, 0, 0.565 11, 4081, 10, 0, 0, 0.566 11, 0, 1, 23, 0, 0.875 11, 0, 2, 0, 0, 0.876 12, 0, 11, 23, 0, 0.874 12, 0, 11, 0, 0, 0.875 12, 11, 11, 23, 0, 0.793 12, 11, 11, 0, 0, 0.791 10, 0, 11, 23, 0, 0.876 10, 0, 11, 0, 0, 0.875 10, 11, 11, 23, 0, 0.789 10, 11, 11, 0, 0, 0.792 12, 2048, 11, 23, 0, 0.639 12, 2048, 11, 0, 0, 0.639 12, 2059, 11, 23, 0, 0.578 12, 2059, 11, 0, 0, 0.58 10, 2048, 11, 23, 0, 0.75 10, 2048, 11, 0, 0, 0.75 10, 2059, 11, 23, 0, 0.685 10, 2059, 11, 0, 0, 0.707 10, 4081, 11, 23, 0, 0.7 10, 4081, 11, 0, 0, 0.716 12, 4081, 11, 23, 0, 0.565 12, 4081, 11, 0, 0, 0.562 12, 0, 1, 23, 0, 0.875 12, 0, 2, 0, 0, 0.876 13, 0, 12, 23, 0, 0.875 13, 0, 12, 0, 0, 0.875 13, 12, 12, 23, 0, 0.794 13, 12, 12, 0, 0, 0.794 11, 0, 12, 23, 0, 0.875 11, 0, 12, 0, 0, 0.874 11, 12, 12, 23, 0, 0.81 11, 12, 12, 0, 0, 0.801 13, 2048, 12, 23, 0, 0.64 13, 2048, 12, 0, 0, 0.639 13, 2060, 12, 23, 0, 0.578 13, 2060, 12, 0, 0, 0.578 11, 2048, 12, 23, 0, 0.75 11, 2048, 12, 0, 0, 0.75 11, 2060, 12, 23, 0, 0.694 11, 2060, 12, 0, 0, 0.701 11, 4081, 12, 23, 0, 0.702 11, 4081, 12, 0, 0, 0.714 13, 4081, 12, 23, 0, 0.563 13, 4081, 12, 0, 0, 0.566 13, 0, 1, 23, 0, 0.875 13, 0, 2, 0, 0, 0.874 14, 0, 13, 23, 0, 0.874 14, 0, 13, 0, 0, 0.876 14, 13, 13, 23, 0, 0.794 14, 13, 13, 0, 0, 0.792 12, 0, 13, 23, 0, 0.875 12, 0, 13, 0, 0, 0.875 12, 13, 13, 23, 0, 0.801 12, 13, 13, 0, 0, 0.817 14, 2048, 13, 23, 0, 0.639 14, 2048, 13, 0, 0, 0.639 14, 2061, 13, 23, 0, 0.579 14, 2061, 13, 0, 0, 0.577 12, 2048, 13, 23, 0, 0.75 12, 2048, 13, 0, 0, 0.751 12, 2061, 13, 23, 0, 0.663 12, 2061, 13, 0, 0, 0.677 12, 4081, 13, 23, 0, 0.703 12, 4081, 13, 0, 0, 0.724 14, 4081, 13, 23, 0, 0.565 14, 4081, 13, 0, 0, 0.564 14, 0, 1, 23, 0, 0.876 14, 0, 2, 0, 0, 0.876 15, 0, 14, 23, 0, 0.875 15, 0, 14, 0, 0, 0.875 15, 14, 14, 23, 0, 0.789 15, 14, 14, 0, 0, 0.792 13, 0, 14, 23, 0, 0.876 13, 0, 14, 0, 0, 0.876 13, 14, 14, 23, 0, 0.768 13, 14, 14, 0, 0, 0.789 15, 2048, 14, 23, 0, 0.64 15, 2048, 14, 0, 0, 0.64 15, 2062, 14, 23, 0, 0.579 15, 2062, 14, 0, 0, 0.582 13, 2048, 14, 23, 0, 0.75 13, 2048, 14, 0, 0, 0.75 13, 2062, 14, 23, 0, 0.668 13, 2062, 14, 0, 0, 0.681 13, 4081, 14, 23, 0, 0.694 13, 4081, 14, 0, 0, 0.699 15, 4081, 14, 23, 0, 0.565 15, 4081, 14, 0, 0, 0.565 15, 0, 1, 23, 0, 0.875 15, 0, 2, 0, 0, 0.875 16, 0, 15, 23, 0, 0.875 16, 0, 15, 0, 0, 0.875 16, 15, 15, 23, 0, 0.793 16, 15, 15, 0, 0, 0.79 14, 0, 15, 23, 0, 0.875 14, 0, 15, 0, 0, 0.875 14, 15, 15, 23, 0, 0.758 14, 15, 15, 0, 0, 0.776 16, 2048, 15, 23, 0, 0.64 16, 2048, 15, 0, 0, 0.64 16, 2063, 15, 23, 0, 0.578 16, 2063, 15, 0, 0, 0.579 14, 2048, 15, 23, 0, 0.75 14, 2048, 15, 0, 0, 0.75 14, 2063, 15, 23, 0, 0.661 14, 2063, 15, 0, 0, 0.678 14, 4081, 15, 23, 0, 0.688 14, 4081, 15, 0, 0, 0.706 16, 4081, 15, 23, 0, 0.887 16, 4081, 15, 0, 0, 0.888 16, 0, 1, 23, 0, 0.874 16, 0, 2, 0, 0, 0.874 17, 0, 16, 23, 0, 0.875 17, 0, 16, 0, 0, 0.874 17, 16, 16, 23, 0, 0.556 17, 16, 16, 0, 0, 0.556 15, 0, 16, 23, 0, 0.875 15, 0, 16, 0, 0, 0.875 15, 16, 16, 23, 0, 0.811 15, 16, 16, 0, 0, 0.813 17, 2048, 16, 23, 0, 0.64 17, 2048, 16, 0, 0, 0.64 17, 2064, 16, 23, 0, 0.556 17, 2064, 16, 0, 0, 0.556 15, 2048, 16, 23, 0, 0.75 15, 2048, 16, 0, 0, 0.75 15, 2064, 16, 23, 0, 0.693 15, 2064, 16, 0, 0, 0.694 15, 4081, 16, 23, 0, 0.709 15, 4081, 16, 0, 0, 0.709 17, 4081, 16, 23, 0, 0.889 17, 4081, 16, 0, 0, 0.889 17, 0, 1, 23, 0, 0.875 17, 0, 2, 0, 0, 0.875 18, 0, 17, 23, 0, 0.875 18, 0, 17, 0, 0, 0.875 18, 17, 17, 23, 0, 0.556 18, 17, 17, 0, 0, 0.556 16, 0, 17, 23, 0, 0.875 16, 0, 17, 0, 0, 0.875 16, 17, 17, 23, 0, 0.666 16, 17, 17, 0, 0, 0.666 18, 2048, 17, 23, 0, 0.64 18, 2048, 17, 0, 0, 0.64 18, 2065, 17, 23, 0, 0.556 18, 2065, 17, 0, 0, 0.556 16, 2048, 17, 23, 0, 0.75 16, 2048, 17, 0, 0, 0.75 16, 2065, 17, 23, 0, 0.666 16, 2065, 17, 0, 0, 0.659 16, 4081, 17, 23, 0, 0.999 16, 4081, 17, 0, 0, 0.999 18, 4081, 17, 23, 0, 0.889 18, 4081, 17, 0, 0, 0.889 18, 0, 1, 23, 0, 0.875 18, 0, 2, 0, 0, 0.875 19, 0, 18, 23, 0, 0.875 19, 0, 18, 0, 0, 0.875 19, 18, 18, 23, 0, 0.556 19, 18, 18, 0, 0, 0.556 17, 0, 18, 23, 0, 0.875 17, 0, 18, 0, 0, 0.875 17, 18, 18, 23, 0, 0.662 17, 18, 18, 0, 0, 0.666 19, 2048, 18, 23, 0, 0.64 19, 2048, 18, 0, 0, 0.64 19, 2066, 18, 23, 0, 0.556 19, 2066, 18, 0, 0, 0.556 17, 2048, 18, 23, 0, 0.75 17, 2048, 18, 0, 0, 0.75 17, 2066, 18, 23, 0, 0.656 17, 2066, 18, 0, 0, 0.666 17, 4081, 18, 23, 0, 0.973 17, 4081, 18, 0, 0, 0.999 19, 4081, 18, 23, 0, 0.889 19, 4081, 18, 0, 0, 0.889 19, 0, 1, 23, 0, 0.875 19, 0, 2, 0, 0, 0.875 20, 0, 19, 23, 0, 0.875 20, 0, 19, 0, 0, 0.875 20, 19, 19, 23, 0, 0.556 20, 19, 19, 0, 0, 0.556 18, 0, 19, 23, 0, 0.875 18, 0, 19, 0, 0, 0.875 18, 19, 19, 23, 0, 0.666 18, 19, 19, 0, 0, 0.666 20, 2048, 19, 23, 0, 0.64 20, 2048, 19, 0, 0, 0.64 20, 2067, 19, 23, 0, 0.556 20, 2067, 19, 0, 0, 0.556 18, 2048, 19, 23, 0, 0.75 18, 2048, 19, 0, 0, 0.75 18, 2067, 19, 23, 0, 0.666 18, 2067, 19, 0, 0, 0.666 18, 4081, 19, 23, 0, 0.999 18, 4081, 19, 0, 0, 0.999 20, 4081, 19, 23, 0, 0.889 20, 4081, 19, 0, 0, 0.889 20, 0, 1, 23, 0, 0.875 20, 0, 2, 0, 0, 0.875 21, 0, 20, 23, 0, 0.875 21, 0, 20, 0, 0, 0.875 21, 20, 20, 23, 0, 0.556 21, 20, 20, 0, 0, 0.556 19, 0, 20, 23, 0, 0.875 19, 0, 20, 0, 0, 0.875 19, 20, 20, 23, 0, 0.657 19, 20, 20, 0, 0, 0.666 21, 2048, 20, 23, 0, 0.64 21, 2048, 20, 0, 0, 0.64 21, 2068, 20, 23, 0, 0.556 21, 2068, 20, 0, 0, 0.556 19, 2048, 20, 23, 0, 0.75 19, 2048, 20, 0, 0, 0.75 19, 2068, 20, 23, 0, 0.666 19, 2068, 20, 0, 0, 0.659 19, 4081, 20, 23, 0, 0.999 19, 4081, 20, 0, 0, 0.989 21, 4081, 20, 23, 0, 0.889 21, 4081, 20, 0, 0, 0.889 21, 0, 1, 23, 0, 0.875 21, 0, 2, 0, 0, 0.875 22, 0, 21, 23, 0, 0.875 22, 0, 21, 0, 0, 0.875 22, 21, 21, 23, 0, 0.556 22, 21, 21, 0, 0, 0.556 20, 0, 21, 23, 0, 0.914 20, 0, 21, 0, 0, 0.903 20, 21, 21, 23, 0, 0.666 20, 21, 21, 0, 0, 0.666 22, 2048, 21, 23, 0, 0.64 22, 2048, 21, 0, 0, 0.64 22, 2069, 21, 23, 0, 0.556 22, 2069, 21, 0, 0, 0.556 20, 2048, 21, 23, 0, 0.75 20, 2048, 21, 0, 0, 0.75 20, 2069, 21, 23, 0, 0.666 20, 2069, 21, 0, 0, 0.666 20, 4081, 21, 23, 0, 0.974 20, 4081, 21, 0, 0, 0.983 22, 4081, 21, 23, 0, 0.889 22, 4081, 21, 0, 0, 0.889 22, 0, 1, 23, 0, 0.875 22, 0, 2, 0, 0, 0.875 23, 0, 22, 23, 0, 0.875 23, 0, 22, 0, 0, 0.875 23, 22, 22, 23, 0, 0.556 23, 22, 22, 0, 0, 0.556 21, 0, 22, 23, 0, 0.932 21, 0, 22, 0, 0, 0.93 21, 22, 22, 23, 0, 0.666 21, 22, 22, 0, 0, 0.666 23, 2048, 22, 23, 0, 0.64 23, 2048, 22, 0, 0, 0.64 23, 2070, 22, 23, 0, 0.556 23, 2070, 22, 0, 0, 0.556 21, 2048, 22, 23, 0, 0.75 21, 2048, 22, 0, 0, 0.75 21, 2070, 22, 23, 0, 0.666 21, 2070, 22, 0, 0, 0.652 21, 4081, 22, 23, 0, 0.999 21, 4081, 22, 0, 0, 0.999 23, 4081, 22, 23, 0, 0.889 23, 4081, 22, 0, 0, 0.889 23, 0, 1, 23, 0, 0.875 23, 0, 2, 0, 0, 0.875 24, 0, 23, 23, 0, 0.875 24, 0, 23, 0, 0, 0.875 24, 23, 23, 23, 0, 0.556 24, 23, 23, 0, 0, 0.556 22, 0, 23, 23, 0, 0.92 22, 0, 23, 0, 0, 0.92 22, 23, 23, 23, 0, 0.66 22, 23, 23, 0, 0, 0.662 24, 2048, 23, 23, 0, 0.64 24, 2048, 23, 0, 0, 0.64 24, 2071, 23, 23, 0, 0.556 24, 2071, 23, 0, 0, 0.556 22, 2048, 23, 23, 0, 0.75 22, 2048, 23, 0, 0, 0.75 22, 2071, 23, 23, 0, 0.654 22, 2071, 23, 0, 0, 0.666 22, 4081, 23, 23, 0, 0.979 22, 4081, 23, 0, 0, 0.994 24, 4081, 23, 23, 0, 0.889 24, 4081, 23, 0, 0, 0.889 24, 0, 1, 23, 0, 0.875 24, 0, 2, 0, 0, 0.875 25, 0, 24, 23, 0, 0.875 25, 0, 24, 0, 0, 0.875 25, 24, 24, 23, 0, 0.556 25, 24, 24, 0, 0, 0.556 23, 0, 24, 23, 0, 0.921 23, 0, 24, 0, 0, 0.909 23, 24, 24, 23, 0, 0.663 23, 24, 24, 0, 0, 0.666 25, 2048, 24, 23, 0, 0.64 25, 2048, 24, 0, 0, 0.64 25, 2072, 24, 23, 0, 0.556 25, 2072, 24, 0, 0, 0.556 23, 2048, 24, 23, 0, 0.75 23, 2048, 24, 0, 0, 0.75 23, 2072, 24, 23, 0, 0.658 23, 2072, 24, 0, 0, 0.666 23, 4081, 24, 23, 0, 0.999 23, 4081, 24, 0, 0, 0.999 25, 4081, 24, 23, 0, 0.889 25, 4081, 24, 0, 0, 0.889 25, 0, 1, 23, 0, 0.875 25, 0, 2, 0, 0, 0.875 26, 0, 25, 23, 0, 0.875 26, 0, 25, 0, 0, 0.875 26, 25, 25, 23, 0, 0.556 26, 25, 25, 0, 0, 0.556 24, 0, 25, 23, 0, 0.92 24, 0, 25, 0, 0, 0.92 24, 25, 25, 23, 0, 0.666 24, 25, 25, 0, 0, 0.666 26, 2048, 25, 23, 0, 0.64 26, 2048, 25, 0, 0, 0.64 26, 2073, 25, 23, 0, 0.556 26, 2073, 25, 0, 0, 0.556 24, 2048, 25, 23, 0, 0.75 24, 2048, 25, 0, 0, 0.75 24, 2073, 25, 23, 0, 0.666 24, 2073, 25, 0, 0, 0.666 24, 4081, 25, 23, 0, 0.999 24, 4081, 25, 0, 0, 0.999 26, 4081, 25, 23, 0, 0.889 26, 4081, 25, 0, 0, 0.889 26, 0, 1, 23, 0, 0.875 26, 0, 2, 0, 0, 0.875 27, 0, 26, 23, 0, 0.875 27, 0, 26, 0, 0, 0.875 27, 26, 26, 23, 0, 0.556 27, 26, 26, 0, 0, 0.556 25, 0, 26, 23, 0, 0.992 25, 0, 26, 0, 0, 0.992 25, 26, 26, 23, 0, 0.664 25, 26, 26, 0, 0, 0.663 27, 2048, 26, 23, 0, 0.64 27, 2048, 26, 0, 0, 0.64 27, 2074, 26, 23, 0, 0.556 27, 2074, 26, 0, 0, 0.556 25, 2048, 26, 23, 0, 0.75 25, 2048, 26, 0, 0, 0.75 25, 2074, 26, 23, 0, 0.651 25, 2074, 26, 0, 0, 0.666 25, 4081, 26, 23, 0, 0.994 25, 4081, 26, 0, 0, 0.999 27, 4081, 26, 23, 0, 0.889 27, 4081, 26, 0, 0, 0.889 27, 0, 1, 23, 0, 0.875 27, 0, 2, 0, 0, 0.875 28, 0, 27, 23, 0, 0.875 28, 0, 27, 0, 0, 0.875 28, 27, 27, 23, 0, 0.556 28, 27, 27, 0, 0, 0.556 26, 0, 27, 23, 0, 0.98 26, 0, 27, 0, 0, 0.98 26, 27, 27, 23, 0, 0.645 26, 27, 27, 0, 0, 0.656 28, 2048, 27, 23, 0, 0.64 28, 2048, 27, 0, 0, 0.64 28, 2075, 27, 23, 0, 0.556 28, 2075, 27, 0, 0, 0.556 26, 2048, 27, 23, 0, 0.75 26, 2048, 27, 0, 0, 0.75 26, 2075, 27, 23, 0, 0.665 26, 2075, 27, 0, 0, 0.666 26, 4081, 27, 23, 0, 0.996 26, 4081, 27, 0, 0, 0.977 28, 4081, 27, 23, 0, 0.889 28, 4081, 27, 0, 0, 0.889 28, 0, 1, 23, 0, 0.875 28, 0, 2, 0, 0, 0.875 29, 0, 28, 23, 0, 0.875 29, 0, 28, 0, 0, 0.875 29, 28, 28, 23, 0, 0.556 29, 28, 28, 0, 0, 0.556 27, 0, 28, 23, 0, 0.99 27, 0, 28, 0, 0, 0.975 27, 28, 28, 23, 0, 0.657 27, 28, 28, 0, 0, 0.663 29, 2048, 28, 23, 0, 0.64 29, 2048, 28, 0, 0, 0.64 29, 2076, 28, 23, 0, 0.556 29, 2076, 28, 0, 0, 0.556 27, 2048, 28, 23, 0, 0.75 27, 2048, 28, 0, 0, 0.75 27, 2076, 28, 23, 0, 0.653 27, 2076, 28, 0, 0, 0.656 27, 4081, 28, 23, 0, 0.992 27, 4081, 28, 0, 0, 0.984 29, 4081, 28, 23, 0, 0.889 29, 4081, 28, 0, 0, 0.889 29, 0, 1, 23, 0, 0.875 29, 0, 2, 0, 0, 0.875 30, 0, 29, 23, 0, 0.875 30, 0, 29, 0, 0, 0.875 30, 29, 29, 23, 0, 0.556 30, 29, 29, 0, 0, 0.556 28, 0, 29, 23, 0, 0.98 28, 0, 29, 0, 0, 0.951 28, 29, 29, 23, 0, 0.656 28, 29, 29, 0, 0, 0.65 30, 2048, 29, 23, 0, 0.64 30, 2048, 29, 0, 0, 0.64 30, 2077, 29, 23, 0, 0.556 30, 2077, 29, 0, 0, 0.555 28, 2048, 29, 23, 0, 0.749 28, 2048, 29, 0, 0, 0.749 28, 2077, 29, 23, 0, 0.656 28, 2077, 29, 0, 0, 0.657 28, 4081, 29, 23, 0, 0.986 28, 4081, 29, 0, 0, 0.978 30, 4081, 29, 23, 0, 0.886 30, 4081, 29, 0, 0, 0.887 30, 0, 1, 23, 0, 0.873 30, 0, 2, 0, 0, 0.873 31, 0, 30, 23, 0, 0.873 31, 0, 30, 0, 0, 0.871 31, 30, 30, 23, 0, 0.554 31, 30, 30, 0, 0, 0.554 29, 0, 30, 23, 0, 0.932 29, 0, 30, 0, 0, 0.927 29, 30, 30, 23, 0, 0.655 29, 30, 30, 0, 0, 0.659 31, 2048, 30, 23, 0, 0.637 31, 2048, 30, 0, 0, 0.638 31, 2078, 30, 23, 0, 0.554 31, 2078, 30, 0, 0, 0.553 29, 2048, 30, 23, 0, 0.746 29, 2048, 30, 0, 0, 0.746 29, 2078, 30, 23, 0, 0.649 29, 2078, 30, 0, 0, 0.658 29, 4081, 30, 23, 0, 0.98 29, 4081, 30, 0, 0, 0.984 31, 4081, 30, 23, 0, 0.883 31, 4081, 30, 0, 0, 0.884 31, 0, 1, 23, 0, 0.87 31, 0, 2, 0, 0, 0.87 32, 0, 31, 23, 0, 0.87 32, 0, 31, 0, 0, 0.869 32, 31, 31, 23, 0, 0.553 32, 31, 31, 0, 0, 0.553 30, 0, 31, 23, 0, 0.977 30, 0, 31, 0, 0, 0.975 30, 31, 31, 23, 0, 0.66 30, 31, 31, 0, 0, 0.658 32, 2048, 31, 23, 0, 0.622 32, 2048, 31, 0, 0, 0.622 32, 2079, 31, 23, 0, 0.553 32, 2079, 31, 0, 0, 0.552 30, 2048, 31, 23, 0, 0.745 30, 2048, 31, 0, 0, 0.744 30, 2079, 31, 23, 0, 0.659 30, 2079, 31, 0, 0, 0.66 30, 4081, 31, 23, 0, 0.972 30, 4081, 31, 0, 0, 0.972 32, 4081, 31, 23, 0, 0.881 32, 4081, 31, 0, 0, 0.881 32, 0, 1, 23, 0, 0.868 32, 0, 2, 0, 0, 0.868 2048, 0, 32, 23, 1, 1.158 256, 1, 64, 23, 1, 0.83 2048, 0, 32, 0, 1, 1.158 256, 1, 64, 0, 1, 0.83 256, 4081, 64, 0, 1, 0.873 256, 0, 1, 23, 1, 1.158 256, 0, 1, 0, 1, 1.157 256, 1, 1, 23, 1, 1.158 256, 1, 1, 0, 1, 1.158 2048, 0, 64, 23, 1, 1.133 256, 2, 64, 23, 1, 0.835 2048, 0, 64, 0, 1, 1.132 256, 2, 64, 0, 1, 0.835 256, 0, 2, 23, 1, 1.16 256, 0, 2, 0, 1, 1.161 256, 2, 2, 23, 1, 1.161 256, 2, 2, 0, 1, 1.161 2048, 0, 128, 23, 1, 1.023 256, 3, 64, 23, 1, 0.833 2048, 0, 128, 0, 1, 1.025 256, 3, 64, 0, 1, 0.835 256, 0, 3, 23, 1, 1.167 256, 0, 3, 0, 1, 1.167 256, 3, 3, 23, 1, 1.167 256, 3, 3, 0, 1, 1.167 2048, 0, 256, 23, 1, 0.993 256, 4, 64, 23, 1, 0.836 2048, 0, 256, 0, 1, 0.994 256, 4, 64, 0, 1, 0.836 256, 0, 4, 23, 1, 1.167 256, 0, 4, 0, 1, 1.167 256, 4, 4, 23, 1, 1.167 256, 4, 4, 0, 1, 1.167 2048, 0, 512, 23, 1, 1.065 256, 5, 64, 23, 1, 0.836 2048, 0, 512, 0, 1, 1.057 256, 5, 64, 0, 1, 0.836 256, 0, 5, 23, 1, 1.167 256, 0, 5, 0, 1, 1.167 256, 5, 5, 23, 1, 1.167 256, 5, 5, 0, 1, 1.167 2048, 0, 1024, 23, 1, 1.034 256, 6, 64, 23, 1, 0.836 2048, 0, 1024, 0, 1, 1.032 256, 6, 64, 0, 1, 0.836 256, 0, 6, 23, 1, 1.167 256, 0, 6, 0, 1, 1.167 256, 6, 6, 23, 1, 1.167 256, 6, 6, 0, 1, 1.167 2048, 0, 2048, 23, 1, 0.901 256, 7, 64, 23, 1, 0.836 2048, 0, 2048, 0, 1, 0.901 256, 7, 64, 0, 1, 0.835 256, 0, 7, 23, 1, 1.165 256, 0, 7, 0, 1, 1.165 256, 7, 7, 23, 1, 1.165 256, 7, 7, 0, 1, 1.165 192, 1, 32, 23, 1, 1.165 192, 1, 32, 0, 1, 1.165 256, 1, 32, 23, 1, 1.165 256, 1, 32, 0, 1, 1.165 512, 1, 32, 23, 1, 1.165 512, 1, 32, 0, 1, 1.165 256, 4081, 32, 23, 1, 1.165 192, 2, 64, 23, 1, 0.835 192, 2, 64, 0, 1, 0.835 512, 2, 64, 23, 1, 0.836 512, 2, 64, 0, 1, 0.836 256, 4081, 64, 23, 1, 0.874 192, 3, 96, 23, 1, 0.847 192, 3, 96, 0, 1, 0.847 256, 3, 96, 23, 1, 0.847 256, 3, 96, 0, 1, 0.847 512, 3, 96, 23, 1, 0.847 512, 3, 96, 0, 1, 0.847 256, 4081, 96, 23, 1, 0.879 192, 4, 128, 23, 1, 0.851 192, 4, 128, 0, 1, 0.851 256, 4, 128, 23, 1, 0.852 256, 4, 128, 0, 1, 0.852 512, 4, 128, 23, 1, 0.851 512, 4, 128, 0, 1, 0.851 256, 4081, 128, 23, 1, 0.862 192, 5, 160, 23, 1, 0.619 192, 5, 160, 0, 1, 0.618 256, 5, 160, 23, 1, 0.781 256, 5, 160, 0, 1, 0.779 512, 5, 160, 23, 1, 0.936 512, 5, 160, 0, 1, 0.937 256, 4081, 160, 23, 1, 0.616 192, 6, 192, 23, 1, 0.695 192, 6, 192, 0, 1, 0.695 256, 6, 192, 23, 1, 0.77 256, 6, 192, 0, 1, 0.771 512, 6, 192, 23, 1, 0.94 512, 6, 192, 0, 1, 0.942 256, 4081, 192, 23, 1, 0.643 192, 7, 224, 23, 1, 0.693 192, 7, 224, 0, 1, 0.694 256, 7, 224, 23, 1, 0.783 256, 7, 224, 0, 1, 0.782 512, 7, 224, 23, 1, 0.945 512, 7, 224, 0, 1, 0.946 256, 4081, 224, 23, 1, 0.728 2, 0, 1, 23, 1, 0.87 2, 0, 1, 0, 1, 0.872 2, 1, 1, 23, 1, 0.793 2, 1, 1, 0, 1, 0.792 0, 0, 1, 23, 1, 0.854 0, 0, 1, 0, 1, 0.854 0, 1, 1, 23, 1, 0.855 0, 1, 1, 0, 1, 0.854 2, 2048, 1, 23, 1, 0.639 2, 2048, 1, 0, 1, 0.638 2, 2049, 1, 23, 1, 0.581 2, 2049, 1, 0, 1, 0.58 0, 2048, 1, 23, 1, 0.854 0, 2048, 1, 0, 1, 0.854 0, 2049, 1, 23, 1, 0.854 0, 2049, 1, 0, 1, 0.854 0, 4081, 1, 23, 1, 0.854 0, 4081, 1, 0, 1, 0.854 2, 4081, 1, 23, 1, 0.567 2, 4081, 1, 0, 1, 0.567 2, 0, 2, 0, 1, 0.922 3, 0, 2, 23, 1, 0.872 3, 0, 2, 0, 1, 0.87 3, 2, 2, 23, 1, 0.793 3, 2, 2, 0, 1, 0.794 1, 0, 2, 23, 1, 0.874 1, 0, 2, 0, 1, 0.873 1, 2, 2, 23, 1, 0.829 1, 2, 2, 0, 1, 0.848 3, 2048, 2, 23, 1, 0.638 3, 2048, 2, 0, 1, 0.638 3, 2050, 2, 23, 1, 0.58 3, 2050, 2, 0, 1, 0.58 1, 2048, 2, 23, 1, 0.747 1, 2048, 2, 0, 1, 0.747 1, 2050, 2, 23, 1, 0.687 1, 2050, 2, 0, 1, 0.691 1, 4081, 2, 23, 1, 0.707 1, 4081, 2, 0, 1, 0.723 3, 4081, 2, 23, 1, 0.565 3, 4081, 2, 0, 1, 0.566 3, 0, 1, 23, 1, 0.873 4, 0, 3, 23, 1, 0.873 4, 0, 3, 0, 1, 0.873 4, 3, 3, 23, 1, 0.794 4, 3, 3, 0, 1, 0.793 2, 0, 3, 23, 1, 0.874 2, 0, 3, 0, 1, 0.874 2, 3, 3, 23, 1, 0.821 2, 3, 3, 0, 1, 0.828 4, 2048, 3, 23, 1, 0.638 4, 2048, 3, 0, 1, 0.638 4, 2051, 3, 23, 1, 0.581 4, 2051, 3, 0, 1, 0.581 2, 2048, 3, 23, 1, 0.747 2, 2048, 3, 0, 1, 0.747 2, 2051, 3, 23, 1, 0.686 2, 2051, 3, 0, 1, 0.689 2, 4081, 3, 23, 1, 0.702 2, 4081, 3, 0, 1, 0.702 4, 4081, 3, 23, 1, 0.567 4, 4081, 3, 0, 1, 0.568 4, 0, 1, 23, 1, 0.874 4, 0, 2, 0, 1, 0.875 5, 0, 4, 23, 1, 0.875 5, 0, 4, 0, 1, 0.875 5, 4, 4, 23, 1, 0.796 5, 4, 4, 0, 1, 0.794 3, 0, 4, 23, 1, 0.874 3, 0, 4, 0, 1, 0.875 3, 4, 4, 23, 1, 0.783 3, 4, 4, 0, 1, 0.795 5, 2048, 4, 23, 1, 0.639 5, 2048, 4, 0, 1, 0.64 5, 2052, 4, 23, 1, 0.581 5, 2052, 4, 0, 1, 0.581 3, 2048, 4, 23, 1, 0.749 3, 2048, 4, 0, 1, 0.748 3, 2052, 4, 23, 1, 0.694 3, 2052, 4, 0, 1, 0.702 3, 4081, 4, 23, 1, 0.701 3, 4081, 4, 0, 1, 0.701 5, 4081, 4, 23, 1, 0.566 5, 4081, 4, 0, 1, 0.567 5, 0, 1, 23, 1, 0.873 5, 0, 2, 0, 1, 0.874 6, 0, 5, 23, 1, 0.874 6, 0, 5, 0, 1, 0.875 6, 5, 5, 23, 1, 0.792 6, 5, 5, 0, 1, 0.795 4, 0, 5, 23, 1, 0.875 4, 0, 5, 0, 1, 0.875 4, 5, 5, 23, 1, 0.804 4, 5, 5, 0, 1, 0.804 6, 2048, 5, 23, 1, 0.64 6, 2048, 5, 0, 1, 0.64 6, 2053, 5, 23, 1, 0.581 6, 2053, 5, 0, 1, 0.581 4, 2048, 5, 23, 1, 0.75 4, 2048, 5, 0, 1, 0.75 4, 2053, 5, 23, 1, 0.709 4, 2053, 5, 0, 1, 0.701 4, 4081, 5, 23, 1, 0.693 4, 4081, 5, 0, 1, 0.7 6, 4081, 5, 23, 1, 0.566 6, 4081, 5, 0, 1, 0.566 6, 0, 1, 23, 1, 0.874 6, 0, 2, 0, 1, 0.874 7, 0, 6, 23, 1, 0.874 7, 0, 6, 0, 1, 0.874 7, 6, 6, 23, 1, 0.793 7, 6, 6, 0, 1, 0.793 5, 0, 6, 23, 1, 0.874 5, 0, 6, 0, 1, 0.874 5, 6, 6, 23, 1, 0.803 5, 6, 6, 0, 1, 0.821 7, 2048, 6, 23, 1, 0.639 7, 2048, 6, 0, 1, 0.639 7, 2054, 6, 23, 1, 0.579 7, 2054, 6, 0, 1, 0.581 5, 2048, 6, 23, 1, 0.749 5, 2048, 6, 0, 1, 0.749 5, 2054, 6, 23, 1, 0.685 5, 2054, 6, 0, 1, 0.693 5, 4081, 6, 23, 1, 0.701 5, 4081, 6, 0, 1, 0.708 7, 4081, 6, 23, 1, 0.566 7, 4081, 6, 0, 1, 0.565 7, 0, 1, 23, 1, 0.874 7, 0, 2, 0, 1, 0.874 8, 0, 7, 23, 1, 0.874 8, 0, 7, 0, 1, 0.874 8, 7, 7, 23, 1, 0.796 8, 7, 7, 0, 1, 0.793 6, 0, 7, 23, 1, 0.875 6, 0, 7, 0, 1, 0.875 6, 7, 7, 23, 1, 0.795 6, 7, 7, 0, 1, 0.786 8, 2048, 7, 23, 1, 0.64 8, 2048, 7, 0, 1, 0.64 8, 2055, 7, 23, 1, 0.579 8, 2055, 7, 0, 1, 0.581 6, 2048, 7, 23, 1, 0.75 6, 2048, 7, 0, 1, 0.75 6, 2055, 7, 23, 1, 0.701 6, 2055, 7, 0, 1, 0.699 6, 4081, 7, 23, 1, 0.701 6, 4081, 7, 0, 1, 0.695 8, 4081, 7, 23, 1, 0.566 8, 4081, 7, 0, 1, 0.567 8, 0, 1, 23, 1, 0.875 8, 0, 2, 0, 1, 0.875 9, 0, 8, 23, 1, 0.875 9, 0, 8, 0, 1, 0.875 9, 8, 8, 23, 1, 0.795 9, 8, 8, 0, 1, 0.794 7, 0, 8, 23, 1, 0.875 7, 0, 8, 0, 1, 0.875 7, 8, 8, 23, 1, 0.773 7, 8, 8, 0, 1, 0.79 9, 2048, 8, 23, 1, 0.64 9, 2048, 8, 0, 1, 0.64 9, 2056, 8, 23, 1, 0.581 9, 2056, 8, 0, 1, 0.581 7, 2048, 8, 23, 1, 0.75 7, 2048, 8, 0, 1, 0.75 7, 2056, 8, 23, 1, 0.701 7, 2056, 8, 0, 1, 0.701 7, 4081, 8, 23, 1, 0.691 7, 4081, 8, 0, 1, 0.701 9, 4081, 8, 23, 1, 0.567 9, 4081, 8, 0, 1, 0.567 9, 0, 1, 23, 1, 0.875 9, 0, 2, 0, 1, 0.875 10, 0, 9, 23, 1, 0.875 10, 0, 9, 0, 1, 0.875 10, 9, 9, 23, 1, 0.794 10, 9, 9, 0, 1, 0.796 8, 0, 9, 23, 1, 0.875 8, 0, 9, 0, 1, 0.875 8, 9, 9, 23, 1, 0.804 8, 9, 9, 0, 1, 0.804 10, 2048, 9, 23, 1, 0.64 10, 2048, 9, 0, 1, 0.64 10, 2057, 9, 23, 1, 0.58 10, 2057, 9, 0, 1, 0.581 8, 2048, 9, 23, 1, 0.75 8, 2048, 9, 0, 1, 0.75 8, 2057, 9, 23, 1, 0.694 8, 2057, 9, 0, 1, 0.709 8, 4081, 9, 23, 1, 0.693 8, 4081, 9, 0, 1, 0.712 10, 4081, 9, 23, 1, 0.565 10, 4081, 9, 0, 1, 0.567 10, 0, 1, 23, 1, 0.875 10, 0, 2, 0, 1, 0.875 11, 0, 10, 23, 1, 0.875 11, 0, 10, 0, 1, 0.875 11, 10, 10, 23, 1, 0.795 11, 10, 10, 0, 1, 0.794 9, 0, 10, 23, 1, 0.875 9, 0, 10, 0, 1, 0.875 9, 10, 10, 23, 1, 0.804 9, 10, 10, 0, 1, 0.804 11, 2048, 10, 23, 1, 0.64 11, 2048, 10, 0, 1, 0.64 11, 2058, 10, 23, 1, 0.58 11, 2058, 10, 0, 1, 0.581 9, 2048, 10, 23, 1, 0.75 9, 2048, 10, 0, 1, 0.75 9, 2058, 10, 23, 1, 0.671 9, 2058, 10, 0, 1, 0.671 9, 4081, 10, 23, 1, 0.678 9, 4081, 10, 0, 1, 0.693 11, 4081, 10, 23, 1, 0.567 11, 4081, 10, 0, 1, 0.565 11, 0, 1, 23, 1, 0.875 11, 0, 2, 0, 1, 0.875 12, 0, 11, 23, 1, 0.875 12, 0, 11, 0, 1, 0.875 12, 11, 11, 23, 1, 0.795 12, 11, 11, 0, 1, 0.794 10, 0, 11, 23, 1, 0.875 10, 0, 11, 0, 1, 0.875 10, 11, 11, 23, 1, 0.791 10, 11, 11, 0, 1, 0.791 12, 2048, 11, 23, 1, 0.64 12, 2048, 11, 0, 1, 0.64 12, 2059, 11, 23, 1, 0.58 12, 2059, 11, 0, 1, 0.581 10, 2048, 11, 23, 1, 0.75 10, 2048, 11, 0, 1, 0.75 10, 2059, 11, 23, 1, 0.689 10, 2059, 11, 0, 1, 0.686 10, 4081, 11, 23, 1, 0.686 10, 4081, 11, 0, 1, 0.709 12, 4081, 11, 23, 1, 0.566 12, 4081, 11, 0, 1, 0.567 12, 0, 1, 23, 1, 0.875 12, 0, 2, 0, 1, 0.875 13, 0, 12, 23, 1, 0.875 13, 0, 12, 0, 1, 0.875 13, 12, 12, 23, 1, 0.794 13, 12, 12, 0, 1, 0.794 11, 0, 12, 23, 1, 0.875 11, 0, 12, 0, 1, 0.875 11, 12, 12, 23, 1, 0.804 11, 12, 12, 0, 1, 0.795 13, 2048, 12, 23, 1, 0.64 13, 2048, 12, 0, 1, 0.64 13, 2060, 12, 23, 1, 0.58 13, 2060, 12, 0, 1, 0.58 11, 2048, 12, 23, 1, 0.75 11, 2048, 12, 0, 1, 0.75 11, 2060, 12, 23, 1, 0.693 11, 2060, 12, 0, 1, 0.701 11, 4081, 12, 23, 1, 0.717 11, 4081, 12, 0, 1, 0.725 13, 4081, 12, 23, 1, 0.566 13, 4081, 12, 0, 1, 0.567 13, 0, 1, 23, 1, 0.875 13, 0, 2, 0, 1, 0.875 14, 0, 13, 23, 1, 0.875 14, 0, 13, 0, 1, 0.875 14, 13, 13, 23, 1, 0.792 14, 13, 13, 0, 1, 0.792 12, 0, 13, 23, 1, 0.875 12, 0, 13, 0, 1, 0.875 12, 13, 13, 23, 1, 0.782 12, 13, 13, 0, 1, 0.804 14, 2048, 13, 23, 1, 0.64 14, 2048, 13, 0, 1, 0.64 14, 2061, 13, 23, 1, 0.579 14, 2061, 13, 0, 1, 0.579 12, 2048, 13, 23, 1, 0.75 12, 2048, 13, 0, 1, 0.75 12, 2061, 13, 23, 1, 0.701 12, 2061, 13, 0, 1, 0.701 12, 4081, 13, 23, 1, 0.705 12, 4081, 13, 0, 1, 0.733 14, 4081, 13, 23, 1, 0.565 14, 4081, 13, 0, 1, 0.565 14, 0, 1, 23, 1, 0.875 14, 0, 2, 0, 1, 0.875 15, 0, 14, 23, 1, 0.875 15, 0, 14, 0, 1, 0.875 15, 14, 14, 23, 1, 0.795 15, 14, 14, 0, 1, 0.794 13, 0, 14, 23, 1, 0.875 13, 0, 14, 0, 1, 0.875 13, 14, 14, 23, 1, 0.804 13, 14, 14, 0, 1, 0.813 15, 2048, 14, 23, 1, 0.64 15, 2048, 14, 0, 1, 0.64 15, 2062, 14, 23, 1, 0.579 15, 2062, 14, 0, 1, 0.58 13, 2048, 14, 23, 1, 0.75 13, 2048, 14, 0, 1, 0.75 13, 2062, 14, 23, 1, 0.705 13, 2062, 14, 0, 1, 0.701 13, 4081, 14, 23, 1, 0.705 13, 4081, 14, 0, 1, 0.733 15, 4081, 14, 23, 1, 0.565 15, 4081, 14, 0, 1, 0.568 15, 0, 1, 23, 1, 0.875 15, 0, 2, 0, 1, 0.875 16, 0, 15, 23, 1, 0.875 16, 0, 15, 0, 1, 0.875 16, 15, 15, 23, 1, 0.795 16, 15, 15, 0, 1, 0.796 14, 0, 15, 23, 1, 0.875 14, 0, 15, 0, 1, 0.875 14, 15, 15, 23, 1, 0.807 14, 15, 15, 0, 1, 0.821 16, 2048, 15, 23, 1, 0.64 16, 2048, 15, 0, 1, 0.64 16, 2063, 15, 23, 1, 0.581 16, 2063, 15, 0, 1, 0.581 14, 2048, 15, 23, 1, 0.75 14, 2048, 15, 0, 1, 0.75 14, 2063, 15, 23, 1, 0.693 14, 2063, 15, 0, 1, 0.685 14, 4081, 15, 23, 1, 0.693 14, 4081, 15, 0, 1, 0.716 16, 4081, 15, 23, 1, 0.862 16, 4081, 15, 0, 1, 0.855 16, 0, 1, 23, 1, 0.875 16, 0, 2, 0, 1, 0.875 17, 0, 16, 23, 1, 0.875 17, 0, 16, 0, 1, 0.875 17, 16, 16, 23, 1, 0.492 17, 16, 16, 0, 1, 0.492 15, 0, 16, 23, 1, 0.875 15, 0, 16, 0, 1, 0.876 15, 16, 16, 23, 1, 0.83 15, 16, 16, 0, 1, 0.841 17, 2048, 16, 23, 1, 0.64 17, 2048, 16, 0, 1, 0.64 17, 2064, 16, 23, 1, 0.492 17, 2064, 16, 0, 1, 0.492 15, 2048, 16, 23, 1, 0.75 15, 2048, 16, 0, 1, 0.75 15, 2064, 16, 23, 1, 0.716 15, 2064, 16, 0, 1, 0.715 15, 4081, 16, 23, 1, 0.716 15, 4081, 16, 0, 1, 0.723 17, 4081, 16, 23, 1, 0.857 17, 4081, 16, 0, 1, 0.856 17, 0, 1, 23, 1, 0.875 17, 0, 2, 0, 1, 0.875 18, 0, 17, 23, 1, 0.875 18, 0, 17, 0, 1, 0.875 18, 17, 17, 23, 1, 0.492 18, 17, 17, 0, 1, 0.492 16, 0, 17, 23, 1, 0.881 16, 0, 17, 0, 1, 0.88 16, 17, 17, 23, 1, 0.661 16, 17, 17, 0, 1, 0.666 18, 2048, 17, 23, 1, 0.64 18, 2048, 17, 0, 1, 0.64 18, 2065, 17, 23, 1, 0.492 18, 2065, 17, 0, 1, 0.492 16, 2048, 17, 23, 1, 0.75 16, 2048, 17, 0, 1, 0.75 16, 2065, 17, 23, 1, 0.655 16, 2065, 17, 0, 1, 0.664 16, 4081, 17, 23, 1, 0.99 16, 4081, 17, 0, 1, 1.0 18, 4081, 17, 23, 1, 0.863 18, 4081, 17, 0, 1, 0.857 18, 0, 1, 23, 1, 0.879 18, 0, 2, 0, 1, 0.879 19, 0, 18, 23, 1, 0.882 19, 0, 18, 0, 1, 0.881 19, 18, 18, 23, 1, 0.495 19, 18, 18, 0, 1, 0.495 17, 0, 18, 23, 1, 0.885 17, 0, 18, 0, 1, 0.886 17, 18, 18, 23, 1, 0.668 17, 18, 18, 0, 1, 0.66 19, 2048, 18, 23, 1, 0.645 19, 2048, 18, 0, 1, 0.644 19, 2066, 18, 23, 1, 0.496 19, 2066, 18, 0, 1, 0.496 17, 2048, 18, 23, 1, 0.755 17, 2048, 18, 0, 1, 0.756 17, 2066, 18, 23, 1, 0.663 17, 2066, 18, 0, 1, 0.667 17, 4081, 18, 23, 1, 0.997 17, 4081, 18, 0, 1, 0.998 19, 4081, 18, 23, 1, 0.858 19, 4081, 18, 0, 1, 0.858 19, 0, 1, 23, 1, 0.877 19, 0, 2, 0, 1, 0.877 20, 0, 19, 23, 1, 0.876 20, 0, 19, 0, 1, 0.876 20, 19, 19, 23, 1, 0.493 20, 19, 19, 0, 1, 0.493 18, 0, 19, 23, 1, 0.882 18, 0, 19, 0, 1, 0.881 18, 19, 19, 23, 1, 0.659 18, 19, 19, 0, 1, 0.665 20, 2048, 19, 23, 1, 0.64 20, 2048, 19, 0, 1, 0.641 20, 2067, 19, 23, 1, 0.492 20, 2067, 19, 0, 1, 0.492 18, 2048, 19, 23, 1, 0.75 18, 2048, 19, 0, 1, 0.751 18, 2067, 19, 23, 1, 0.648 18, 2067, 19, 0, 1, 0.651 18, 4081, 19, 23, 1, 0.979 18, 4081, 19, 0, 1, 0.993 20, 4081, 19, 23, 1, 0.857 20, 4081, 19, 0, 1, 0.857 20, 0, 1, 23, 1, 0.876 20, 0, 2, 0, 1, 0.876 21, 0, 20, 23, 1, 0.876 21, 0, 20, 0, 1, 0.875 21, 20, 20, 23, 1, 0.492 21, 20, 20, 0, 1, 0.492 19, 0, 20, 23, 1, 0.889 19, 0, 20, 0, 1, 0.89 19, 20, 20, 23, 1, 0.66 19, 20, 20, 0, 1, 0.665 21, 2048, 20, 23, 1, 0.64 21, 2048, 20, 0, 1, 0.64 21, 2068, 20, 23, 1, 0.492 21, 2068, 20, 0, 1, 0.492 19, 2048, 20, 23, 1, 0.75 19, 2048, 20, 0, 1, 0.75 19, 2068, 20, 23, 1, 0.655 19, 2068, 20, 0, 1, 0.649 19, 4081, 20, 23, 1, 0.981 19, 4081, 20, 0, 1, 1.0 21, 4081, 20, 23, 1, 0.858 21, 4081, 20, 0, 1, 0.856 21, 0, 1, 23, 1, 0.877 21, 0, 2, 0, 1, 0.877 22, 0, 21, 23, 1, 0.877 22, 0, 21, 0, 1, 0.876 22, 21, 21, 23, 1, 0.493 22, 21, 21, 0, 1, 0.492 20, 0, 21, 23, 1, 0.878 20, 0, 21, 0, 1, 0.879 20, 21, 21, 23, 1, 0.66 20, 21, 21, 0, 1, 0.66 22, 2048, 21, 23, 1, 0.64 22, 2048, 21, 0, 1, 0.64 22, 2069, 21, 23, 1, 0.493 22, 2069, 21, 0, 1, 0.492 20, 2048, 21, 23, 1, 0.75 20, 2048, 21, 0, 1, 0.75 20, 2069, 21, 23, 1, 0.665 20, 2069, 21, 0, 1, 0.666 20, 4081, 21, 23, 1, 0.985 20, 4081, 21, 0, 1, 0.985 22, 4081, 21, 23, 1, 0.858 22, 4081, 21, 0, 1, 0.856 22, 0, 1, 23, 1, 0.876 22, 0, 2, 0, 1, 0.875 23, 0, 22, 23, 1, 0.875 23, 0, 22, 0, 1, 0.876 23, 22, 22, 23, 1, 0.492 23, 22, 22, 0, 1, 0.492 21, 0, 22, 23, 1, 0.897 21, 0, 22, 0, 1, 0.892 21, 22, 22, 23, 1, 0.659 21, 22, 22, 0, 1, 0.66 23, 2048, 22, 23, 1, 0.639 23, 2048, 22, 0, 1, 0.639 23, 2070, 22, 23, 1, 0.492 23, 2070, 22, 0, 1, 0.492 21, 2048, 22, 23, 1, 0.748 21, 2048, 22, 0, 1, 0.748 21, 2070, 22, 23, 1, 0.65 21, 2070, 22, 0, 1, 0.664 21, 4081, 22, 23, 1, 0.996 21, 4081, 22, 0, 1, 0.995 23, 4081, 22, 23, 1, 0.854 23, 4081, 22, 0, 1, 0.855 23, 0, 1, 23, 1, 0.873 23, 0, 2, 0, 1, 0.873 24, 0, 23, 23, 1, 0.873 24, 0, 23, 0, 1, 0.873 24, 23, 23, 23, 1, 0.491 24, 23, 23, 0, 1, 0.491 22, 0, 23, 23, 1, 0.884 22, 0, 23, 0, 1, 0.884 22, 23, 23, 23, 1, 0.664 22, 23, 23, 0, 1, 0.665 24, 2048, 23, 23, 1, 0.638 24, 2048, 23, 0, 1, 0.639 24, 2071, 23, 23, 1, 0.491 24, 2071, 23, 0, 1, 0.491 22, 2048, 23, 23, 1, 0.748 22, 2048, 23, 0, 1, 0.748 22, 2071, 23, 23, 1, 0.66 22, 2071, 23, 0, 1, 0.66 22, 4081, 23, 23, 1, 0.991 22, 4081, 23, 0, 1, 0.99 24, 4081, 23, 23, 1, 0.855 24, 4081, 23, 0, 1, 0.853 24, 0, 1, 23, 1, 0.873 24, 0, 2, 0, 1, 0.873 25, 0, 24, 23, 1, 0.873 25, 0, 24, 0, 1, 0.872 25, 24, 24, 23, 1, 0.491 25, 24, 24, 0, 1, 0.491 23, 0, 24, 23, 1, 0.917 23, 0, 24, 0, 1, 0.917 23, 24, 24, 23, 1, 0.66 23, 24, 24, 0, 1, 0.659 25, 2048, 24, 23, 1, 0.638 25, 2048, 24, 0, 1, 0.638 25, 2072, 24, 23, 1, 0.491 25, 2072, 24, 0, 1, 0.491 23, 2048, 24, 23, 1, 0.747 23, 2048, 24, 0, 1, 0.747 23, 2072, 24, 23, 1, 0.648 23, 2072, 24, 0, 1, 0.663 23, 4081, 24, 23, 1, 0.99 23, 4081, 24, 0, 1, 0.996 25, 4081, 24, 23, 1, 0.858 25, 4081, 24, 0, 1, 0.852 25, 0, 1, 23, 1, 0.872 25, 0, 2, 0, 1, 0.872 26, 0, 25, 23, 1, 0.872 26, 0, 25, 0, 1, 0.872 26, 25, 25, 23, 1, 0.491 26, 25, 25, 0, 1, 0.491 24, 0, 25, 23, 1, 0.906 24, 0, 25, 0, 1, 0.897 24, 25, 25, 23, 1, 0.653 24, 25, 25, 0, 1, 0.664 26, 2048, 25, 23, 1, 0.638 26, 2048, 25, 0, 1, 0.638 26, 2073, 25, 23, 1, 0.491 26, 2073, 25, 0, 1, 0.491 24, 2048, 25, 23, 1, 0.747 24, 2048, 25, 0, 1, 0.748 24, 2073, 25, 23, 1, 0.663 24, 2073, 25, 0, 1, 0.657 24, 4081, 25, 23, 1, 0.991 24, 4081, 25, 0, 1, 0.995 26, 4081, 25, 23, 1, 0.853 26, 4081, 25, 0, 1, 0.852 26, 0, 1, 23, 1, 0.872 26, 0, 2, 0, 1, 0.872 27, 0, 26, 23, 1, 0.873 27, 0, 26, 0, 1, 0.873 27, 26, 26, 23, 1, 0.492 27, 26, 26, 0, 1, 0.492 25, 0, 26, 23, 1, 0.919 25, 0, 26, 0, 1, 0.92 25, 26, 26, 23, 1, 0.661 25, 26, 26, 0, 1, 0.656 27, 2048, 26, 23, 1, 0.639 27, 2048, 26, 0, 1, 0.639 27, 2074, 26, 23, 1, 0.492 27, 2074, 26, 0, 1, 0.492 25, 2048, 26, 23, 1, 0.749 25, 2048, 26, 0, 1, 0.749 25, 2074, 26, 23, 1, 0.665 25, 2074, 26, 0, 1, 0.662 25, 4081, 26, 23, 1, 0.988 25, 4081, 26, 0, 1, 0.998 27, 4081, 26, 23, 1, 0.855 27, 4081, 26, 0, 1, 0.854 27, 0, 1, 23, 1, 0.874 27, 0, 2, 0, 1, 0.874 28, 0, 27, 23, 1, 0.874 28, 0, 27, 0, 1, 0.874 28, 27, 27, 23, 1, 0.492 28, 27, 27, 0, 1, 0.492 26, 0, 27, 23, 1, 0.908 26, 0, 27, 0, 1, 0.908 26, 27, 27, 23, 1, 0.658 26, 27, 27, 0, 1, 0.665 28, 2048, 27, 23, 1, 0.639 28, 2048, 27, 0, 1, 0.639 28, 2075, 27, 23, 1, 0.492 28, 2075, 27, 0, 1, 0.492 26, 2048, 27, 23, 1, 0.749 26, 2048, 27, 0, 1, 0.749 26, 2075, 27, 23, 1, 0.664 26, 2075, 27, 0, 1, 0.665 26, 4081, 27, 23, 1, 0.999 26, 4081, 27, 0, 1, 0.998 28, 4081, 27, 23, 1, 0.855 28, 4081, 27, 0, 1, 0.855 28, 0, 1, 23, 1, 0.874 28, 0, 2, 0, 1, 0.874 29, 0, 28, 23, 1, 0.874 29, 0, 28, 0, 1, 0.874 29, 28, 28, 23, 1, 0.492 29, 28, 28, 0, 1, 0.492 27, 0, 28, 23, 1, 0.919 27, 0, 28, 0, 1, 0.919 27, 28, 28, 23, 1, 0.665 27, 28, 28, 0, 1, 0.655 29, 2048, 28, 23, 1, 0.64 29, 2048, 28, 0, 1, 0.64 29, 2076, 28, 23, 1, 0.492 29, 2076, 28, 0, 1, 0.492 27, 2048, 28, 23, 1, 0.75 27, 2048, 28, 0, 1, 0.752 27, 2076, 28, 23, 1, 0.657 27, 2076, 28, 0, 1, 0.667 27, 4081, 28, 23, 1, 0.981 27, 4081, 28, 0, 1, 0.998 29, 4081, 28, 23, 1, 0.859 29, 4081, 28, 0, 1, 0.858 29, 0, 1, 23, 1, 0.876 29, 0, 2, 0, 1, 0.876 30, 0, 29, 23, 1, 0.876 30, 0, 29, 0, 1, 0.875 30, 29, 29, 23, 1, 0.493 30, 29, 29, 0, 1, 0.494 28, 0, 29, 23, 1, 0.919 28, 0, 29, 0, 1, 0.913 28, 29, 29, 23, 1, 0.668 28, 29, 29, 0, 1, 0.669 30, 2048, 29, 23, 1, 0.642 30, 2048, 29, 0, 1, 0.643 30, 2077, 29, 23, 1, 0.495 30, 2077, 29, 0, 1, 0.495 28, 2048, 29, 23, 1, 0.754 28, 2048, 29, 0, 1, 0.753 28, 2077, 29, 23, 1, 0.663 28, 2077, 29, 0, 1, 0.664 28, 4081, 29, 23, 1, 0.998 28, 4081, 29, 0, 1, 0.98 30, 4081, 29, 23, 1, 0.854 30, 4081, 29, 0, 1, 0.852 30, 0, 1, 23, 1, 0.872 30, 0, 2, 0, 1, 0.871 31, 0, 30, 23, 1, 0.872 31, 0, 30, 0, 1, 0.87 31, 30, 30, 23, 1, 0.49 31, 30, 30, 0, 1, 0.49 29, 0, 30, 23, 1, 0.921 29, 0, 30, 0, 1, 0.924 29, 30, 30, 23, 1, 0.659 29, 30, 30, 0, 1, 0.664 31, 2048, 30, 23, 1, 0.639 31, 2048, 30, 0, 1, 0.64 31, 2078, 30, 23, 1, 0.492 31, 2078, 30, 0, 1, 0.492 29, 2048, 30, 23, 1, 0.75 29, 2048, 30, 0, 1, 0.75 29, 2078, 30, 23, 1, 0.661 29, 2078, 30, 0, 1, 0.665 29, 4081, 30, 23, 1, 0.98 29, 4081, 30, 0, 1, 0.987 31, 4081, 30, 23, 1, 0.861 31, 4081, 30, 0, 1, 0.855 31, 0, 1, 23, 1, 0.875 31, 0, 2, 0, 1, 0.875 32, 0, 31, 23, 1, 0.875 32, 0, 31, 0, 1, 0.875 32, 31, 31, 23, 1, 0.556 32, 31, 31, 0, 1, 0.556 30, 0, 31, 23, 1, 0.93 30, 0, 31, 0, 1, 0.92 30, 31, 31, 23, 1, 0.666 30, 31, 31, 0, 1, 0.666 32, 2048, 31, 23, 1, 0.625 32, 2048, 31, 0, 1, 0.625 32, 2079, 31, 23, 1, 0.556 32, 2079, 31, 0, 1, 0.556 30, 2048, 31, 23, 1, 0.75 30, 2048, 31, 0, 1, 0.75 30, 2079, 31, 23, 1, 0.666 30, 2079, 31, 0, 1, 0.655 30, 4081, 31, 23, 1, 0.993 30, 4081, 31, 0, 1, 0.999 32, 4081, 31, 23, 1, 0.857 32, 4081, 31, 0, 1, 0.855 32, 0, 1, 23, 1, 0.875 32, 0, 2, 0, 1, 0.875 > --- > sysdeps/x86_64/multiarch/memrchr-evex.S | 539 ++++++++++++------------ > 1 file changed, 268 insertions(+), 271 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S > index 0b99709c6b..ad541c0e50 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-evex.S > +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S > @@ -19,319 +19,316 @@ > #if IS_IN (libc) > > # include <sysdep.h> > +# include "evex256-vecs.h" > +# if VEC_SIZE != 32 > +# error "VEC_SIZE != 32 unimplemented" > +# endif > + > +# ifndef MEMRCHR > +# define MEMRCHR __memrchr_evex > +# endif > + > +# define PAGE_SIZE 4096 > +# define VECMATCH VEC(0) > + > + .section SECTION(.text), "ax", @progbits > +ENTRY_P2ALIGN(MEMRCHR, 6) > +# ifdef __ILP32__ > + /* Clear upper bits. */ > + and %RDX_LP, %RDX_LP > +# else > + test %RDX_LP, %RDX_LP > +# endif > + jz L(zero_0) > + > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > + correct page cross check and 2) it correctly sets up end ptr to be > + subtract by lzcnt aligned. */ > + leaq -1(%rdi, %rdx), %rax > + vpbroadcastb %esi, %VECMATCH > + > + /* Check if we can load 1x VEC without cross a page. */ > + testl $(PAGE_SIZE - VEC_SIZE), %eax > + jz L(page_cross) > + > + /* Don't use rax for pointer here because EVEX has better encoding with > + offset % VEC_SIZE == 0. */ > + vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VECMATCH, %k0 > + kmovd %k0, %ecx > + > + /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ > + cmpq $VEC_SIZE, %rdx > + ja L(more_1x_vec) > +L(ret_vec_x0_test): > + > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > + will gurantee edx (len) is less than it. */ > + lzcntl %ecx, %ecx > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > -# define VMOVA vmovdqa64 > - > -# define YMMMATCH ymm16 > - > -# define VEC_SIZE 32 > - > - .section .text.evex,"ax",@progbits > -ENTRY (__memrchr_evex) > - /* Broadcast CHAR to YMMMATCH. */ > - vpbroadcastb %esi, %YMMMATCH > - > - sub $VEC_SIZE, %RDX_LP > - jbe L(last_vec_or_less) > - > - add %RDX_LP, %RDI_LP > - > - /* Check the last VEC_SIZE bytes. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - subq $(VEC_SIZE * 4), %rdi > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(aligned_more) > - > - /* Align data for aligned loads in the loop. */ > - addq $VEC_SIZE, %rdi > - addq $VEC_SIZE, %rdx > - andq $-VEC_SIZE, %rdi > - subq %rcx, %rdx > - > - .p2align 4 > -L(aligned_more): > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > - since data is only aligned to VEC_SIZE. */ > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > - kmovd %k4, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > - There are some overlaps with above if data isn't aligned > - to 4 * VEC_SIZE. */ > - movl %edi, %ecx > - andl $(VEC_SIZE * 4 - 1), %ecx > - jz L(loop_4x_vec) > - > - addq $(VEC_SIZE * 4), %rdi > - addq $(VEC_SIZE * 4), %rdx > - andq $-(VEC_SIZE * 4), %rdi > - subq %rcx, %rdx > + /* Fits in aligning bytes of first cache line. */ > +L(zero_0): > + xorl %eax, %eax > + ret > > - .p2align 4 > -L(loop_4x_vec): > - /* Compare 4 * VEC at a time forward. */ > - subq $(VEC_SIZE * 4), %rdi > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 > - kord %k1, %k2, %k5 > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 > - > - kord %k3, %k4, %k6 > - kortestd %k5, %k6 > - jz L(loop_4x_vec) > - > - /* There is a match. */ > - kmovd %k4, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - kmovd %k1, %eax > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 9 > +L(ret_vec_x0_dec): > + decq %rax > +L(ret_vec_x0): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_4x_vec_or_less): > - addl $(VEC_SIZE * 4), %edx > - cmpl $(VEC_SIZE * 2), %edx > - jbe L(last_2x_vec) > + .p2align 4,, 10 > +L(more_1x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > + /* Align rax (pointer to string). */ > + andq $-VEC_SIZE, %rax > > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2 > - kmovd %k2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > + /* Recompute length after aligning. */ > + movq %rax, %rdx > > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k3 > - kmovd %k3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1_check) > - cmpl $(VEC_SIZE * 3), %edx > - jbe L(zero) > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - vpcmpb $0, (%rdi), %YMMMATCH, %k4 > - kmovd %k4, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 4), %rdx > - addq %rax, %rdx > - jl L(zero) > - addq %rdi, %rax > - ret > + subq %rdi, %rdx > > - .p2align 4 > + cmpq $(VEC_SIZE * 2), %rdx > + ja L(more_2x_vec) > L(last_2x_vec): > - vpcmpb $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3_check) > + > + /* Must dec rax because L(ret_vec_x0_test) expects it. */ > + decq %rax > cmpl $VEC_SIZE, %edx > - jbe L(zero) > - > - vpcmpb $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 2), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + jbe L(ret_vec_x0_test) > + > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + /* Don't use rax for pointer here because EVEX has better encoding with > + offset % VEC_SIZE == 0. */ > + vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VECMATCH, %k0 > + kmovd %k0, %ecx > + /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_vec_x0): > - bsrl %eax, %eax > - addq %rdi, %rax > + /* Inexpensive place to put this regarding code size / target alignments > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit > + in first cache line. */ > +L(page_cross): > + movq %rax, %rsi > + andq $-VEC_SIZE, %rsi > + vpcmpb $0, (%rsi), %VECMATCH, %k0 > + kmovd %k0, %r8d > + /* Shift out negative alignment (because we are starting from endptr and > + working backwards). */ > + movl %eax, %ecx > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > + notl %ecx > + shlxl %ecx, %r8d, %ecx > + cmpq %rdi, %rsi > + ja L(more_1x_vec) > + lzcntl %ecx, %ecx > + cmpl %ecx, %edx > + jle L(zero_1) > + subq %rcx, %rax > ret > > - .p2align 4 > -L(last_vec_x1): > - bsrl %eax, %eax > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > + /* Continue creating zero labels that fit in aligning bytes and get > + 2-byte encoding / are in the same cache line as condition. */ > +L(zero_1): > + xorl %eax, %eax > ret > > - .p2align 4 > -L(last_vec_x2): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + .p2align 4,, 8 > +L(ret_vec_x1): > + /* This will naturally add 32 to position. */ > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax > ret > > - .p2align 4 > -L(last_vec_x3): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + .p2align 4,, 8 > +L(more_2x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0_dec) > > - .p2align 4 > -L(last_vec_x1_check): > - bsrl %eax, %eax > - subq $(VEC_SIZE * 3), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - ret > + vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - .p2align 4 > -L(last_vec_x3_check): > - bsrl %eax, %eax > - subq $VEC_SIZE, %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - .p2align 4 > -L(zero): > - xorl %eax, %eax > + subq $(VEC_SIZE * 4), %rdx > + ja L(more_4x_vec) > + > + cmpl $(VEC_SIZE * -1), %edx > + jle L(ret_vec_x2_test) > +L(last_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > + > + > + /* Need no matter what. */ > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 3 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_1) > ret > > - .p2align 4 > -L(last_vec_or_less_aligned): > - movl %edx, %ecx > - > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - > - movl $1, %edx > - /* Support rdx << 32. */ > - salq %cl, %rdx > - subq $1, %rdx > - > - kmovd %k1, %eax > - > - /* Remove the trailing bytes. */ > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > - > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 8 > +L(ret_vec_x2_test): > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_1) > ret > > - .p2align 4 > -L(last_vec_or_less): > - addl $VEC_SIZE, %edx > - > - /* Check for zero length. */ > - testl %edx, %edx > - jz L(zero) > - > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(last_vec_or_less_aligned) > - > - movl %ecx, %esi > - movl %ecx, %r8d > - addl %edx, %esi > - andq $-VEC_SIZE, %rdi > + .p2align 4,, 8 > +L(ret_vec_x2): > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax > + ret > > - subl $VEC_SIZE, %esi > - ja L(last_vec_2x_aligned) > + .p2align 4,, 8 > +L(ret_vec_x3): > + bsrl %ecx, %ecx > + leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax > + ret > > - /* Check the last VEC. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > - kmovd %k1, %eax > + .p2align 4,, 8 > +L(more_4x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > > - /* Remove the leading and trailing bytes. */ > - sarl %cl, %eax > - movl %edx, %ecx > + vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + testl %ecx, %ecx > + jnz L(ret_vec_x3) > > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > + /* Check if near end before re-aligning (otherwise might do an > + unnecissary loop iteration). */ > + addq $-(VEC_SIZE * 4), %rax > + cmpq $(VEC_SIZE * 4), %rdx > + jbe L(last_4x_vec) > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > - ret > + decq %rax > + andq $-(VEC_SIZE * 4), %rax > + movq %rdi, %rdx > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > + lengths that overflow can be valid and break the comparison. */ > + andq $-(VEC_SIZE * 4), %rdx > > .p2align 4 > -L(last_vec_2x_aligned): > - movl %esi, %ecx > - > - /* Check the last VEC. */ > - vpcmpb $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > +L(loop_4x_vec): > + /* Store 1 were not-equals and 0 where equals in k1 (used to mask later > + on). */ > + vpcmpb $4, (VEC_SIZE * 3)(%rax), %VECMATCH, %k1 > + > + /* VEC(2/3) will have zero-byte where we found a CHAR. */ > + vpxorq (VEC_SIZE * 2)(%rax), %VECMATCH, %VEC(2) > + vpxorq (VEC_SIZE * 1)(%rax), %VECMATCH, %VEC(3) > + vpcmpb $0, (VEC_SIZE * 0)(%rax), %VECMATCH, %k4 > + > + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where > + CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ > + vpminub %VEC(2), %VEC(3), %VEC(3){%k1}{z} > + vptestnmb %VEC(3), %VEC(3), %k2 > + > + /* Any 1s and we found CHAR. */ > + kortestd %k2, %k4 > + jnz L(loop_end) > + > + addq $-(VEC_SIZE * 4), %rax > + cmpq %rdx, %rax > + jne L(loop_4x_vec) > + > + /* Need to re-adjust rdx / rax for L(last_4x_vec). */ > + subq $-(VEC_SIZE * 4), %rdx > + movq %rdx, %rax > + subl %edi, %edx > +L(last_4x_vec): > + > + /* Used no matter what. */ > + vpcmpb $0, (VEC_SIZE * -1)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + cmpl $(VEC_SIZE * 2), %edx > + jbe L(last_2x_vec) > > - kmovd %k1, %eax > + testl %ecx, %ecx > + jnz L(ret_vec_x0_dec) > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > > - testl %eax, %eax > - jnz L(last_vec_x1) > + vpcmpb $0, (VEC_SIZE * -2)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - /* Check the second last VEC. */ > - vpcmpb $0, (%rdi), %YMMMATCH, %k1 > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - movl %r8d, %ecx > + /* Used no matter what. */ > + vpcmpb $0, (VEC_SIZE * -3)(%rax), %VECMATCH, %k0 > + kmovd %k0, %ecx > > - kmovd %k1, %eax > + cmpl $(VEC_SIZE * 3), %edx > + ja L(last_vec) > > - /* Remove the leading bytes. Must use unsigned right shift for > - bsrl below. */ > - shrl %cl, %eax > - testl %eax, %eax > - jz L(zero) > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2 + 1), %rax > + subq %rcx, %rax > + cmpq %rax, %rdi > + jbe L(ret_1) > + xorl %eax, %eax > +L(ret_1): > + ret > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > + .p2align 4,, 6 > +L(loop_end): > + kmovd %k1, %ecx > + notl %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > + > + vptestnmb %VEC(2), %VEC(2), %k0 > + kmovd %k0, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > + > + kmovd %k2, %ecx > + kmovd %k4, %esi > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > + then it won't affect the result in esi (VEC4). If ecx is non-zero > + then CHAR in VEC3 and bsrq will use that position. */ > + salq $32, %rcx > + orq %rsi, %rcx > + bsrq %rcx, %rcx > + addq %rcx, %rax > + ret > + .p2align 4,, 4 > +L(ret_vec_x0_end): > + addq $(VEC_SIZE), %rax > +L(ret_vec_x1_end): > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * 2)(%rax, %rcx), %rax > ret > -END (__memrchr_evex) > + > +END(MEMRCHR) > #endif > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v1 6/8] x86: Optimize memrchr-avx2.S 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (3 preceding siblings ...) 2022-06-03 4:42 ` [PATCH v1 5/8] x86: Optimize memrchr-evex.S Noah Goldstein @ 2022-06-03 4:42 ` Noah Goldstein 2022-06-03 4:50 ` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein ` (2 subsequent siblings) 7 siblings, 1 reply; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:42 UTC (permalink / raw) To: libc-alpha The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the begining of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memrchr-avx2.S | 538 ++++++++++---------- 2 files changed, 260 insertions(+), 279 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S index cea2d2a72d..5e9beeeef2 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMRCHR __memrchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S index ba2ce7cb03..6915e1c373 100644 --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S @@ -21,340 +21,320 @@ # include <sysdep.h> # ifndef MEMRCHR -# define MEMRCHR __memrchr_avx2 +# define MEMRCHR __memrchr_avx2 # endif # ifndef VZEROUPPER -# define VZEROUPPER vzeroupper +# define VZEROUPPER vzeroupper # endif +// abf-off # ifndef SECTION # define SECTION(p) p##.avx # endif +// abf-on + +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + .section SECTION(.text), "ax", @progbits +ENTRY(MEMRCHR) +# ifdef __ILP32__ + /* Clear upper bits. */ + and %RDX_LP, %RDX_LP +# else + test %RDX_LP, %RDX_LP +# endif + jz L(zero_0) -# define VEC_SIZE 32 - - .section SECTION(.text),"ax",@progbits -ENTRY (MEMRCHR) - /* Broadcast CHAR to YMM0. */ vmovd %esi, %xmm0 - vpbroadcastb %xmm0, %ymm0 - - sub $VEC_SIZE, %RDX_LP - jbe L(last_vec_or_less) - - add %RDX_LP, %RDI_LP + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a + correct page cross check and 2) it correctly sets up end ptr to be + subtract by lzcnt aligned. */ + leaq -1(%rdx, %rdi), %rax - /* Check the last VEC_SIZE bytes. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - subq $(VEC_SIZE * 4), %rdi - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(aligned_more) + vpbroadcastb %xmm0, %ymm0 - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - addq $VEC_SIZE, %rdx - andq $-VEC_SIZE, %rdi - subq %rcx, %rdx + /* Check if we can load 1x VEC without cross a page. */ + testl $(PAGE_SIZE - VEC_SIZE), %eax + jz L(page_cross) + + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + cmpq $VEC_SIZE, %rdx + ja L(more_1x_vec) + +L(ret_vec_x0_test): + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which + will gurantee edx (len) is less than it. */ + lzcntl %ecx, %ecx + + /* Hoist vzeroupper (not great for RTM) to save code size. This allows + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(aligned_more): - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x0) - - /* Align data to 4 * VEC_SIZE for loop with fewer branches. - There are some overlaps with above if data isn't aligned - to 4 * VEC_SIZE. */ - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx - jz L(loop_4x_vec) - - addq $(VEC_SIZE * 4), %rdi - addq $(VEC_SIZE * 4), %rdx - andq $-(VEC_SIZE * 4), %rdi - subq %rcx, %rdx + /* Fits in aligning bytes of first cache line. */ +L(zero_0): + xorl %eax, %eax + ret - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - subq $(VEC_SIZE * 4), %rdi - subq $(VEC_SIZE * 4), %rdx - jbe L(last_4x_vec_or_less) - - vmovdqa (%rdi), %ymm1 - vmovdqa VEC_SIZE(%rdi), %ymm2 - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 - - vpcmpeqb %ymm1, %ymm0, %ymm1 - vpcmpeqb %ymm2, %ymm0, %ymm2 - vpcmpeqb %ymm3, %ymm0, %ymm3 - vpcmpeqb %ymm4, %ymm0, %ymm4 - - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - testl %eax, %eax - jz L(loop_4x_vec) - - /* There is a match. */ - vpmovmskb %ymm4, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x1) - - vpmovmskb %ymm1, %eax - bsrl %eax, %eax - addq %rdi, %rax + .p2align 4,, 9 +L(ret_vec_x0): + lzcntl %ecx, %ecx + subq %rcx, %rax L(return_vzeroupper): ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 -L(last_4x_vec_or_less): - addl $(VEC_SIZE * 4), %edx - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) - - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(last_vec_x1_check) - cmpl $(VEC_SIZE * 3), %edx - jbe L(zero) - - vpcmpeqb (%rdi), %ymm0, %ymm4 - vpmovmskb %ymm4, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 4), %rdx - addq %rax, %rdx - jl L(zero) - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 + .p2align 4,, 10 +L(more_1x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) + + /* Align rax (string pointer). */ + andq $-VEC_SIZE, %rax + + /* Recompute remaining length after aligning. */ + movq %rax, %rdx + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 + subq %rdi, %rdx + decq %rax + vpmovmskb %ymm1, %ecx + /* Fall through for short (hotter than length). */ + cmpq $(VEC_SIZE * 2), %rdx + ja L(more_2x_vec) L(last_2x_vec): - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(last_vec_x3_check) cmpl $VEC_SIZE, %edx - jbe L(zero) - - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - subq $(VEC_SIZE * 2), %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax - VZEROUPPER_RETURN - - .p2align 4 -L(last_vec_x0): - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN + jbe L(ret_vec_x0_test) + + testl %ecx, %ecx + jnz L(ret_vec_x0) + + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* 64-bit lzcnt. This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret - .p2align 4 -L(last_vec_x1): - bsrl %eax, %eax - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x2): - bsrl %eax, %eax - addl $(VEC_SIZE * 2), %eax - addq %rdi, %rax + /* Inexpensive place to put this regarding code size / target alignments + / ICache NLP. Necessary for 2-byte encoding of jump to page cross + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit + in first cache line. */ +L(page_cross): + movq %rax, %rsi + andq $-VEC_SIZE, %rsi + vpcmpeqb (%rsi), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + /* Shift out negative alignment (because we are starting from endptr and + working backwards). */ + movl %eax, %r8d + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ + notl %r8d + shlxl %r8d, %ecx, %ecx + cmpq %rdi, %rsi + ja L(more_1x_vec) + lzcntl %ecx, %ecx + COND_VZEROUPPER + cmpl %ecx, %edx + jle L(zero_0) + subq %rcx, %rax + ret + .p2align 4,, 11 +L(ret_vec_x1): + /* This will naturally add 32 to position. */ + lzcntq %rcx, %rcx + subq %rcx, %rax VZEROUPPER_RETURN + .p2align 4,, 10 +L(more_2x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x0) - .p2align 4 -L(last_vec_x3): - bsrl %eax, %eax - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - ret + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1) - .p2align 4 -L(last_vec_x1_check): - bsrl %eax, %eax - subq $(VEC_SIZE * 3), %rdx - addq %rax, %rdx - jl L(zero) - addl $VEC_SIZE, %eax - addq %rdi, %rax - VZEROUPPER_RETURN - .p2align 4 -L(last_vec_x3_check): - bsrl %eax, %eax - subq $VEC_SIZE, %rdx - addq %rax, %rdx - jl L(zero) - addl $(VEC_SIZE * 3), %eax - addq %rdi, %rax - VZEROUPPER_RETURN + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - .p2align 4 -L(zero): - xorl %eax, %eax - VZEROUPPER_RETURN + subq $(VEC_SIZE * 4), %rdx + ja L(more_4x_vec) + + cmpl $(VEC_SIZE * -1), %edx + jle L(ret_vec_x2_test) + +L(last_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) + + /* Needed no matter what. */ + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 3), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - .p2align 4 -L(null): + /* First in aligning bytes. */ +L(zero_2): xorl %eax, %eax ret - .p2align 4 -L(last_vec_or_less_aligned): - movl %edx, %ecx + .p2align 4,, 4 +L(ret_vec_x2_test): + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + ja L(zero_2) + ret - vpcmpeqb (%rdi), %ymm0, %ymm1 - movl $1, %edx - /* Support rdx << 32. */ - salq %cl, %rdx - subq $1, %rdx + .p2align 4,, 11 +L(ret_vec_x2): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - vpmovmskb %ymm1, %eax + .p2align 4,, 14 +L(ret_vec_x3): + /* ecx must be non-zero. */ + bsrl %ecx, %ecx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the trailing bytes. */ - andl %edx, %eax - testl %eax, %eax - jz L(zero) - bsrl %eax, %eax - addq %rdi, %rax - VZEROUPPER_RETURN .p2align 4 -L(last_vec_or_less): - addl $VEC_SIZE, %edx +L(more_4x_vec): + testl %ecx, %ecx + jnz L(ret_vec_x2) - /* Check for zero length. */ - testl %edx, %edx - jz L(null) + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl %edi, %ecx - andl $(VEC_SIZE - 1), %ecx - jz L(last_vec_or_less_aligned) + testl %ecx, %ecx + jnz L(ret_vec_x3) - movl %ecx, %esi - movl %ecx, %r8d - addl %edx, %esi - andq $-VEC_SIZE, %rdi + /* Check if near end before re-aligning (otherwise might do an + unnecissary loop iteration). */ + addq $-(VEC_SIZE * 4), %rax + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_4x_vec) - subl $VEC_SIZE, %esi - ja L(last_vec_2x_aligned) + /* Align rax to (VEC_SIZE - 1). */ + orq $(VEC_SIZE * 4 - 1), %rax + movq %rdi, %rdx + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because + lengths that overflow can be valid and break the comparison. */ + orq $(VEC_SIZE * 4 - 1), %rdx - /* Check the last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax - - /* Remove the leading and trailing bytes. */ - sarl %cl, %eax - movl %edx, %ecx + .p2align 4 +L(loop_4x_vec): + /* Need this comparison next no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 - movl $1, %edx - sall %cl, %edx - subl $1, %edx + vpor %ymm1, %ymm2, %ymm2 + vpor %ymm3, %ymm4, %ymm4 + vpor %ymm2, %ymm4, %ymm4 + vpmovmskb %ymm4, %esi - andl %edx, %eax - testl %eax, %eax - jz L(zero) + testl %esi, %esi + jnz L(loop_end) - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax - VZEROUPPER_RETURN + addq $(VEC_SIZE * -4), %rax + cmpq %rdx, %rax + jne L(loop_4x_vec) - .p2align 4 -L(last_vec_2x_aligned): - movl %esi, %ecx + subl %edi, %edx + incl %edx - /* Check the last VEC. */ - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 +L(last_4x_vec): + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - movl $1, %edx - sall %cl, %edx - subl $1, %edx + cmpl $(VEC_SIZE * 2), %edx + jbe L(last_2x_vec) - vpmovmskb %ymm1, %eax + testl %ecx, %ecx + jnz L(ret_vec_x0_end) - /* Remove the trailing bytes. */ - andl %edx, %eax + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) - testl %eax, %eax - jnz L(last_vec_x1) + /* Used no matter what. */ + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 + vpmovmskb %ymm1, %ecx - /* Check the second last VEC. */ - vpcmpeqb (%rdi), %ymm0, %ymm1 + cmpl $(VEC_SIZE * 3), %edx + ja L(last_vec) + + lzcntl %ecx, %ecx + subq $(VEC_SIZE * 2), %rax + COND_VZEROUPPER + subq %rcx, %rax + cmpq %rax, %rdi + jbe L(ret0) + xorl %eax, %eax +L(ret0): + ret - movl %r8d, %ecx - vpmovmskb %ymm1, %eax + .p2align 4 +L(loop_end): + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x0_end) + + vpmovmskb %ymm2, %ecx + testl %ecx, %ecx + jnz L(ret_vec_x1_end) + + vpmovmskb %ymm3, %ecx + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) + then it won't affect the result in esi (VEC4). If ecx is non-zero + then CHAR in VEC3 and bsrq will use that position. */ + salq $32, %rcx + orq %rsi, %rcx + bsrq %rcx, %rcx + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax + VZEROUPPER_RETURN - /* Remove the leading bytes. Must use unsigned right shift for - bsrl below. */ - shrl %cl, %eax - testl %eax, %eax - jz L(zero) + .p2align 4,, 4 +L(ret_vec_x1_end): + /* 64-bit version will automatically add 32 (VEC_SIZE). */ + lzcntq %rcx, %rcx + subq %rcx, %rax + VZEROUPPER_RETURN - bsrl %eax, %eax - addq %rdi, %rax - addq %r8, %rax + .p2align 4,, 4 +L(ret_vec_x0_end): + lzcntl %ecx, %ecx + subq %rcx, %rax VZEROUPPER_RETURN -END (MEMRCHR) + + /* 2 bytes until next cache line. */ +END(MEMRCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v1 6/8] x86: Optimize memrchr-avx2.S 2022-06-03 4:42 ` [PATCH v1 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-03 4:50 ` Noah Goldstein 0 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:50 UTC (permalink / raw) To: GNU C Library [-- Attachment #1: Type: text/plain, Size: 19793 bytes --] On Thu, Jun 2, 2022 at 11:42 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > The new code: > 1. prioritizes smaller user-arg lengths more. > 2. optimizes target placement more carefully > 3. reuses logic more > 4. fixes up various inefficiencies in the logic. The biggest > case here is the `lzcnt` logic for checking returns which > saves either a branch or multiple instructions. > > The total code size saving is: 306 bytes > Geometric Mean of all benchmarks New / Old: 0.760 > > Regressions: > There are some regressions. Particularly where the length (user arg > length) is large but the position of the match char is near the > begining of the string (in first VEC). This case has roughly a > 10-20% regression. > > This is because the new logic gives the hot path for immediate matches > to shorter lengths (the more common input). This case has roughly > a 15-45% speedup. > > Full xcheck passes on x86_64. > --- > sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S | 1 + > sysdeps/x86_64/multiarch/memrchr-avx2.S | 538 ++++++++++---------- > 2 files changed, 260 insertions(+), 279 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > index cea2d2a72d..5e9beeeef2 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S > @@ -2,6 +2,7 @@ > # define MEMRCHR __memrchr_avx2_rtm > #endif > > +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST > #define ZERO_UPPER_VEC_REGISTERS_RETURN \ > ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S > index ba2ce7cb03..6915e1c373 100644 > --- a/sysdeps/x86_64/multiarch/memrchr-avx2.S > +++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S > @@ -21,340 +21,320 @@ > # include <sysdep.h> > > # ifndef MEMRCHR > -# define MEMRCHR __memrchr_avx2 > +# define MEMRCHR __memrchr_avx2 > # endif > > # ifndef VZEROUPPER > -# define VZEROUPPER vzeroupper > +# define VZEROUPPER vzeroupper > # endif > > +// abf-off > # ifndef SECTION > # define SECTION(p) p##.avx > # endif > +// abf-on > + > +# define VEC_SIZE 32 > +# define PAGE_SIZE 4096 > + .section SECTION(.text), "ax", @progbits > +ENTRY(MEMRCHR) > +# ifdef __ILP32__ > + /* Clear upper bits. */ > + and %RDX_LP, %RDX_LP > +# else > + test %RDX_LP, %RDX_LP > +# endif > + jz L(zero_0) > > -# define VEC_SIZE 32 > - > - .section SECTION(.text),"ax",@progbits > -ENTRY (MEMRCHR) > - /* Broadcast CHAR to YMM0. */ > vmovd %esi, %xmm0 > - vpbroadcastb %xmm0, %ymm0 > - > - sub $VEC_SIZE, %RDX_LP > - jbe L(last_vec_or_less) > - > - add %RDX_LP, %RDI_LP > + /* Get end pointer. Minus one for two reasons. 1) It is necessary for a > + correct page cross check and 2) it correctly sets up end ptr to be > + subtract by lzcnt aligned. */ > + leaq -1(%rdx, %rdi), %rax > > - /* Check the last VEC_SIZE bytes. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - subq $(VEC_SIZE * 4), %rdi > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(aligned_more) > + vpbroadcastb %xmm0, %ymm0 > > - /* Align data for aligned loads in the loop. */ > - addq $VEC_SIZE, %rdi > - addq $VEC_SIZE, %rdx > - andq $-VEC_SIZE, %rdi > - subq %rcx, %rdx > + /* Check if we can load 1x VEC without cross a page. */ > + testl $(PAGE_SIZE - VEC_SIZE), %eax > + jz L(page_cross) > + > + vpcmpeqb -(VEC_SIZE - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + cmpq $VEC_SIZE, %rdx > + ja L(more_1x_vec) > + > +L(ret_vec_x0_test): > + /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which > + will gurantee edx (len) is less than it. */ > + lzcntl %ecx, %ecx > + > + /* Hoist vzeroupper (not great for RTM) to save code size. This allows > + all logic for edx (len) <= VEC_SIZE to fit in first cache line. */ > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > - .p2align 4 > -L(aligned_more): > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - /* Check the last 4 * VEC_SIZE. Only one VEC_SIZE at a time > - since data is only aligned to VEC_SIZE. */ > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpcmpeqb (%rdi), %ymm0, %ymm4 > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jnz L(last_vec_x0) > - > - /* Align data to 4 * VEC_SIZE for loop with fewer branches. > - There are some overlaps with above if data isn't aligned > - to 4 * VEC_SIZE. */ > - movl %edi, %ecx > - andl $(VEC_SIZE * 4 - 1), %ecx > - jz L(loop_4x_vec) > - > - addq $(VEC_SIZE * 4), %rdi > - addq $(VEC_SIZE * 4), %rdx > - andq $-(VEC_SIZE * 4), %rdi > - subq %rcx, %rdx > + /* Fits in aligning bytes of first cache line. */ > +L(zero_0): > + xorl %eax, %eax > + ret > > - .p2align 4 > -L(loop_4x_vec): > - /* Compare 4 * VEC at a time forward. */ > - subq $(VEC_SIZE * 4), %rdi > - subq $(VEC_SIZE * 4), %rdx > - jbe L(last_4x_vec_or_less) > - > - vmovdqa (%rdi), %ymm1 > - vmovdqa VEC_SIZE(%rdi), %ymm2 > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 > - > - vpcmpeqb %ymm1, %ymm0, %ymm1 > - vpcmpeqb %ymm2, %ymm0, %ymm2 > - vpcmpeqb %ymm3, %ymm0, %ymm3 > - vpcmpeqb %ymm4, %ymm0, %ymm4 > - > - vpor %ymm1, %ymm2, %ymm5 > - vpor %ymm3, %ymm4, %ymm6 > - vpor %ymm5, %ymm6, %ymm5 > - > - vpmovmskb %ymm5, %eax > - testl %eax, %eax > - jz L(loop_4x_vec) > - > - /* There is a match. */ > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x1) > - > - vpmovmskb %ymm1, %eax > - bsrl %eax, %eax > - addq %rdi, %rax > + .p2align 4,, 9 > +L(ret_vec_x0): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > L(return_vzeroupper): > ZERO_UPPER_VEC_REGISTERS_RETURN > > - .p2align 4 > -L(last_4x_vec_or_less): > - addl $(VEC_SIZE * 4), %edx > - cmpl $(VEC_SIZE * 2), %edx > - jbe L(last_2x_vec) > - > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm2 > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(last_vec_x2) > - > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm3 > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(last_vec_x1_check) > - cmpl $(VEC_SIZE * 3), %edx > - jbe L(zero) > - > - vpcmpeqb (%rdi), %ymm0, %ymm4 > - vpmovmskb %ymm4, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 4), %rdx > - addq %rax, %rdx > - jl L(zero) > - addq %rdi, %rax > - VZEROUPPER_RETURN > - > - .p2align 4 > + .p2align 4,, 10 > +L(more_1x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + /* Align rax (string pointer). */ > + andq $-VEC_SIZE, %rax > + > + /* Recompute remaining length after aligning. */ > + movq %rax, %rdx > + /* Need this comparison next no matter what. */ > + vpcmpeqb -(VEC_SIZE)(%rax), %ymm0, %ymm1 > + subq %rdi, %rdx > + decq %rax > + vpmovmskb %ymm1, %ecx > + /* Fall through for short (hotter than length). */ > + cmpq $(VEC_SIZE * 2), %rdx > + ja L(more_2x_vec) > L(last_2x_vec): > - vpcmpeqb (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(last_vec_x3_check) > cmpl $VEC_SIZE, %edx > - jbe L(zero) > - > - vpcmpeqb (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jz L(zero) > - bsrl %eax, %eax > - subq $(VEC_SIZE * 2), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > - > - .p2align 4 > -L(last_vec_x0): > - bsrl %eax, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > + jbe L(ret_vec_x0_test) > + > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > + > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + /* 64-bit lzcnt. This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > > - .p2align 4 > -L(last_vec_x1): > - bsrl %eax, %eax > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > - .p2align 4 > -L(last_vec_x2): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 2), %eax > - addq %rdi, %rax > + /* Inexpensive place to put this regarding code size / target alignments > + / ICache NLP. Necessary for 2-byte encoding of jump to page cross > + case which inturn in necessray for hot path (len <= VEC_SIZE) to fit > + in first cache line. */ > +L(page_cross): > + movq %rax, %rsi > + andq $-VEC_SIZE, %rsi > + vpcmpeqb (%rsi), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + /* Shift out negative alignment (because we are starting from endptr and > + working backwards). */ > + movl %eax, %r8d > + /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ > + notl %r8d > + shlxl %r8d, %ecx, %ecx > + cmpq %rdi, %rsi > + ja L(more_1x_vec) > + lzcntl %ecx, %ecx > + COND_VZEROUPPER > + cmpl %ecx, %edx > + jle L(zero_0) > + subq %rcx, %rax > + ret > + .p2align 4,, 11 > +L(ret_vec_x1): > + /* This will naturally add 32 to position. */ > + lzcntq %rcx, %rcx > + subq %rcx, %rax > VZEROUPPER_RETURN > + .p2align 4,, 10 > +L(more_2x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x0) > > - .p2align 4 > -L(last_vec_x3): > - bsrl %eax, %eax > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - ret > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1) > > - .p2align 4 > -L(last_vec_x1_check): > - bsrl %eax, %eax > - subq $(VEC_SIZE * 3), %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $VEC_SIZE, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > - .p2align 4 > -L(last_vec_x3_check): > - bsrl %eax, %eax > - subq $VEC_SIZE, %rdx > - addq %rax, %rdx > - jl L(zero) > - addl $(VEC_SIZE * 3), %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > + /* Needed no matter what. */ > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - .p2align 4 > -L(zero): > - xorl %eax, %eax > - VZEROUPPER_RETURN > + subq $(VEC_SIZE * 4), %rdx > + ja L(more_4x_vec) > + > + cmpl $(VEC_SIZE * -1), %edx > + jle L(ret_vec_x2_test) > + > +L(last_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > + > + /* Needed no matter what. */ > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 3), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_2) > + ret > > - .p2align 4 > -L(null): > + /* First in aligning bytes. */ > +L(zero_2): > xorl %eax, %eax > ret > > - .p2align 4 > -L(last_vec_or_less_aligned): > - movl %edx, %ecx > + .p2align 4,, 4 > +L(ret_vec_x2_test): > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + ja L(zero_2) > + ret > > - vpcmpeqb (%rdi), %ymm0, %ymm1 > > - movl $1, %edx > - /* Support rdx << 32. */ > - salq %cl, %rdx > - subq $1, %rdx > + .p2align 4,, 11 > +L(ret_vec_x2): > + /* ecx must be non-zero. */ > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * -3 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - vpmovmskb %ymm1, %eax > + .p2align 4,, 14 > +L(ret_vec_x3): > + /* ecx must be non-zero. */ > + bsrl %ecx, %ecx > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > > - bsrl %eax, %eax > - addq %rdi, %rax > - VZEROUPPER_RETURN > > .p2align 4 > -L(last_vec_or_less): > - addl $VEC_SIZE, %edx > +L(more_4x_vec): > + testl %ecx, %ecx > + jnz L(ret_vec_x2) > > - /* Check for zero length. */ > - testl %edx, %edx > - jz L(null) > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - movl %edi, %ecx > - andl $(VEC_SIZE - 1), %ecx > - jz L(last_vec_or_less_aligned) > + testl %ecx, %ecx > + jnz L(ret_vec_x3) > > - movl %ecx, %esi > - movl %ecx, %r8d > - addl %edx, %esi > - andq $-VEC_SIZE, %rdi > + /* Check if near end before re-aligning (otherwise might do an > + unnecissary loop iteration). */ > + addq $-(VEC_SIZE * 4), %rax > + cmpq $(VEC_SIZE * 4), %rdx > + jbe L(last_4x_vec) > > - subl $VEC_SIZE, %esi > - ja L(last_vec_2x_aligned) > + /* Align rax to (VEC_SIZE - 1). */ > + orq $(VEC_SIZE * 4 - 1), %rax > + movq %rdi, %rdx > + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because > + lengths that overflow can be valid and break the comparison. */ > + orq $(VEC_SIZE * 4 - 1), %rdx > > - /* Check the last VEC. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > - vpmovmskb %ymm1, %eax > - > - /* Remove the leading and trailing bytes. */ > - sarl %cl, %eax > - movl %edx, %ecx > + .p2align 4 > +L(loop_4x_vec): > + /* Need this comparison next no matter what. */ > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm2 > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm3 > + vpcmpeqb -(VEC_SIZE * 4 - 1)(%rax), %ymm0, %ymm4 > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + vpor %ymm1, %ymm2, %ymm2 > + vpor %ymm3, %ymm4, %ymm4 > + vpor %ymm2, %ymm4, %ymm4 > + vpmovmskb %ymm4, %esi > > - andl %edx, %eax > - testl %eax, %eax > - jz L(zero) > + testl %esi, %esi > + jnz L(loop_end) > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > - VZEROUPPER_RETURN > + addq $(VEC_SIZE * -4), %rax > + cmpq %rdx, %rax > + jne L(loop_4x_vec) > > - .p2align 4 > -L(last_vec_2x_aligned): > - movl %esi, %ecx > + subl %edi, %edx > + incl %edx > > - /* Check the last VEC. */ > - vpcmpeqb VEC_SIZE(%rdi), %ymm0, %ymm1 > +L(last_4x_vec): > + /* Used no matter what. */ > + vpcmpeqb -(VEC_SIZE * 1 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - movl $1, %edx > - sall %cl, %edx > - subl $1, %edx > + cmpl $(VEC_SIZE * 2), %edx > + jbe L(last_2x_vec) > > - vpmovmskb %ymm1, %eax > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > > - /* Remove the trailing bytes. */ > - andl %edx, %eax > + vpcmpeqb -(VEC_SIZE * 2 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > > - testl %eax, %eax > - jnz L(last_vec_x1) > + /* Used no matter what. */ > + vpcmpeqb -(VEC_SIZE * 3 - 1)(%rax), %ymm0, %ymm1 > + vpmovmskb %ymm1, %ecx > > - /* Check the second last VEC. */ > - vpcmpeqb (%rdi), %ymm0, %ymm1 > + cmpl $(VEC_SIZE * 3), %edx > + ja L(last_vec) > + > + lzcntl %ecx, %ecx > + subq $(VEC_SIZE * 2), %rax > + COND_VZEROUPPER > + subq %rcx, %rax > + cmpq %rax, %rdi > + jbe L(ret0) > + xorl %eax, %eax > +L(ret0): > + ret > > - movl %r8d, %ecx > > - vpmovmskb %ymm1, %eax > + .p2align 4 > +L(loop_end): > + vpmovmskb %ymm1, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x0_end) > + > + vpmovmskb %ymm2, %ecx > + testl %ecx, %ecx > + jnz L(ret_vec_x1_end) > + > + vpmovmskb %ymm3, %ecx > + /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) > + then it won't affect the result in esi (VEC4). If ecx is non-zero > + then CHAR in VEC3 and bsrq will use that position. */ > + salq $32, %rcx > + orq %rsi, %rcx > + bsrq %rcx, %rcx > + leaq (VEC_SIZE * -4 + 1)(%rcx, %rax), %rax > + VZEROUPPER_RETURN > > - /* Remove the leading bytes. Must use unsigned right shift for > - bsrl below. */ > - shrl %cl, %eax > - testl %eax, %eax > - jz L(zero) > + .p2align 4,, 4 > +L(ret_vec_x1_end): > + /* 64-bit version will automatically add 32 (VEC_SIZE). */ > + lzcntq %rcx, %rcx > + subq %rcx, %rax > + VZEROUPPER_RETURN > > - bsrl %eax, %eax > - addq %rdi, %rax > - addq %r8, %rax > + .p2align 4,, 4 > +L(ret_vec_x0_end): > + lzcntl %ecx, %ecx > + subq %rcx, %rax > VZEROUPPER_RETURN > -END (MEMRCHR) > + > + /* 2 bytes until next cache line. */ > +END(MEMRCHR) > #endif > -- > 2.34.1 > [-- Attachment #2: tgl-memrchr-avx2.txt --] [-- Type: text/plain, Size: 83424 bytes --] Geometric mean of N = 30 runs. Benchmarked on Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i71165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html Aggregate Geometric Mean of New / Old: 0.7606704539854201 Results For: memrchr len, align, pos, seek_char, invert_pos, New / Old 2048, 0, 32, 23, 0, 0.938 256, 1, 64, 23, 0, 0.813 2048, 0, 32, 0, 0, 0.948 256, 1, 64, 0, 0, 0.811 256, 4081, 64, 0, 0, 0.642 256, 0, 1, 23, 0, 0.77 256, 0, 1, 0, 0, 0.771 256, 1, 1, 23, 0, 0.788 256, 1, 1, 0, 0, 0.788 2048, 0, 64, 23, 0, 0.95 256, 2, 64, 23, 0, 0.817 2048, 0, 64, 0, 0, 0.96 256, 2, 64, 0, 0, 0.811 256, 0, 2, 23, 0, 0.772 256, 0, 2, 0, 0, 0.772 256, 2, 2, 23, 0, 0.789 256, 2, 2, 0, 0, 0.789 2048, 0, 128, 23, 0, 0.979 256, 3, 64, 23, 0, 0.811 2048, 0, 128, 0, 0, 0.971 256, 3, 64, 0, 0, 0.816 256, 0, 3, 23, 0, 0.772 256, 0, 3, 0, 0, 0.772 256, 3, 3, 23, 0, 0.788 256, 3, 3, 0, 0, 0.787 2048, 0, 256, 23, 0, 0.949 256, 4, 64, 23, 0, 0.808 2048, 0, 256, 0, 0, 0.952 256, 4, 64, 0, 0, 0.811 256, 0, 4, 23, 0, 0.778 256, 0, 4, 0, 0, 0.78 256, 4, 4, 23, 0, 0.79 256, 4, 4, 0, 0, 0.79 2048, 0, 512, 23, 0, 0.971 256, 5, 64, 23, 0, 0.817 2048, 0, 512, 0, 0, 0.969 256, 5, 64, 0, 0, 0.822 256, 0, 5, 23, 0, 0.78 256, 0, 5, 0, 0, 0.77 256, 5, 5, 23, 0, 0.791 256, 5, 5, 0, 0, 0.791 2048, 0, 1024, 23, 0, 1.007 256, 6, 64, 23, 0, 0.812 2048, 0, 1024, 0, 0, 1.002 256, 6, 64, 0, 0, 0.811 256, 0, 6, 23, 0, 0.773 256, 0, 6, 0, 0, 0.774 256, 6, 6, 23, 0, 0.791 256, 6, 6, 0, 0, 0.791 2048, 0, 2048, 23, 0, 0.969 256, 7, 64, 23, 0, 0.818 2048, 0, 2048, 0, 0, 0.959 256, 7, 64, 0, 0, 0.816 256, 0, 7, 23, 0, 0.773 256, 0, 7, 0, 0, 0.776 256, 7, 7, 23, 0, 0.791 256, 7, 7, 0, 0, 0.798 192, 1, 32, 23, 0, 0.647 192, 1, 32, 0, 0, 0.643 256, 1, 32, 23, 0, 0.803 256, 1, 32, 0, 0, 0.808 512, 1, 32, 23, 0, 0.838 512, 1, 32, 0, 0, 0.834 256, 4081, 32, 23, 0, 0.725 192, 2, 64, 23, 0, 0.919 192, 2, 64, 0, 0, 0.919 512, 2, 64, 23, 0, 0.878 512, 2, 64, 0, 0, 0.875 256, 4081, 64, 23, 0, 0.645 192, 3, 96, 23, 0, 0.919 192, 3, 96, 0, 0, 0.934 256, 3, 96, 23, 0, 0.829 256, 3, 96, 0, 0, 0.83 512, 3, 96, 23, 0, 0.883 512, 3, 96, 0, 0, 0.886 256, 4081, 96, 23, 0, 0.649 192, 4, 128, 23, 0, 0.877 192, 4, 128, 0, 0, 0.862 256, 4, 128, 23, 0, 0.915 256, 4, 128, 0, 0, 0.918 512, 4, 128, 23, 0, 0.892 512, 4, 128, 0, 0, 0.89 256, 4081, 128, 23, 0, 0.927 192, 5, 160, 23, 0, 1.164 192, 5, 160, 0, 0, 1.157 256, 5, 160, 23, 0, 0.928 256, 5, 160, 0, 0, 0.93 512, 5, 160, 23, 0, 0.857 512, 5, 160, 0, 0, 0.864 256, 4081, 160, 23, 0, 0.94 192, 6, 192, 23, 0, 0.699 192, 6, 192, 0, 0, 0.701 256, 6, 192, 23, 0, 0.865 256, 6, 192, 0, 0, 0.873 512, 6, 192, 23, 0, 0.836 512, 6, 192, 0, 0, 0.835 256, 4081, 192, 23, 0, 0.896 192, 7, 224, 23, 0, 0.702 192, 7, 224, 0, 0, 0.701 256, 7, 224, 23, 0, 1.16 256, 7, 224, 0, 0, 1.161 512, 7, 224, 23, 0, 0.854 512, 7, 224, 0, 0, 0.862 256, 4081, 224, 23, 0, 1.155 2, 0, 1, 23, 0, 0.812 2, 0, 1, 0, 0, 0.836 2, 1, 1, 23, 0, 0.72 2, 1, 1, 0, 0, 0.726 0, 0, 1, 23, 0, 0.857 0, 0, 1, 0, 0, 0.857 0, 1, 1, 23, 0, 0.857 0, 1, 1, 0, 0, 0.857 2, 2048, 1, 23, 0, 0.694 2, 2048, 1, 0, 0, 0.71 2, 2049, 1, 23, 0, 0.621 2, 2049, 1, 0, 0, 0.629 0, 2048, 1, 23, 0, 0.857 0, 2048, 1, 0, 0, 0.856 0, 2049, 1, 23, 0, 0.857 0, 2049, 1, 0, 0, 0.857 0, 4081, 1, 23, 0, 0.857 0, 4081, 1, 0, 0, 0.857 2, 4081, 1, 23, 0, 0.602 2, 4081, 1, 0, 0, 0.621 2, 0, 2, 0, 0, 0.777 3, 0, 2, 23, 0, 0.827 3, 0, 2, 0, 0, 0.819 3, 2, 2, 23, 0, 0.737 3, 2, 2, 0, 0, 0.731 1, 0, 2, 23, 0, 0.778 1, 0, 2, 0, 0, 0.778 1, 2, 2, 23, 0, 0.868 1, 2, 2, 0, 0, 0.868 3, 2048, 2, 23, 0, 0.716 3, 2048, 2, 0, 0, 0.719 3, 2050, 2, 23, 0, 0.625 3, 2050, 2, 0, 0, 0.632 1, 2048, 2, 23, 0, 0.667 1, 2048, 2, 0, 0, 0.667 1, 2050, 2, 23, 0, 0.749 1, 2050, 2, 0, 0, 0.752 1, 4081, 2, 23, 0, 0.743 1, 4081, 2, 0, 0, 0.745 3, 4081, 2, 23, 0, 0.601 3, 4081, 2, 0, 0, 0.613 3, 0, 1, 23, 0, 0.832 4, 0, 3, 23, 0, 0.834 4, 0, 3, 0, 0, 0.838 4, 3, 3, 23, 0, 0.728 4, 3, 3, 0, 0, 0.718 2, 0, 3, 23, 0, 0.778 2, 0, 3, 0, 0, 0.777 2, 3, 3, 23, 0, 0.868 2, 3, 3, 0, 0, 0.87 4, 2048, 3, 23, 0, 0.711 4, 2048, 3, 0, 0, 0.711 4, 2051, 3, 23, 0, 0.621 4, 2051, 3, 0, 0, 0.622 2, 2048, 3, 23, 0, 0.669 2, 2048, 3, 0, 0, 0.669 2, 2051, 3, 23, 0, 0.746 2, 2051, 3, 0, 0, 0.746 2, 4081, 3, 23, 0, 0.745 2, 4081, 3, 0, 0, 0.744 4, 4081, 3, 23, 0, 0.587 4, 4081, 3, 0, 0, 0.585 4, 0, 1, 23, 0, 0.846 4, 0, 2, 0, 0, 0.834 5, 0, 4, 23, 0, 0.84 5, 0, 4, 0, 0, 0.838 5, 4, 4, 23, 0, 0.736 5, 4, 4, 0, 0, 0.733 3, 0, 4, 23, 0, 0.779 3, 0, 4, 0, 0, 0.779 3, 4, 4, 23, 0, 0.869 3, 4, 4, 0, 0, 0.869 5, 2048, 4, 23, 0, 0.693 5, 2048, 4, 0, 0, 0.713 5, 2052, 4, 23, 0, 0.615 5, 2052, 4, 0, 0, 0.609 3, 2048, 4, 23, 0, 0.666 3, 2048, 4, 0, 0, 0.668 3, 2052, 4, 23, 0, 0.748 3, 2052, 4, 0, 0, 0.745 3, 4081, 4, 23, 0, 0.745 3, 4081, 4, 0, 0, 0.745 5, 4081, 4, 23, 0, 0.566 5, 4081, 4, 0, 0, 0.589 5, 0, 1, 23, 0, 0.841 5, 0, 2, 0, 0, 0.832 6, 0, 5, 23, 0, 0.832 6, 0, 5, 0, 0, 0.836 6, 5, 5, 23, 0, 0.736 6, 5, 5, 0, 0, 0.732 4, 0, 5, 23, 0, 0.778 4, 0, 5, 0, 0, 0.779 4, 5, 5, 23, 0, 0.871 4, 5, 5, 0, 0, 0.87 6, 2048, 5, 23, 0, 0.713 6, 2048, 5, 0, 0, 0.702 6, 2053, 5, 23, 0, 0.611 6, 2053, 5, 0, 0, 0.632 4, 2048, 5, 23, 0, 0.667 4, 2048, 5, 0, 0, 0.667 4, 2053, 5, 23, 0, 0.746 4, 2053, 5, 0, 0, 0.745 4, 4081, 5, 23, 0, 0.746 4, 4081, 5, 0, 0, 0.744 6, 4081, 5, 23, 0, 0.584 6, 4081, 5, 0, 0, 0.602 6, 0, 1, 23, 0, 0.842 6, 0, 2, 0, 0, 0.848 7, 0, 6, 23, 0, 0.837 7, 0, 6, 0, 0, 0.838 7, 6, 6, 23, 0, 0.739 7, 6, 6, 0, 0, 0.721 5, 0, 6, 23, 0, 0.778 5, 0, 6, 0, 0, 0.777 5, 6, 6, 23, 0, 0.869 5, 6, 6, 0, 0, 0.869 7, 2048, 6, 23, 0, 0.719 7, 2048, 6, 0, 0, 0.699 7, 2054, 6, 23, 0, 0.618 7, 2054, 6, 0, 0, 0.621 5, 2048, 6, 23, 0, 0.667 5, 2048, 6, 0, 0, 0.667 5, 2054, 6, 23, 0, 0.747 5, 2054, 6, 0, 0, 0.75 5, 4081, 6, 23, 0, 0.745 5, 4081, 6, 0, 0, 0.743 7, 4081, 6, 23, 0, 0.586 7, 4081, 6, 0, 0, 0.572 7, 0, 1, 23, 0, 0.836 7, 0, 2, 0, 0, 0.848 8, 0, 7, 23, 0, 0.843 8, 0, 7, 0, 0, 0.827 8, 7, 7, 23, 0, 0.733 8, 7, 7, 0, 0, 0.744 6, 0, 7, 23, 0, 0.778 6, 0, 7, 0, 0, 0.781 6, 7, 7, 23, 0, 0.869 6, 7, 7, 0, 0, 0.872 8, 2048, 7, 23, 0, 0.717 8, 2048, 7, 0, 0, 0.703 8, 2055, 7, 23, 0, 0.608 8, 2055, 7, 0, 0, 0.608 6, 2048, 7, 23, 0, 0.667 6, 2048, 7, 0, 0, 0.669 6, 2055, 7, 23, 0, 0.749 6, 2055, 7, 0, 0, 0.748 6, 4081, 7, 23, 0, 0.742 6, 4081, 7, 0, 0, 0.745 8, 4081, 7, 23, 0, 0.602 8, 4081, 7, 0, 0, 0.59 8, 0, 1, 23, 0, 0.833 8, 0, 2, 0, 0, 0.837 9, 0, 8, 23, 0, 0.822 9, 0, 8, 0, 0, 0.824 9, 8, 8, 23, 0, 0.725 9, 8, 8, 0, 0, 0.734 7, 0, 8, 23, 0, 0.778 7, 0, 8, 0, 0, 0.778 7, 8, 8, 23, 0, 0.871 7, 8, 8, 0, 0, 0.873 9, 2048, 8, 23, 0, 0.709 9, 2048, 8, 0, 0, 0.713 9, 2056, 8, 23, 0, 0.616 9, 2056, 8, 0, 0, 0.616 7, 2048, 8, 23, 0, 0.667 7, 2048, 8, 0, 0, 0.668 7, 2056, 8, 23, 0, 0.749 7, 2056, 8, 0, 0, 0.747 7, 4081, 8, 23, 0, 0.744 7, 4081, 8, 0, 0, 0.742 9, 4081, 8, 23, 0, 0.581 9, 4081, 8, 0, 0, 0.591 9, 0, 1, 23, 0, 0.842 9, 0, 2, 0, 0, 0.83 10, 0, 9, 23, 0, 0.834 10, 0, 9, 0, 0, 0.851 10, 9, 9, 23, 0, 0.728 10, 9, 9, 0, 0, 0.726 8, 0, 9, 23, 0, 0.778 8, 0, 9, 0, 0, 0.776 8, 9, 9, 23, 0, 0.872 8, 9, 9, 0, 0, 0.868 10, 2048, 9, 23, 0, 0.697 10, 2048, 9, 0, 0, 0.712 10, 2057, 9, 23, 0, 0.605 10, 2057, 9, 0, 0, 0.61 8, 2048, 9, 23, 0, 0.667 8, 2048, 9, 0, 0, 0.667 8, 2057, 9, 23, 0, 0.744 8, 2057, 9, 0, 0, 0.745 8, 4081, 9, 23, 0, 0.745 8, 4081, 9, 0, 0, 0.743 10, 4081, 9, 23, 0, 0.57 10, 4081, 9, 0, 0, 0.593 10, 0, 1, 23, 0, 0.841 10, 0, 2, 0, 0, 0.841 11, 0, 10, 23, 0, 0.815 11, 0, 10, 0, 0, 0.834 11, 10, 10, 23, 0, 0.735 11, 10, 10, 0, 0, 0.73 9, 0, 10, 23, 0, 0.778 9, 0, 10, 0, 0, 0.779 9, 10, 10, 23, 0, 0.867 9, 10, 10, 0, 0, 0.87 11, 2048, 10, 23, 0, 0.707 11, 2048, 10, 0, 0, 0.697 11, 2058, 10, 23, 0, 0.614 11, 2058, 10, 0, 0, 0.616 9, 2048, 10, 23, 0, 0.667 9, 2048, 10, 0, 0, 0.666 9, 2058, 10, 23, 0, 0.744 9, 2058, 10, 0, 0, 0.744 9, 4081, 10, 23, 0, 0.745 9, 4081, 10, 0, 0, 0.744 11, 4081, 10, 23, 0, 0.59 11, 4081, 10, 0, 0, 0.588 11, 0, 1, 23, 0, 0.84 11, 0, 2, 0, 0, 0.837 12, 0, 11, 23, 0, 0.831 12, 0, 11, 0, 0, 0.831 12, 11, 11, 23, 0, 0.725 12, 11, 11, 0, 0, 0.737 10, 0, 11, 23, 0, 0.78 10, 0, 11, 0, 0, 0.778 10, 11, 11, 23, 0, 0.867 10, 11, 11, 0, 0, 0.87 12, 2048, 11, 23, 0, 0.715 12, 2048, 11, 0, 0, 0.723 12, 2059, 11, 23, 0, 0.623 12, 2059, 11, 0, 0, 0.613 10, 2048, 11, 23, 0, 0.668 10, 2048, 11, 0, 0, 0.667 10, 2059, 11, 23, 0, 0.743 10, 2059, 11, 0, 0, 0.743 10, 4081, 11, 23, 0, 0.744 10, 4081, 11, 0, 0, 0.743 12, 4081, 11, 23, 0, 0.605 12, 4081, 11, 0, 0, 0.586 12, 0, 1, 23, 0, 0.843 12, 0, 2, 0, 0, 0.846 13, 0, 12, 23, 0, 0.829 13, 0, 12, 0, 0, 0.832 13, 12, 12, 23, 0, 0.731 13, 12, 12, 0, 0, 0.727 11, 0, 12, 23, 0, 0.778 11, 0, 12, 0, 0, 0.777 11, 12, 12, 23, 0, 0.87 11, 12, 12, 0, 0, 0.87 13, 2048, 12, 23, 0, 0.714 13, 2048, 12, 0, 0, 0.713 13, 2060, 12, 23, 0, 0.618 13, 2060, 12, 0, 0, 0.614 11, 2048, 12, 23, 0, 0.667 11, 2048, 12, 0, 0, 0.667 11, 2060, 12, 23, 0, 0.744 11, 2060, 12, 0, 0, 0.744 11, 4081, 12, 23, 0, 0.744 11, 4081, 12, 0, 0, 0.743 13, 4081, 12, 23, 0, 0.586 13, 4081, 12, 0, 0, 0.589 13, 0, 1, 23, 0, 0.838 13, 0, 2, 0, 0, 0.83 14, 0, 13, 23, 0, 0.838 14, 0, 13, 0, 0, 0.843 14, 13, 13, 23, 0, 0.739 14, 13, 13, 0, 0, 0.728 12, 0, 13, 23, 0, 0.778 12, 0, 13, 0, 0, 0.778 12, 13, 13, 23, 0, 0.868 12, 13, 13, 0, 0, 0.866 14, 2048, 13, 23, 0, 0.706 14, 2048, 13, 0, 0, 0.719 14, 2061, 13, 23, 0, 0.626 14, 2061, 13, 0, 0, 0.626 12, 2048, 13, 23, 0, 0.667 12, 2048, 13, 0, 0, 0.667 12, 2061, 13, 23, 0, 0.744 12, 2061, 13, 0, 0, 0.742 12, 4081, 13, 23, 0, 0.745 12, 4081, 13, 0, 0, 0.743 14, 4081, 13, 23, 0, 0.601 14, 4081, 13, 0, 0, 0.582 14, 0, 1, 23, 0, 0.851 14, 0, 2, 0, 0, 0.839 15, 0, 14, 23, 0, 0.833 15, 0, 14, 0, 0, 0.815 15, 14, 14, 23, 0, 0.723 15, 14, 14, 0, 0, 0.719 13, 0, 14, 23, 0, 0.777 13, 0, 14, 0, 0, 0.779 13, 14, 14, 23, 0, 0.867 13, 14, 14, 0, 0, 0.867 15, 2048, 14, 23, 0, 0.701 15, 2048, 14, 0, 0, 0.718 15, 2062, 14, 23, 0, 0.628 15, 2062, 14, 0, 0, 0.622 13, 2048, 14, 23, 0, 0.667 13, 2048, 14, 0, 0, 0.667 13, 2062, 14, 23, 0, 0.743 13, 2062, 14, 0, 0, 0.743 13, 4081, 14, 23, 0, 0.744 13, 4081, 14, 0, 0, 0.741 15, 4081, 14, 23, 0, 0.568 15, 4081, 14, 0, 0, 0.562 15, 0, 1, 23, 0, 0.842 15, 0, 2, 0, 0, 0.841 16, 0, 15, 23, 0, 0.834 16, 0, 15, 0, 0, 0.831 16, 15, 15, 23, 0, 0.737 16, 15, 15, 0, 0, 0.715 14, 0, 15, 23, 0, 0.793 14, 0, 15, 0, 0, 0.792 14, 15, 15, 23, 0, 0.878 14, 15, 15, 0, 0, 0.876 16, 2048, 15, 23, 0, 0.702 16, 2048, 15, 0, 0, 0.697 16, 2063, 15, 23, 0, 0.615 16, 2063, 15, 0, 0, 0.622 14, 2048, 15, 23, 0, 0.689 14, 2048, 15, 0, 0, 0.688 14, 2063, 15, 23, 0, 0.76 14, 2063, 15, 0, 0, 0.759 14, 4081, 15, 23, 0, 0.756 14, 4081, 15, 0, 0, 0.763 16, 4081, 15, 23, 0, 0.887 16, 4081, 15, 0, 0, 0.888 16, 0, 1, 23, 0, 0.84 16, 0, 2, 0, 0, 0.848 17, 0, 16, 23, 0, 0.833 17, 0, 16, 0, 0, 0.845 17, 16, 16, 23, 0, 0.616 17, 16, 16, 0, 0, 0.603 15, 0, 16, 23, 0, 0.829 15, 0, 16, 0, 0, 0.829 15, 16, 16, 23, 0, 0.907 15, 16, 16, 0, 0, 0.909 17, 2048, 16, 23, 0, 0.71 17, 2048, 16, 0, 0, 0.69 17, 2064, 16, 23, 0, 0.615 17, 2064, 16, 0, 0, 0.588 15, 2048, 16, 23, 0, 0.686 15, 2048, 16, 0, 0, 0.687 15, 2064, 16, 23, 0, 0.755 15, 2064, 16, 0, 0, 0.756 15, 4081, 16, 23, 0, 0.76 15, 4081, 16, 0, 0, 0.755 17, 4081, 16, 23, 0, 0.889 17, 4081, 16, 0, 0, 0.889 17, 0, 1, 23, 0, 0.849 17, 0, 2, 0, 0, 0.855 18, 0, 17, 23, 0, 0.83 18, 0, 17, 0, 0, 0.826 18, 17, 17, 23, 0, 0.612 18, 17, 17, 0, 0, 0.597 16, 0, 17, 23, 0, 0.8 16, 0, 17, 0, 0, 0.805 16, 17, 17, 23, 0, 0.669 16, 17, 17, 0, 0, 0.669 18, 2048, 17, 23, 0, 0.707 18, 2048, 17, 0, 0, 0.71 18, 2065, 17, 23, 0, 0.607 18, 2065, 17, 0, 0, 0.588 16, 2048, 17, 23, 0, 0.687 16, 2048, 17, 0, 0, 0.686 16, 2065, 17, 23, 0, 0.669 16, 2065, 17, 0, 0, 0.67 16, 4081, 17, 23, 0, 0.986 16, 4081, 17, 0, 0, 0.982 18, 4081, 17, 23, 0, 0.889 18, 4081, 17, 0, 0, 0.889 18, 0, 1, 23, 0, 0.857 18, 0, 2, 0, 0, 0.853 19, 0, 18, 23, 0, 0.842 19, 0, 18, 0, 0, 0.817 19, 18, 18, 23, 0, 0.599 19, 18, 18, 0, 0, 0.593 17, 0, 18, 23, 0, 0.795 17, 0, 18, 0, 0, 0.8 17, 18, 18, 23, 0, 0.67 17, 18, 18, 0, 0, 0.669 19, 2048, 18, 23, 0, 0.707 19, 2048, 18, 0, 0, 0.704 19, 2066, 18, 23, 0, 0.588 19, 2066, 18, 0, 0, 0.611 17, 2048, 18, 23, 0, 0.687 17, 2048, 18, 0, 0, 0.686 17, 2066, 18, 23, 0, 0.67 17, 2066, 18, 0, 0, 0.671 17, 4081, 18, 23, 0, 0.982 17, 4081, 18, 0, 0, 0.98 19, 4081, 18, 23, 0, 0.889 19, 4081, 18, 0, 0, 0.889 19, 0, 1, 23, 0, 0.844 19, 0, 2, 0, 0, 0.847 20, 0, 19, 23, 0, 0.83 20, 0, 19, 0, 0, 0.836 20, 19, 19, 23, 0, 0.588 20, 19, 19, 0, 0, 0.61 18, 0, 19, 23, 0, 0.829 18, 0, 19, 0, 0, 0.835 18, 19, 19, 23, 0, 0.669 18, 19, 19, 0, 0, 0.67 20, 2048, 19, 23, 0, 0.691 20, 2048, 19, 0, 0, 0.707 20, 2067, 19, 23, 0, 0.626 20, 2067, 19, 0, 0, 0.611 18, 2048, 19, 23, 0, 0.686 18, 2048, 19, 0, 0, 0.687 18, 2067, 19, 23, 0, 0.669 18, 2067, 19, 0, 0, 0.669 18, 4081, 19, 23, 0, 0.982 18, 4081, 19, 0, 0, 0.98 20, 4081, 19, 23, 0, 0.889 20, 4081, 19, 0, 0, 0.889 20, 0, 1, 23, 0, 0.85 20, 0, 2, 0, 0, 0.838 21, 0, 20, 23, 0, 0.839 21, 0, 20, 0, 0, 0.824 21, 20, 20, 23, 0, 0.593 21, 20, 20, 0, 0, 0.612 19, 0, 20, 23, 0, 0.833 19, 0, 20, 0, 0, 0.83 19, 20, 20, 23, 0, 0.669 19, 20, 20, 0, 0, 0.669 21, 2048, 20, 23, 0, 0.7 21, 2048, 20, 0, 0, 0.72 21, 2068, 20, 23, 0, 0.611 21, 2068, 20, 0, 0, 0.597 19, 2048, 20, 23, 0, 0.687 19, 2048, 20, 0, 0, 0.687 19, 2068, 20, 23, 0, 0.669 19, 2068, 20, 0, 0, 0.668 19, 4081, 20, 23, 0, 0.98 19, 4081, 20, 0, 0, 0.98 21, 4081, 20, 23, 0, 0.889 21, 4081, 20, 0, 0, 0.889 21, 0, 1, 23, 0, 0.856 21, 0, 2, 0, 0, 0.845 22, 0, 21, 23, 0, 0.833 22, 0, 21, 0, 0, 0.83 22, 21, 21, 23, 0, 0.607 22, 21, 21, 0, 0, 0.602 20, 0, 21, 23, 0, 0.807 20, 0, 21, 0, 0, 0.807 20, 21, 21, 23, 0, 0.666 20, 21, 21, 0, 0, 0.669 22, 2048, 21, 23, 0, 0.71 22, 2048, 21, 0, 0, 0.723 22, 2069, 21, 23, 0, 0.602 22, 2069, 21, 0, 0, 0.597 20, 2048, 21, 23, 0, 0.688 20, 2048, 21, 0, 0, 0.689 20, 2069, 21, 23, 0, 0.67 20, 2069, 21, 0, 0, 0.668 20, 4081, 21, 23, 0, 0.982 20, 4081, 21, 0, 0, 0.983 22, 4081, 21, 23, 0, 0.889 22, 4081, 21, 0, 0, 0.889 22, 0, 1, 23, 0, 0.851 22, 0, 2, 0, 0, 0.837 23, 0, 22, 23, 0, 0.833 23, 0, 22, 0, 0, 0.834 23, 22, 22, 23, 0, 0.626 23, 22, 22, 0, 0, 0.603 21, 0, 22, 23, 0, 0.828 21, 0, 22, 0, 0, 0.823 21, 22, 22, 23, 0, 0.67 21, 22, 22, 0, 0, 0.669 23, 2048, 22, 23, 0, 0.71 23, 2048, 22, 0, 0, 0.713 23, 2070, 22, 23, 0, 0.611 23, 2070, 22, 0, 0, 0.607 21, 2048, 22, 23, 0, 0.687 21, 2048, 22, 0, 0, 0.687 21, 2070, 22, 23, 0, 0.67 21, 2070, 22, 0, 0, 0.67 21, 4081, 22, 23, 0, 0.981 21, 4081, 22, 0, 0, 0.981 23, 4081, 22, 23, 0, 0.889 23, 4081, 22, 0, 0, 0.889 23, 0, 1, 23, 0, 0.852 23, 0, 2, 0, 0, 0.856 24, 0, 23, 23, 0, 0.83 24, 0, 23, 0, 0, 0.852 24, 23, 23, 23, 0, 0.595 24, 23, 23, 0, 0, 0.597 22, 0, 23, 23, 0, 0.846 22, 0, 23, 0, 0, 0.847 22, 23, 23, 23, 0, 0.673 22, 23, 23, 0, 0, 0.673 24, 2048, 23, 23, 0, 0.691 24, 2048, 23, 0, 0, 0.694 24, 2071, 23, 23, 0, 0.611 24, 2071, 23, 0, 0, 0.593 22, 2048, 23, 23, 0, 0.688 22, 2048, 23, 0, 0, 0.692 22, 2071, 23, 23, 0, 0.675 22, 2071, 23, 0, 0, 0.673 22, 4081, 23, 23, 0, 0.982 22, 4081, 23, 0, 0, 0.981 24, 4081, 23, 23, 0, 0.889 24, 4081, 23, 0, 0, 0.889 24, 0, 1, 23, 0, 0.84 24, 0, 2, 0, 0, 0.853 25, 0, 24, 23, 0, 0.823 25, 0, 24, 0, 0, 0.83 25, 24, 24, 23, 0, 0.593 25, 24, 24, 0, 0, 0.597 23, 0, 24, 23, 0, 0.815 23, 0, 24, 0, 0, 0.815 23, 24, 24, 23, 0, 0.669 23, 24, 24, 0, 0, 0.672 25, 2048, 24, 23, 0, 0.694 25, 2048, 24, 0, 0, 0.716 25, 2072, 24, 23, 0, 0.621 25, 2072, 24, 0, 0, 0.597 23, 2048, 24, 23, 0, 0.689 23, 2048, 24, 0, 0, 0.689 23, 2072, 24, 23, 0, 0.67 23, 2072, 24, 0, 0, 0.675 23, 4081, 24, 23, 0, 0.98 23, 4081, 24, 0, 0, 0.983 25, 4081, 24, 23, 0, 0.889 25, 4081, 24, 0, 0, 0.889 25, 0, 1, 23, 0, 0.847 25, 0, 2, 0, 0, 0.851 26, 0, 25, 23, 0, 0.825 26, 0, 25, 0, 0, 0.842 26, 25, 25, 23, 0, 0.616 26, 25, 25, 0, 0, 0.626 24, 0, 25, 23, 0, 0.817 24, 0, 25, 0, 0, 0.814 24, 25, 25, 23, 0, 0.676 24, 25, 25, 0, 0, 0.673 26, 2048, 25, 23, 0, 0.707 26, 2048, 25, 0, 0, 0.707 26, 2073, 25, 23, 0, 0.607 26, 2073, 25, 0, 0, 0.593 24, 2048, 25, 23, 0, 0.686 24, 2048, 25, 0, 0, 0.691 24, 2073, 25, 23, 0, 0.672 24, 2073, 25, 0, 0, 0.673 24, 4081, 25, 23, 0, 0.981 24, 4081, 25, 0, 0, 0.977 26, 4081, 25, 23, 0, 0.889 26, 4081, 25, 0, 0, 0.889 26, 0, 1, 23, 0, 0.842 26, 0, 2, 0, 0, 0.85 27, 0, 26, 23, 0, 0.83 27, 0, 26, 0, 0, 0.848 27, 26, 26, 23, 0, 0.607 27, 26, 26, 0, 0, 0.612 25, 0, 26, 23, 0, 0.828 25, 0, 26, 0, 0, 0.826 25, 26, 26, 23, 0, 0.675 25, 26, 26, 0, 0, 0.672 27, 2048, 26, 23, 0, 0.7 27, 2048, 26, 0, 0, 0.7 27, 2074, 26, 23, 0, 0.616 27, 2074, 26, 0, 0, 0.599 25, 2048, 26, 23, 0, 0.691 25, 2048, 26, 0, 0, 0.694 25, 2074, 26, 23, 0, 0.67 25, 2074, 26, 0, 0, 0.672 25, 4081, 26, 23, 0, 0.979 25, 4081, 26, 0, 0, 0.985 27, 4081, 26, 23, 0, 0.889 27, 4081, 26, 0, 0, 0.889 27, 0, 1, 23, 0, 0.854 27, 0, 2, 0, 0, 0.853 28, 0, 27, 23, 0, 0.827 28, 0, 27, 0, 0, 0.845 28, 27, 27, 23, 0, 0.583 28, 27, 27, 0, 0, 0.585 26, 0, 27, 23, 0, 0.844 26, 0, 27, 0, 0, 0.829 26, 27, 27, 23, 0, 0.673 26, 27, 27, 0, 0, 0.671 28, 2048, 27, 23, 0, 0.697 28, 2048, 27, 0, 0, 0.713 28, 2075, 27, 23, 0, 0.602 28, 2075, 27, 0, 0, 0.602 26, 2048, 27, 23, 0, 0.688 26, 2048, 27, 0, 0, 0.692 26, 2075, 27, 23, 0, 0.673 26, 2075, 27, 0, 0, 0.67 26, 4081, 27, 23, 0, 0.98 26, 4081, 27, 0, 0, 0.977 28, 4081, 27, 23, 0, 0.889 28, 4081, 27, 0, 0, 0.889 28, 0, 1, 23, 0, 0.837 28, 0, 2, 0, 0, 0.839 29, 0, 28, 23, 0, 0.811 29, 0, 28, 0, 0, 0.843 29, 28, 28, 23, 0, 0.618 29, 28, 28, 0, 0, 0.626 27, 0, 28, 23, 0, 0.839 27, 0, 28, 0, 0, 0.832 27, 28, 28, 23, 0, 0.674 27, 28, 28, 0, 0, 0.671 29, 2048, 28, 23, 0, 0.694 29, 2048, 28, 0, 0, 0.7 29, 2076, 28, 23, 0, 0.583 29, 2076, 28, 0, 0, 0.618 27, 2048, 28, 23, 0, 0.689 27, 2048, 28, 0, 0, 0.692 27, 2076, 28, 23, 0, 0.67 27, 2076, 28, 0, 0, 0.678 27, 4081, 28, 23, 0, 0.982 27, 4081, 28, 0, 0, 0.978 29, 4081, 28, 23, 0, 0.889 29, 4081, 28, 0, 0, 0.889 29, 0, 1, 23, 0, 0.849 29, 0, 2, 0, 0, 0.843 30, 0, 29, 23, 0, 0.827 30, 0, 29, 0, 0, 0.829 30, 29, 29, 23, 0, 0.616 30, 29, 29, 0, 0, 0.616 28, 0, 29, 23, 0, 0.847 28, 0, 29, 0, 0, 0.85 28, 29, 29, 23, 0, 0.676 28, 29, 29, 0, 0, 0.672 30, 2048, 29, 23, 0, 0.71 30, 2048, 29, 0, 0, 0.707 30, 2077, 29, 23, 0, 0.621 30, 2077, 29, 0, 0, 0.583 28, 2048, 29, 23, 0, 0.69 28, 2048, 29, 0, 0, 0.687 28, 2077, 29, 23, 0, 0.668 28, 2077, 29, 0, 0, 0.666 28, 4081, 29, 23, 0, 0.977 28, 4081, 29, 0, 0, 0.974 30, 4081, 29, 23, 0, 0.887 30, 4081, 29, 0, 0, 0.887 30, 0, 1, 23, 0, 0.854 30, 0, 2, 0, 0, 0.852 31, 0, 30, 23, 0, 0.828 31, 0, 30, 0, 0, 0.837 31, 30, 30, 23, 0, 0.586 31, 30, 30, 0, 0, 0.6 29, 0, 30, 23, 0, 0.852 29, 0, 30, 0, 0, 0.85 29, 30, 30, 23, 0, 0.667 29, 30, 30, 0, 0, 0.673 31, 2048, 30, 23, 0, 0.727 31, 2048, 30, 0, 0, 0.695 31, 2078, 30, 23, 0, 0.614 31, 2078, 30, 0, 0, 0.599 29, 2048, 30, 23, 0, 0.687 29, 2048, 30, 0, 0, 0.684 29, 2078, 30, 23, 0, 0.668 29, 2078, 30, 0, 0, 0.664 29, 4081, 30, 23, 0, 0.967 29, 4081, 30, 0, 0, 0.97 31, 4081, 30, 23, 0, 0.884 31, 4081, 30, 0, 0, 0.884 31, 0, 1, 23, 0, 0.837 31, 0, 2, 0, 0, 0.847 32, 0, 31, 23, 0, 0.841 32, 0, 31, 0, 0, 0.827 32, 31, 31, 23, 0, 0.585 32, 31, 31, 0, 0, 0.604 30, 0, 31, 23, 0, 0.848 30, 0, 31, 0, 0, 0.838 30, 31, 31, 23, 0, 0.663 30, 31, 31, 0, 0, 0.671 32, 2048, 31, 23, 0, 0.658 32, 2048, 31, 0, 0, 0.689 32, 2079, 31, 23, 0, 0.585 32, 2079, 31, 0, 0, 0.584 30, 2048, 31, 23, 0, 0.682 30, 2048, 31, 0, 0, 0.691 30, 2079, 31, 23, 0, 0.673 30, 2079, 31, 0, 0, 0.67 30, 4081, 31, 23, 0, 0.97 30, 4081, 31, 0, 0, 0.976 32, 4081, 31, 23, 0, 0.882 32, 4081, 31, 0, 0, 0.881 32, 0, 1, 23, 0, 0.855 32, 0, 2, 0, 0, 0.856 2048, 0, 32, 23, 1, 1.142 256, 1, 64, 23, 1, 0.857 2048, 0, 32, 0, 1, 1.134 256, 1, 64, 0, 1, 0.854 256, 4081, 64, 0, 1, 0.868 256, 0, 1, 23, 1, 1.155 256, 0, 1, 0, 1, 1.157 256, 1, 1, 23, 1, 1.157 256, 1, 1, 0, 1, 1.158 2048, 0, 64, 23, 1, 1.122 256, 2, 64, 23, 1, 0.863 2048, 0, 64, 0, 1, 1.118 256, 2, 64, 0, 1, 0.857 256, 0, 2, 23, 1, 1.161 256, 0, 2, 0, 1, 1.161 256, 2, 2, 23, 1, 1.16 256, 2, 2, 0, 1, 1.161 2048, 0, 128, 23, 1, 1.091 256, 3, 64, 23, 1, 0.867 2048, 0, 128, 0, 1, 1.081 256, 3, 64, 0, 1, 0.867 256, 0, 3, 23, 1, 1.165 256, 0, 3, 0, 1, 1.166 256, 3, 3, 23, 1, 1.167 256, 3, 3, 0, 1, 1.167 2048, 0, 256, 23, 1, 0.917 256, 4, 64, 23, 1, 0.866 2048, 0, 256, 0, 1, 0.917 256, 4, 64, 0, 1, 0.884 256, 0, 4, 23, 1, 1.167 256, 0, 4, 0, 1, 1.167 256, 4, 4, 23, 1, 1.167 256, 4, 4, 0, 1, 1.167 2048, 0, 512, 23, 1, 0.928 256, 5, 64, 23, 1, 0.884 2048, 0, 512, 0, 1, 0.926 256, 5, 64, 0, 1, 0.876 256, 0, 5, 23, 1, 1.167 256, 0, 5, 0, 1, 1.167 256, 5, 5, 23, 1, 1.167 256, 5, 5, 0, 1, 1.167 2048, 0, 1024, 23, 1, 1.004 256, 6, 64, 23, 1, 0.869 2048, 0, 1024, 0, 1, 1.001 256, 6, 64, 0, 1, 0.877 256, 0, 6, 23, 1, 1.167 256, 0, 6, 0, 1, 1.167 256, 6, 6, 23, 1, 1.167 256, 6, 6, 0, 1, 1.167 2048, 0, 2048, 23, 1, 0.962 256, 7, 64, 23, 1, 0.897 2048, 0, 2048, 0, 1, 0.946 256, 7, 64, 0, 1, 0.873 256, 0, 7, 23, 1, 1.165 256, 0, 7, 0, 1, 1.165 256, 7, 7, 23, 1, 1.165 256, 7, 7, 0, 1, 1.165 192, 1, 32, 23, 1, 1.155 192, 1, 32, 0, 1, 1.153 256, 1, 32, 23, 1, 1.151 256, 1, 32, 0, 1, 1.159 512, 1, 32, 23, 1, 1.159 512, 1, 32, 0, 1, 1.156 256, 4081, 32, 23, 1, 1.152 192, 2, 64, 23, 1, 0.882 192, 2, 64, 0, 1, 0.869 512, 2, 64, 23, 1, 0.877 512, 2, 64, 0, 1, 0.859 256, 4081, 64, 23, 1, 0.875 192, 3, 96, 23, 1, 0.928 192, 3, 96, 0, 1, 0.936 256, 3, 96, 23, 1, 0.923 256, 3, 96, 0, 1, 0.918 512, 3, 96, 23, 1, 0.93 512, 3, 96, 0, 1, 0.914 256, 4081, 96, 23, 1, 0.933 192, 4, 128, 23, 1, 0.917 192, 4, 128, 0, 1, 0.918 256, 4, 128, 23, 1, 0.913 256, 4, 128, 0, 1, 0.919 512, 4, 128, 23, 1, 0.916 512, 4, 128, 0, 1, 0.916 256, 4081, 128, 23, 1, 0.925 192, 5, 160, 23, 1, 0.657 192, 5, 160, 0, 1, 0.65 256, 5, 160, 23, 1, 0.842 256, 5, 160, 0, 1, 0.832 512, 5, 160, 23, 1, 0.89 512, 5, 160, 0, 1, 0.892 256, 4081, 160, 23, 1, 0.642 192, 6, 192, 23, 1, 0.701 192, 6, 192, 0, 1, 0.701 256, 6, 192, 23, 1, 0.819 256, 6, 192, 0, 1, 0.815 512, 6, 192, 23, 1, 0.863 512, 6, 192, 0, 1, 0.867 256, 4081, 192, 23, 1, 0.645 192, 7, 224, 23, 1, 0.702 192, 7, 224, 0, 1, 0.699 256, 7, 224, 23, 1, 0.803 256, 7, 224, 0, 1, 0.801 512, 7, 224, 23, 1, 0.882 512, 7, 224, 0, 1, 0.882 256, 4081, 224, 23, 1, 0.719 2, 0, 1, 23, 1, 0.809 2, 0, 1, 0, 1, 0.836 2, 1, 1, 23, 1, 0.726 2, 1, 1, 0, 1, 0.733 0, 0, 1, 23, 1, 0.854 0, 0, 1, 0, 1, 0.854 0, 1, 1, 23, 1, 0.854 0, 1, 1, 0, 1, 0.854 2, 2048, 1, 23, 1, 0.705 2, 2048, 1, 0, 1, 0.698 2, 2049, 1, 23, 1, 0.616 2, 2049, 1, 0, 1, 0.613 0, 2048, 1, 23, 1, 0.854 0, 2048, 1, 0, 1, 0.854 0, 2049, 1, 23, 1, 0.854 0, 2049, 1, 0, 1, 0.854 0, 4081, 1, 23, 1, 0.855 0, 4081, 1, 0, 1, 0.854 2, 4081, 1, 23, 1, 0.576 2, 4081, 1, 0, 1, 0.614 2, 0, 2, 0, 1, 0.858 3, 0, 2, 23, 1, 0.83 3, 0, 2, 0, 1, 0.821 3, 2, 2, 23, 1, 0.74 3, 2, 2, 0, 1, 0.712 1, 0, 2, 23, 1, 0.846 1, 0, 2, 0, 1, 0.838 1, 2, 2, 23, 1, 0.933 1, 2, 2, 0, 1, 0.935 3, 2048, 2, 23, 1, 0.701 3, 2048, 2, 0, 1, 0.704 3, 2050, 2, 23, 1, 0.621 3, 2050, 2, 0, 1, 0.625 1, 2048, 2, 23, 1, 0.665 1, 2048, 2, 0, 1, 0.664 1, 2050, 2, 23, 1, 0.741 1, 2050, 2, 0, 1, 0.741 1, 4081, 2, 23, 1, 0.741 1, 4081, 2, 0, 1, 0.741 3, 4081, 2, 23, 1, 0.59 3, 4081, 2, 0, 1, 0.59 3, 0, 1, 23, 1, 0.836 4, 0, 3, 23, 1, 0.831 4, 0, 3, 0, 1, 0.831 4, 3, 3, 23, 1, 0.741 4, 3, 3, 0, 1, 0.719 2, 0, 3, 23, 1, 0.815 2, 0, 3, 0, 1, 0.816 2, 3, 3, 23, 1, 0.919 2, 3, 3, 0, 1, 0.921 4, 2048, 3, 23, 1, 0.714 4, 2048, 3, 0, 1, 0.711 4, 2051, 3, 23, 1, 0.611 4, 2051, 3, 0, 1, 0.614 2, 2048, 3, 23, 1, 0.664 2, 2048, 3, 0, 1, 0.664 2, 2051, 3, 23, 1, 0.743 2, 2051, 3, 0, 1, 0.744 2, 4081, 3, 23, 1, 0.744 2, 4081, 3, 0, 1, 0.744 4, 4081, 3, 23, 1, 0.613 4, 4081, 3, 0, 1, 0.601 4, 0, 1, 23, 1, 0.838 4, 0, 2, 0, 1, 0.82 5, 0, 4, 23, 1, 0.831 5, 0, 4, 0, 1, 0.836 5, 4, 4, 23, 1, 0.742 5, 4, 4, 0, 1, 0.731 3, 0, 4, 23, 1, 0.792 3, 0, 4, 0, 1, 0.788 3, 4, 4, 23, 1, 0.878 3, 4, 4, 0, 1, 0.878 5, 2048, 4, 23, 1, 0.687 5, 2048, 4, 0, 1, 0.71 5, 2052, 4, 23, 1, 0.618 5, 2052, 4, 0, 1, 0.615 3, 2048, 4, 23, 1, 0.666 3, 2048, 4, 0, 1, 0.665 3, 2052, 4, 23, 1, 0.743 3, 2052, 4, 0, 1, 0.745 3, 4081, 4, 23, 1, 0.744 3, 4081, 4, 0, 1, 0.744 5, 4081, 4, 23, 1, 0.586 5, 4081, 4, 0, 1, 0.591 5, 0, 1, 23, 1, 0.824 5, 0, 2, 0, 1, 0.841 6, 0, 5, 23, 1, 0.832 6, 0, 5, 0, 1, 0.849 6, 5, 5, 23, 1, 0.73 6, 5, 5, 0, 1, 0.736 4, 0, 5, 23, 1, 0.778 4, 0, 5, 0, 1, 0.778 4, 5, 5, 23, 1, 0.868 4, 5, 5, 0, 1, 0.868 6, 2048, 5, 23, 1, 0.727 6, 2048, 5, 0, 1, 0.71 6, 2053, 5, 23, 1, 0.604 6, 2053, 5, 0, 1, 0.624 4, 2048, 5, 23, 1, 0.667 4, 2048, 5, 0, 1, 0.667 4, 2053, 5, 23, 1, 0.744 4, 2053, 5, 0, 1, 0.744 4, 4081, 5, 23, 1, 0.743 4, 4081, 5, 0, 1, 0.743 6, 4081, 5, 23, 1, 0.606 6, 4081, 5, 0, 1, 0.601 6, 0, 1, 23, 1, 0.822 6, 0, 2, 0, 1, 0.819 7, 0, 6, 23, 1, 0.842 7, 0, 6, 0, 1, 0.856 7, 6, 6, 23, 1, 0.743 7, 6, 6, 0, 1, 0.743 5, 0, 6, 23, 1, 0.777 5, 0, 6, 0, 1, 0.777 5, 6, 6, 23, 1, 0.867 5, 6, 6, 0, 1, 0.867 7, 2048, 6, 23, 1, 0.706 7, 2048, 6, 0, 1, 0.704 7, 2054, 6, 23, 1, 0.617 7, 2054, 6, 0, 1, 0.617 5, 2048, 6, 23, 1, 0.666 5, 2048, 6, 0, 1, 0.666 5, 2054, 6, 23, 1, 0.743 5, 2054, 6, 0, 1, 0.743 5, 4081, 6, 23, 1, 0.743 5, 4081, 6, 0, 1, 0.743 7, 4081, 6, 23, 1, 0.592 7, 4081, 6, 0, 1, 0.592 7, 0, 1, 23, 1, 0.835 7, 0, 2, 0, 1, 0.829 8, 0, 7, 23, 1, 0.851 8, 0, 7, 0, 1, 0.831 8, 7, 7, 23, 1, 0.731 8, 7, 7, 0, 1, 0.723 6, 0, 7, 23, 1, 0.778 6, 0, 7, 0, 1, 0.778 6, 7, 7, 23, 1, 0.868 6, 7, 7, 0, 1, 0.867 8, 2048, 7, 23, 1, 0.71 8, 2048, 7, 0, 1, 0.713 8, 2055, 7, 23, 1, 0.62 8, 2055, 7, 0, 1, 0.621 6, 2048, 7, 23, 1, 0.667 6, 2048, 7, 0, 1, 0.667 6, 2055, 7, 23, 1, 0.744 6, 2055, 7, 0, 1, 0.744 6, 4081, 7, 23, 1, 0.744 6, 4081, 7, 0, 1, 0.744 8, 4081, 7, 23, 1, 0.578 8, 4081, 7, 0, 1, 0.59 8, 0, 1, 23, 1, 0.833 8, 0, 2, 0, 1, 0.822 9, 0, 8, 23, 1, 0.836 9, 0, 8, 0, 1, 0.817 9, 8, 8, 23, 1, 0.729 9, 8, 8, 0, 1, 0.715 7, 0, 8, 23, 1, 0.778 7, 0, 8, 0, 1, 0.778 7, 8, 8, 23, 1, 0.867 7, 8, 8, 0, 1, 0.868 9, 2048, 8, 23, 1, 0.697 9, 2048, 8, 0, 1, 0.697 9, 2056, 8, 23, 1, 0.621 9, 2056, 8, 0, 1, 0.621 7, 2048, 8, 23, 1, 0.667 7, 2048, 8, 0, 1, 0.667 7, 2056, 8, 23, 1, 0.744 7, 2056, 8, 0, 1, 0.744 7, 4081, 8, 23, 1, 0.744 7, 4081, 8, 0, 1, 0.744 9, 4081, 8, 23, 1, 0.611 9, 4081, 8, 0, 1, 0.605 9, 0, 1, 23, 1, 0.83 9, 0, 2, 0, 1, 0.839 10, 0, 9, 23, 1, 0.849 10, 0, 9, 0, 1, 0.836 10, 9, 9, 23, 1, 0.732 10, 9, 9, 0, 1, 0.717 8, 0, 9, 23, 1, 0.778 8, 0, 9, 0, 1, 0.778 8, 9, 9, 23, 1, 0.867 8, 9, 9, 0, 1, 0.868 10, 2048, 9, 23, 1, 0.697 10, 2048, 9, 0, 1, 0.697 10, 2057, 9, 23, 1, 0.63 10, 2057, 9, 0, 1, 0.612 8, 2048, 9, 23, 1, 0.667 8, 2048, 9, 0, 1, 0.667 8, 2057, 9, 23, 1, 0.744 8, 2057, 9, 0, 1, 0.744 8, 4081, 9, 23, 1, 0.744 8, 4081, 9, 0, 1, 0.744 10, 4081, 9, 23, 1, 0.597 10, 4081, 9, 0, 1, 0.597 10, 0, 1, 23, 1, 0.848 10, 0, 2, 0, 1, 0.826 11, 0, 10, 23, 1, 0.849 11, 0, 10, 0, 1, 0.833 11, 10, 10, 23, 1, 0.725 11, 10, 10, 0, 1, 0.723 9, 0, 10, 23, 1, 0.778 9, 0, 10, 0, 1, 0.778 9, 10, 10, 23, 1, 0.868 9, 10, 10, 0, 1, 0.868 11, 2048, 10, 23, 1, 0.713 11, 2048, 10, 0, 1, 0.71 11, 2058, 10, 23, 1, 0.627 11, 2058, 10, 0, 1, 0.621 9, 2048, 10, 23, 1, 0.667 9, 2048, 10, 0, 1, 0.667 9, 2058, 10, 23, 1, 0.744 9, 2058, 10, 0, 1, 0.744 9, 4081, 10, 23, 1, 0.744 9, 4081, 10, 0, 1, 0.744 11, 4081, 10, 23, 1, 0.612 11, 4081, 10, 0, 1, 0.612 11, 0, 1, 23, 1, 0.836 11, 0, 2, 0, 1, 0.839 12, 0, 11, 23, 1, 0.846 12, 0, 11, 0, 1, 0.833 12, 11, 11, 23, 1, 0.72 12, 11, 11, 0, 1, 0.731 10, 0, 11, 23, 1, 0.778 10, 0, 11, 0, 1, 0.778 10, 11, 11, 23, 1, 0.867 10, 11, 11, 0, 1, 0.868 12, 2048, 11, 23, 1, 0.71 12, 2048, 11, 0, 1, 0.707 12, 2059, 11, 23, 1, 0.621 12, 2059, 11, 0, 1, 0.632 10, 2048, 11, 23, 1, 0.667 10, 2048, 11, 0, 1, 0.667 10, 2059, 11, 23, 1, 0.744 10, 2059, 11, 0, 1, 0.743 10, 4081, 11, 23, 1, 0.744 10, 4081, 11, 0, 1, 0.744 12, 4081, 11, 23, 1, 0.592 12, 4081, 11, 0, 1, 0.583 12, 0, 1, 23, 1, 0.849 12, 0, 2, 0, 1, 0.843 13, 0, 12, 23, 1, 0.846 13, 0, 12, 0, 1, 0.833 13, 12, 12, 23, 1, 0.743 13, 12, 12, 0, 1, 0.731 11, 0, 12, 23, 1, 0.778 11, 0, 12, 0, 1, 0.778 11, 12, 12, 23, 1, 0.867 11, 12, 12, 0, 1, 0.868 13, 2048, 12, 23, 1, 0.71 13, 2048, 12, 0, 1, 0.703 13, 2060, 12, 23, 1, 0.624 13, 2060, 12, 0, 1, 0.627 11, 2048, 12, 23, 1, 0.667 11, 2048, 12, 0, 1, 0.667 11, 2060, 12, 23, 1, 0.744 11, 2060, 12, 0, 1, 0.744 11, 4081, 12, 23, 1, 0.744 11, 4081, 12, 0, 1, 0.744 13, 4081, 12, 23, 1, 0.606 13, 4081, 12, 0, 1, 0.597 13, 0, 1, 23, 1, 0.836 13, 0, 2, 0, 1, 0.82 14, 0, 13, 23, 1, 0.849 14, 0, 13, 0, 1, 0.837 14, 13, 13, 23, 1, 0.743 14, 13, 13, 0, 1, 0.749 12, 0, 13, 23, 1, 0.778 12, 0, 13, 0, 1, 0.778 12, 13, 13, 23, 1, 0.868 12, 13, 13, 0, 1, 0.868 14, 2048, 13, 23, 1, 0.724 14, 2048, 13, 0, 1, 0.731 14, 2061, 13, 23, 1, 0.648 14, 2061, 13, 0, 1, 0.648 12, 2048, 13, 23, 1, 0.667 12, 2048, 13, 0, 1, 0.667 12, 2061, 13, 23, 1, 0.744 12, 2061, 13, 0, 1, 0.744 12, 4081, 13, 23, 1, 0.744 12, 4081, 13, 0, 1, 0.745 14, 4081, 13, 23, 1, 0.6 14, 4081, 13, 0, 1, 0.628 14, 0, 1, 23, 1, 0.836 14, 0, 2, 0, 1, 0.836 15, 0, 14, 23, 1, 0.848 15, 0, 14, 0, 1, 0.847 15, 14, 14, 23, 1, 0.747 15, 14, 14, 0, 1, 0.742 13, 0, 14, 23, 1, 0.778 13, 0, 14, 0, 1, 0.778 13, 14, 14, 23, 1, 0.868 13, 14, 14, 0, 1, 0.868 15, 2048, 14, 23, 1, 0.731 15, 2048, 14, 0, 1, 0.735 15, 2062, 14, 23, 1, 0.648 15, 2062, 14, 0, 1, 0.649 13, 2048, 14, 23, 1, 0.667 13, 2048, 14, 0, 1, 0.667 13, 2062, 14, 23, 1, 0.744 13, 2062, 14, 0, 1, 0.744 13, 4081, 14, 23, 1, 0.744 13, 4081, 14, 0, 1, 0.744 15, 4081, 14, 23, 1, 0.645 15, 4081, 14, 0, 1, 0.627 15, 0, 1, 23, 1, 0.842 15, 0, 2, 0, 1, 0.83 16, 0, 15, 23, 1, 0.858 16, 0, 15, 0, 1, 0.849 16, 15, 15, 23, 1, 0.763 16, 15, 15, 0, 1, 0.745 14, 0, 15, 23, 1, 0.793 14, 0, 15, 0, 1, 0.792 14, 15, 15, 23, 1, 0.875 14, 15, 15, 0, 1, 0.878 16, 2048, 15, 23, 1, 0.73 16, 2048, 15, 0, 1, 0.737 16, 2063, 15, 23, 1, 0.647 16, 2063, 15, 0, 1, 0.633 14, 2048, 15, 23, 1, 0.69 14, 2048, 15, 0, 1, 0.693 14, 2063, 15, 23, 1, 0.759 14, 2063, 15, 0, 1, 0.765 14, 4081, 15, 23, 1, 0.756 14, 4081, 15, 0, 1, 0.76 16, 4081, 15, 23, 1, 0.9 16, 4081, 15, 0, 1, 0.886 16, 0, 1, 23, 1, 0.84 16, 0, 2, 0, 1, 0.831 17, 0, 16, 23, 1, 0.849 17, 0, 16, 0, 1, 0.842 17, 16, 16, 23, 1, 0.562 17, 16, 16, 0, 1, 0.562 15, 0, 16, 23, 1, 0.795 15, 0, 16, 0, 1, 0.795 15, 16, 16, 23, 1, 0.878 15, 16, 16, 0, 1, 0.874 17, 2048, 16, 23, 1, 0.718 17, 2048, 16, 0, 1, 0.731 17, 2064, 16, 23, 1, 0.558 17, 2064, 16, 0, 1, 0.551 15, 2048, 16, 23, 1, 0.69 15, 2048, 16, 0, 1, 0.689 15, 2064, 16, 23, 1, 0.758 15, 2064, 16, 0, 1, 0.759 15, 4081, 16, 23, 1, 0.762 15, 4081, 16, 0, 1, 0.761 17, 4081, 16, 23, 1, 0.89 17, 4081, 16, 0, 1, 0.889 17, 0, 1, 23, 1, 0.824 17, 0, 2, 0, 1, 0.834 18, 0, 17, 23, 1, 0.86 18, 0, 17, 0, 1, 0.849 18, 17, 17, 23, 1, 0.55 18, 17, 17, 0, 1, 0.564 16, 0, 17, 23, 1, 0.794 16, 0, 17, 0, 1, 0.793 16, 17, 17, 23, 1, 0.67 16, 17, 17, 0, 1, 0.672 18, 2048, 17, 23, 1, 0.727 18, 2048, 17, 0, 1, 0.747 18, 2065, 17, 23, 1, 0.549 18, 2065, 17, 0, 1, 0.565 16, 2048, 17, 23, 1, 0.69 16, 2048, 17, 0, 1, 0.687 16, 2065, 17, 23, 1, 0.669 16, 2065, 17, 0, 1, 0.672 16, 4081, 17, 23, 1, 0.982 16, 4081, 17, 0, 1, 0.984 18, 4081, 17, 23, 1, 0.89 18, 4081, 17, 0, 1, 0.888 18, 0, 1, 23, 1, 0.846 18, 0, 2, 0, 1, 0.861 19, 0, 18, 23, 1, 0.848 19, 0, 18, 0, 1, 0.859 19, 18, 18, 23, 1, 0.576 19, 18, 18, 0, 1, 0.561 17, 0, 18, 23, 1, 0.8 17, 0, 18, 0, 1, 0.797 17, 18, 18, 23, 1, 0.676 17, 18, 18, 0, 1, 0.675 19, 2048, 18, 23, 1, 0.726 19, 2048, 18, 0, 1, 0.733 19, 2066, 18, 23, 1, 0.566 19, 2066, 18, 0, 1, 0.578 17, 2048, 18, 23, 1, 0.689 17, 2048, 18, 0, 1, 0.692 17, 2066, 18, 23, 1, 0.673 17, 2066, 18, 0, 1, 0.674 17, 4081, 18, 23, 1, 0.985 17, 4081, 18, 0, 1, 0.979 19, 4081, 18, 23, 1, 0.885 19, 4081, 18, 0, 1, 0.887 19, 0, 1, 23, 1, 0.82 19, 0, 2, 0, 1, 0.834 20, 0, 19, 23, 1, 0.842 20, 0, 19, 0, 1, 0.852 20, 19, 19, 23, 1, 0.556 20, 19, 19, 0, 1, 0.544 18, 0, 19, 23, 1, 0.797 18, 0, 19, 0, 1, 0.797 18, 19, 19, 23, 1, 0.672 18, 19, 19, 0, 1, 0.672 20, 2048, 19, 23, 1, 0.719 20, 2048, 19, 0, 1, 0.731 20, 2067, 19, 23, 1, 0.572 20, 2067, 19, 0, 1, 0.537 18, 2048, 19, 23, 1, 0.688 18, 2048, 19, 0, 1, 0.687 18, 2067, 19, 23, 1, 0.668 18, 2067, 19, 0, 1, 0.667 18, 4081, 19, 23, 1, 0.979 18, 4081, 19, 0, 1, 0.976 20, 4081, 19, 23, 1, 0.882 20, 4081, 19, 0, 1, 0.881 20, 0, 1, 23, 1, 0.829 20, 0, 2, 0, 1, 0.842 21, 0, 20, 23, 1, 0.837 21, 0, 20, 0, 1, 0.845 21, 20, 20, 23, 1, 0.543 21, 20, 20, 0, 1, 0.55 19, 0, 20, 23, 1, 0.792 19, 0, 20, 0, 1, 0.794 19, 20, 20, 23, 1, 0.671 19, 20, 20, 0, 1, 0.669 21, 2048, 20, 23, 1, 0.728 21, 2048, 20, 0, 1, 0.735 21, 2068, 20, 23, 1, 0.553 21, 2068, 20, 0, 1, 0.528 19, 2048, 20, 23, 1, 0.685 19, 2048, 20, 0, 1, 0.685 19, 2068, 20, 23, 1, 0.669 19, 2068, 20, 0, 1, 0.669 19, 4081, 20, 23, 1, 0.984 19, 4081, 20, 0, 1, 0.981 21, 4081, 20, 23, 1, 0.886 21, 4081, 20, 0, 1, 0.885 21, 0, 1, 23, 1, 0.846 21, 0, 2, 0, 1, 0.837 22, 0, 21, 23, 1, 0.854 22, 0, 21, 0, 1, 0.84 22, 21, 21, 23, 1, 0.548 22, 21, 21, 0, 1, 0.553 20, 0, 21, 23, 1, 0.796 20, 0, 21, 0, 1, 0.799 20, 21, 21, 23, 1, 0.674 20, 21, 21, 0, 1, 0.673 22, 2048, 21, 23, 1, 0.743 22, 2048, 21, 0, 1, 0.74 22, 2069, 21, 23, 1, 0.556 22, 2069, 21, 0, 1, 0.546 20, 2048, 21, 23, 1, 0.686 20, 2048, 21, 0, 1, 0.685 20, 2069, 21, 23, 1, 0.669 20, 2069, 21, 0, 1, 0.67 20, 4081, 21, 23, 1, 0.98 20, 4081, 21, 0, 1, 0.974 22, 4081, 21, 23, 1, 0.883 22, 4081, 21, 0, 1, 0.882 22, 0, 1, 23, 1, 0.839 22, 0, 2, 0, 1, 0.835 23, 0, 22, 23, 1, 0.841 23, 0, 22, 0, 1, 0.857 23, 22, 22, 23, 1, 0.554 23, 22, 22, 0, 1, 0.562 21, 0, 22, 23, 1, 0.796 21, 0, 22, 0, 1, 0.792 21, 22, 22, 23, 1, 0.67 21, 22, 22, 0, 1, 0.668 23, 2048, 22, 23, 1, 0.738 23, 2048, 22, 0, 1, 0.744 23, 2070, 22, 23, 1, 0.566 23, 2070, 22, 0, 1, 0.53 21, 2048, 22, 23, 1, 0.686 21, 2048, 22, 0, 1, 0.683 21, 2070, 22, 23, 1, 0.67 21, 2070, 22, 0, 1, 0.667 21, 4081, 22, 23, 1, 0.977 21, 4081, 22, 0, 1, 0.981 23, 4081, 22, 23, 1, 0.882 23, 4081, 22, 0, 1, 0.882 23, 0, 1, 23, 1, 0.849 23, 0, 2, 0, 1, 0.827 24, 0, 23, 23, 1, 0.849 24, 0, 23, 0, 1, 0.851 24, 23, 23, 23, 1, 0.561 24, 23, 23, 0, 1, 0.545 22, 0, 23, 23, 1, 0.793 22, 0, 23, 0, 1, 0.796 22, 23, 23, 23, 1, 0.674 22, 23, 23, 0, 1, 0.673 24, 2048, 23, 23, 1, 0.741 24, 2048, 23, 0, 1, 0.745 24, 2071, 23, 23, 1, 0.549 24, 2071, 23, 0, 1, 0.567 22, 2048, 23, 23, 1, 0.693 22, 2048, 23, 0, 1, 0.692 22, 2071, 23, 23, 1, 0.675 22, 2071, 23, 0, 1, 0.674 22, 4081, 23, 23, 1, 0.984 22, 4081, 23, 0, 1, 0.979 24, 4081, 23, 23, 1, 0.88 24, 4081, 23, 0, 1, 0.882 24, 0, 1, 23, 1, 0.825 24, 0, 2, 0, 1, 0.837 25, 0, 24, 23, 1, 0.847 25, 0, 24, 0, 1, 0.837 25, 24, 24, 23, 1, 0.548 25, 24, 24, 0, 1, 0.554 23, 0, 24, 23, 1, 0.797 23, 0, 24, 0, 1, 0.793 23, 24, 24, 23, 1, 0.671 23, 24, 24, 0, 1, 0.67 25, 2048, 24, 23, 1, 0.734 25, 2048, 24, 0, 1, 0.722 25, 2072, 24, 23, 1, 0.57 25, 2072, 24, 0, 1, 0.543 23, 2048, 24, 23, 1, 0.689 23, 2048, 24, 0, 1, 0.692 23, 2072, 24, 23, 1, 0.673 23, 2072, 24, 0, 1, 0.675 23, 4081, 24, 23, 1, 0.978 23, 4081, 24, 0, 1, 0.973 25, 4081, 24, 23, 1, 0.881 25, 4081, 24, 0, 1, 0.883 25, 0, 1, 23, 1, 0.84 25, 0, 2, 0, 1, 0.824 26, 0, 25, 23, 1, 0.834 26, 0, 25, 0, 1, 0.847 26, 25, 25, 23, 1, 0.559 26, 25, 25, 0, 1, 0.554 24, 0, 25, 23, 1, 0.799 24, 0, 25, 0, 1, 0.794 24, 25, 25, 23, 1, 0.674 24, 25, 25, 0, 1, 0.676 26, 2048, 25, 23, 1, 0.722 26, 2048, 25, 0, 1, 0.729 26, 2073, 25, 23, 1, 0.557 26, 2073, 25, 0, 1, 0.563 24, 2048, 25, 23, 1, 0.693 24, 2048, 25, 0, 1, 0.687 24, 2073, 25, 23, 1, 0.672 24, 2073, 25, 0, 1, 0.672 24, 4081, 25, 23, 1, 0.976 24, 4081, 25, 0, 1, 0.977 26, 4081, 25, 23, 1, 0.885 26, 4081, 25, 0, 1, 0.884 26, 0, 1, 23, 1, 0.843 26, 0, 2, 0, 1, 0.84 27, 0, 26, 23, 1, 0.868 27, 0, 26, 0, 1, 0.854 27, 26, 26, 23, 1, 0.537 27, 26, 26, 0, 1, 0.558 25, 0, 26, 23, 1, 0.799 25, 0, 26, 0, 1, 0.797 25, 26, 26, 23, 1, 0.673 25, 26, 26, 0, 1, 0.67 27, 2048, 26, 23, 1, 0.737 27, 2048, 26, 0, 1, 0.724 27, 2074, 26, 23, 1, 0.559 27, 2074, 26, 0, 1, 0.555 25, 2048, 26, 23, 1, 0.692 25, 2048, 26, 0, 1, 0.697 25, 2074, 26, 23, 1, 0.674 25, 2074, 26, 0, 1, 0.676 25, 4081, 26, 23, 1, 0.98 25, 4081, 26, 0, 1, 0.986 27, 4081, 26, 23, 1, 0.89 27, 4081, 26, 0, 1, 0.884 27, 0, 1, 23, 1, 0.816 27, 0, 2, 0, 1, 0.845 28, 0, 27, 23, 1, 0.846 28, 0, 27, 0, 1, 0.84 28, 27, 27, 23, 1, 0.571 28, 27, 27, 0, 1, 0.56 26, 0, 27, 23, 1, 0.797 26, 0, 27, 0, 1, 0.794 26, 27, 27, 23, 1, 0.674 26, 27, 27, 0, 1, 0.677 28, 2048, 27, 23, 1, 0.747 28, 2048, 27, 0, 1, 0.727 28, 2075, 27, 23, 1, 0.556 28, 2075, 27, 0, 1, 0.555 26, 2048, 27, 23, 1, 0.688 26, 2048, 27, 0, 1, 0.692 26, 2075, 27, 23, 1, 0.672 26, 2075, 27, 0, 1, 0.672 26, 4081, 27, 23, 1, 0.982 26, 4081, 27, 0, 1, 0.98 28, 4081, 27, 23, 1, 0.886 28, 4081, 27, 0, 1, 0.883 28, 0, 1, 23, 1, 0.826 28, 0, 2, 0, 1, 0.838 29, 0, 28, 23, 1, 0.844 29, 0, 28, 0, 1, 0.851 29, 28, 28, 23, 1, 0.567 29, 28, 28, 0, 1, 0.567 27, 0, 28, 23, 1, 0.804 27, 0, 28, 0, 1, 0.797 27, 28, 28, 23, 1, 0.678 27, 28, 28, 0, 1, 0.672 29, 2048, 28, 23, 1, 0.73 29, 2048, 28, 0, 1, 0.724 29, 2076, 28, 23, 1, 0.577 29, 2076, 28, 0, 1, 0.55 27, 2048, 28, 23, 1, 0.692 27, 2048, 28, 0, 1, 0.693 27, 2076, 28, 23, 1, 0.672 27, 2076, 28, 0, 1, 0.675 27, 4081, 28, 23, 1, 0.988 27, 4081, 28, 0, 1, 0.985 29, 4081, 28, 23, 1, 0.886 29, 4081, 28, 0, 1, 0.89 29, 0, 1, 23, 1, 0.858 29, 0, 2, 0, 1, 0.821 30, 0, 29, 23, 1, 0.853 30, 0, 29, 0, 1, 0.865 30, 29, 29, 23, 1, 0.541 30, 29, 29, 0, 1, 0.559 28, 0, 29, 23, 1, 0.801 28, 0, 29, 0, 1, 0.795 28, 29, 29, 23, 1, 0.676 28, 29, 29, 0, 1, 0.678 30, 2048, 29, 23, 1, 0.756 30, 2048, 29, 0, 1, 0.735 30, 2077, 29, 23, 1, 0.569 30, 2077, 29, 0, 1, 0.563 28, 2048, 29, 23, 1, 0.7 28, 2048, 29, 0, 1, 0.698 28, 2077, 29, 23, 1, 0.678 28, 2077, 29, 0, 1, 0.674 28, 4081, 29, 23, 1, 0.983 28, 4081, 29, 0, 1, 0.974 30, 4081, 29, 23, 1, 0.883 30, 4081, 29, 0, 1, 0.887 30, 0, 1, 23, 1, 0.837 30, 0, 2, 0, 1, 0.83 31, 0, 30, 23, 1, 0.859 31, 0, 30, 0, 1, 0.85 31, 30, 30, 23, 1, 0.542 31, 30, 30, 0, 1, 0.551 29, 0, 30, 23, 1, 0.797 29, 0, 30, 0, 1, 0.798 29, 30, 30, 23, 1, 0.676 29, 30, 30, 0, 1, 0.676 31, 2048, 30, 23, 1, 0.738 31, 2048, 30, 0, 1, 0.739 31, 2078, 30, 23, 1, 0.57 31, 2078, 30, 0, 1, 0.551 29, 2048, 30, 23, 1, 0.693 29, 2048, 30, 0, 1, 0.694 29, 2078, 30, 23, 1, 0.675 29, 2078, 30, 0, 1, 0.671 29, 4081, 30, 23, 1, 0.981 29, 4081, 30, 0, 1, 0.976 31, 4081, 30, 23, 1, 0.89 31, 4081, 30, 0, 1, 0.89 31, 0, 1, 23, 1, 0.837 31, 0, 2, 0, 1, 0.848 32, 0, 31, 23, 1, 0.853 32, 0, 31, 0, 1, 0.838 32, 31, 31, 23, 1, 0.646 32, 31, 31, 0, 1, 0.648 30, 0, 31, 23, 1, 0.799 30, 0, 31, 0, 1, 0.798 30, 31, 31, 23, 1, 0.674 30, 31, 31, 0, 1, 0.675 32, 2048, 31, 23, 1, 0.722 32, 2048, 31, 0, 1, 0.703 32, 2079, 31, 23, 1, 0.651 32, 2079, 31, 0, 1, 0.636 30, 2048, 31, 23, 1, 0.691 30, 2048, 31, 0, 1, 0.695 30, 2079, 31, 23, 1, 0.675 30, 2079, 31, 0, 1, 0.675 30, 4081, 31, 23, 1, 0.978 30, 4081, 31, 0, 1, 0.982 32, 4081, 31, 23, 1, 0.886 32, 4081, 31, 0, 1, 0.886 32, 0, 1, 23, 1, 0.852 32, 0, 2, 0, 1, 0.833 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v1 7/8] x86: Shrink code size of memchr-avx2.S 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (4 preceding siblings ...) 2022-06-03 4:42 ` [PATCH v1 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein @ 2022-06-03 4:42 ` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-03 4:51 ` [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:42 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-avx2-rtm.S | 1 + sysdeps/x86_64/multiarch/memchr-avx2.S | 109 +++++++++++---------- 2 files changed, 60 insertions(+), 50 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S index 87b076c7c4..c4d71938c5 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S @@ -2,6 +2,7 @@ # define MEMCHR __memchr_avx2_rtm #endif +#define COND_VZEROUPPER COND_VZEROUPPER_XTEST #define ZERO_UPPER_VEC_REGISTERS_RETURN \ ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S index 75bd7262e0..28a01280ec 100644 --- a/sysdeps/x86_64/multiarch/memchr-avx2.S +++ b/sysdeps/x86_64/multiarch/memchr-avx2.S @@ -57,7 +57,7 @@ # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 5) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ # ifdef __ILP32__ @@ -87,12 +87,14 @@ ENTRY (MEMCHR) # endif testl %eax, %eax jz L(aligned_more) - tzcntl %eax, %eax + bsfl %eax, %eax addq %rdi, %rax - VZEROUPPER_RETURN +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN + # ifndef USE_AS_RAWMEMCHR - .p2align 5 + .p2align 4 L(first_vec_x0): /* Check if first match was before length. */ tzcntl %eax, %eax @@ -100,58 +102,31 @@ L(first_vec_x0): /* NB: Multiply length by 4 to get byte count. */ sall $2, %edx # endif - xorl %ecx, %ecx + COND_VZEROUPPER + /* Use branch instead of cmovcc so L(first_vec_x0) fits in one fetch + block. branch here as opposed to cmovcc is not that costly. Common + usage of memchr is to check if the return was NULL (if string was + known to contain CHAR user would use rawmemchr). This branch will be + highly correlated with the user branch and can be used by most + modern branch predictors to predict the user branch. */ cmpl %eax, %edx - leaq (%rdi, %rax), %rax - cmovle %rcx, %rax - VZEROUPPER_RETURN - -L(null): - xorl %eax, %eax - ret -# endif - .p2align 4 -L(cross_page_boundary): - /* Save pointer before aligning as its original value is - necessary for computer return address if byte is found or - adjusting length if it is not and this is memchr. */ - movq %rdi, %rcx - /* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr - and rdi for rawmemchr. */ - orq $(VEC_SIZE - 1), %ALGN_PTR_REG - VPCMPEQ -(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1 - vpmovmskb %ymm1, %eax -# ifndef USE_AS_RAWMEMCHR - /* Calculate length until end of page (length checked for a - match). */ - leaq 1(%ALGN_PTR_REG), %rsi - subq %RRAW_PTR_REG, %rsi -# ifdef USE_AS_WMEMCHR - /* NB: Divide bytes by 4 to get wchar_t count. */ - shrl $2, %esi -# endif -# endif - /* Remove the leading bytes. */ - sarxl %ERAW_PTR_REG, %eax, %eax -# ifndef USE_AS_RAWMEMCHR - /* Check the end of data. */ - cmpq %rsi, %rdx - jbe L(first_vec_x0) + jle L(null) + addq %rdi, %rax + ret # endif - testl %eax, %eax - jz L(cross_page_continue) - tzcntl %eax, %eax - addq %RRAW_PTR_REG, %rax -L(return_vzeroupper): - ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 + .p2align 4,, 10 L(first_vec_x1): - tzcntl %eax, %eax + bsfl %eax, %eax incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - +# ifndef USE_AS_RAWMEMCHR + /* First in aligning bytes here. */ +L(null): + xorl %eax, %eax + ret +# endif .p2align 4 L(first_vec_x2): tzcntl %eax, %eax @@ -340,7 +315,7 @@ L(first_vec_x1_check): incq %rdi addq %rdi, %rax VZEROUPPER_RETURN - .p2align 4 + .p2align 4,, 6 L(set_zero_end): xorl %eax, %eax VZEROUPPER_RETURN @@ -428,5 +403,39 @@ L(last_vec_x3): VZEROUPPER_RETURN # endif + .p2align 4 +L(cross_page_boundary): + /* Save pointer before aligning as its original value is necessary for + computer return address if byte is found or adjusting length if it + is not and this is memchr. */ + movq %rdi, %rcx + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi for + rawmemchr. */ + andq $-VEC_SIZE, %ALGN_PTR_REG + VPCMPEQ (%ALGN_PTR_REG), %ymm0, %ymm1 + vpmovmskb %ymm1, %eax +# ifndef USE_AS_RAWMEMCHR + /* Calculate length until end of page (length checked for a match). */ + leal VEC_SIZE(%ALGN_PTR_REG), %esi + subl %ERAW_PTR_REG, %esi +# ifdef USE_AS_WMEMCHR + /* NB: Divide bytes by 4 to get wchar_t count. */ + shrl $2, %esi +# endif +# endif + /* Remove the leading bytes. */ + sarxl %ERAW_PTR_REG, %eax, %eax +# ifndef USE_AS_RAWMEMCHR + /* Check the end of data. */ + cmpq %rsi, %rdx + jbe L(first_vec_x0) +# endif + testl %eax, %eax + jz L(cross_page_continue) + bsfl %eax, %eax + addq %RRAW_PTR_REG, %rax + VZEROUPPER_RETURN + + END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v1 8/8] x86: Shrink code size of memchr-evex.S 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (5 preceding siblings ...) 2022-06-03 4:42 ` [PATCH v1 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein @ 2022-06-03 4:42 ` Noah Goldstein 2022-06-03 4:51 ` [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:42 UTC (permalink / raw) To: libc-alpha This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 32 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. --- sysdeps/x86_64/multiarch/memchr-evex.S | 32 ++++++++++++++------------ 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S index cfaf02907d..ac705d66cb 100644 --- a/sysdeps/x86_64/multiarch/memchr-evex.S +++ b/sysdeps/x86_64/multiarch/memchr-evex.S @@ -88,7 +88,7 @@ # define PAGE_SIZE 4096 .section SECTION(.text),"ax",@progbits -ENTRY (MEMCHR) +ENTRY_P2ALIGN (MEMCHR, 6) # ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ test %RDX_LP, %RDX_LP @@ -131,22 +131,24 @@ L(zero): xorl %eax, %eax ret - .p2align 5 + .p2align 4 L(first_vec_x0): - /* Check if first match was before length. */ - tzcntl %eax, %eax - xorl %ecx, %ecx - cmpl %eax, %edx - leaq (%rdi, %rax, CHAR_SIZE), %rax - cmovle %rcx, %rax + /* Check if first match was before length. NB: tzcnt has false data- + dependency on destination. eax already had a data-dependency on esi + so this should have no affect here. */ + tzcntl %eax, %esi +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rsi, CHAR_SIZE), %rdi +# else + addq %rsi, %rdi +# endif + xorl %eax, %eax + cmpl %esi, %edx + cmovg %rdi, %rax ret -# else - /* NB: first_vec_x0 is 17 bytes which will leave - cross_page_boundary (which is relatively cold) close enough - to ideal alignment. So only realign L(cross_page_boundary) if - rawmemchr. */ - .p2align 4 # endif + + .p2align 4 L(cross_page_boundary): /* Save pointer before aligning as its original value is necessary for computer return address if byte is found or @@ -562,6 +564,6 @@ L(last_vec_x3): leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax ret # endif - + /* 7 bytes from next cache line. */ END (MEMCHR) #endif -- 2.34.1 ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein ` (6 preceding siblings ...) 2022-06-03 4:42 ` [PATCH v1 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein @ 2022-06-03 4:51 ` Noah Goldstein 7 siblings, 0 replies; 82+ messages in thread From: Noah Goldstein @ 2022-06-03 4:51 UTC (permalink / raw) To: GNU C Library Ignore this patchset. There is an issue with it. On Thu, Jun 2, 2022 at 11:42 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > This patch does not touch any existing code and is only meant to be a > tool for future patches so that simple source files can more easily be > maintained to target multiple VEC classes. > > There is no difference in the objdump of libc.so before and after this > patch. > --- > sysdeps/x86_64/multiarch/avx-rtm-vecs.h | 33 +++++++++ > sysdeps/x86_64/multiarch/avx-vecs.h | 53 ++++++++++++++ > sysdeps/x86_64/multiarch/avx2-rtm-vecs.h | 33 +++++++++ > sysdeps/x86_64/multiarch/avx2-vecs.h | 30 ++++++++ > sysdeps/x86_64/multiarch/evex256-vecs.h | 50 +++++++++++++ > sysdeps/x86_64/multiarch/evex512-vecs.h | 49 +++++++++++++ > sysdeps/x86_64/multiarch/sse2-vecs.h | 48 +++++++++++++ > sysdeps/x86_64/multiarch/vec-macros.h | 90 ++++++++++++++++++++++++ > 8 files changed, 386 insertions(+) > create mode 100644 sysdeps/x86_64/multiarch/avx-rtm-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/avx-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/avx2-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/evex256-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/evex512-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/sse2-vecs.h > create mode 100644 sysdeps/x86_64/multiarch/vec-macros.h > > diff --git a/sysdeps/x86_64/multiarch/avx-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > new file mode 100644 > index 0000000000..c00b83ea0e > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx-rtm-vecs.h > @@ -0,0 +1,33 @@ > +/* Common config for AVX-RTM VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX_RTM_VECS_H > +#define _AVX_RTM_VECS_H 1 > + > +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > + > +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > + > +#define SECTION(p) p##.avx.rtm > + > +#define USE_WITH_RTM 1 > +#include "avx-vecs.h" > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/avx-vecs.h b/sysdeps/x86_64/multiarch/avx-vecs.h > new file mode 100644 > index 0000000000..3b84d7e8b2 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx-vecs.h > @@ -0,0 +1,53 @@ > +/* Common config for AVX VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX_VECS_H > +#define _AVX_VECS_H 1 > + > +#ifdef HAS_VEC > +# error "Multiple VEC configs included!" > +#endif > + > +#define HAS_VEC 1 > +#include "vec-macros.h" > + > +#ifndef USE_WITH_AVX2 > +# define USE_WITH_AVX 1 > +#endif > +/* Included by RTM version. */ > +#ifndef SECTION > +# define SECTION(p) p##.avx > +#endif > + > +#define VEC_SIZE 32 > +/* 4-byte mov instructions with AVX2. */ > +#define MOV_SIZE 4 > +/* 1 (ret) + 3 (vzeroupper). */ > +#define RET_SIZE 4 > +#define VZEROUPPER vzeroupper > + > +#define VMOVU vmovdqu > +#define VMOVA vmovdqa > +#define VMOVNT vmovntdq > + > +/* Often need to access xmm portion. */ > +#define VEC_xmm VEC_any_xmm > +#define VEC VEC_any_ymm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > new file mode 100644 > index 0000000000..a5d46e8c66 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx2-rtm-vecs.h > @@ -0,0 +1,33 @@ > +/* Common config for AVX2-RTM VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX2_RTM_VECS_H > +#define _AVX2_RTM_VECS_H 1 > + > +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > + > +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > + > +#define SECTION(p) p##.avx.rtm > + > +#define USE_WITH_RTM 1 > +#include "avx2-vecs.h" > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/avx2-vecs.h b/sysdeps/x86_64/multiarch/avx2-vecs.h > new file mode 100644 > index 0000000000..4c029b4621 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/avx2-vecs.h > @@ -0,0 +1,30 @@ > +/* Common config for AVX2 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _AVX2_VECS_H > +#define _AVX2_VECS_H 1 > + > +#define USE_WITH_AVX2 1 > +/* Included by RTM version. */ > +#ifndef SECTION > +# define SECTION(p) p##.avx > +#endif > +#include "avx-vecs.h" > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/evex256-vecs.h b/sysdeps/x86_64/multiarch/evex256-vecs.h > new file mode 100644 > index 0000000000..ed7a32b0ec > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/evex256-vecs.h > @@ -0,0 +1,50 @@ > +/* Common config for EVEX256 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _EVEX256_VECS_H > +#define _EVEX256_VECS_H 1 > + > +#ifdef HAS_VEC > +# error "Multiple VEC configs included!" > +#endif > + > +#define HAS_VEC 1 > +#include "vec-macros.h" > + > +#define USE_WITH_EVEX256 1 > +#ifndef SECTION > +# define SECTION(p) p##.evex > +#endif > + > +#define VEC_SIZE 32 > +/* 6-byte mov instructions with EVEX. */ > +#define MOV_SIZE 6 > +/* No vzeroupper needed. */ > +#define RET_SIZE 1 > +#define VZEROUPPER > + > +#define VMOVU vmovdqu64 > +#define VMOVA vmovdqa64 > +#define VMOVNT vmovntdq > + > +/* Often need to access xmm portion. */ > +#define VEC_xmm VEC_hi_xmm > +#define VEC VEC_hi_ymm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/evex512-vecs.h b/sysdeps/x86_64/multiarch/evex512-vecs.h > new file mode 100644 > index 0000000000..53597734fc > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/evex512-vecs.h > @@ -0,0 +1,49 @@ > +/* Common config for EVEX512 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _EVEX512_VECS_H > +#define _EVEX512_VECS_H 1 > + > +#ifdef HAS_VEC > +# error "Multiple VEC configs included!" > +#endif > + > +#define HAS_VEC 1 > +#include "vec-macros.h" > + > +#define USE_WITH_EVEX512 1 > +#define SECTION(p) p##.evex512 > + > +#define VEC_SIZE 64 > +/* 6-byte mov instructions with EVEX. */ > +#define MOV_SIZE 6 > +/* No vzeroupper needed. */ > +#define RET_SIZE 1 > +#define VZEROUPPER > + > +#define VMOVU vmovdqu64 > +#define VMOVA vmovdqa64 > +#define VMOVNT vmovntdq > + > +/* Often need to access xmm/ymm portion. */ > +#define VEC_xmm VEC_hi_xmm > +#define VEC_ymm VEC_hi_ymm > +#define VEC VEC_hi_zmm > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/sse2-vecs.h b/sysdeps/x86_64/multiarch/sse2-vecs.h > new file mode 100644 > index 0000000000..b645b93e3d > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/sse2-vecs.h > @@ -0,0 +1,48 @@ > +/* Common config for SSE2 VECs > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _SSE2_VECS_H > +#define _SSE2_VECS_H 1 > + > +#ifdef HAS_VEC > +# error "Multiple VEC configs included!" > +#endif > + > +#define HAS_VEC 1 > +#include "vec-macros.h" > + > +#define USE_WITH_SSE2 1 > +#define SECTION(p) p > + > +#define VEC_SIZE 16 > +/* 3-byte mov instructions with SSE2. */ > +#define MOV_SIZE 3 > +/* No vzeroupper needed. */ > +#define RET_SIZE 1 > + > +#define VMOVU movups > +#define VMOVA movaps > +#define VMOVNT movntdq > +#define VZEROUPPER > + > +#define VEC_xmm VEC_any_xmm > +#define VEC VEC_any_xmm > + > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/vec-macros.h b/sysdeps/x86_64/multiarch/vec-macros.h > new file mode 100644 > index 0000000000..4dae4503c8 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/vec-macros.h > @@ -0,0 +1,90 @@ > +/* Macro helpers for VEC_{type}({vec_num}) > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2022 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#ifndef _VEC_MACROS_H > +# define _VEC_MACROS_H 1 > + > +# ifndef HAS_VEC > +# error "Never include this file directly. Always include a vector config." > +# endif > + > +/* Defines so we can use SSE2 / AVX2 / EVEX / EVEX512 encoding with same > + VEC(N) values. */ > +#define VEC_hi_xmm0 xmm16 > +#define VEC_hi_xmm1 xmm17 > +#define VEC_hi_xmm2 xmm18 > +#define VEC_hi_xmm3 xmm19 > +#define VEC_hi_xmm4 xmm20 > +#define VEC_hi_xmm5 xmm21 > +#define VEC_hi_xmm6 xmm22 > +#define VEC_hi_xmm7 xmm23 > +#define VEC_hi_xmm8 xmm24 > +#define VEC_hi_xmm9 xmm25 > +#define VEC_hi_xmm10 xmm26 > +#define VEC_hi_xmm11 xmm27 > +#define VEC_hi_xmm12 xmm28 > +#define VEC_hi_xmm13 xmm29 > +#define VEC_hi_xmm14 xmm30 > +#define VEC_hi_xmm15 xmm31 > + > +#define VEC_hi_ymm0 ymm16 > +#define VEC_hi_ymm1 ymm17 > +#define VEC_hi_ymm2 ymm18 > +#define VEC_hi_ymm3 ymm19 > +#define VEC_hi_ymm4 ymm20 > +#define VEC_hi_ymm5 ymm21 > +#define VEC_hi_ymm6 ymm22 > +#define VEC_hi_ymm7 ymm23 > +#define VEC_hi_ymm8 ymm24 > +#define VEC_hi_ymm9 ymm25 > +#define VEC_hi_ymm10 ymm26 > +#define VEC_hi_ymm11 ymm27 > +#define VEC_hi_ymm12 ymm28 > +#define VEC_hi_ymm13 ymm29 > +#define VEC_hi_ymm14 ymm30 > +#define VEC_hi_ymm15 ymm31 > + > +#define VEC_hi_zmm0 zmm16 > +#define VEC_hi_zmm1 zmm17 > +#define VEC_hi_zmm2 zmm18 > +#define VEC_hi_zmm3 zmm19 > +#define VEC_hi_zmm4 zmm20 > +#define VEC_hi_zmm5 zmm21 > +#define VEC_hi_zmm6 zmm22 > +#define VEC_hi_zmm7 zmm23 > +#define VEC_hi_zmm8 zmm24 > +#define VEC_hi_zmm9 zmm25 > +#define VEC_hi_zmm10 zmm26 > +#define VEC_hi_zmm11 zmm27 > +#define VEC_hi_zmm12 zmm28 > +#define VEC_hi_zmm13 zmm29 > +#define VEC_hi_zmm14 zmm30 > +#define VEC_hi_zmm15 zmm31 > + > +# define PRIMITIVE_VEC(vec, num) vec##num > + > +# define VEC_any_xmm(i) PRIMITIVE_VEC(xmm, i) > +# define VEC_any_ymm(i) PRIMITIVE_VEC(ymm, i) > +# define VEC_any_zmm(i) PRIMITIVE_VEC(zmm, i) > + > +# define VEC_hi_xmm(i) PRIMITIVE_VEC(VEC_hi_xmm, i) > +# define VEC_hi_ymm(i) PRIMITIVE_VEC(VEC_hi_ymm, i) > +# define VEC_hi_zmm(i) PRIMITIVE_VEC(VEC_hi_zmm, i) > + > +#endif > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 82+ messages in thread
end of thread, other threads:[~2022-07-14 2:43 UTC | newest] Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-06-03 4:42 [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-03 23:12 ` H.J. Lu 2022-06-03 23:33 ` Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 5/8] x86: Optimize memrchr-evex.S Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-03 20:04 ` [PATCH v2 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-03 23:09 ` [PATCH v2 1/8] x86: Create header for VEC classes in x86 strings library H.J. Lu 2022-06-03 23:49 ` Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 " Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-06 21:30 ` H.J. Lu 2022-06-06 22:38 ` Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 5/8] x86: Optimize memrchr-evex.S Noah Goldstein 2022-06-03 23:49 ` [PATCH v3 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein 2022-06-03 23:50 ` [PATCH v3 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-03 23:50 ` [PATCH v3 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-07 2:45 ` H.J. Lu 2022-07-14 2:12 ` Sunil Pandey 2022-06-06 22:37 ` [PATCH v4 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein 2022-06-07 2:44 ` H.J. Lu 2022-06-07 4:10 ` Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 5/8] x86: Optimize memrchr-evex.S Noah Goldstein 2022-06-07 2:41 ` H.J. Lu 2022-06-07 4:09 ` Noah Goldstein 2022-06-07 4:12 ` Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein 2022-06-07 2:35 ` H.J. Lu 2022-06-07 4:06 ` Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-06 22:37 ` [PATCH v4 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 5/8] x86: Optimize memrchr-evex.S Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-07 4:05 ` [PATCH v5 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 2/8] x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein 2022-06-07 18:03 ` H.J. Lu 2022-06-07 4:11 ` [PATCH v6 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein 2022-06-07 18:04 ` H.J. Lu 2022-07-14 2:19 ` Sunil Pandey 2022-06-07 4:11 ` [PATCH v6 5/8] x86: Optimize memrchr-evex.S Noah Goldstein 2022-06-07 18:21 ` H.J. Lu 2022-07-14 2:21 ` Sunil Pandey 2022-06-07 4:11 ` [PATCH v6 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein 2022-06-07 18:17 ` H.J. Lu 2022-07-14 2:26 ` Sunil Pandey 2022-07-14 2:43 ` Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-07 18:18 ` H.J. Lu 2022-07-14 2:31 ` Sunil Pandey 2022-07-14 2:41 ` Noah Goldstein 2022-06-07 4:11 ` [PATCH v6 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-07 18:19 ` H.J. Lu 2022-07-14 2:32 ` Sunil Pandey 2022-06-07 18:04 ` [PATCH v6 1/8] x86: Create header for VEC classes in x86 strings library H.J. Lu 2022-07-14 2:07 ` Sunil Pandey 2022-06-03 4:42 ` [PATCH v1 3/8] Benchtests: Improve memrchr benchmarks Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 4/8] x86: Optimize memrchr-sse2.S Noah Goldstein 2022-06-03 4:47 ` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 5/8] x86: Optimize memrchr-evex.S Noah Goldstein 2022-06-03 4:49 ` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 6/8] x86: Optimize memrchr-avx2.S Noah Goldstein 2022-06-03 4:50 ` Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 7/8] x86: Shrink code size of memchr-avx2.S Noah Goldstein 2022-06-03 4:42 ` [PATCH v1 8/8] x86: Shrink code size of memchr-evex.S Noah Goldstein 2022-06-03 4:51 ` [PATCH v1 1/8] x86: Create header for VEC classes in x86 strings library Noah Goldstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).