From: Andrea Corallo <andrea.corallo@arm.com>
To: libc-alpha@sourceware.org
Cc: Szabolcs.Nagy@arm.com, Wilco.Dijkstra@arm.com, nd@arm.com
Subject: [PATCH] aarch64: MTE compatible strchr
Date: Wed, 03 Jun 2020 11:49:04 +0200 [thread overview]
Message-ID: <gkrsgfc8oof.fsf@arm.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 14614 bytes --]
Hi all,
I'd like to submit this patch introducing an Arm MTE compatible strchr
implementation.
Follows a performance comparison of the strchr benchmark run on
Cortex-A72, Cortex-A53, Neoverse N1.
| length | alignment | perf-uplift A72 | perf-uplift A53 | perf-uplift N1 |
|--------+-----------+-----------------+-----------------+----------------|
| 32 | 0 | 1.91x | 1.10x | 1.33x |
| 32 | 1 | 2.06x | 1.22x | 1.41x |
| 64 | 0 | 1.61x | 1.00x | 1.18x |
| 64 | 2 | 1.69x | 1.08x | 1.15x |
| 128 | 0 | 1.51x | 0.85x | 1.06x |
| 128 | 3 | 1.57x | 0.90x | 1.15x |
| 256 | 0 | 1.37x | 0.84x | 1.09x |
| 256 | 4 | 1.41x | 0.83x | 1.15x |
| 512 | 0 | 1.18x | 0.80x | 1.09x |
| 512 | 5 | 1.19x | 0.82x | 1.14x |
| 1024 | 0 | 1.15x | 0.78x | 1.09x |
| 1024 | 6 | 1.05x | 0.79x | 1.09x |
| 2048 | 0 | 1.15x | 0.76x | 1.08x |
| 2048 | 7 | 1.13x | 0.77x | 1.08x |
| 64 | 1 | 1.28x | 1.08x | 1.33x |
| 64 | 1 | 1.28x | 1.08x | 1.31x |
| 64 | 2 | 1.28x | 1.08x | 1.31x |
| 64 | 2 | 1.28x | 1.08x | 1.15x |
| 64 | 3 | 1.28x | 1.08x | 1.15x |
| 64 | 3 | 1.28x | 1.08x | 1.31x |
| 64 | 4 | 1.28x | 1.08x | 1.31x |
| 64 | 4 | 1.28x | 1.08x | 1.31x |
| 64 | 5 | 1.28x | 1.08x | 1.31x |
| 64 | 5 | 1.28x | 1.08x | 1.31x |
| 64 | 6 | 1.28x | 1.08x | 1.31x |
| 64 | 6 | 1.28x | 1.08x | 1.31x |
| 64 | 7 | 1.28x | 1.08x | 1.31x |
| 64 | 7 | 1.28x | 1.08x | 1.31x |
| 0 | 0 | 1.32x | 1.63x | 1.53x |
| 0 | 0 | 1.32x | 1.63x | 1.53x |
| 1 | 0 | 1.31x | 1.64x | 1.53x |
| 1 | 0 | 1.32x | 1.67x | 1.53x |
| 2 | 0 | 1.32x | 1.63x | 1.52x |
| 2 | 0 | 1.32x | 1.69x | 1.52x |
| 3 | 0 | 1.32x | 1.67x | 1.51x |
| 3 | 0 | 1.32x | 1.66x | 1.52x |
| 4 | 0 | 1.32x | 1.69x | 1.52x |
| 4 | 0 | 1.32x | 1.69x | 1.52x |
| 5 | 0 | 1.32x | 1.69x | 1.26x |
| 5 | 0 | 1.32x | 1.69x | 1.26x |
| 6 | 0 | 1.32x | 1.69x | 1.26x |
| 6 | 0 | 1.32x | 1.68x | 1.51x |
| 7 | 0 | 1.32x | 1.63x | 1.54x |
| 7 | 0 | 1.32x | 1.63x | 1.52x |
| 8 | 0 | 1.32x | 1.69x | 1.53x |
| 8 | 0 | 1.32x | 1.65x | 1.53x |
| 9 | 0 | 1.32x | 1.63x | 1.54x |
| 9 | 0 | 1.32x | 1.68x | 1.52x |
| 10 | 0 | 1.32x | 1.63x | 1.52x |
| 10 | 0 | 1.32x | 1.69x | 1.51x |
| 11 | 0 | 1.32x | 1.64x | 1.52x |
| 11 | 0 | 1.32x | 1.63x | 1.52x |
| 12 | 0 | 1.32x | 1.64x | 1.52x |
| 12 | 0 | 1.32x | 1.68x | 1.54x |
| 13 | 0 | 1.32x | 1.63x | 1.53x |
| 13 | 0 | 1.32x | 1.67x | 1.52x |
| 14 | 0 | 1.32x | 1.65x | 1.53x |
| 14 | 0 | 1.32x | 1.63x | 1.52x |
| 15 | 0 | 1.32x | 1.67x | 1.52x |
| 15 | 0 | 1.32x | 1.65x | 1.26x |
| 16 | 0 | 1.08x | 1.00x | 1.03x |
| 16 | 0 | 1.08x | 1.00x | 1.03x |
| 17 | 0 | 1.09x | 1.00x | 1.03x |
| 17 | 0 | 1.09x | 1.00x | 1.03x |
| 18 | 0 | 1.09x | 1.00x | 1.03x |
| 18 | 0 | 1.08x | 1.00x | 1.03x |
| 19 | 0 | 1.08x | 1.00x | 1.03x |
| 19 | 0 | 1.08x | 1.00x | 1.03x |
| 20 | 0 | 1.08x | 1.00x | 1.03x |
| 20 | 0 | 1.09x | 1.00x | 1.03x |
| 21 | 0 | 1.08x | 1.00x | 1.03x |
| 21 | 0 | 1.08x | 1.00x | 1.08x |
| 22 | 0 | 1.09x | 1.00x | 1.09x |
| 22 | 0 | 1.08x | 1.00x | 1.09x |
| 23 | 0 | 1.08x | 1.00x | 1.08x |
| 23 | 0 | 1.08x | 1.00x | 1.08x |
| 24 | 0 | 1.08x | 1.00x | 1.08x |
| 24 | 0 | 1.08x | 1.00x | 1.09x |
| 25 | 0 | 1.08x | 1.00x | 1.10x |
| 25 | 0 | 1.08x | 1.00x | 1.09x |
| 26 | 0 | 1.08x | 1.00x | 1.08x |
| 26 | 0 | 1.08x | 1.00x | 1.08x |
| 27 | 0 | 1.09x | 1.00x | 1.08x |
| 27 | 0 | 1.08x | 1.00x | 1.08x |
| 28 | 0 | 1.08x | 1.00x | 1.08x |
| 28 | 0 | 1.08x | 1.00x | 1.08x |
| 29 | 0 | 1.08x | 1.00x | 1.09x |
| 29 | 0 | 1.08x | 1.00x | 1.08x |
| 30 | 0 | 1.08x | 1.00x | 1.08x |
| 30 | 0 | 1.08x | 1.00x | 1.08x |
| 31 | 0 | 1.09x | 1.00x | 1.08x |
| 31 | 0 | 1.08x | 1.00x | 1.08x |
| 32 | 0 | 1.27x | 1.10x | 1.25x |
| 32 | 1 | 1.38x | 1.21x | 1.38x |
| 64 | 0 | 1.17x | 1.00x | 1.20x |
| 64 | 2 | 1.28x | 1.08x | 1.33x |
| 128 | 0 | 1.17x | 0.85x | 1.17x |
| 128 | 3 | 1.23x | 0.90x | 1.29x |
| 256 | 0 | 1.17x | 0.84x | 1.15x |
| 256 | 4 | 1.21x | 0.83x | 1.21x |
| 512 | 0 | 1.16x | 0.80x | 1.08x |
| 512 | 5 | 1.19x | 0.82x | 1.14x |
| 1024 | 0 | 1.15x | 0.78x | 1.09x |
| 1024 | 6 | 1.05x | 0.79x | 1.09x |
| 2048 | 0 | 1.15x | 0.76x | 1.08x |
| 2048 | 7 | 1.14x | 0.77x | 1.08x |
| 64 | 1 | 1.20x | 1.08x | 1.33x |
| 64 | 1 | 1.28x | 1.08x | 1.33x |
| 64 | 2 | 1.28x | 1.08x | 1.35x |
| 64 | 2 | 1.28x | 1.08x | 1.35x |
| 64 | 3 | 1.28x | 1.08x | 1.15x |
| 64 | 3 | 1.28x | 1.08x | 1.15x |
| 64 | 4 | 1.28x | 1.08x | 1.35x |
| 64 | 4 | 1.28x | 1.08x | 1.31x |
| 64 | 5 | 1.28x | 1.08x | 1.35x |
| 64 | 5 | 1.28x | 1.08x | 1.35x |
| 64 | 6 | 1.28x | 1.08x | 1.31x |
| 64 | 6 | 1.28x | 1.08x | 1.31x |
| 64 | 7 | 1.28x | 1.08x | 1.35x |
| 64 | 7 | 1.28x | 1.08x | 1.35x |
| 0 | 0 | 1.32x | 1.68x | 1.52x |
| 0 | 0 | 1.32x | 1.63x | 1.53x |
| 1 | 0 | 1.32x | 1.69x | 1.52x |
| 1 | 0 | 1.32x | 1.68x | 1.52x |
| 2 | 0 | 1.32x | 1.69x | 1.51x |
| 2 | 0 | 1.32x | 1.69x | 1.52x |
| 3 | 0 | 1.32x | 1.67x | 1.51x |
| 3 | 0 | 1.32x | 1.69x | 1.52x |
| 4 | 0 | 1.32x | 1.67x | 1.52x |
| 4 | 0 | 1.32x | 1.69x | 1.56x |
| 5 | 0 | 1.32x | 1.69x | 1.52x |
| 5 | 0 | 1.32x | 1.69x | 1.52x |
| 6 | 0 | 1.32x | 1.69x | 1.51x |
| 6 | 0 | 1.32x | 1.69x | 1.52x |
| 7 | 0 | 1.32x | 1.63x | 1.52x |
| 7 | 0 | 1.32x | 1.63x | 1.53x |
| 8 | 0 | 1.32x | 1.65x | 1.52x |
| 8 | 0 | 1.32x | 1.63x | 1.52x |
| 9 | 0 | 1.32x | 1.63x | 1.51x |
| 9 | 0 | 1.32x | 1.64x | 1.52x |
| 10 | 0 | 1.32x | 1.63x | 1.52x |
| 10 | 0 | 1.32x | 1.65x | 1.52x |
| 11 | 0 | 1.32x | 1.63x | 1.52x |
| 11 | 0 | 1.32x | 1.63x | 1.51x |
| 12 | 0 | 1.32x | 1.63x | 1.53x |
| 12 | 0 | 1.32x | 1.63x | 1.51x |
| 13 | 0 | 1.32x | 1.63x | 1.52x |
| 13 | 0 | 1.32x | 1.65x | 1.52x |
| 14 | 0 | 1.32x | 1.66x | 1.53x |
| 14 | 0 | 1.32x | 1.64x | 1.26x |
| 15 | 0 | 1.32x | 1.68x | 1.26x |
| 15 | 0 | 1.32x | 1.69x | 1.26x |
| 16 | 0 | 1.08x | 1.00x | 1.03x |
| 16 | 0 | 1.08x | 1.00x | 1.05x |
| 17 | 0 | 1.08x | 1.00x | 1.08x |
| 17 | 0 | 1.09x | 1.00x | 1.03x |
| 18 | 0 | 1.09x | 1.00x | 1.08x |
| 18 | 0 | 1.08x | 1.00x | 1.08x |
| 19 | 0 | 1.08x | 1.00x | 1.08x |
| 19 | 0 | 1.08x | 1.00x | 1.09x |
| 20 | 0 | 1.09x | 1.00x | 1.08x |
| 20 | 0 | 1.08x | 1.00x | 1.08x |
| 21 | 0 | 1.08x | 1.00x | 1.09x |
| 21 | 0 | 1.08x | 1.00x | 1.08x |
| 22 | 0 | 1.09x | 1.00x | 1.08x |
| 22 | 0 | 1.08x | 1.00x | 1.09x |
| 23 | 0 | 1.08x | 1.00x | 1.08x |
| 23 | 0 | 1.08x | 1.00x | 1.08x |
| 24 | 0 | 1.08x | 1.00x | 1.08x |
| 24 | 0 | 1.08x | 1.00x | 1.08x |
| 25 | 0 | 1.08x | 1.00x | 1.08x |
| 25 | 0 | 1.08x | 1.00x | 1.09x |
| 26 | 0 | 1.08x | 1.00x | 1.08x |
| 26 | 0 | 1.08x | 1.00x | 1.09x |
| 27 | 0 | 1.09x | 1.00x | 1.08x |
| 27 | 0 | 1.08x | 1.00x | 1.08x |
| 28 | 0 | 1.08x | 1.00x | 1.08x |
| 28 | 0 | 1.09x | 1.00x | 1.03x |
| 29 | 0 | 1.08x | 1.00x | 1.03x |
| 29 | 0 | 1.08x | 1.00x | 1.03x |
| 30 | 0 | 1.08x | 1.00x | 1.08x |
| 30 | 0 | 1.08x | 1.00x | 1.08x |
| 31 | 0 | 1.09x | 1.00x | 1.08x |
| 31 | 0 | 1.08x | 1.00x | 1.08x |
This patch is passing GLIBC tests.
Regards
Andrea
8< --- 8< --- 8<
Introduce an Arm MTE compatible strchr implementation.
Benchmarked on Cortex-A72, Cortex-A53, Neoverse N1 does not show
performance regressions.
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: strchr-mte.patch --]
[-- Type: text/x-diff, Size: 6241 bytes --]
diff --git a/sysdeps/aarch64/strchr.S b/sysdeps/aarch64/strchr.S
index 4a75e73945..fd1b941666 100644
--- a/sysdeps/aarch64/strchr.S
+++ b/sysdeps/aarch64/strchr.S
@@ -22,118 +22,98 @@
/* Assumptions:
*
- * ARMv8-a, AArch64
+ * ARMv8-a, AArch64, Advanced SIMD.
+ * MTE compatible.
*/
-/* Arguments and results. */
#define srcin x0
#define chrin w1
-
#define result x0
#define src x2
-#define tmp1 x3
-#define wtmp2 w4
-#define tmp3 x5
+#define tmp1 x1
+#define wtmp2 w3
+#define tmp3 x3
#define vrepchr v0
-#define vdata1 v1
-#define vdata2 v2
-#define vhas_nul1 v3
-#define vhas_nul2 v4
-#define vhas_chr1 v5
-#define vhas_chr2 v6
-#define vrepmask_0 v7
-#define vrepmask_c v16
-#define vend1 v17
-#define vend2 v18
-
- /* Core algorithm.
- For each 32-byte hunk we calculate a 64-bit syndrome value, with
- two bits per byte (LSB is always in bits 0 and 1, for both big
- and little-endian systems). Bit 0 is set iff the relevant byte
- matched the requested character. Bit 1 is set iff the
- relevant byte matched the NUL end of string (we trigger off bit0
- for the special case of looking for NUL). Since the bits
- in the syndrome reflect exactly the order in which things occur
- in the original string a count_trailing_zeros() operation will
- identify exactly which byte is causing the termination, and why. */
-
-/* Locals and temporaries. */
+#define vdata v1
+#define qdata q1
+#define vhas_nul v2
+#define vhas_chr v3
+#define vrepmask v4
+#define vrepmask2 v5
+#define vend v6
+#define dend d6
+
+/* Core algorithm.
+
+ For each 16-byte chunk we calculate a 64-bit syndrome value with four bits
+ per byte. For even bytes, bits 0-1 are set if the relevant byte matched the
+ requested character, bits 2-3 are set if the byte is NUL (or matched), and
+ bits 4-7 are not used and must be zero if none of bits 0-3 are set). Odd
+ bytes set bits 4-7 so that adjacent bytes can be merged. Since the bits
+ in the syndrome reflect the order in which things occur in the original
+ string, counting trailing zeros identifies exactly which byte matched. */
ENTRY (strchr)
DELOUSE (0)
- mov wtmp2, #0x0401
- movk wtmp2, #0x4010, lsl #16
+ bic src, srcin, 15
dup vrepchr.16b, chrin
- bic src, srcin, #31
- dup vrepmask_c.4s, wtmp2
- ands tmp1, srcin, #31
- add vrepmask_0.4s, vrepmask_c.4s, vrepmask_c.4s // lsl #1
- b.eq L(loop)
-
- /* Input string is not 32-byte aligned. Rather than forcing
- the padding bytes to a safe value, we calculate the syndrome
- for all the bytes, but then mask off those bits of the
- syndrome that are related to the padding. */
- ld1 {vdata1.16b, vdata2.16b}, [src], #32
- neg tmp1, tmp1
- cmeq vhas_nul1.16b, vdata1.16b, #0
- cmeq vhas_chr1.16b, vdata1.16b, vrepchr.16b
- cmeq vhas_nul2.16b, vdata2.16b, #0
- cmeq vhas_chr2.16b, vdata2.16b, vrepchr.16b
- and vhas_nul1.16b, vhas_nul1.16b, vrepmask_0.16b
- and vhas_nul2.16b, vhas_nul2.16b, vrepmask_0.16b
- and vhas_chr1.16b, vhas_chr1.16b, vrepmask_c.16b
- and vhas_chr2.16b, vhas_chr2.16b, vrepmask_c.16b
- orr vend1.16b, vhas_nul1.16b, vhas_chr1.16b
- orr vend2.16b, vhas_nul2.16b, vhas_chr2.16b
- lsl tmp1, tmp1, #1
- addp vend1.16b, vend1.16b, vend2.16b // 256->128
- mov tmp3, #~0
- addp vend1.16b, vend1.16b, vend2.16b // 128->64
- lsr tmp1, tmp3, tmp1
-
- mov tmp3, vend1.2d[0]
- bic tmp1, tmp3, tmp1 // Mask padding bits.
- cbnz tmp1, L(tail)
+ ld1 {vdata.16b}, [src]
+ mov wtmp2, 0x3003
+ dup vrepmask.8h, wtmp2
+ cmeq vhas_nul.16b, vdata.16b, 0
+ cmeq vhas_chr.16b, vdata.16b, vrepchr.16b
+ mov wtmp2, 0xf00f
+ dup vrepmask2.8h, wtmp2
+
+ bit vhas_nul.16b, vhas_chr.16b, vrepmask.16b
+ and vhas_nul.16b, vhas_nul.16b, vrepmask2.16b
+ lsl tmp3, srcin, 2
+ addp vend.16b, vhas_nul.16b, vhas_nul.16b /* 128->64 */
+
+ fmov tmp1, dend
+ lsr tmp1, tmp1, tmp3
+ cbz tmp1, L(loop)
+
+ rbit tmp1, tmp1
+ clz tmp1, tmp1
+ /* Tmp1 is an even multiple of 2 if the target character was
+ found first. Otherwise we've found the end of string. */
+ tst tmp1, 2
+ add result, srcin, tmp1, lsr 2
+ csel result, result, xzr, eq
+ ret
+ .p2align 4
L(loop):
- ld1 {vdata1.16b, vdata2.16b}, [src], #32
- cmeq vhas_nul1.16b, vdata1.16b, #0
- cmeq vhas_chr1.16b, vdata1.16b, vrepchr.16b
- cmeq vhas_nul2.16b, vdata2.16b, #0
- cmeq vhas_chr2.16b, vdata2.16b, vrepchr.16b
- /* Use a fast check for the termination condition. */
- orr vend1.16b, vhas_nul1.16b, vhas_chr1.16b
- orr vend2.16b, vhas_nul2.16b, vhas_chr2.16b
- orr vend1.16b, vend1.16b, vend2.16b
- addp vend1.2d, vend1.2d, vend1.2d
- mov tmp1, vend1.2d[0]
+ ldr qdata, [src, 16]!
+ cmeq vhas_chr.16b, vdata.16b, vrepchr.16b
+ cmhs vhas_nul.16b, vhas_chr.16b, vdata.16b
+ umaxp vend.16b, vhas_nul.16b, vhas_nul.16b
+ fmov tmp1, dend
cbz tmp1, L(loop)
- /* Termination condition found. Now need to establish exactly why
- we terminated. */
- and vhas_nul1.16b, vhas_nul1.16b, vrepmask_0.16b
- and vhas_nul2.16b, vhas_nul2.16b, vrepmask_0.16b
- and vhas_chr1.16b, vhas_chr1.16b, vrepmask_c.16b
- and vhas_chr2.16b, vhas_chr2.16b, vrepmask_c.16b
- orr vend1.16b, vhas_nul1.16b, vhas_chr1.16b
- orr vend2.16b, vhas_nul2.16b, vhas_chr2.16b
- addp vend1.16b, vend1.16b, vend2.16b // 256->128
- addp vend1.16b, vend1.16b, vend2.16b // 128->64
-
- mov tmp1, vend1.2d[0]
-L(tail):
- sub src, src, #32
+#ifdef __AARCH64EB__
+ bif vhas_nul.16b, vhas_chr.16b, vrepmask.16b
+ and vhas_nul.16b, vhas_nul.16b, vrepmask2.16b
+ addp vend.16b, vhas_nul.16b, vhas_nul.16b /* 128->64 */
+ fmov tmp1, dend
+#else
+ bit vhas_nul.16b, vhas_chr.16b, vrepmask.16b
+ and vhas_nul.16b, vhas_nul.16b, vrepmask2.16b
+ addp vend.16b, vhas_nul.16b, vhas_nul.16b /* 128->64 */
+ fmov tmp1, dend
rbit tmp1, tmp1
+#endif
clz tmp1, tmp1
- /* Tmp1 is even if the target charager was found first. Otherwise
- we've found the end of string and we weren't looking for NUL. */
- tst tmp1, #1
- add result, src, tmp1, lsr #1
+ /* Tmp1 is an even multiple of 2 if the target character was
+ found first. Otherwise we've found the end of string. */
+ tst tmp1, 2
+ add result, src, tmp1, lsr 2
csel result, result, xzr, eq
ret
+
END (strchr)
libc_hidden_builtin_def (strchr)
weak_alias (strchr, index)
reply other threads:[~2020-06-03 9:49 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=gkrsgfc8oof.fsf@arm.com \
--to=andrea.corallo@arm.com \
--cc=Szabolcs.Nagy@arm.com \
--cc=Wilco.Dijkstra@arm.com \
--cc=libc-alpha@sourceware.org \
--cc=nd@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).