From: "Lucas A. M. Magalhaes" <lamm@linux.ibm.com>
To: Matheus Castanho <msc@linux.ibm.com>, libc-alpha@sourceware.org
Subject: Re: [PATCH v2] powerpc: Add optimized strlen for POWER10
Date: Thu, 22 Apr 2021 15:20:06 -0300 [thread overview]
Message-ID: <161911560634.43295.12328092311719242757@localhost.localdomain> (raw)
In-Reply-To: <20210422122911.27758-1-msc@linux.ibm.com>
Hi Matheus, LGTM. Reviewed and all tests pass.
Thanks for working on this.
Quoting Matheus Castanho via Libc-alpha (2021-04-22 09:29:11)
> Improvements compared to POWER9 version:
>
> 1. Take into account first 16B comparison for aligned strings
>
> The previous version compares the first 16B and increments r4 by the number
> of bytes until the address is 16B-aligned, then starts doing aligned loads at
> that address. For aligned strings, this causes the first 16B to be compared
> twice, because the increment is 0. Here we calculate the next 16B-aligned
> address differently, which avoids that issue.
>
> 2. Use simple comparisons for the first ~192 bytes
>
> The main loop is good for big strings, but comparing 16B each time is better
> for smaller strings. So after aligning the address to 16 Bytes, we check
> more 176B in 16B chunks. There may be some overlaps with the main loop for
> unaligned strings, but we avoid using the more aggressive strategy too soon,
> and also allow the loop to start at a 64B-aligned address. This greatly
> benefits smaller strings and avoids overlapping checks if the string is
> already aligned at a 64B boundary.
>
> 3. Reduce dependencies between load blocks caused by address calculation on loop
>
> Doing a precise time tracing on the code showed many loads in the loop were
> stalled waiting for updates to r4 from previous code blocks. This
> implementation avoids that as much as possible by using 2 registers (r4 and
> r5) to hold addresses to be used by different parts of the code.
>
> Also, the previous code aligned the address to 16B, then to 64B by doing a
> few 48B loops (if needed) until the address was aligned. The main loop could
> not start until that 48B loop had finished and r4 was updated with the
> current address. Here we calculate the address used by the loop very early,
> so it can start sooner.
>
> The main loop now uses 2 pointers 128B apart to make pointer updates less
> frequent, and also unrolls 1 iteration to guarantee there is enough time
> between iterations to update the pointers, reducing stalled cycles.
>
> 4. Use new P10 instructions
>
> lxvp is used to load 32B with a single instruction, reducing contention in
> the load queue.
>
> vextractbm allows simplifying the tail code for the loop, replacing
> vbpermq and avoiding having to generate a permute control vector.
>
> Output of bench-strlen from 'make USE_CLOCK_GETTIME=1 BENCHSET="string-benchset"
> using slightly different set of inputs than the default:
>
> $ ./compare_strings.py --functions __strlen_power9,__strlen_power10
> -a length,alignment -s benchout_strings.schema.json
> -i bench-strlen.out
>
> Function: strlen
> Variant:
> __strlen_power10 __strlen_power9
> ================================================================================
> length=1, alignment=0: 2.50 2.50 ( 0.00%)
> length=1, alignment=1: 2.50 2.50 ( 0.00%)
> length=2, alignment=0: 2.50 2.50 ( 0.00%)
> length=2, alignment=2: 2.50 2.50 ( 0.00%)
> length=3, alignment=0: 2.50 2.50 ( 0.00%)
> length=3, alignment=3: 2.50 2.50 ( 0.00%)
> length=4, alignment=0: 2.50 2.50 ( 0.00%)
> length=4, alignment=4: 2.50 2.50 ( 0.00%)
> length=5, alignment=0: 2.50 2.50 ( 0.00%)
> length=5, alignment=5: 2.50 2.50 ( 0.00%)
> length=6, alignment=0: 2.50 2.50 ( 0.00%)
> length=6, alignment=6: 2.50 2.50 ( 0.00%)
> length=7, alignment=0: 2.50 2.50 ( 0.00%)
> length=7, alignment=7: 2.50 2.50 ( 0.00%)
> length=8, alignment=0: 2.50 2.50 ( 0.00%)
> length=8, alignment=8: 3.12 3.12 ( 0.00%)
> length=9, alignment=0: 2.50 2.50 ( 0.00%)
> length=9, alignment=9: 3.12 3.12 ( 0.00%)
> length=10, alignment=10: 3.12 3.12 ( 0.00%)
> length=16, alignment=0: 3.12 3.40 ( -9.09%)
> length=16, alignment=4: 3.12 3.12 ( 0.00%)
> length=16, alignment=7: 3.12 3.12 ( 0.00%)
> length=21, alignment=0: 3.12 3.40 ( -9.09%)
> length=21, alignment=5: 3.12 3.12 ( 0.00%)
> length=32, alignment=0: 3.12 3.40 ( -9.09%)
> length=32, alignment=7: 3.12 3.40 ( -9.09%)
> length=42, alignment=0: 3.12 3.40 ( -9.09%)
> length=42, alignment=7: 3.42 3.40 ( 0.51%)
> length=48, alignment=0: 3.43 3.74 ( -9.13%)
> length=48, alignment=7: 3.40 3.40 ( 0.17%)
> length=64, alignment=0: 3.40 5.21 (-53.34%)
> length=64, alignment=7: 3.40 3.74 (-10.00%)
> length=80, alignment=0: 3.74 5.21 (-39.43%)
> length=80, alignment=7: 3.74 4.01 ( -7.14%)
> length=85, alignment=0: 3.74 5.21 (-39.42%)
> length=85, alignment=7: 3.74 4.01 ( -7.14%)
> length=96, alignment=0: 3.74 5.21 (-39.40%)
> length=96, alignment=7: 3.74 4.01 ( -7.14%)
> length=112, alignment=0: 3.74 5.21 (-39.39%)
> length=112, alignment=7: 3.74 4.88 (-30.43%)
> length=128, alignment=0: 4.01 5.91 (-47.59%)
> length=128, alignment=7: 4.01 6.15 (-53.59%)
> length=128, alignment=16: 4.01 6.16 (-53.78%)
> length=128, alignment=23: 4.01 5.17 (-29.08%)
> length=160, alignment=0: 4.01 5.92 (-47.75%)
> length=160, alignment=7: 4.01 6.16 (-53.72%)
> length=160, alignment=16: 4.01 6.14 (-53.29%)
> length=160, alignment=23: 4.01 6.05 (-50.98%)
> length=192, alignment=0: 5.93 6.84 (-15.44%)
> length=192, alignment=7: 5.93 6.90 (-16.35%)
> length=256, alignment=0: 6.61 7.73 (-17.02%)
> length=256, alignment=7: 6.61 7.85 (-18.79%)
> length=320, alignment=0: 7.26 8.65 (-19.12%)
> length=320, alignment=7: 7.26 8.76 (-20.70%)
> length=384, alignment=0: 7.95 9.62 (-20.98%)
> length=384, alignment=7: 7.95 9.49 (-19.37%)
> length=448, alignment=0: 8.73 10.39 (-19.06%)
> length=448, alignment=7: 8.73 10.51 (-20.40%)
> length=512, alignment=0: 9.44 11.13 (-17.87%)
> length=512, alignment=7: 9.45 11.32 (-19.85%)
> length=576, alignment=0: 10.10 11.93 (-18.05%)
> length=576, alignment=7: 10.10 12.02 (-18.97%)
> length=640, alignment=0: 10.71 12.73 (-18.86%)
> length=640, alignment=7: 10.67 12.89 (-20.76%)
> length=704, alignment=0: 11.59 13.39 (-15.61%)
> length=704, alignment=7: 11.59 13.61 (-17.45%)
> length=768, alignment=0: 12.27 14.22 (-15.90%)
> length=768, alignment=7: 12.27 14.44 (-17.72%)
> length=896, alignment=0: 13.48 15.70 (-16.47%)
> length=896, alignment=7: 13.47 15.97 (-18.56%)
> length=960, alignment=0: 14.22 16.63 (-16.92%)
> length=960, alignment=7: 14.19 16.70 (-17.66%)
> length=1024, alignment=0: 14.85 17.46 (-17.54%)
> length=1024, alignment=7: 14.87 17.68 (-18.94%)
> length=1280, alignment=0: 17.58 20.91 (-18.94%)
> length=1280, alignment=7: 17.62 21.35 (-21.13%)
> length=1536, alignment=0: 20.61 24.54 (-19.07%)
> length=1536, alignment=7: 20.61 24.21 (-17.48%)
> length=1792, alignment=0: 23.02 27.94 (-21.39%)
> length=1792, alignment=7: 23.02 27.83 (-20.90%)
> length=2048, alignment=0: 25.98 30.71 (-18.23%)
> length=2048, alignment=7: 25.96 31.26 (-20.45%)
> length=2560, alignment=0: 31.37 37.82 (-20.57%)
> length=2560, alignment=7: 31.34 37.69 (-20.26%)
> length=3008, alignment=0: 35.61 43.29 (-21.56%)
> length=3008, alignment=7: 35.55 43.84 (-23.31%)
> length=3520, alignment=0: 41.08 50.48 (-22.90%)
> length=3520, alignment=7: 41.12 50.63 (-23.13%)
> length=4096, alignment=0: 47.80 57.96 (-21.25%)
> length=4096, alignment=7: 47.79 57.66 (-20.66%)
>
> Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com>
>
> ---
> Changes from v1:
> - Added comment about minimum binutils version needed to remove the instruction macros
> - s/reg/vreg/ on CHECK16 for clarity
>
> ---
> sysdeps/powerpc/powerpc64/le/power10/strlen.S | 221 ++++++++++++++++++
> sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +-
> .../powerpc64/multiarch/ifunc-impl-list.c | 2 +
> .../powerpc64/multiarch/strlen-power10.S | 2 +
> sysdeps/powerpc/powerpc64/multiarch/strlen.c | 3 +
> 5 files changed, 230 insertions(+), 1 deletion(-)
> create mode 100644 sysdeps/powerpc/powerpc64/le/power10/strlen.S
> create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S
>
> diff --git a/sysdeps/powerpc/powerpc64/le/power10/strlen.S b/sysdeps/powerpc/powerpc64/le/power10/strlen.S
> new file mode 100644
> index 0000000000..7eb37a8f54
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power10/strlen.S
> @@ -0,0 +1,221 @@
> +/* Optimized strlen implementation for POWER10 LE.
> + Copyright (C) 2021 Free Software Foundation, Inc.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <https://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +
> +#ifndef STRLEN
> +# define STRLEN __strlen
> +# define DEFINE_STRLEN_HIDDEN_DEF 1
> +#endif
> +
> +/* TODO: Replace macros by the actual instructions when minimum binutils becomes
> + >= 2.35. This is used to keep compatibility with older versions. */
> +#define VEXTRACTBM(rt,vrb) \
> + .long(((4)<<(32-6)) \
> + | ((rt)<<(32-11)) \
> + | ((8)<<(32-16)) \
> + | ((vrb)<<(32-21)) \
> + | 1602)
> +
> +#define LXVP(xtp,dq,ra) \
> + .long(((6)<<(32-6)) \
> + | ((((xtp)-32)>>1)<<(32-10)) \
> + | ((1)<<(32-11)) \
> + | ((ra)<<(32-16)) \
> + | dq)
> +
> +#define CHECK16(vreg,offset,addr,label) \
> + lxv vreg+32,offset(addr); \
> + vcmpequb. vreg,vreg,v18; \
> + bne cr6,L(label);
> +
> +/* Load 4 quadwords, merge into one VR for speed and check for NULLs. r6 has #
> + of bytes already checked. */
> +#define CHECK64(offset,addr,label) \
> + li r6,offset; \
> + LXVP(v4+32,offset,addr); \
> + LXVP(v6+32,offset+32,addr); \
> + vminub v14,v4,v5; \
> + vminub v15,v6,v7; \
> + vminub v16,v14,v15; \
> + vcmpequb. v0,v16,v18; \
> + bne cr6,L(label)
> +
> +# define TAIL(vreg,increment) \
> + vctzlsbb r4,vreg; \
> + subf r3,r3,r5; \
> + addi r4,r4,increment; \
> + add r3,r3,r4; \
> + blr
> +
> +/* Implements the function
> +
> + int [r3] strlen (const void *s [r3])
> +
> + The implementation can load bytes past a matching byte, but only
> + up to the next 64B boundary, so it never crosses a page. */
> +
> +.machine power9
> +
> +ENTRY_TOCLESS (STRLEN, 4)
> + CALL_MCOUNT 1
> +
> + vspltisb v18,0
> + vspltisb v19,-1
> +
> + /* Next 16B-aligned address. Prepare address for L(aligned). */
> + addi r5,r3,16
> + clrrdi r5,r5,4
> +
> + /* Align data and fill bytes not loaded with non matching char. */
> + lvx v0,0,r3
> + lvsr v1,0,r3
> + vperm v0,v19,v0,v1
> +
> + vcmpequb. v6,v0,v18
> + beq cr6,L(aligned)
> +
> + vctzlsbb r3,v6
> + blr
> +
> + /* Test more 112B, 16B at a time. The main loop is optimized for longer
> + strings, so checking the first bytes in 16B chunks benefits a lot
> + small strings. */
> + .p2align 5
> +L(aligned):
> + /* Prepare address for the loop. */
> + addi r4,r3,192
> + clrrdi r4,r4,6
> +
> + CHECK16(v0,0,r5,tail1)
> + CHECK16(v1,16,r5,tail2)
> + CHECK16(v2,32,r5,tail3)
> + CHECK16(v3,48,r5,tail4)
> + CHECK16(v4,64,r5,tail5)
> + CHECK16(v5,80,r5,tail6)
> + CHECK16(v6,96,r5,tail7)
> + CHECK16(v7,112,r5,tail8)
> + CHECK16(v8,128,r5,tail9)
> + CHECK16(v9,144,r5,tail10)
> + CHECK16(v10,160,r5,tail11)
> +
> + addi r5,r4,128
> +
> + /* Switch to a more aggressive approach checking 64B each time. Use 2
> + pointers 128B apart and unroll the loop once to make the pointer
> + updates and usages separated enough to avoid stalls waiting for
> + address calculation. */
> + .p2align 5
> +L(loop):
> + CHECK64(0,r4,pre_tail_64b)
> + CHECK64(64,r4,pre_tail_64b)
> + addi r4,r4,256
> +
> + CHECK64(0,r5,tail_64b)
> + CHECK64(64,r5,tail_64b)
> + addi r5,r5,256
> +
> + b L(loop)
> +
> + .p2align 5
> +L(pre_tail_64b):
> + mr r5,r4
> +L(tail_64b):
> + /* OK, we found a null byte. Let's look for it in the current 64-byte
> + block and mark it in its corresponding VR. lxvp vx,0(ry) puts the
> + low 16B bytes into vx+1, and the high into vx, so the order here is
> + v5, v4, v7, v6. */
> + vcmpequb v1,v5,v18
> + vcmpequb v2,v4,v18
> + vcmpequb v3,v7,v18
> + vcmpequb v4,v6,v18
> +
> + /* Take into account the other 64B blocks we had already checked. */
> + add r5,r5,r6
> +
> + /* Extract first bit of each byte. */
> + VEXTRACTBM(r7,v1)
> + VEXTRACTBM(r8,v2)
> + VEXTRACTBM(r9,v3)
> + VEXTRACTBM(r10,v4)
> +
> + /* Shift each value into their corresponding position. */
> + sldi r8,r8,16
> + sldi r9,r9,32
> + sldi r10,r10,48
> +
> + /* Merge the results. */
> + or r7,r7,r8
> + or r8,r9,r10
> + or r10,r8,r7
> +
> + cnttzd r0,r10 /* Count trailing zeros before the match. */
> + subf r5,r3,r5
> + add r3,r5,r0 /* Compute final length. */
> + blr
> +
> + .p2align 5
> +L(tail1):
> + TAIL(v0,0)
> +
> + .p2align 5
> +L(tail2):
> + TAIL(v1,16)
> +
> + .p2align 5
> +L(tail3):
> + TAIL(v2,32)
> +
> + .p2align 5
> +L(tail4):
> + TAIL(v3,48)
> +
> + .p2align 5
> +L(tail5):
> + TAIL(v4,64)
> +
> + .p2align 5
> +L(tail6):
> + TAIL(v5,80)
> +
> + .p2align 5
> +L(tail7):
> + TAIL(v6,96)
> +
> + .p2align 5
> +L(tail8):
> + TAIL(v7,112)
> +
> + .p2align 5
> +L(tail9):
> + TAIL(v8,128)
> +
> + .p2align 5
> +L(tail10):
> + TAIL(v9,144)
> +
> + .p2align 5
> +L(tail11):
> + TAIL(v10,160)
> +
> +END (STRLEN)
> +
> +#ifdef DEFINE_STRLEN_HIDDEN_DEF
> +weak_alias (__strlen, strlen)
> +libc_hidden_builtin_def (strlen)
> +#endif
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
> index f46bf50732..8aa46a3702 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
> +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
> @@ -33,7 +33,8 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \
>
> ifneq (,$(filter %le,$(config-machine)))
> sysdep_routines += strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \
> - rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-power9
> + rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-power9 \
> + strlen-power10
> endif
> CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops
> CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> index 72f7f83e7e..1a6993616f 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> @@ -112,6 +112,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c. */
> IFUNC_IMPL (i, name, strlen,
> #ifdef __LITTLE_ENDIAN__
> + IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_3_1,
> + __strlen_power10)
> IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_3_00,
> __strlen_power9)
> #endif
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S b/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S
> new file mode 100644
> index 0000000000..6a774fad58
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S
> @@ -0,0 +1,2 @@
> +#define STRLEN __strlen_power10
> +#include <sysdeps/powerpc/powerpc64/le/power10/strlen.S>
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
> index c3bbc78df8..109c8a90bd 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
> @@ -31,9 +31,12 @@ extern __typeof (__redirect_strlen) __strlen_ppc attribute_hidden;
> extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden;
> extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden;
> extern __typeof (__redirect_strlen) __strlen_power9 attribute_hidden;
> +extern __typeof (__redirect_strlen) __strlen_power10 attribute_hidden;
>
> libc_ifunc (__libc_strlen,
> # ifdef __LITTLE_ENDIAN__
> + (hwcap2 & PPC_FEATURE2_ARCH_3_1)
> + ? __strlen_power10 :
> (hwcap2 & PPC_FEATURE2_ARCH_3_00)
> ? __strlen_power9 :
> # endif
> --
> 2.30.2
>
next prev parent reply other threads:[~2021-04-22 18:20 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-22 12:29 Matheus Castanho
2021-04-22 15:22 ` Raphael M Zinsly
2021-04-22 18:20 ` Lucas A. M. Magalhaes [this message]
2021-04-22 19:50 ` Matheus Castanho
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=161911560634.43295.12328092311719242757@localhost.localdomain \
--to=lamm@linux.ibm.com \
--cc=libc-alpha@sourceware.org \
--cc=msc@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).