public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: "Lucas A. M. Magalhaes" <lamm@linux.ibm.com>
To: Matheus Castanho <msc@linux.ibm.com>, libc-alpha@sourceware.org
Subject: Re: [PATCH v2] powerpc: Add optimized strlen for POWER10
Date: Thu, 22 Apr 2021 15:20:06 -0300	[thread overview]
Message-ID: <161911560634.43295.12328092311719242757@localhost.localdomain> (raw)
In-Reply-To: <20210422122911.27758-1-msc@linux.ibm.com>

Hi Matheus, LGTM. Reviewed and all tests pass.

Thanks for working on this.

Quoting Matheus Castanho via Libc-alpha (2021-04-22 09:29:11)
> Improvements compared to POWER9 version:
> 
> 1. Take into account first 16B comparison for aligned strings
> 
>    The previous version compares the first 16B and increments r4 by the number
>    of bytes until the address is 16B-aligned, then starts doing aligned loads at
>    that address. For aligned strings, this causes the first 16B to be compared
>    twice, because the increment is 0. Here we calculate the next 16B-aligned
>    address differently, which avoids that issue.
> 
> 2. Use simple comparisons for the first ~192 bytes
> 
>    The main loop is good for big strings, but comparing 16B each time is better
>    for smaller strings.  So after aligning the address to 16 Bytes, we check
>    more 176B in 16B chunks.  There may be some overlaps with the main loop for
>    unaligned strings, but we avoid using the more aggressive strategy too soon,
>    and also allow the loop to start at a 64B-aligned address.  This greatly
>    benefits smaller strings and avoids overlapping checks if the string is
>    already aligned at a 64B boundary.
> 
> 3. Reduce dependencies between load blocks caused by address calculation on loop
> 
>    Doing a precise time tracing on the code showed many loads in the loop were
>    stalled waiting for updates to r4 from previous code blocks.  This
>    implementation avoids that as much as possible by using 2 registers (r4 and
>    r5) to hold addresses to be used by different parts of the code.
> 
>    Also, the previous code aligned the address to 16B, then to 64B by doing a
>    few 48B loops (if needed) until the address was aligned. The main loop could
>    not start until that 48B loop had finished and r4 was updated with the
>    current address. Here we calculate the address used by the loop very early,
>    so it can start sooner.
> 
>    The main loop now uses 2 pointers 128B apart to make pointer updates less
>    frequent, and also unrolls 1 iteration to guarantee there is enough time
>    between iterations to update the pointers, reducing stalled cycles.
> 
> 4. Use new P10 instructions
> 
>    lxvp is used to load 32B with a single instruction, reducing contention in
>    the load queue.
> 
>    vextractbm allows simplifying the tail code for the loop, replacing
>    vbpermq and avoiding having to generate a permute control vector.
> 
> Output of bench-strlen from 'make USE_CLOCK_GETTIME=1 BENCHSET="string-benchset"
> using slightly different set of inputs than the default:
> 
> $ ./compare_strings.py --functions __strlen_power9,__strlen_power10
>                        -a length,alignment -s benchout_strings.schema.json
>                        -i bench-strlen.out
> 
> Function: strlen
> Variant:
>                                     __strlen_power10    __strlen_power9
> ================================================================================
>                length=1, alignment=0:         2.50              2.50 (  0.00%)
>                length=1, alignment=1:         2.50              2.50 (  0.00%)
>                length=2, alignment=0:         2.50              2.50 (  0.00%)
>                length=2, alignment=2:         2.50              2.50 (  0.00%)
>                length=3, alignment=0:         2.50              2.50 (  0.00%)
>                length=3, alignment=3:         2.50              2.50 (  0.00%)
>                length=4, alignment=0:         2.50              2.50 (  0.00%)
>                length=4, alignment=4:         2.50              2.50 (  0.00%)
>                length=5, alignment=0:         2.50              2.50 (  0.00%)
>                length=5, alignment=5:         2.50              2.50 (  0.00%)
>                length=6, alignment=0:         2.50              2.50 (  0.00%)
>                length=6, alignment=6:         2.50              2.50 (  0.00%)
>                length=7, alignment=0:         2.50              2.50 (  0.00%)
>                length=7, alignment=7:         2.50              2.50 (  0.00%)
>                length=8, alignment=0:         2.50              2.50 (  0.00%)
>                length=8, alignment=8:         3.12              3.12 (  0.00%)
>                length=9, alignment=0:         2.50              2.50 (  0.00%)
>                length=9, alignment=9:         3.12              3.12 (  0.00%)
>              length=10, alignment=10:         3.12              3.12 (  0.00%)
>               length=16, alignment=0:         3.12              3.40 ( -9.09%)
>               length=16, alignment=4:         3.12              3.12 (  0.00%)
>               length=16, alignment=7:         3.12              3.12 (  0.00%)
>               length=21, alignment=0:         3.12              3.40 ( -9.09%)
>               length=21, alignment=5:         3.12              3.12 (  0.00%)
>               length=32, alignment=0:         3.12              3.40 ( -9.09%)
>               length=32, alignment=7:         3.12              3.40 ( -9.09%)
>               length=42, alignment=0:         3.12              3.40 ( -9.09%)
>               length=42, alignment=7:         3.42              3.40 (  0.51%)
>               length=48, alignment=0:         3.43              3.74 ( -9.13%)
>               length=48, alignment=7:         3.40              3.40 (  0.17%)
>               length=64, alignment=0:         3.40              5.21 (-53.34%)
>               length=64, alignment=7:         3.40              3.74 (-10.00%)
>               length=80, alignment=0:         3.74              5.21 (-39.43%)
>               length=80, alignment=7:         3.74              4.01 ( -7.14%)
>               length=85, alignment=0:         3.74              5.21 (-39.42%)
>               length=85, alignment=7:         3.74              4.01 ( -7.14%)
>               length=96, alignment=0:         3.74              5.21 (-39.40%)
>               length=96, alignment=7:         3.74              4.01 ( -7.14%)
>              length=112, alignment=0:         3.74              5.21 (-39.39%)
>              length=112, alignment=7:         3.74              4.88 (-30.43%)
>              length=128, alignment=0:         4.01              5.91 (-47.59%)
>              length=128, alignment=7:         4.01              6.15 (-53.59%)
>             length=128, alignment=16:         4.01              6.16 (-53.78%)
>             length=128, alignment=23:         4.01              5.17 (-29.08%)
>              length=160, alignment=0:         4.01              5.92 (-47.75%)
>              length=160, alignment=7:         4.01              6.16 (-53.72%)
>             length=160, alignment=16:         4.01              6.14 (-53.29%)
>             length=160, alignment=23:         4.01              6.05 (-50.98%)
>              length=192, alignment=0:         5.93              6.84 (-15.44%)
>              length=192, alignment=7:         5.93              6.90 (-16.35%)
>              length=256, alignment=0:         6.61              7.73 (-17.02%)
>              length=256, alignment=7:         6.61              7.85 (-18.79%)
>              length=320, alignment=0:         7.26              8.65 (-19.12%)
>              length=320, alignment=7:         7.26              8.76 (-20.70%)
>              length=384, alignment=0:         7.95              9.62 (-20.98%)
>              length=384, alignment=7:         7.95              9.49 (-19.37%)
>              length=448, alignment=0:         8.73             10.39 (-19.06%)
>              length=448, alignment=7:         8.73             10.51 (-20.40%)
>              length=512, alignment=0:         9.44             11.13 (-17.87%)
>              length=512, alignment=7:         9.45             11.32 (-19.85%)
>              length=576, alignment=0:        10.10             11.93 (-18.05%)
>              length=576, alignment=7:        10.10             12.02 (-18.97%)
>              length=640, alignment=0:        10.71             12.73 (-18.86%)
>              length=640, alignment=7:        10.67             12.89 (-20.76%)
>              length=704, alignment=0:        11.59             13.39 (-15.61%)
>              length=704, alignment=7:        11.59             13.61 (-17.45%)
>              length=768, alignment=0:        12.27             14.22 (-15.90%)
>              length=768, alignment=7:        12.27             14.44 (-17.72%)
>              length=896, alignment=0:        13.48             15.70 (-16.47%)
>              length=896, alignment=7:        13.47             15.97 (-18.56%)
>              length=960, alignment=0:        14.22             16.63 (-16.92%)
>              length=960, alignment=7:        14.19             16.70 (-17.66%)
>             length=1024, alignment=0:        14.85             17.46 (-17.54%)
>             length=1024, alignment=7:        14.87             17.68 (-18.94%)
>             length=1280, alignment=0:        17.58             20.91 (-18.94%)
>             length=1280, alignment=7:        17.62             21.35 (-21.13%)
>             length=1536, alignment=0:        20.61             24.54 (-19.07%)
>             length=1536, alignment=7:        20.61             24.21 (-17.48%)
>             length=1792, alignment=0:        23.02             27.94 (-21.39%)
>             length=1792, alignment=7:        23.02             27.83 (-20.90%)
>             length=2048, alignment=0:        25.98             30.71 (-18.23%)
>             length=2048, alignment=7:        25.96             31.26 (-20.45%)
>             length=2560, alignment=0:        31.37             37.82 (-20.57%)
>             length=2560, alignment=7:        31.34             37.69 (-20.26%)
>             length=3008, alignment=0:        35.61             43.29 (-21.56%)
>             length=3008, alignment=7:        35.55             43.84 (-23.31%)
>             length=3520, alignment=0:        41.08             50.48 (-22.90%)
>             length=3520, alignment=7:        41.12             50.63 (-23.13%)
>             length=4096, alignment=0:        47.80             57.96 (-21.25%)
>             length=4096, alignment=7:        47.79             57.66 (-20.66%)
> 
> Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com>
> 
> ---
> Changes from v1:
>   - Added comment about minimum binutils version needed to remove the instruction macros
>   - s/reg/vreg/ on CHECK16 for clarity
>   
> ---
>  sysdeps/powerpc/powerpc64/le/power10/strlen.S | 221 ++++++++++++++++++
>  sysdeps/powerpc/powerpc64/multiarch/Makefile  |   3 +-
>  .../powerpc64/multiarch/ifunc-impl-list.c     |   2 +
>  .../powerpc64/multiarch/strlen-power10.S      |   2 +
>  sysdeps/powerpc/powerpc64/multiarch/strlen.c  |   3 +
>  5 files changed, 230 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/powerpc/powerpc64/le/power10/strlen.S
>  create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S
> 
> diff --git a/sysdeps/powerpc/powerpc64/le/power10/strlen.S b/sysdeps/powerpc/powerpc64/le/power10/strlen.S
> new file mode 100644
> index 0000000000..7eb37a8f54
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power10/strlen.S
> @@ -0,0 +1,221 @@
> +/* Optimized strlen implementation for POWER10 LE.
> +   Copyright (C) 2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.         See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +#ifndef STRLEN
> +# define STRLEN __strlen
> +# define DEFINE_STRLEN_HIDDEN_DEF 1
> +#endif
> +
> +/* TODO: Replace macros by the actual instructions when minimum binutils becomes
> +   >= 2.35.  This is used to keep compatibility with older versions.  */
> +#define VEXTRACTBM(rt,vrb)      \
> +       .long(((4)<<(32-6))      \
> +             | ((rt)<<(32-11))  \
> +             | ((8)<<(32-16))   \
> +             | ((vrb)<<(32-21)) \
> +             | 1602)
> +
> +#define LXVP(xtp,dq,ra)                   \
> +       .long(((6)<<(32-6))                \
> +             | ((((xtp)-32)>>1)<<(32-10)) \
> +             | ((1)<<(32-11))             \
> +             | ((ra)<<(32-16))            \
> +             | dq)
> +
> +#define CHECK16(vreg,offset,addr,label) \
> +       lxv       vreg+32,offset(addr); \
> +       vcmpequb. vreg,vreg,v18;        \
> +       bne       cr6,L(label);
> +
> +/* Load 4 quadwords, merge into one VR for speed and check for NULLs.  r6 has #
> +   of bytes already checked.  */
> +#define CHECK64(offset,addr,label)         \
> +       li        r6,offset;                \
> +       LXVP(v4+32,offset,addr);            \
> +       LXVP(v6+32,offset+32,addr);         \
> +       vminub    v14,v4,v5;                \
> +       vminub    v15,v6,v7;                \
> +       vminub    v16,v14,v15;              \
> +       vcmpequb. v0,v16,v18;               \
> +       bne       cr6,L(label)
> +
> +# define TAIL(vreg,increment)     \
> +       vctzlsbb  r4,vreg;         \
> +       subf      r3,r3,r5;        \
> +       addi      r4,r4,increment; \
> +       add       r3,r3,r4;        \
> +       blr
> +
> +/* Implements the function
> +
> +   int [r3] strlen (const void *s [r3])
> +
> +   The implementation can load bytes past a matching byte, but only
> +   up to the next 64B boundary, so it never crosses a page.  */
> +
> +.machine power9
> +
> +ENTRY_TOCLESS (STRLEN, 4)
> +       CALL_MCOUNT 1
> +
> +       vspltisb  v18,0
> +       vspltisb  v19,-1
> +
> +       /* Next 16B-aligned address. Prepare address for L(aligned).  */
> +       addi      r5,r3,16
> +       clrrdi    r5,r5,4
> +
> +       /* Align data and fill bytes not loaded with non matching char.  */
> +       lvx       v0,0,r3
> +       lvsr      v1,0,r3
> +       vperm     v0,v19,v0,v1
> +
> +       vcmpequb. v6,v0,v18
> +       beq       cr6,L(aligned)
> +
> +       vctzlsbb  r3,v6
> +       blr
> +
> +       /* Test more 112B, 16B at a time.  The main loop is optimized for longer
> +          strings, so checking the first bytes in 16B chunks benefits a lot
> +          small strings.  */
> +       .p2align 5
> +L(aligned):
> +       /* Prepare address for the loop.  */
> +       addi      r4,r3,192
> +       clrrdi    r4,r4,6
> +
> +       CHECK16(v0,0,r5,tail1)
> +       CHECK16(v1,16,r5,tail2)
> +       CHECK16(v2,32,r5,tail3)
> +       CHECK16(v3,48,r5,tail4)
> +       CHECK16(v4,64,r5,tail5)
> +       CHECK16(v5,80,r5,tail6)
> +       CHECK16(v6,96,r5,tail7)
> +       CHECK16(v7,112,r5,tail8)
> +       CHECK16(v8,128,r5,tail9)
> +       CHECK16(v9,144,r5,tail10)
> +       CHECK16(v10,160,r5,tail11)
> +
> +       addi      r5,r4,128
> +
> +       /* Switch to a more aggressive approach checking 64B each time.  Use 2
> +          pointers 128B apart and unroll the loop once to make the pointer
> +          updates and usages separated enough to avoid stalls waiting for
> +          address calculation.  */
> +       .p2align 5
> +L(loop):
> +       CHECK64(0,r4,pre_tail_64b)
> +       CHECK64(64,r4,pre_tail_64b)
> +       addi      r4,r4,256
> +
> +       CHECK64(0,r5,tail_64b)
> +       CHECK64(64,r5,tail_64b)
> +       addi      r5,r5,256
> +
> +       b         L(loop)
> +
> +       .p2align  5
> +L(pre_tail_64b):
> +       mr      r5,r4
> +L(tail_64b):
> +       /* OK, we found a null byte.  Let's look for it in the current 64-byte
> +          block and mark it in its corresponding VR.  lxvp vx,0(ry) puts the
> +          low 16B bytes into vx+1, and the high into vx, so the order here is
> +          v5, v4, v7, v6.  */
> +       vcmpequb  v1,v5,v18
> +       vcmpequb  v2,v4,v18
> +       vcmpequb  v3,v7,v18
> +       vcmpequb  v4,v6,v18
> +
> +       /* Take into account the other 64B blocks we had already checked.  */
> +       add     r5,r5,r6
> +
> +       /* Extract first bit of each byte.  */
> +       VEXTRACTBM(r7,v1)
> +       VEXTRACTBM(r8,v2)
> +       VEXTRACTBM(r9,v3)
> +       VEXTRACTBM(r10,v4)
> +
> +       /* Shift each value into their corresponding position.  */
> +       sldi      r8,r8,16
> +       sldi      r9,r9,32
> +       sldi      r10,r10,48
> +
> +       /* Merge the results.  */
> +       or        r7,r7,r8
> +       or        r8,r9,r10
> +       or        r10,r8,r7
> +
> +       cnttzd    r0,r10          /* Count trailing zeros before the match.  */
> +       subf      r5,r3,r5
> +       add       r3,r5,r0        /* Compute final length.  */
> +       blr
> +
> +       .p2align  5
> +L(tail1):
> +       TAIL(v0,0)
> +
> +       .p2align  5
> +L(tail2):
> +       TAIL(v1,16)
> +
> +       .p2align  5
> +L(tail3):
> +       TAIL(v2,32)
> +
> +       .p2align  5
> +L(tail4):
> +       TAIL(v3,48)
> +
> +       .p2align  5
> +L(tail5):
> +       TAIL(v4,64)
> +
> +       .p2align  5
> +L(tail6):
> +       TAIL(v5,80)
> +
> +       .p2align  5
> +L(tail7):
> +       TAIL(v6,96)
> +
> +       .p2align  5
> +L(tail8):
> +       TAIL(v7,112)
> +
> +       .p2align  5
> +L(tail9):
> +       TAIL(v8,128)
> +
> +       .p2align  5
> +L(tail10):
> +       TAIL(v9,144)
> +
> +       .p2align  5
> +L(tail11):
> +       TAIL(v10,160)
> +
> +END (STRLEN)
> +
> +#ifdef DEFINE_STRLEN_HIDDEN_DEF
> +weak_alias (__strlen, strlen)
> +libc_hidden_builtin_def (strlen)
> +#endif
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
> index f46bf50732..8aa46a3702 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
> +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
> @@ -33,7 +33,8 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \
>  
>  ifneq (,$(filter %le,$(config-machine)))
>  sysdep_routines += strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \
> -                  rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-power9
> +                  rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-power9 \
> +                  strlen-power10
>  endif
>  CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops
>  CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> index 72f7f83e7e..1a6993616f 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> @@ -112,6 +112,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>    /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c.  */
>    IFUNC_IMPL (i, name, strlen,
>  #ifdef __LITTLE_ENDIAN__
> +             IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_3_1,
> +                             __strlen_power10)
>               IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_3_00,
>                               __strlen_power9)
>  #endif
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S b/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S
> new file mode 100644
> index 0000000000..6a774fad58
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S
> @@ -0,0 +1,2 @@
> +#define STRLEN __strlen_power10
> +#include <sysdeps/powerpc/powerpc64/le/power10/strlen.S>
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
> index c3bbc78df8..109c8a90bd 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
> @@ -31,9 +31,12 @@ extern __typeof (__redirect_strlen) __strlen_ppc attribute_hidden;
>  extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden;
>  extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden;
>  extern __typeof (__redirect_strlen) __strlen_power9 attribute_hidden;
> +extern __typeof (__redirect_strlen) __strlen_power10 attribute_hidden;
>  
>  libc_ifunc (__libc_strlen,
>  # ifdef __LITTLE_ENDIAN__
> +       (hwcap2 & PPC_FEATURE2_ARCH_3_1)
> +       ? __strlen_power10 :
>           (hwcap2 & PPC_FEATURE2_ARCH_3_00)
>           ? __strlen_power9 :
>  # endif
> -- 
> 2.30.2
>

  parent reply	other threads:[~2021-04-22 18:20 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-22 12:29 Matheus Castanho
2021-04-22 15:22 ` Raphael M Zinsly
2021-04-22 18:20 ` Lucas A. M. Magalhaes [this message]
2021-04-22 19:50 ` Matheus Castanho

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=161911560634.43295.12328092311719242757@localhost.localdomain \
    --to=lamm@linux.ibm.com \
    --cc=libc-alpha@sourceware.org \
    --cc=msc@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).