From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lamm@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 14152385702F
 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 18:20:11 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 14152385702F
Received: from pps.filterd (m0098394.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 13MI3m1c041391
 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 14:20:10 -0400
Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com
 [169.47.144.27])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3838hkkv7p-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 14:20:09 -0400
Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1])
 by ppma05wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13MICuiS028621
 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 18:20:08 GMT
Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com
 [9.57.198.25]) by ppma05wdc.us.ibm.com with ESMTP id 37yqa9kn67-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <libc-alpha@sourceware.org>; Thu, 22 Apr 2021 18:20:08 +0000
Received: from b01ledav006.gho.pok.ibm.com (b01ledav006.gho.pok.ibm.com
 [9.57.199.111])
 by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 13MIK8pH33882524
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Thu, 22 Apr 2021 18:20:08 GMT
Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 37035AC064;
 Thu, 22 Apr 2021 18:20:08 +0000 (GMT)
Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 93AC6AC05E;
 Thu, 22 Apr 2021 18:20:07 +0000 (GMT)
Received: from localhost (unknown [9.80.229.10])
 by b01ledav006.gho.pok.ibm.com (Postfix) with ESMTP;
 Thu, 22 Apr 2021 18:20:07 +0000 (GMT)
Content-Type: text/plain; charset="utf-8"
In-Reply-To: <20210422122911.27758-1-msc@linux.ibm.com>
References: <20210422122911.27758-1-msc@linux.ibm.com>
Subject: Re: [PATCH v2] powerpc: Add optimized strlen for POWER10
From: "Lucas A. M. Magalhaes" <lamm@linux.ibm.com>
To: Matheus Castanho <msc@linux.ibm.com>, libc-alpha@sourceware.org
Date: Thu, 22 Apr 2021 15:20:06 -0300
Message-ID: <161911560634.43295.12328092311719242757@localhost.localdomain>
User-Agent: alot/0.9.1
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: mkNxmkz1HGo0Lf2_9araifxexUurReRb
X-Proofpoint-GUID: mkNxmkz1HGo0Lf2_9araifxexUurReRb
Content-Transfer-Encoding: quoted-printable
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761
 definitions=2021-04-22_12:2021-04-22,
 2021-04-22 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 adultscore=0 impostorscore=0
 mlxscore=0 malwarescore=0 priorityscore=1501 spamscore=0 bulkscore=0
 suspectscore=0 phishscore=0 clxscore=1015 mlxlogscore=999
 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2104060000 definitions=main-2104220136
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_ASCII_DIVIDERS, KAM_NUMSUBJECT,
 KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 22 Apr 2021 18:20:14 -0000

Hi Matheus, LGTM. Reviewed and all tests pass.

Thanks for working on this.

Quoting Matheus Castanho via Libc-alpha (2021-04-22 09:29:11)
> Improvements compared to POWER9 version:
>=20
> 1. Take into account first 16B comparison for aligned strings
>=20
>    The previous version compares the first 16B and increments r4 by the n=
umber
>    of bytes until the address is 16B-aligned, then starts doing aligned l=
oads at
>    that address. For aligned strings, this causes the first 16B to be com=
pared
>    twice, because the increment is 0. Here we calculate the next 16B-alig=
ned
>    address differently, which avoids that issue.
>=20
> 2. Use simple comparisons for the first ~192 bytes
>=20
>    The main loop is good for big strings, but comparing 16B each time is =
better
>    for smaller strings.  So after aligning the address to 16 Bytes, we ch=
eck
>    more 176B in 16B chunks.  There may be some overlaps with the main loo=
p for
>    unaligned strings, but we avoid using the more aggressive strategy too=
 soon,
>    and also allow the loop to start at a 64B-aligned address.  This great=
ly
>    benefits smaller strings and avoids overlapping checks if the string is
>    already aligned at a 64B boundary.
>=20
> 3. Reduce dependencies between load blocks caused by address calculation =
on loop
>=20
>    Doing a precise time tracing on the code showed many loads in the loop=
 were
>    stalled waiting for updates to r4 from previous code blocks.  This
>    implementation avoids that as much as possible by using 2 registers (r=
4 and
>    r5) to hold addresses to be used by different parts of the code.
>=20
>    Also, the previous code aligned the address to 16B, then to 64B by doi=
ng a
>    few 48B loops (if needed) until the address was aligned. The main loop=
 could
>    not start until that 48B loop had finished and r4 was updated with the
>    current address. Here we calculate the address used by the loop very e=
arly,
>    so it can start sooner.
>=20
>    The main loop now uses 2 pointers 128B apart to make pointer updates l=
ess
>    frequent, and also unrolls 1 iteration to guarantee there is enough ti=
me
>    between iterations to update the pointers, reducing stalled cycles.
>=20
> 4. Use new P10 instructions
>=20
>    lxvp is used to load 32B with a single instruction, reducing contentio=
n in
>    the load queue.
>=20
>    vextractbm allows simplifying the tail code for the loop, replacing
>    vbpermq and avoiding having to generate a permute control vector.
>=20
> Output of bench-strlen from 'make USE_CLOCK_GETTIME=3D1 BENCHSET=3D"strin=
g-benchset"
> using slightly different set of inputs than the default:
>=20
> $ ./compare_strings.py --functions __strlen_power9,__strlen_power10
>                        -a length,alignment -s benchout_strings.schema.json
>                        -i bench-strlen.out
>=20
> Function: strlen
> Variant:
>                                     __strlen_power10    __strlen_power9
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
>                length=3D1, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D1, alignment=3D1:         2.50              2.50 =
(  0.00%)
>                length=3D2, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D2, alignment=3D2:         2.50              2.50 =
(  0.00%)
>                length=3D3, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D3, alignment=3D3:         2.50              2.50 =
(  0.00%)
>                length=3D4, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D4, alignment=3D4:         2.50              2.50 =
(  0.00%)
>                length=3D5, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D5, alignment=3D5:         2.50              2.50 =
(  0.00%)
>                length=3D6, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D6, alignment=3D6:         2.50              2.50 =
(  0.00%)
>                length=3D7, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D7, alignment=3D7:         2.50              2.50 =
(  0.00%)
>                length=3D8, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D8, alignment=3D8:         3.12              3.12 =
(  0.00%)
>                length=3D9, alignment=3D0:         2.50              2.50 =
(  0.00%)
>                length=3D9, alignment=3D9:         3.12              3.12 =
(  0.00%)
>              length=3D10, alignment=3D10:         3.12              3.12 =
(  0.00%)
>               length=3D16, alignment=3D0:         3.12              3.40 =
( -9.09%)
>               length=3D16, alignment=3D4:         3.12              3.12 =
(  0.00%)
>               length=3D16, alignment=3D7:         3.12              3.12 =
(  0.00%)
>               length=3D21, alignment=3D0:         3.12              3.40 =
( -9.09%)
>               length=3D21, alignment=3D5:         3.12              3.12 =
(  0.00%)
>               length=3D32, alignment=3D0:         3.12              3.40 =
( -9.09%)
>               length=3D32, alignment=3D7:         3.12              3.40 =
( -9.09%)
>               length=3D42, alignment=3D0:         3.12              3.40 =
( -9.09%)
>               length=3D42, alignment=3D7:         3.42              3.40 =
(  0.51%)
>               length=3D48, alignment=3D0:         3.43              3.74 =
( -9.13%)
>               length=3D48, alignment=3D7:         3.40              3.40 =
(  0.17%)
>               length=3D64, alignment=3D0:         3.40              5.21 =
(-53.34%)
>               length=3D64, alignment=3D7:         3.40              3.74 =
(-10.00%)
>               length=3D80, alignment=3D0:         3.74              5.21 =
(-39.43%)
>               length=3D80, alignment=3D7:         3.74              4.01 =
( -7.14%)
>               length=3D85, alignment=3D0:         3.74              5.21 =
(-39.42%)
>               length=3D85, alignment=3D7:         3.74              4.01 =
( -7.14%)
>               length=3D96, alignment=3D0:         3.74              5.21 =
(-39.40%)
>               length=3D96, alignment=3D7:         3.74              4.01 =
( -7.14%)
>              length=3D112, alignment=3D0:         3.74              5.21 =
(-39.39%)
>              length=3D112, alignment=3D7:         3.74              4.88 =
(-30.43%)
>              length=3D128, alignment=3D0:         4.01              5.91 =
(-47.59%)
>              length=3D128, alignment=3D7:         4.01              6.15 =
(-53.59%)
>             length=3D128, alignment=3D16:         4.01              6.16 =
(-53.78%)
>             length=3D128, alignment=3D23:         4.01              5.17 =
(-29.08%)
>              length=3D160, alignment=3D0:         4.01              5.92 =
(-47.75%)
>              length=3D160, alignment=3D7:         4.01              6.16 =
(-53.72%)
>             length=3D160, alignment=3D16:         4.01              6.14 =
(-53.29%)
>             length=3D160, alignment=3D23:         4.01              6.05 =
(-50.98%)
>              length=3D192, alignment=3D0:         5.93              6.84 =
(-15.44%)
>              length=3D192, alignment=3D7:         5.93              6.90 =
(-16.35%)
>              length=3D256, alignment=3D0:         6.61              7.73 =
(-17.02%)
>              length=3D256, alignment=3D7:         6.61              7.85 =
(-18.79%)
>              length=3D320, alignment=3D0:         7.26              8.65 =
(-19.12%)
>              length=3D320, alignment=3D7:         7.26              8.76 =
(-20.70%)
>              length=3D384, alignment=3D0:         7.95              9.62 =
(-20.98%)
>              length=3D384, alignment=3D7:         7.95              9.49 =
(-19.37%)
>              length=3D448, alignment=3D0:         8.73             10.39 =
(-19.06%)
>              length=3D448, alignment=3D7:         8.73             10.51 =
(-20.40%)
>              length=3D512, alignment=3D0:         9.44             11.13 =
(-17.87%)
>              length=3D512, alignment=3D7:         9.45             11.32 =
(-19.85%)
>              length=3D576, alignment=3D0:        10.10             11.93 =
(-18.05%)
>              length=3D576, alignment=3D7:        10.10             12.02 =
(-18.97%)
>              length=3D640, alignment=3D0:        10.71             12.73 =
(-18.86%)
>              length=3D640, alignment=3D7:        10.67             12.89 =
(-20.76%)
>              length=3D704, alignment=3D0:        11.59             13.39 =
(-15.61%)
>              length=3D704, alignment=3D7:        11.59             13.61 =
(-17.45%)
>              length=3D768, alignment=3D0:        12.27             14.22 =
(-15.90%)
>              length=3D768, alignment=3D7:        12.27             14.44 =
(-17.72%)
>              length=3D896, alignment=3D0:        13.48             15.70 =
(-16.47%)
>              length=3D896, alignment=3D7:        13.47             15.97 =
(-18.56%)
>              length=3D960, alignment=3D0:        14.22             16.63 =
(-16.92%)
>              length=3D960, alignment=3D7:        14.19             16.70 =
(-17.66%)
>             length=3D1024, alignment=3D0:        14.85             17.46 =
(-17.54%)
>             length=3D1024, alignment=3D7:        14.87             17.68 =
(-18.94%)
>             length=3D1280, alignment=3D0:        17.58             20.91 =
(-18.94%)
>             length=3D1280, alignment=3D7:        17.62             21.35 =
(-21.13%)
>             length=3D1536, alignment=3D0:        20.61             24.54 =
(-19.07%)
>             length=3D1536, alignment=3D7:        20.61             24.21 =
(-17.48%)
>             length=3D1792, alignment=3D0:        23.02             27.94 =
(-21.39%)
>             length=3D1792, alignment=3D7:        23.02             27.83 =
(-20.90%)
>             length=3D2048, alignment=3D0:        25.98             30.71 =
(-18.23%)
>             length=3D2048, alignment=3D7:        25.96             31.26 =
(-20.45%)
>             length=3D2560, alignment=3D0:        31.37             37.82 =
(-20.57%)
>             length=3D2560, alignment=3D7:        31.34             37.69 =
(-20.26%)
>             length=3D3008, alignment=3D0:        35.61             43.29 =
(-21.56%)
>             length=3D3008, alignment=3D7:        35.55             43.84 =
(-23.31%)
>             length=3D3520, alignment=3D0:        41.08             50.48 =
(-22.90%)
>             length=3D3520, alignment=3D7:        41.12             50.63 =
(-23.13%)
>             length=3D4096, alignment=3D0:        47.80             57.96 =
(-21.25%)
>             length=3D4096, alignment=3D7:        47.79             57.66 =
(-20.66%)
>=20
> Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com>
>=20
> ---
> Changes from v1:
>   - Added comment about minimum binutils version needed to remove the ins=
truction macros
>   - s/reg/vreg/ on CHECK16 for clarity
>=20=20=20
> ---
>  sysdeps/powerpc/powerpc64/le/power10/strlen.S | 221 ++++++++++++++++++
>  sysdeps/powerpc/powerpc64/multiarch/Makefile  |   3 +-
>  .../powerpc64/multiarch/ifunc-impl-list.c     |   2 +
>  .../powerpc64/multiarch/strlen-power10.S      |   2 +
>  sysdeps/powerpc/powerpc64/multiarch/strlen.c  |   3 +
>  5 files changed, 230 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/powerpc/powerpc64/le/power10/strlen.S
>  create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S
>=20
> diff --git a/sysdeps/powerpc/powerpc64/le/power10/strlen.S b/sysdeps/powe=
rpc/powerpc64/le/power10/strlen.S
> new file mode 100644
> index 0000000000..7eb37a8f54
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power10/strlen.S
> @@ -0,0 +1,221 @@
> +/* Optimized strlen implementation for POWER10 LE.
> +   Copyright (C) 2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.         See the =
GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +#ifndef STRLEN
> +# define STRLEN __strlen
> +# define DEFINE_STRLEN_HIDDEN_DEF 1
> +#endif
> +
> +/* TODO: Replace macros by the actual instructions when minimum binutils=
 becomes
> +   >=3D 2.35.  This is used to keep compatibility with older versions.  =
*/
> +#define VEXTRACTBM(rt,vrb)      \
> +       .long(((4)<<(32-6))      \
> +             | ((rt)<<(32-11))  \
> +             | ((8)<<(32-16))   \
> +             | ((vrb)<<(32-21)) \
> +             | 1602)
> +
> +#define LXVP(xtp,dq,ra)                   \
> +       .long(((6)<<(32-6))                \
> +             | ((((xtp)-32)>>1)<<(32-10)) \
> +             | ((1)<<(32-11))             \
> +             | ((ra)<<(32-16))            \
> +             | dq)
> +
> +#define CHECK16(vreg,offset,addr,label) \
> +       lxv       vreg+32,offset(addr); \
> +       vcmpequb. vreg,vreg,v18;        \
> +       bne       cr6,L(label);
> +
> +/* Load 4 quadwords, merge into one VR for speed and check for NULLs.  r=
6 has #
> +   of bytes already checked.  */
> +#define CHECK64(offset,addr,label)         \
> +       li        r6,offset;                \
> +       LXVP(v4+32,offset,addr);            \
> +       LXVP(v6+32,offset+32,addr);         \
> +       vminub    v14,v4,v5;                \
> +       vminub    v15,v6,v7;                \
> +       vminub    v16,v14,v15;              \
> +       vcmpequb. v0,v16,v18;               \
> +       bne       cr6,L(label)
> +
> +# define TAIL(vreg,increment)     \
> +       vctzlsbb  r4,vreg;         \
> +       subf      r3,r3,r5;        \
> +       addi      r4,r4,increment; \
> +       add       r3,r3,r4;        \
> +       blr
> +
> +/* Implements the function
> +
> +   int [r3] strlen (const void *s [r3])
> +
> +   The implementation can load bytes past a matching byte, but only
> +   up to the next 64B boundary, so it never crosses a page.  */
> +
> +.machine power9
> +
> +ENTRY_TOCLESS (STRLEN, 4)
> +       CALL_MCOUNT 1
> +
> +       vspltisb  v18,0
> +       vspltisb  v19,-1
> +
> +       /* Next 16B-aligned address. Prepare address for L(aligned).  */
> +       addi      r5,r3,16
> +       clrrdi    r5,r5,4
> +
> +       /* Align data and fill bytes not loaded with non matching char.  =
*/
> +       lvx       v0,0,r3
> +       lvsr      v1,0,r3
> +       vperm     v0,v19,v0,v1
> +
> +       vcmpequb. v6,v0,v18
> +       beq       cr6,L(aligned)
> +
> +       vctzlsbb  r3,v6
> +       blr
> +
> +       /* Test more 112B, 16B at a time.  The main loop is optimized for=
 longer
> +          strings, so checking the first bytes in 16B chunks benefits a =
lot
> +          small strings.  */
> +       .p2align 5
> +L(aligned):
> +       /* Prepare address for the loop.  */
> +       addi      r4,r3,192
> +       clrrdi    r4,r4,6
> +
> +       CHECK16(v0,0,r5,tail1)
> +       CHECK16(v1,16,r5,tail2)
> +       CHECK16(v2,32,r5,tail3)
> +       CHECK16(v3,48,r5,tail4)
> +       CHECK16(v4,64,r5,tail5)
> +       CHECK16(v5,80,r5,tail6)
> +       CHECK16(v6,96,r5,tail7)
> +       CHECK16(v7,112,r5,tail8)
> +       CHECK16(v8,128,r5,tail9)
> +       CHECK16(v9,144,r5,tail10)
> +       CHECK16(v10,160,r5,tail11)
> +
> +       addi      r5,r4,128
> +
> +       /* Switch to a more aggressive approach checking 64B each time.  =
Use 2
> +          pointers 128B apart and unroll the loop once to make the point=
er
> +          updates and usages separated enough to avoid stalls waiting for
> +          address calculation.  */
> +       .p2align 5
> +L(loop):
> +       CHECK64(0,r4,pre_tail_64b)
> +       CHECK64(64,r4,pre_tail_64b)
> +       addi      r4,r4,256
> +
> +       CHECK64(0,r5,tail_64b)
> +       CHECK64(64,r5,tail_64b)
> +       addi      r5,r5,256
> +
> +       b         L(loop)
> +
> +       .p2align  5
> +L(pre_tail_64b):
> +       mr      r5,r4
> +L(tail_64b):
> +       /* OK, we found a null byte.  Let's look for it in the current 64=
-byte
> +          block and mark it in its corresponding VR.  lxvp vx,0(ry) puts=
 the
> +          low 16B bytes into vx+1, and the high into vx, so the order he=
re is
> +          v5, v4, v7, v6.  */
> +       vcmpequb  v1,v5,v18
> +       vcmpequb  v2,v4,v18
> +       vcmpequb  v3,v7,v18
> +       vcmpequb  v4,v6,v18
> +
> +       /* Take into account the other 64B blocks we had already checked.=
  */
> +       add     r5,r5,r6
> +
> +       /* Extract first bit of each byte.  */
> +       VEXTRACTBM(r7,v1)
> +       VEXTRACTBM(r8,v2)
> +       VEXTRACTBM(r9,v3)
> +       VEXTRACTBM(r10,v4)
> +
> +       /* Shift each value into their corresponding position.  */
> +       sldi      r8,r8,16
> +       sldi      r9,r9,32
> +       sldi      r10,r10,48
> +
> +       /* Merge the results.  */
> +       or        r7,r7,r8
> +       or        r8,r9,r10
> +       or        r10,r8,r7
> +
> +       cnttzd    r0,r10          /* Count trailing zeros before the matc=
h.  */
> +       subf      r5,r3,r5
> +       add       r3,r5,r0        /* Compute final length.  */
> +       blr
> +
> +       .p2align  5
> +L(tail1):
> +       TAIL(v0,0)
> +
> +       .p2align  5
> +L(tail2):
> +       TAIL(v1,16)
> +
> +       .p2align  5
> +L(tail3):
> +       TAIL(v2,32)
> +
> +       .p2align  5
> +L(tail4):
> +       TAIL(v3,48)
> +
> +       .p2align  5
> +L(tail5):
> +       TAIL(v4,64)
> +
> +       .p2align  5
> +L(tail6):
> +       TAIL(v5,80)
> +
> +       .p2align  5
> +L(tail7):
> +       TAIL(v6,96)
> +
> +       .p2align  5
> +L(tail8):
> +       TAIL(v7,112)
> +
> +       .p2align  5
> +L(tail9):
> +       TAIL(v8,128)
> +
> +       .p2align  5
> +L(tail10):
> +       TAIL(v9,144)
> +
> +       .p2align  5
> +L(tail11):
> +       TAIL(v10,160)
> +
> +END (STRLEN)
> +
> +#ifdef DEFINE_STRLEN_HIDDEN_DEF
> +weak_alias (__strlen, strlen)
> +libc_hidden_builtin_def (strlen)
> +#endif
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/power=
pc/powerpc64/multiarch/Makefile
> index f46bf50732..8aa46a3702 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
> +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
> @@ -33,7 +33,8 @@ sysdep_routines +=3D memcpy-power8-cached memcpy-power7=
 memcpy-a2 memcpy-power6 \
>=20=20
>  ifneq (,$(filter %le,$(config-machine)))
>  sysdep_routines +=3D strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-p=
ower9 \
> -                  rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-=
power9
> +                  rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-=
power9 \
> +                  strlen-power10
>  endif
>  CFLAGS-strncase-power7.c +=3D -mcpu=3Dpower7 -funroll-loops
>  CFLAGS-strncase_l-power7.c +=3D -mcpu=3Dpower7 -funroll-loops
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysd=
eps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> index 72f7f83e7e..1a6993616f 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> @@ -112,6 +112,8 @@ __libc_ifunc_impl_list (const char *name, struct libc=
_ifunc_impl *array,
>    /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c.  */
>    IFUNC_IMPL (i, name, strlen,
>  #ifdef __LITTLE_ENDIAN__
> +             IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARC=
H_3_1,
> +                             __strlen_power10)
>               IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARC=
H_3_00,
>                               __strlen_power9)
>  #endif
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S b/sysde=
ps/powerpc/powerpc64/multiarch/strlen-power10.S
> new file mode 100644
> index 0000000000..6a774fad58
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S
> @@ -0,0 +1,2 @@
> +#define STRLEN __strlen_power10
> +#include <sysdeps/powerpc/powerpc64/le/power10/strlen.S>
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/power=
pc/powerpc64/multiarch/strlen.c
> index c3bbc78df8..109c8a90bd 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
> @@ -31,9 +31,12 @@ extern __typeof (__redirect_strlen) __strlen_ppc attri=
bute_hidden;
>  extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden;
>  extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden;
>  extern __typeof (__redirect_strlen) __strlen_power9 attribute_hidden;
> +extern __typeof (__redirect_strlen) __strlen_power10 attribute_hidden;
>=20=20
>  libc_ifunc (__libc_strlen,
>  # ifdef __LITTLE_ENDIAN__
> +       (hwcap2 & PPC_FEATURE2_ARCH_3_1)
> +       ? __strlen_power10 :
>           (hwcap2 & PPC_FEATURE2_ARCH_3_00)
>           ? __strlen_power9 :
>  # endif
> --=20
> 2.30.2
>