From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 14152385702F for ; Thu, 22 Apr 2021 18:20:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 14152385702F Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 13MI3m1c041391 for ; Thu, 22 Apr 2021 14:20:10 -0400 Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0a-001b2d01.pphosted.com with ESMTP id 3838hkkv7p-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 22 Apr 2021 14:20:09 -0400 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13MICuiS028621 for ; Thu, 22 Apr 2021 18:20:08 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma05wdc.us.ibm.com with ESMTP id 37yqa9kn67-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 22 Apr 2021 18:20:08 +0000 Received: from b01ledav006.gho.pok.ibm.com (b01ledav006.gho.pok.ibm.com [9.57.199.111]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 13MIK8pH33882524 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 22 Apr 2021 18:20:08 GMT Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 37035AC064; Thu, 22 Apr 2021 18:20:08 +0000 (GMT) Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 93AC6AC05E; Thu, 22 Apr 2021 18:20:07 +0000 (GMT) Received: from localhost (unknown [9.80.229.10]) by b01ledav006.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 22 Apr 2021 18:20:07 +0000 (GMT) Content-Type: text/plain; charset="utf-8" In-Reply-To: <20210422122911.27758-1-msc@linux.ibm.com> References: <20210422122911.27758-1-msc@linux.ibm.com> Subject: Re: [PATCH v2] powerpc: Add optimized strlen for POWER10 From: "Lucas A. M. Magalhaes" To: Matheus Castanho , libc-alpha@sourceware.org Date: Thu, 22 Apr 2021 15:20:06 -0300 Message-ID: <161911560634.43295.12328092311719242757@localhost.localdomain> User-Agent: alot/0.9.1 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: mkNxmkz1HGo0Lf2_9araifxexUurReRb X-Proofpoint-GUID: mkNxmkz1HGo0Lf2_9araifxexUurReRb Content-Transfer-Encoding: quoted-printable X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-04-22_12:2021-04-22, 2021-04-22 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 impostorscore=0 mlxscore=0 malwarescore=0 priorityscore=1501 spamscore=0 bulkscore=0 suspectscore=0 phishscore=0 clxscore=1015 mlxlogscore=999 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2104220136 X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_ASCII_DIVIDERS, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 Apr 2021 18:20:14 -0000 Hi Matheus, LGTM. Reviewed and all tests pass. Thanks for working on this. Quoting Matheus Castanho via Libc-alpha (2021-04-22 09:29:11) > Improvements compared to POWER9 version: >=20 > 1. Take into account first 16B comparison for aligned strings >=20 > The previous version compares the first 16B and increments r4 by the n= umber > of bytes until the address is 16B-aligned, then starts doing aligned l= oads at > that address. For aligned strings, this causes the first 16B to be com= pared > twice, because the increment is 0. Here we calculate the next 16B-alig= ned > address differently, which avoids that issue. >=20 > 2. Use simple comparisons for the first ~192 bytes >=20 > The main loop is good for big strings, but comparing 16B each time is = better > for smaller strings. So after aligning the address to 16 Bytes, we ch= eck > more 176B in 16B chunks. There may be some overlaps with the main loo= p for > unaligned strings, but we avoid using the more aggressive strategy too= soon, > and also allow the loop to start at a 64B-aligned address. This great= ly > benefits smaller strings and avoids overlapping checks if the string is > already aligned at a 64B boundary. >=20 > 3. Reduce dependencies between load blocks caused by address calculation = on loop >=20 > Doing a precise time tracing on the code showed many loads in the loop= were > stalled waiting for updates to r4 from previous code blocks. This > implementation avoids that as much as possible by using 2 registers (r= 4 and > r5) to hold addresses to be used by different parts of the code. >=20 > Also, the previous code aligned the address to 16B, then to 64B by doi= ng a > few 48B loops (if needed) until the address was aligned. The main loop= could > not start until that 48B loop had finished and r4 was updated with the > current address. Here we calculate the address used by the loop very e= arly, > so it can start sooner. >=20 > The main loop now uses 2 pointers 128B apart to make pointer updates l= ess > frequent, and also unrolls 1 iteration to guarantee there is enough ti= me > between iterations to update the pointers, reducing stalled cycles. >=20 > 4. Use new P10 instructions >=20 > lxvp is used to load 32B with a single instruction, reducing contentio= n in > the load queue. >=20 > vextractbm allows simplifying the tail code for the loop, replacing > vbpermq and avoiding having to generate a permute control vector. >=20 > Output of bench-strlen from 'make USE_CLOCK_GETTIME=3D1 BENCHSET=3D"strin= g-benchset" > using slightly different set of inputs than the default: >=20 > $ ./compare_strings.py --functions __strlen_power9,__strlen_power10 > -a length,alignment -s benchout_strings.schema.json > -i bench-strlen.out >=20 > Function: strlen > Variant: > __strlen_power10 __strlen_power9 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D > length=3D1, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D1, alignment=3D1: 2.50 2.50 = ( 0.00%) > length=3D2, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D2, alignment=3D2: 2.50 2.50 = ( 0.00%) > length=3D3, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D3, alignment=3D3: 2.50 2.50 = ( 0.00%) > length=3D4, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D4, alignment=3D4: 2.50 2.50 = ( 0.00%) > length=3D5, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D5, alignment=3D5: 2.50 2.50 = ( 0.00%) > length=3D6, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D6, alignment=3D6: 2.50 2.50 = ( 0.00%) > length=3D7, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D7, alignment=3D7: 2.50 2.50 = ( 0.00%) > length=3D8, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D8, alignment=3D8: 3.12 3.12 = ( 0.00%) > length=3D9, alignment=3D0: 2.50 2.50 = ( 0.00%) > length=3D9, alignment=3D9: 3.12 3.12 = ( 0.00%) > length=3D10, alignment=3D10: 3.12 3.12 = ( 0.00%) > length=3D16, alignment=3D0: 3.12 3.40 = ( -9.09%) > length=3D16, alignment=3D4: 3.12 3.12 = ( 0.00%) > length=3D16, alignment=3D7: 3.12 3.12 = ( 0.00%) > length=3D21, alignment=3D0: 3.12 3.40 = ( -9.09%) > length=3D21, alignment=3D5: 3.12 3.12 = ( 0.00%) > length=3D32, alignment=3D0: 3.12 3.40 = ( -9.09%) > length=3D32, alignment=3D7: 3.12 3.40 = ( -9.09%) > length=3D42, alignment=3D0: 3.12 3.40 = ( -9.09%) > length=3D42, alignment=3D7: 3.42 3.40 = ( 0.51%) > length=3D48, alignment=3D0: 3.43 3.74 = ( -9.13%) > length=3D48, alignment=3D7: 3.40 3.40 = ( 0.17%) > length=3D64, alignment=3D0: 3.40 5.21 = (-53.34%) > length=3D64, alignment=3D7: 3.40 3.74 = (-10.00%) > length=3D80, alignment=3D0: 3.74 5.21 = (-39.43%) > length=3D80, alignment=3D7: 3.74 4.01 = ( -7.14%) > length=3D85, alignment=3D0: 3.74 5.21 = (-39.42%) > length=3D85, alignment=3D7: 3.74 4.01 = ( -7.14%) > length=3D96, alignment=3D0: 3.74 5.21 = (-39.40%) > length=3D96, alignment=3D7: 3.74 4.01 = ( -7.14%) > length=3D112, alignment=3D0: 3.74 5.21 = (-39.39%) > length=3D112, alignment=3D7: 3.74 4.88 = (-30.43%) > length=3D128, alignment=3D0: 4.01 5.91 = (-47.59%) > length=3D128, alignment=3D7: 4.01 6.15 = (-53.59%) > length=3D128, alignment=3D16: 4.01 6.16 = (-53.78%) > length=3D128, alignment=3D23: 4.01 5.17 = (-29.08%) > length=3D160, alignment=3D0: 4.01 5.92 = (-47.75%) > length=3D160, alignment=3D7: 4.01 6.16 = (-53.72%) > length=3D160, alignment=3D16: 4.01 6.14 = (-53.29%) > length=3D160, alignment=3D23: 4.01 6.05 = (-50.98%) > length=3D192, alignment=3D0: 5.93 6.84 = (-15.44%) > length=3D192, alignment=3D7: 5.93 6.90 = (-16.35%) > length=3D256, alignment=3D0: 6.61 7.73 = (-17.02%) > length=3D256, alignment=3D7: 6.61 7.85 = (-18.79%) > length=3D320, alignment=3D0: 7.26 8.65 = (-19.12%) > length=3D320, alignment=3D7: 7.26 8.76 = (-20.70%) > length=3D384, alignment=3D0: 7.95 9.62 = (-20.98%) > length=3D384, alignment=3D7: 7.95 9.49 = (-19.37%) > length=3D448, alignment=3D0: 8.73 10.39 = (-19.06%) > length=3D448, alignment=3D7: 8.73 10.51 = (-20.40%) > length=3D512, alignment=3D0: 9.44 11.13 = (-17.87%) > length=3D512, alignment=3D7: 9.45 11.32 = (-19.85%) > length=3D576, alignment=3D0: 10.10 11.93 = (-18.05%) > length=3D576, alignment=3D7: 10.10 12.02 = (-18.97%) > length=3D640, alignment=3D0: 10.71 12.73 = (-18.86%) > length=3D640, alignment=3D7: 10.67 12.89 = (-20.76%) > length=3D704, alignment=3D0: 11.59 13.39 = (-15.61%) > length=3D704, alignment=3D7: 11.59 13.61 = (-17.45%) > length=3D768, alignment=3D0: 12.27 14.22 = (-15.90%) > length=3D768, alignment=3D7: 12.27 14.44 = (-17.72%) > length=3D896, alignment=3D0: 13.48 15.70 = (-16.47%) > length=3D896, alignment=3D7: 13.47 15.97 = (-18.56%) > length=3D960, alignment=3D0: 14.22 16.63 = (-16.92%) > length=3D960, alignment=3D7: 14.19 16.70 = (-17.66%) > length=3D1024, alignment=3D0: 14.85 17.46 = (-17.54%) > length=3D1024, alignment=3D7: 14.87 17.68 = (-18.94%) > length=3D1280, alignment=3D0: 17.58 20.91 = (-18.94%) > length=3D1280, alignment=3D7: 17.62 21.35 = (-21.13%) > length=3D1536, alignment=3D0: 20.61 24.54 = (-19.07%) > length=3D1536, alignment=3D7: 20.61 24.21 = (-17.48%) > length=3D1792, alignment=3D0: 23.02 27.94 = (-21.39%) > length=3D1792, alignment=3D7: 23.02 27.83 = (-20.90%) > length=3D2048, alignment=3D0: 25.98 30.71 = (-18.23%) > length=3D2048, alignment=3D7: 25.96 31.26 = (-20.45%) > length=3D2560, alignment=3D0: 31.37 37.82 = (-20.57%) > length=3D2560, alignment=3D7: 31.34 37.69 = (-20.26%) > length=3D3008, alignment=3D0: 35.61 43.29 = (-21.56%) > length=3D3008, alignment=3D7: 35.55 43.84 = (-23.31%) > length=3D3520, alignment=3D0: 41.08 50.48 = (-22.90%) > length=3D3520, alignment=3D7: 41.12 50.63 = (-23.13%) > length=3D4096, alignment=3D0: 47.80 57.96 = (-21.25%) > length=3D4096, alignment=3D7: 47.79 57.66 = (-20.66%) >=20 > Reviewed-by: Paul E Murphy >=20 > --- > Changes from v1: > - Added comment about minimum binutils version needed to remove the ins= truction macros > - s/reg/vreg/ on CHECK16 for clarity >=20=20=20 > --- > sysdeps/powerpc/powerpc64/le/power10/strlen.S | 221 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +- > .../powerpc64/multiarch/ifunc-impl-list.c | 2 + > .../powerpc64/multiarch/strlen-power10.S | 2 + > sysdeps/powerpc/powerpc64/multiarch/strlen.c | 3 + > 5 files changed, 230 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power10/strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S >=20 > diff --git a/sysdeps/powerpc/powerpc64/le/power10/strlen.S b/sysdeps/powe= rpc/powerpc64/le/power10/strlen.S > new file mode 100644 > index 0000000000..7eb37a8f54 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power10/strlen.S > @@ -0,0 +1,221 @@ > +/* Optimized strlen implementation for POWER10 LE. > + Copyright (C) 2021 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the = GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > + > +#ifndef STRLEN > +# define STRLEN __strlen > +# define DEFINE_STRLEN_HIDDEN_DEF 1 > +#endif > + > +/* TODO: Replace macros by the actual instructions when minimum binutils= becomes > + >=3D 2.35. This is used to keep compatibility with older versions. = */ > +#define VEXTRACTBM(rt,vrb) \ > + .long(((4)<<(32-6)) \ > + | ((rt)<<(32-11)) \ > + | ((8)<<(32-16)) \ > + | ((vrb)<<(32-21)) \ > + | 1602) > + > +#define LXVP(xtp,dq,ra) \ > + .long(((6)<<(32-6)) \ > + | ((((xtp)-32)>>1)<<(32-10)) \ > + | ((1)<<(32-11)) \ > + | ((ra)<<(32-16)) \ > + | dq) > + > +#define CHECK16(vreg,offset,addr,label) \ > + lxv vreg+32,offset(addr); \ > + vcmpequb. vreg,vreg,v18; \ > + bne cr6,L(label); > + > +/* Load 4 quadwords, merge into one VR for speed and check for NULLs. r= 6 has # > + of bytes already checked. */ > +#define CHECK64(offset,addr,label) \ > + li r6,offset; \ > + LXVP(v4+32,offset,addr); \ > + LXVP(v6+32,offset+32,addr); \ > + vminub v14,v4,v5; \ > + vminub v15,v6,v7; \ > + vminub v16,v14,v15; \ > + vcmpequb. v0,v16,v18; \ > + bne cr6,L(label) > + > +# define TAIL(vreg,increment) \ > + vctzlsbb r4,vreg; \ > + subf r3,r3,r5; \ > + addi r4,r4,increment; \ > + add r3,r3,r4; \ > + blr > + > +/* Implements the function > + > + int [r3] strlen (const void *s [r3]) > + > + The implementation can load bytes past a matching byte, but only > + up to the next 64B boundary, so it never crosses a page. */ > + > +.machine power9 > + > +ENTRY_TOCLESS (STRLEN, 4) > + CALL_MCOUNT 1 > + > + vspltisb v18,0 > + vspltisb v19,-1 > + > + /* Next 16B-aligned address. Prepare address for L(aligned). */ > + addi r5,r3,16 > + clrrdi r5,r5,4 > + > + /* Align data and fill bytes not loaded with non matching char. = */ > + lvx v0,0,r3 > + lvsr v1,0,r3 > + vperm v0,v19,v0,v1 > + > + vcmpequb. v6,v0,v18 > + beq cr6,L(aligned) > + > + vctzlsbb r3,v6 > + blr > + > + /* Test more 112B, 16B at a time. The main loop is optimized for= longer > + strings, so checking the first bytes in 16B chunks benefits a = lot > + small strings. */ > + .p2align 5 > +L(aligned): > + /* Prepare address for the loop. */ > + addi r4,r3,192 > + clrrdi r4,r4,6 > + > + CHECK16(v0,0,r5,tail1) > + CHECK16(v1,16,r5,tail2) > + CHECK16(v2,32,r5,tail3) > + CHECK16(v3,48,r5,tail4) > + CHECK16(v4,64,r5,tail5) > + CHECK16(v5,80,r5,tail6) > + CHECK16(v6,96,r5,tail7) > + CHECK16(v7,112,r5,tail8) > + CHECK16(v8,128,r5,tail9) > + CHECK16(v9,144,r5,tail10) > + CHECK16(v10,160,r5,tail11) > + > + addi r5,r4,128 > + > + /* Switch to a more aggressive approach checking 64B each time. = Use 2 > + pointers 128B apart and unroll the loop once to make the point= er > + updates and usages separated enough to avoid stalls waiting for > + address calculation. */ > + .p2align 5 > +L(loop): > + CHECK64(0,r4,pre_tail_64b) > + CHECK64(64,r4,pre_tail_64b) > + addi r4,r4,256 > + > + CHECK64(0,r5,tail_64b) > + CHECK64(64,r5,tail_64b) > + addi r5,r5,256 > + > + b L(loop) > + > + .p2align 5 > +L(pre_tail_64b): > + mr r5,r4 > +L(tail_64b): > + /* OK, we found a null byte. Let's look for it in the current 64= -byte > + block and mark it in its corresponding VR. lxvp vx,0(ry) puts= the > + low 16B bytes into vx+1, and the high into vx, so the order he= re is > + v5, v4, v7, v6. */ > + vcmpequb v1,v5,v18 > + vcmpequb v2,v4,v18 > + vcmpequb v3,v7,v18 > + vcmpequb v4,v6,v18 > + > + /* Take into account the other 64B blocks we had already checked.= */ > + add r5,r5,r6 > + > + /* Extract first bit of each byte. */ > + VEXTRACTBM(r7,v1) > + VEXTRACTBM(r8,v2) > + VEXTRACTBM(r9,v3) > + VEXTRACTBM(r10,v4) > + > + /* Shift each value into their corresponding position. */ > + sldi r8,r8,16 > + sldi r9,r9,32 > + sldi r10,r10,48 > + > + /* Merge the results. */ > + or r7,r7,r8 > + or r8,r9,r10 > + or r10,r8,r7 > + > + cnttzd r0,r10 /* Count trailing zeros before the matc= h. */ > + subf r5,r3,r5 > + add r3,r5,r0 /* Compute final length. */ > + blr > + > + .p2align 5 > +L(tail1): > + TAIL(v0,0) > + > + .p2align 5 > +L(tail2): > + TAIL(v1,16) > + > + .p2align 5 > +L(tail3): > + TAIL(v2,32) > + > + .p2align 5 > +L(tail4): > + TAIL(v3,48) > + > + .p2align 5 > +L(tail5): > + TAIL(v4,64) > + > + .p2align 5 > +L(tail6): > + TAIL(v5,80) > + > + .p2align 5 > +L(tail7): > + TAIL(v6,96) > + > + .p2align 5 > +L(tail8): > + TAIL(v7,112) > + > + .p2align 5 > +L(tail9): > + TAIL(v8,128) > + > + .p2align 5 > +L(tail10): > + TAIL(v9,144) > + > + .p2align 5 > +L(tail11): > + TAIL(v10,160) > + > +END (STRLEN) > + > +#ifdef DEFINE_STRLEN_HIDDEN_DEF > +weak_alias (__strlen, strlen) > +libc_hidden_builtin_def (strlen) > +#endif > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/power= pc/powerpc64/multiarch/Makefile > index f46bf50732..8aa46a3702 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile > @@ -33,7 +33,8 @@ sysdep_routines +=3D memcpy-power8-cached memcpy-power7= memcpy-a2 memcpy-power6 \ >=20=20 > ifneq (,$(filter %le,$(config-machine))) > sysdep_routines +=3D strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-p= ower9 \ > - rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-= power9 > + rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-= power9 \ > + strlen-power10 > endif > CFLAGS-strncase-power7.c +=3D -mcpu=3Dpower7 -funroll-loops > CFLAGS-strncase_l-power7.c +=3D -mcpu=3Dpower7 -funroll-loops > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysd= eps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > index 72f7f83e7e..1a6993616f 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > @@ -112,6 +112,8 @@ __libc_ifunc_impl_list (const char *name, struct libc= _ifunc_impl *array, > /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c. */ > IFUNC_IMPL (i, name, strlen, > #ifdef __LITTLE_ENDIAN__ > + IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARC= H_3_1, > + __strlen_power10) > IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARC= H_3_00, > __strlen_power9) > #endif > diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S b/sysde= ps/powerpc/powerpc64/multiarch/strlen-power10.S > new file mode 100644 > index 0000000000..6a774fad58 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power10.S > @@ -0,0 +1,2 @@ > +#define STRLEN __strlen_power10 > +#include > diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/power= pc/powerpc64/multiarch/strlen.c > index c3bbc78df8..109c8a90bd 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c > @@ -31,9 +31,12 @@ extern __typeof (__redirect_strlen) __strlen_ppc attri= bute_hidden; > extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden; > extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden; > extern __typeof (__redirect_strlen) __strlen_power9 attribute_hidden; > +extern __typeof (__redirect_strlen) __strlen_power10 attribute_hidden; >=20=20 > libc_ifunc (__libc_strlen, > # ifdef __LITTLE_ENDIAN__ > + (hwcap2 & PPC_FEATURE2_ARCH_3_1) > + ? __strlen_power10 : > (hwcap2 & PPC_FEATURE2_ARCH_3_00) > ? __strlen_power9 : > # endif > --=20 > 2.30.2 >