From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id B4BEF3857036 for ; Thu, 29 Apr 2021 19:18:44 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B4BEF3857036 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 13TJ44n3074097 for ; Thu, 29 Apr 2021 15:18:44 -0400 Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0a-001b2d01.pphosted.com with ESMTP id 38807qchmk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 29 Apr 2021 15:18:44 -0400 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13TJHP2f000735 for ; Thu, 29 Apr 2021 19:18:43 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma05wdc.us.ibm.com with ESMTP id 384ay9xh94-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 29 Apr 2021 19:18:43 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 13TJIgUm38601208 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 29 Apr 2021 19:18:42 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1C13A112064; Thu, 29 Apr 2021 19:18:42 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DCC60112061; Thu, 29 Apr 2021 19:18:40 +0000 (GMT) Received: from TP480.linux.ibm.com (unknown [9.160.146.242]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTPS; Thu, 29 Apr 2021 19:18:40 +0000 (GMT) References: <20210428174642.1437818-1-lamm@linux.ibm.com> <87wnsm2rbz.fsf@linux.ibm.com> <161971180792.1464995.9920465636880275787@fedora.local> User-agent: mu4e 1.4.10; emacs 27.2 From: Matheus Castanho To: "Lucas A. M. Magalhaes" Cc: tulio@linux.ibm.com, libc-alpha@sourceware.org Subject: Re: [PATCH] powerpc: Optimized memmove for POWER10 In-reply-to: <161971180792.1464995.9920465636880275787@fedora.local> Date: Thu, 29 Apr 2021 16:18:38 -0300 Message-ID: <87pmycyk4x.fsf@linux.ibm.com> Content-Type: text/plain X-TM-AS-GCONF: 00 X-Proofpoint-GUID: jYkCowcKEgMcz36ZjDBHsxgMUFYIM_S2 X-Proofpoint-ORIG-GUID: jYkCowcKEgMcz36ZjDBHsxgMUFYIM_S2 X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-04-29_10:2021-04-28, 2021-04-29 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 suspectscore=0 adultscore=0 lowpriorityscore=0 mlxlogscore=999 spamscore=0 clxscore=1015 impostorscore=0 phishscore=0 malwarescore=0 mlxscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2104290121 X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Apr 2021 19:18:47 -0000 Lucas A. M. Magalhaes writes: > Quoting Matheus Castanho (2021-04-29 11:18:30) >> >> No new test failures after this patch. >> >> Some comments and questions below. >> >> Lucas A. M. Magalhaes via Libc-alpha writes: >> >> > This patch was initially based on the __memmove_power7 with some ideas >> > from strncpy implementation for Power 9. >> > >> > Improvements from __memmove_power7: >> > >> > 1. Use lxvl/stxvl for alignment code. >> > >> > The code for Power 7 uses branches when the input is not naturally >> > aligned to the width of a vector. The new implementation uses >> > lxvl/stxvl instead which reduces pressure on GPRs. It also allows >> > the removal of branch instructions, implicitly removing branch stalls >> > and mispredictions. >> > >> > 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited >> > memory. >> > >> > On Power 10 vector load and stores are safe to use on CI memory for >> > addresses unaligned to 16B. This code takes advantage of this to >> > do unaligned loads. >> > >> > The unaligned loads don't have a significant performance impact by >> > themselves. However doing so decreases register pressure on GPRs >> > and interdependence stalls on load/store pairs. This also improved >> > readability as there are now less code paths for different alignments. >> > Finally this reduces the overall code size. >> > >> > 3. Improved performance. >> > >> > This version runs on average about 30% better than memmove_power7 >> > for lengths larger than 8KB. For input lengths shorter than 8KB >> > the improvement is smaller, it has on average about 17% better >> > performance. >> > >> > This version has a degradation of about 50% for input lengths >> > in the 0 to 31 bytes range when dest is unaligned src. >> >> What do you mean? When dest is unaligned or when src is unaligned? Or both? >> > Fixed. > >> > --- >> > .../powerpc/powerpc64/le/power10/memmove.S | 313 ++++++++++++++++++ >> > sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +- >> > .../powerpc64/multiarch/ifunc-impl-list.c | 7 + >> > .../powerpc64/multiarch/memmove-power10.S | 24 ++ >> > sysdeps/powerpc/powerpc64/multiarch/memmove.c | 16 +- >> > 5 files changed, 358 insertions(+), 5 deletions(-) >> > create mode 100644 sysdeps/powerpc/powerpc64/le/power10/memmove.S >> > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S >> > >> > diff --git a/sysdeps/powerpc/powerpc64/le/power10/memmove.S b/sysdeps/powerpc/powerpc64/le/power10/memmove.S >> > new file mode 100644 >> > index 0000000000..7cff5ef2ac >> > --- /dev/null >> > +++ b/sysdeps/powerpc/powerpc64/le/power10/memmove.S >> > @@ -0,0 +1,313 @@ >> > +/* Optimized memmove implementation for POWER10. >> > + Copyright (C) 2021 Free Software Foundation, Inc. >> > + This file is part of the GNU C Library. >> > + >> > + The GNU C Library is free software; you can redistribute it and/or >> > + modify it under the terms of the GNU Lesser General Public >> > + License as published by the Free Software Foundation; either >> > + version 2.1 of the License, or (at your option) any later version. >> > + >> > + The GNU C Library is distributed in the hope that it will be useful, >> > + but WITHOUT ANY WARRANTY; without even the implied warranty of >> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> > + Lesser General Public License for more details. >> > + >> > + You should have received a copy of the GNU Lesser General Public >> > + License along with the GNU C Library; if not, see >> > + . */ >> > + >> > +#include >> > + >> > + >> > +/* void* [r3] memmove (void *dest [r3], const void *src [r4], size_t len [r5]) >> > + >> > + This optimization checks if 'src' and 'dst' overlap. If they do not >> >> Two spaces after '.' >> > Fixed. > >> > + or 'src' is ahead of 'dest' then it copies forward. >> > + Otherwise, an optimized backward copy is used. */ >> > + >> > +#ifndef MEMMOVE >> > +# define MEMMOVE memmove >> > +#endif >> > + .machine power9 >> > +ENTRY_TOCLESS (MEMMOVE, 5) >> > + CALL_MCOUNT 3 >> > + >> > + .p2align 5 >> > + /* Check if there is overlap, if so it will branch to backward copy. */ >> > + subf r9,r4,r3 >> > + cmpld cr7,r9,r5 >> > + blt cr7,L(memmove_bwd) >> > + >> > + /* Fast path for length shorter than 16 bytes. */ >> > + sldi r7,r5,56 >> > + lxvl 32+v2,r4,r7 >> > + stxvl 32+v2,r3,r7 >> > + subic. r8,r5,16 >> > + blelr >> >> Ok. >> >> > + >> > + cmpldi cr6,r5,256 >> > + bge cr6,L(ge_256) >> > + /* Account for the first 16-byte copy. For shorter lengths the alignment >> > + either slows down or is irrelevant. I'm making use of this comparison >> > + to skip the alignment. */ >> > + addi r4,r4,16 >> > + addi r11,r3,16 /* use r11 to keep dest address on r3. */ >> > + subi r5,r5,16 >> > + b L(loop_head) >> >> Two spaces after '.' (x2) >> > Fixed. > >> Also, I'd suggest slightly rephrasing the big comment and splitting it >> in two: >> >> /* For shorter lengths aligning the address to 16B either >> decreases performance or is irrelevant. So make use of this >> comparison to skip the alignment code in such cases. */ >> cmpldi cr6,r5,256 >> bge cr6,L(ge_256) >> >> /* Account for the first 16-byte copy. */ >> addi r4,r4,16 >> addi r11,r3,16 /* use r11 to keep dest address on r3. */ >> subi r5,r5,16 >> b L(loop_head) >> > Fixed. > >> > + >> > + .p2align 5 >> > +L(ge_256): >> > + /* Account for the first copy <= 16 bytes. This is necessary for >> > + memmove because at this point the src address can be in front of the > t> > + dest address. */ >> > + clrldi r9,r5,56 >> > + li r8,16 >> > + cmpldi r9,16 >> > + iselgt r9,r8,r9 >> > + add r4,r4,r9 >> > + add r11,r3,r9 /* use r11 to keep dest address on r3. */ >> > + sub r5,r5,r9 >> > + >> > + /* Align dest to 16 bytes. */ >> > + neg r7,r3 >> > + clrldi. r9,r7,60 >> > + beq L(loop_head) >> > + >> > + .p2align 5 >> > + sldi r6,r9,56 >> > + lxvl 32+v0,r4,r6 >> > + stxvl 32+v0,r11,r6 >> > + sub r5,r5,r9 >> > + add r4,r4,r9 >> > + add r11,r11,r9 >> >> Ok. Only align for to 16B for strings >256. >> >> > + >> > +L(loop_head): >> > + cmpldi r5,63 >> > + ble L(final_64) >> > + >> > + srdi. r7,r5,7 >> > + beq L(loop_tail) >> > + >> > + mtctr r7 >> > + >> > +/* Main loop that copies 128 bytes each iteration. */ >> > + .p2align 5 >> > +L(loop): >> > + addi r9,r4,64 >> > + addi r10,r11,64 >> > + >> > + lxv 32+v0,0(r4) >> > + lxv 32+v1,16(r4) >> > + lxv 32+v2,32(r4) >> > + lxv 32+v3,48(r4) >> > + >> > + stxv 32+v0,0(r11) >> > + stxv 32+v1,16(r11) >> > + stxv 32+v2,32(r11) >> > + stxv 32+v3,48(r11) >> > + >> > + addi r4,r4,128 >> > + addi r11,r11,128 >> > + >> > + lxv 32+v4,0(r9) >> > + lxv 32+v5,16(r9) >> > + lxv 32+v6,32(r9) >> > + lxv 32+v7,48(r9) >> > + >> > + stxv 32+v4,0(r10) >> > + stxv 32+v5,16(r10) >> > + stxv 32+v6,32(r10) >> > + stxv 32+v7,48(r10) >> > + >> > + bdnz L(loop) >> > + clrldi. r5,r5,57 >> > + beqlr >> > + >> >> Ok. Copy 128 bytes per iteration using 2 pairs of pointers. >> >> > +/* Copy 64 bytes. */ >> > + .p2align 5 >> > +L(loop_tail): >> > + cmpldi cr5,r5,63 >> > + ble cr5,L(final_64) >> > + >> > + lxv 32+v0,0(r4) >> > + lxv 32+v1,16(r4) >> > + lxv 32+v2,32(r4) >> > + lxv 32+v3,48(r4) >> > + >> > + stxv 32+v0,0(r11) >> > + stxv 32+v1,16(r11) >> > + stxv 32+v2,32(r11) >> > + stxv 32+v3,48(r11) >> > + >> > + addi r4,r4,64 >> > + addi r11,r11,64 >> > + subi r5,r5,64 >> > + >> >> Ok. Copy an extra 64B block if needed. >> >> > +/* Copies the last 1-63 bytes. */ >> > + .p2align 5 >> > +L(final_64): >> > + /* r8 hold the number of bytes that will be copied with lxv/stxv. */ >> >> r8 hold -> r8 holds >> > Fixed. > >> > + clrrdi. r8,r5,4 >> > + beq L(tail1) >> > + >> > + cmpldi cr5,r5,32 >> > + lxv 32+v0,0(r4) >> > + blt cr5,L(tail2) >> > + >> > + cmpldi cr6,r5,48 >> > + lxv 32+v1,16(r4) >> > + blt cr6,L(tail3) >> > + >> > + lxv 32+v2,32(r4) >> > + >> > + .p2align 5 >> > +L(tail4): >> > + stxv 32+v2,32(r11) >> > +L(tail3): >> > + stxv 32+v1,16(r11) >> > +L(tail2): >> > + stxv 32+v0,0(r11) >> > + sub r5,r5,r8 >> > + add r4,r4,r8 >> > + add r11,r11,r8 >> > + .p2align 5 >> > +L(tail1): >> > + sldi r6,r5,56 >> > + lxvl v4,r4,r6 >> > + stxvl v4,r11,r6 >> > + blr >> > + >> >> Ok. >> >> > +/* If dest and src overlap, we should copy backwards. */ >> > +L(memmove_bwd): >> > + add r11,r3,r5 >> > + add r4,r4,r5 >> > + >> > + /* Optimization for length smaller than 16 bytes. */ >> > + cmpldi cr5,r5,15 >> > + ble cr5,L(tail1_bwd) >> > + >> > + /* For shorter lengths the alignment either slows down or is irrelevant. >> > + The forward copy uses a already need 256 comparison for that. Here >> > + it's using 128 as it will reduce code and improve readability. */ >> >> What about something like this instead: >> >> /* For shorter lengths the alignment either decreases >> performance or is irrelevant. The forward copy applies the >> alignment only for len >= 256. Here it's using 128 as it >> will reduce code and improve readability. */ >> >> But, I'm curious about why you chose a different threshold than the >> forward case. I agree it reduces code and improves readability, but then >> shouldn't the same be done for the forward case? If not, because of >> performance reasons, why the same 256 threshold was not used here? >> >> The two codepaths are fairly similar (basically replacing sub by add and >> vice-versa and using some different offsets), so I'd expect an >> optimization like that to apply to both cases. >> > > To do a first copy here would need a code like the bellow. > > /* Calculate how many bytes will be copied to adjust r4 and r3. Saturate > it to 0 if subi results in a negative number. */ > subi. r10,r5,16 > isellt r10,0,r10 > > /* Ajust the pointers. */ > sub r4,r4,r10 > sub r11,r11,r10 > sldi r6,r10,56 > lxvl 32+v0,r4,r6 > stxvl 32+v0,r11,r6 > blelr /* Using the subi. result. */ > > /* Account for the size copied here. Different than the forward path > I need to calculate the size before the copy. If we get here I will > always have copied 16 bytes. */ > subi r5, r5, 16 > > After this I will have to decide if I will skip the alignment for some > size. I would decide here for 128 bytes anyway, for two reasons. First > empirically the alignment overhead will not decrease performance for > sizes bigger than 100 bytes. Second because the code already is made > for blocks of 128, so I will use the already code to copy the last 127. > > For sizes less then 16 bytes the code today is as shown bellow. I'm > doing the branch so we can see the whole code path. > > cmpldi cr5,r5,15 > ble cr5,L(tail1_bwd) > /* Branches to this.*/ > L(tail1_bwd): > sub r11,r11,r5 > sldi r6,r5,56 > lxvl v4,r4,r6 > stxvl v4,r11,r6 > blr > > IMHO both code will have similar performance although the second version > is smaller and simpler. With that said I don't think it's worth to add > that optimization here. > > Maybe add a comparison before adjusting the pointers would be a better > option. Something like: > > L(memmove_bwd): > cmpldi cr5,r5,16 > bgt cr5,L(bwd_gt_16) > sldi r6,r5,56 > lxvl v4,r4,r6 > stxvl v4,r11,r6 > blr > > L(bwd_gt_16): > /* Continue the code. */ > > Or maybe I could save a lot of code by doing this comparison for the > forward version as well. I would like to hear your thoughts about it. > I think I get it now. You used 256 in the fwd case because that's the length limit for lxvl/stvxl. I'm OK with that. No changes needed. But thanks for the detailed explanation anyways =) >> > + cmpldi cr7,r5,128 >> > + blt cr7,L(bwd_loop_tail) >> > + >> > + .p2align 5 >> > + clrldi. r9,r11,60 >> > + beq L(bwd_loop_head) >> > + sub r4,r4,r9 >> > + sub r11,r11,r9 >> > + lxv 32+v0,0(r4) >> > + sldi r6,r9,56 >> > + stxvl 32+v0,r11,r6 >> > + sub r5,r5,r9 >> > + >> >> This block would benefit from a one-line comment about what it does. >> > Fixed. > >> > +L(bwd_loop_head): >> > + srdi. r7,r5,7 >> > + beq L(bwd_loop_tail) >> > + >> > + mtctr r7 >> > + >> > +/* Main loop that copies 128 bytes every iteration. */ >> > + .p2align 5 >> > +L(bwd_loop): >> > + addi r9,r4,-64 >> > + addi r10,r11,-64 >> > + >> > + lxv 32+v0,-16(r4) >> > + lxv 32+v1,-32(r4) >> > + lxv 32+v2,-48(r4) >> > + lxv 32+v3,-64(r4) >> > + >> > + stxv 32+v0,-16(r11) >> > + stxv 32+v1,-32(r11) >> > + stxv 32+v2,-48(r11) >> > + stxv 32+v3,-64(r11) >> > + >> > + addi r4,r4,-128 >> > + addi r11,r11,-128 >> > + >> > + lxv 32+v0,-16(r9) >> > + lxv 32+v1,-32(r9) >> > + lxv 32+v2,-48(r9) >> > + lxv 32+v3,-64(r9) >> > + >> > + stxv 32+v0,-16(r10) >> > + stxv 32+v1,-32(r10) >> > + stxv 32+v2,-48(r10) >> > + stxv 32+v3,-64(r10) >> > + >> > + bdnz L(bwd_loop) >> > + clrldi. r5,r5,57 >> > + beqlr >> > + >> >> Ok. >> >> > +/* Copy 64 bytes. */ >> > + .p2align 5 >> > +L(bwd_loop_tail): >> > + cmpldi cr5,r5,63 >> > + ble cr5,L(bwd_final_64) >> > + >> > + addi r4,r4,-64 >> > + addi r11,r11,-64 >> > + >> > + lxv 32+v0,0(r4) >> > + lxv 32+v1,16(r4) >> > + lxv 32+v2,32(r4) >> > + lxv 32+v3,48(r4) >> > + >> > + stxv 32+v0,0(r11) >> > + stxv 32+v1,16(r11) >> > + stxv 32+v2,32(r11) >> > + stxv 32+v3,48(r11) >> > + >> > + subi r5,r5,64 >> > + >> >> Ok. >> >> > +/* Copies the last 1-63 bytes. */ >> > + .p2align 5 >> > +L(bwd_final_64): >> > + /* r8 hold the number of bytes that will be copied with lxv/stxv. */ >> >> r8 hold -> r8 holds >> > Fixed. > >> > + clrrdi. r8,r5,4 >> > + beq L(tail1_bwd) >> > + >> > + cmpldi cr5,r5,32 >> > + lxv 32+v2,-16(r4) >> > + blt cr5,L(tail2_bwd) >> > + >> > + cmpldi cr6,r5,48 >> > + lxv 32+v1,-32(r4) >> > + blt cr6,L(tail3_bwd) >> > + >> > + lxv 32+v0,-48(r4) >> > + >> > + .p2align 5 >> > +L(tail4_bwd): >> > + stxv 32+v0,-48(r11) >> > +L(tail3_bwd): >> > + stxv 32+v1,-32(r11) >> > +L(tail2_bwd): >> > + stxv 32+v2,-16(r11) >> > + sub r4,r4,r5 >> > + sub r11,r11,r5 >> > + sub r5,r5,r8 >> > + sldi r6,r5,56 >> > + lxvl v4,r4,r6 >> > + stxvl v4,r11,r6 >> > + blr >> > + >> > +/* Copy last 16 bytes. */ >> > + .p2align 5 >> > +L(tail1_bwd): >> > + sub r4,r4,r5 >> > + sub r11,r11,r5 >> > + sldi r6,r5,56 >> > + lxvl v4,r4,r6 >> > + stxvl v4,r11,r6 >> > + blr >> >> The last two blocks are almost identical, I imagine you separated both >> to save one sub for the cases that go straight to tail1_bwd because you >> already know r8 will be 0 and thus the sub is not needed. So probably >> saves 1 cycle. >> > Thats it. It will actually save more cycles the sub instructions will > depend on the update of r5. > >> Ok. >> >> > + >> > + >> > +END (MEMMOVE) >> > + >> > +#ifdef DEFINE_STRLEN_HIDDEN_DEF >> > +weak_alias (__memmove, memmove) >> > +libc_hidden_builtin_def (memmove) >> > +#endif >> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile >> > index 8aa46a3702..16ad1ab439 100644 >> > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile >> > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile >> > @@ -24,7 +24,8 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ >> > stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \ >> > strcmp-power8 strcmp-power7 strcmp-ppc64 \ >> > strcat-power8 strcat-power7 strcat-ppc64 \ >> > - memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \ >> > + memmove-power10 memmove-power7 memmove-ppc64 \ >> > + wordcopy-ppc64 bcopy-ppc64 \ >> > strncpy-power8 strstr-power7 strstr-ppc64 \ >> > strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \ >> > strlen-power8 strcasestr-power8 strcasestr-ppc64 \ >> >> I agree with the issue Raoni pointed. This should be in the list of >> LE-only ifuncs. >> > Fixed. > >> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c >> > index 1a6993616f..d1c159f2f7 100644 >> > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c >> > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c >> > @@ -67,6 +67,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, >> > >> > /* Support sysdeps/powerpc/powerpc64/multiarch/memmove.c. */ >> > IFUNC_IMPL (i, name, memmove, >> > +#ifdef __LITTLE_ENDIAN__ >> > + IFUNC_IMPL_ADD (array, i, memmove, >> > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 | >> > + PPC_FEATURE2_HAS_ISEL) >> > + && (hwcap & PPC_FEATURE_HAS_VSX), >> > + __memmove_power10) >> > +#endif >> > IFUNC_IMPL_ADD (array, i, memmove, hwcap & PPC_FEATURE_HAS_VSX, >> > __memmove_power7) >> > IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_ppc)) >> >> Ok. >> > Fixed. > >> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S >> > new file mode 100644 >> > index 0000000000..d6d8ea83ec >> > --- /dev/null >> > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S >> > @@ -0,0 +1,24 @@ >> > +/* Optimized memmove implementation for POWER10. >> > + Copyright (C) 2021 Free Software Foundation, Inc. >> > + This file is part of the GNU C Library. >> > + >> > + The GNU C Library is free software; you can redistribute it and/or >> > + modify it under the terms of the GNU Lesser General Public >> > + License as published by the Free Software Foundation; either >> > + version 2.1 of the License, or (at your option) any later version. >> > + >> > + The GNU C Library is distributed in the hope that it will be useful, >> > + but WITHOUT ANY WARRANTY; without even the implied warranty of >> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> > + Lesser General Public License for more details. >> > + >> > + You should have received a copy of the GNU Lesser General Public >> > + License along with the GNU C Library; if not, see >> > + . */ >> > + >> > +#define MEMMOVE __memmove_power10 >> > + >> > +#undef libc_hidden_builtin_def >> > +#define libc_hidden_builtin_def(name) >> > + >> > +#include >> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove.c b/sysdeps/powerpc/powerpc64/multiarch/memmove.c >> > index 9bec61a321..4704636f5d 100644 >> > --- a/sysdeps/powerpc/powerpc64/multiarch/memmove.c >> > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove.c >> > @@ -28,14 +28,22 @@ >> > # include "init-arch.h" >> > >> > extern __typeof (__redirect_memmove) __libc_memmove; >> > - >> > extern __typeof (__redirect_memmove) __memmove_ppc attribute_hidden; >> > extern __typeof (__redirect_memmove) __memmove_power7 attribute_hidden; >> > +#ifdef __LITTLE_ENDIAN__ >> > +extern __typeof (__redirect_memmove) __memmove_power10 attribute_hidden; >> > +#endif >> > >> > libc_ifunc (__libc_memmove, >> > - (hwcap & PPC_FEATURE_HAS_VSX) >> > - ? __memmove_power7 >> > - : __memmove_ppc); >> > +#ifdef __LITTLE_ENDIAN__ >> > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 | >> > + PPC_FEATURE2_HAS_ISEL) >> > + && (hwcap & PPC_FEATURE_HAS_VSX) >> > + ? __memmove_power10 : >> > +#endif >> >> This should be indented 1 level down. >> > Fixed. > >> > + (hwcap & PPC_FEATURE_HAS_VSX) >> > + ? __memmove_power7 >> > + : __memmove_ppc); >> > >> > #undef memmove >> > strong_alias (__libc_memmove, memmove); >> >> >> -- >> Matheus Castanho -- Matheus Castanho