From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lamm@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 7D9D63945C38
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 14:46:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 7D9D63945C38
Received: from pps.filterd (m0098394.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 13TEYGFd039446
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 10:46:29 -0400
Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com
 [169.55.85.253])
 by mx0a-001b2d01.pphosted.com with ESMTP id 387x5g9qrn-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 10:46:29 -0400
Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1])
 by ppma01wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13TEc5nW003346
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 14:46:28 GMT
Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com
 [9.57.198.24]) by ppma01wdc.us.ibm.com with ESMTP id 384ay9w21b-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 14:46:28 +0000
Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com
 [9.57.199.109])
 by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 13TEkRaX39387604
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Thu, 29 Apr 2021 14:46:27 GMT
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 11262112066;
 Thu, 29 Apr 2021 14:46:27 +0000 (GMT)
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 6680C112062;
 Thu, 29 Apr 2021 14:46:26 +0000 (GMT)
Received: from localhost (unknown [9.85.140.86])
 by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP;
 Thu, 29 Apr 2021 14:46:26 +0000 (GMT)
Content-Type: text/plain; charset="utf-8"
In-Reply-To: <f0d5755f-b18d-924c-0f31-7dbbb0e390f6@linux.ibm.com>
References: <20210428174642.1437818-1-lamm@linux.ibm.com>
 <f0d5755f-b18d-924c-0f31-7dbbb0e390f6@linux.ibm.com>
Subject: Re: [PATCH] powerpc: Optimized memmove for POWER10
From: "Lucas A. M. Magalhaes" <lamm@linux.ibm.com>
To: Raphael M Zinsly <rzinsly@linux.ibm.com>, libc-alpha@sourceware.org
Date: Thu, 29 Apr 2021 11:46:22 -0300
Message-ID: <161970758268.1464995.414608869969618394@fedora.local>
User-Agent: alot/0.9.1
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: FMLDQpa6RKsNlNiBxKn06aG9Uj36W_wh
X-Proofpoint-GUID: FMLDQpa6RKsNlNiBxKn06aG9Uj36W_wh
Content-Transfer-Encoding: quoted-printable
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761
 definitions=2021-04-29_07:2021-04-28,
 2021-04-29 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 impostorscore=0 clxscore=1015
 adultscore=0 priorityscore=1501 mlxscore=0 phishscore=0 bulkscore=0
 spamscore=0 lowpriorityscore=0 malwarescore=0 suspectscore=0
 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2104060000 definitions=main-2104290094
X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Apr 2021 14:46:33 -0000

Quoting Raphael M Zinsly (2021-04-29 11:22:01)
>=20
> Hi Lucas, my review is following:
>=20
> On 28/04/2021 14:46, Lucas A. M. Magalhaes via Libc-alpha wrote:
>  > This patch was initially based on the __memmove_power7 with some ideas
>  > from strncpy implementation for Power 9.
>  >
>  > Improvements from __memmove_power7:
>  >
>  > 1. Use lxvl/stxvl for alignment code.
>  >
>  >     The code for Power 7 uses branches when the input is not naturally
>  >     aligned to the width of a vector. The new implementation uses
>  >     lxvl/stxvl instead which reduces pressure on GPRs. It also allows
>  >     the removal of branch instructions, implicitly removing branch sta=
lls
>  >     and mispredictions.
>  >
>  > 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibit=
ed
>  >     memory.
>  >
>  >     On Power 10 vector load and stores are safe to use on CI memory for
>  >     addresses unaligned to 16B. This code takes advantage of this to
>  >     do unaligned loads.
>  >
>  >     The unaligned loads don't have a significant performance impact by
>  >     themselves. However doing so decreases register pressure on GPRs
>  >     and interdependence stalls on load/store pairs. This also improved
>  >     readability as there are now less code paths for different=20
> alignments.
>  >     Finally this reduces the overall code size.
>  >
>  > 3. Improved performance.
>  >
>  >     This version runs on average about 30% better than memmove_power7
>  >     for lengths  larger than 8KB. For input lengths shorter than 8KB
>  >     the improvement is smaller, it has on average about 17% better
>  >     performance.
>  >
>  >     This version has a degradation of about 50% for input lengths
>  >     in the 0 to 31 bytes range when dest is unaligned src.
>=20
> Should it be just "dest is unaligned."? If not what do you mean by "dest=
=20
> is unaligned src"?
>=20
Yes. Sorry about that.

>  > ---
>  >   .../powerpc/powerpc64/le/power10/memmove.S    | 313 ++++++++++++++++=
++
>  >   sysdeps/powerpc/powerpc64/multiarch/Makefile  |   3 +-
>  >   .../powerpc64/multiarch/ifunc-impl-list.c     |   7 +
>  >   .../powerpc64/multiarch/memmove-power10.S     |  24 ++
>  >   sysdeps/powerpc/powerpc64/multiarch/memmove.c |  16 +-
>  >   5 files changed, 358 insertions(+), 5 deletions(-)
>  >   create mode 100644 sysdeps/powerpc/powerpc64/le/power10/memmove.S
>  >   create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memmove-power=
10.S
>  >
>  > diff --git a/sysdeps/powerpc/powerpc64/le/power10/memmove.S=20
> b/sysdeps/powerpc/powerpc64/le/power10/memmove.S
>  > new file mode 100644
>  > index 0000000000..7cff5ef2ac
>  > --- /dev/null
>  > +++ b/sysdeps/powerpc/powerpc64/le/power10/memmove.S
>  > @@ -0,0 +1,313 @@
>  > +/* Optimized memmove implementation for POWER10.
>  > +   Copyright (C) 2021 Free Software Foundation, Inc.
>  > +   This file is part of the GNU C Library.
>  > +
>  > +   The GNU C Library is free software; you can redistribute it and/or
>  > +   modify it under the terms of the GNU Lesser General Public
>  > +   License as published by the Free Software Foundation; either
>  > +   version 2.1 of the License, or (at your option) any later version.
>  > +
>  > +   The GNU C Library is distributed in the hope that it will be usefu=
l,
>  > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>  > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>  > +   Lesser General Public License for more details.
>  > +
>  > +   You should have received a copy of the GNU Lesser General Public
>  > +   License along with the GNU C Library; if not, see
>  > +   <https://www.gnu.org/licenses/>.  */
>  > +
>  > +#include <sysdep.h>
>  > +
>  > +
>  > +/* void* [r3] memmove (void *dest [r3], const void *src [r4], size_t=
=20
> len [r5])
>  > +
>  > +   This optimization checks if 'src' and 'dst' overlap. If they do not
>  > +   or 'src' is ahead of 'dest' then it copies forward.
>=20
> Looking at the check at the beginning seems to me if there is an overlap=
=20
> it will go backwards even if the src is ahead of dest.
>=20

The comparison is an cmpld which will compare them as unsigned integers.
It would be otherwise if it was a cmpd.  Also notice this is the right
behavior here.  If the src is ahead of the dest and we copy backwards
the result will be corrupted.

Maybe I need a comment about this.  You are not the first one asking
about this.
>  > +   Otherwise, an optimized backward copy is used.  */
>  > +
>  > +#ifndef MEMMOVE
>  > +# define MEMMOVE memmove
>  > +#endif
>  > +    .machine power9
>  > +ENTRY_TOCLESS (MEMMOVE, 5)
>  > +    CALL_MCOUNT 3
>  > +
>  > +    .p2align 5
>  > +    /* Check if there is overlap, if so it will branch to backward=20
> copy.  */
>  > +    subf    r9,r4,r3
>  > +    cmpld    cr7,r9,r5
>  > +    blt    cr7,L(memmove_bwd)
>  > +
>  > +    /* Fast path for length shorter than 16 bytes.  */
>  > +    sldi    r7,r5,56
>  > +    lxvl    32+v2,r4,r7
>  > +    stxvl    32+v2,r3,r7
>  > +    subic.    r8,r5,16
>  > +    blelr
>  > +
>  > +    cmpldi    cr6,r5,256
>  > +    bge    cr6,L(ge_256)
>  > +    /* Account for the first 16-byte copy. For shorter lengths the=20
> alignment
>  > +       either slows down or is irrelevant. I'm making use of this=20
> comparison
>  > +       to skip the alignment.  */
>  > +    addi    r4,r4,16
>  > +    addi    r11,r3,16    /* use r11 to keep dest address on r3.  */
>  > +    subi    r5,r5,16
>  > +    b    L(loop_head)
>  > +
>  > +    .p2align 5
>  > +L(ge_256):
>  > +    /* Account for the first copy <=3D 16 bytes. This is necessary for
>  > +       memmove because at this point the src address can be in front=
=20
> of the
>  > +       dest address.  */
>  > +    clrldi    r9,r5,56
>  > +    li    r8,16
>  > +    cmpldi    r9,16
>  > +    iselgt    r9,r8,r9 > +    add    r4,r4,r9
>  > +    add    r11,r3,r9    /* use r11 to keep dest address on r3.  */
>  > +    sub    r5,r5,r9
>  > +
>  > +    /* Align dest to 16 bytes.  */
>  > +    neg    r7,r3
>  > +    clrldi.    r9,r7,60
>  > +    beq    L(loop_head)
>  > +
>  > +    .p2align 5
>  > +    sldi    r6,r9,56
>  > +    lxvl    32+v0,r4,r6
>  > +    stxvl    32+v0,r11,r6
>  > +    sub    r5,r5,r9
>  > +    add    r4,r4,r9
>  > +    add    r11,r11,r9
>  > +
>  > +L(loop_head):
>  > +    cmpldi    r5,63
>  > +    ble    L(final_64)
>  > +
>  > +    srdi.    r7,r5,7
>  > +    beq    L(loop_tail)
>  > +
>  > +    mtctr    r7
>  > +
>  > +/* Main loop that copies 128 bytes each iteration.  */
>  > +    .p2align 5
>  > +L(loop):
>  > +    addi    r9,r4,64
>  > +    addi    r10,r11,64
>  > +
>  > +    lxv    32+v0,0(r4)
>  > +    lxv    32+v1,16(r4)
>  > +    lxv    32+v2,32(r4)
>  > +    lxv    32+v3,48(r4)
>  > +
>  > +    stxv    32+v0,0(r11)
>  > +    stxv    32+v1,16(r11)
>  > +    stxv    32+v2,32(r11)
>  > +    stxv    32+v3,48(r11)
>  > +
>  > +    addi    r4,r4,128
>  > +    addi    r11,r11,128
>  > +
>  > +    lxv    32+v4,0(r9)
>  > +    lxv    32+v5,16(r9)
>  > +    lxv    32+v6,32(r9)
>  > +    lxv    32+v7,48(r9)
>  > +
>  > +    stxv    32+v4,0(r10)
>  > +    stxv    32+v5,16(r10)
>  > +    stxv    32+v6,32(r10)
>  > +    stxv    32+v7,48(r10)
>  > +
>  > +    bdnz    L(loop)
>  > +    clrldi.    r5,r5,57
>  > +    beqlr
>  > +
>  > +/* Copy 64 bytes.  */
>  > +    .p2align 5
>  > +L(loop_tail):
>  > +    cmpldi     cr5,r5,63
>  > +    ble    cr5,L(final_64)
>  > +
>  > +    lxv    32+v0,0(r4)
>  > +    lxv    32+v1,16(r4)
>  > +    lxv    32+v2,32(r4)
>  > +    lxv    32+v3,48(r4)
>  > +
>  > +    stxv    32+v0,0(r11)
>  > +    stxv    32+v1,16(r11)
>  > +    stxv    32+v2,32(r11)
>  > +    stxv    32+v3,48(r11)
>  > +
>  > +    addi    r4,r4,64
>  > +    addi    r11,r11,64
>  > +    subi    r5,r5,64
>  > +
>  > +/* Copies the last 1-63 bytes.  */
>  > +    .p2align 5
>  > +L(final_64):
>  > +    /* r8 hold the number of bytes that will be copied with=20
> lxv/stxv.  */
>  > +    clrrdi.    r8,r5,4
>  > +    beq    L(tail1)
>  > +
>  > +    cmpldi  cr5,r5,32
>  > +    lxv    32+v0,0(r4)
>  > +    blt    cr5,L(tail2)
>  > +
>  > +    cmpldi    cr6,r5,48
>  > +    lxv    32+v1,16(r4)
>  > +    blt    cr6,L(tail3)
>  > +
>  > +    lxv    32+v2,32(r4)
>  > +
>  > +    .p2align 5
>  > +L(tail4):
>=20
> This label is not used.
>=20
Ok.
>  > +    stxv    32+v2,32(r11)
>  > +L(tail3):
>  > +    stxv    32+v1,16(r11)
>  > +L(tail2):
>  > +    stxv    32+v0,0(r11)
>  > +    sub    r5,r5,r8
>  > +    add    r4,r4,r8
>  > +    add    r11,r11,r8
>  > +    .p2align 5
>  > +L(tail1):
>  > +    sldi    r6,r5,56
>  > +    lxvl    v4,r4,r6
>  > +    stxvl    v4,r11,r6
>  > +    blr
>  > +
>  > +/* If dest and src overlap, we should copy backwards.  */
>  > +L(memmove_bwd):
>  > +    add    r11,r3,r5
>  > +    add    r4,r4,r5
>  > +
>  > +    /* Optimization for length smaller than 16 bytes.  */
>  > +    cmpldi    cr5,r5,15
>  > +    ble    cr5,L(tail1_bwd)
>  > +
>  > +    /* For shorter lengths the alignment either slows down or is=20
> irrelevant.
>  > +       The forward copy uses a already need 256 comparison for that.=
=20
> Here
>  > +       it's using 128 as it will reduce code and improve=20
> readability.  */
>  > +    cmpldi    cr7,r5,128
>  > +    blt    cr7,L(bwd_loop_tail)
>  > +
>  > +    .p2align 5
>  > +    clrldi.    r9,r11,60
>  > +    beq    L(bwd_loop_head)
>  > +    sub    r4,r4,r9
>  > +    sub    r11,r11,r9
>  > +    lxv    32+v0,0(r4)
>  > +    sldi    r6,r9,56
>  > +    stxvl   32+v0,r11,r6
>  > +    sub    r5,r5,r9
>  > +
>  > +L(bwd_loop_head):
>  > +    srdi.    r7,r5,7
>  > +    beq    L(bwd_loop_tail)
>  > +
>  > +    mtctr    r7
>  > +
>  > +/* Main loop that copies 128 bytes every iteration.  */
>  > +    .p2align 5
>  > +L(bwd_loop):
>  > +    addi    r9,r4,-64
>  > +    addi    r10,r11,-64
>  > +
>  > +    lxv    32+v0,-16(r4)
>  > +    lxv    32+v1,-32(r4)
>  > +    lxv    32+v2,-48(r4)
>  > +    lxv    32+v3,-64(r4)
>  > +
>  > +    stxv    32+v0,-16(r11)
>  > +    stxv    32+v1,-32(r11)
>  > +    stxv    32+v2,-48(r11)
>  > +    stxv    32+v3,-64(r11)
>  > +
>  > +    addi    r4,r4,-128
>  > +    addi    r11,r11,-128
>  > +
>  > +    lxv    32+v0,-16(r9)
>  > +    lxv    32+v1,-32(r9)
>  > +    lxv    32+v2,-48(r9)
>  > +    lxv    32+v3,-64(r9)
>  > +
>  > +    stxv    32+v0,-16(r10)
>  > +    stxv    32+v1,-32(r10)
>  > +    stxv    32+v2,-48(r10)
>  > +    stxv    32+v3,-64(r10)
>  > +
>  > +    bdnz    L(bwd_loop)
>  > +    clrldi.    r5,r5,57
>  > +    beqlr
>  > +
>  > +/* Copy 64 bytes.  */
>  > +    .p2align 5
>  > +L(bwd_loop_tail):
>  > +    cmpldi     cr5,r5,63
>  > +    ble    cr5,L(bwd_final_64)
>  > +
>  > +    addi    r4,r4,-64
>  > +    addi    r11,r11,-64
>  > +
>  > +    lxv    32+v0,0(r4)
>  > +    lxv    32+v1,16(r4)
>  > +    lxv    32+v2,32(r4)
>  > +    lxv    32+v3,48(r4)
>  > +
>  > +    stxv    32+v0,0(r11)
>  > +    stxv    32+v1,16(r11)
>  > +    stxv    32+v2,32(r11)
>  > +    stxv    32+v3,48(r11)
>  > +
>  > +    subi    r5,r5,64
>  > +
>  > +/* Copies the last 1-63 bytes.  */
>  > +    .p2align 5
>  > +L(bwd_final_64):
>  > +    /* r8 hold the number of bytes that will be copied with=20
> lxv/stxv.  */
>  > +    clrrdi.    r8,r5,4
>  > +    beq    L(tail1_bwd)
>  > +
>  > +    cmpldi    cr5,r5,32
>  > +    lxv    32+v2,-16(r4)
>  > +    blt    cr5,L(tail2_bwd)
>  > +
>  > +    cmpldi    cr6,r5,48
>  > +    lxv    32+v1,-32(r4)
>  > +    blt    cr6,L(tail3_bwd)
>  > +
>  > +    lxv    32+v0,-48(r4)
>  > +
>  > +    .p2align 5
>  > +L(tail4_bwd):
>=20
> This label is not used.
>=20
Ok.
>  > +    stxv    32+v0,-48(r11)
>  > +L(tail3_bwd):
>  > +    stxv    32+v1,-32(r11)
>  > +L(tail2_bwd):
>  > +    stxv    32+v2,-16(r11)
>  > +    sub    r4,r4,r5
>  > +    sub    r11,r11,r5
>  > +    sub    r5,r5,r8
>  > +    sldi    r6,r5,56
>  > +    lxvl    v4,r4,r6
>  > +    stxvl    v4,r11,r6
>  > +    blr
>  > +
>  > +/* Copy last 16 bytes.  */
>  > +    .p2align 5
>  > +L(tail1_bwd):
>  > +    sub    r4,r4,r5
>  > +    sub    r11,r11,r5
>  > +    sldi    r6,r5,56
>  > +    lxvl    v4,r4,r6
>  > +    stxvl    v4,r11,r6
>  > +    blr
>  > +
>  > +
>  > +END (MEMMOVE)
>  > +
>  > +#ifdef DEFINE_STRLEN_HIDDEN_DEF
>  > +weak_alias (__memmove, memmove)
>  > +libc_hidden_builtin_def (memmove)
>  > +#endif
>=20
> IMO you should copy bcopy here to keep taking advantage of skipping a=20
> function call, like is done for memmove_power7.
>=20

Ok.

>  > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile=20
> b/sysdeps/powerpc/powerpc64/multiarch/Makefile
>  > index 8aa46a3702..16ad1ab439 100644
>  > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
>  > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
>  > @@ -24,7 +24,8 @@ sysdep_routines +=3D memcpy-power8-cached=20
> memcpy-power7 memcpy-a2 memcpy-power6 \
>  >              stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \
>  >              strcmp-power8 strcmp-power7 strcmp-ppc64 \
>  >              strcat-power8 strcat-power7 strcat-ppc64 \
>  > -           memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
>  > +           memmove-power10 memmove-power7 memmove-ppc64 \
>  > +           wordcopy-ppc64 bcopy-ppc64 \
>  >              strncpy-power8 strstr-power7 strstr-ppc64 \
>  >              strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \
>  >              strlen-power8 strcasestr-power8 strcasestr-ppc64 \
>  > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c=20
> b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>  > index 1a6993616f..d1c159f2f7 100644
>  > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>  > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>  > @@ -67,6 +67,13 @@ __libc_ifunc_impl_list (const char *name, struct=20
> libc_ifunc_impl *array,
>  >
>  >     /* Support sysdeps/powerpc/powerpc64/multiarch/memmove.c.  */
>  >     IFUNC_IMPL (i, name, memmove,
>  > +#ifdef __LITTLE_ENDIAN__
>  > +          IFUNC_IMPL_ADD (array, i, memmove,
>  > +                  hwcap2 & (PPC_FEATURE2_ARCH_3_1 |
>  > +                    PPC_FEATURE2_HAS_ISEL)
>  > +                  && (hwcap & PPC_FEATURE_HAS_VSX),
>  > +                  __memmove_power10)
>  > +#endif
>  >             IFUNC_IMPL_ADD (array, i, memmove, hwcap &=20
> PPC_FEATURE_HAS_VSX,
>  >                     __memmove_power7)
>  >             IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_ppc))
>  > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S=20
> b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S
>  > new file mode 100644
>  > index 0000000000..d6d8ea83ec
>  > --- /dev/null
>  > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S
>  > @@ -0,0 +1,24 @@
>  > +/* Optimized memmove implementation for POWER10.
>  > +   Copyright (C) 2021 Free Software Foundation, Inc.
>  > +   This file is part of the GNU C Library.
>  > +
>  > +   The GNU C Library is free software; you can redistribute it and/or
>  > +   modify it under the terms of the GNU Lesser General Public
>  > +   License as published by the Free Software Foundation; either
>  > +   version 2.1 of the License, or (at your option) any later version.
>  > +
>  > +   The GNU C Library is distributed in the hope that it will be usefu=
l,
>  > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>  > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>  > +   Lesser General Public License for more details.
>  > +
>  > +   You should have received a copy of the GNU Lesser General Public
>  > +   License along with the GNU C Library; if not, see
>  > +   <https://www.gnu.org/licenses/>.  */
>  > +
>  > +#define MEMMOVE __memmove_power10
>  > +
>  > +#undef libc_hidden_builtin_def
>  > +#define libc_hidden_builtin_def(name)
>  > +
>  > +#include <sysdeps/powerpc/powerpc64/le/power10/memmove.S>
>  > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove.c=20
> b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>  > index 9bec61a321..4704636f5d 100644
>  > --- a/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>  > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>  > @@ -28,14 +28,22 @@
>  >   # include "init-arch.h"
>  >
>  >   extern __typeof (__redirect_memmove) __libc_memmove;
>  > -
>  >   extern __typeof (__redirect_memmove) __memmove_ppc attribute_hidden;
>  >   extern __typeof (__redirect_memmove) __memmove_power7 attribute_hidd=
en;
>  > +#ifdef __LITTLE_ENDIAN__
>  > +extern __typeof (__redirect_memmove) __memmove_power10 attribute_hidd=
en;
>  > +#endif
>  >
>  >   libc_ifunc (__libc_memmove,
>  > -            (hwcap & PPC_FEATURE_HAS_VSX)
>  > -            ? __memmove_power7
>  > -            : __memmove_ppc);
>  > +#ifdef __LITTLE_ENDIAN__
>  > +             hwcap2 & (PPC_FEATURE2_ARCH_3_1 |
>  > +                   PPC_FEATURE2_HAS_ISEL)
>  > +             && (hwcap & PPC_FEATURE_HAS_VSX)
>  > +             ? __memmove_power10 :
>  > +#endif
>  > +             (hwcap & PPC_FEATURE_HAS_VSX)
>  > +             ? __memmove_power7
>  > +             : __memmove_ppc);
>  >
>  >   #undef memmove
>  >   strong_alias (__libc_memmove, memmove);
>  >
>=20
> Best Regards,
> --=20
> Raphael Moreira Zinsly
>