From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <msc@linux.ibm.com>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id B4BEF3857036
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 19:18:44 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B4BEF3857036
Received: from pps.filterd (m0098421.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 13TJ44n3074097
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 15:18:44 -0400
Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com
 [169.47.144.27])
 by mx0a-001b2d01.pphosted.com with ESMTP id 38807qchmk-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 15:18:44 -0400
Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1])
 by ppma05wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13TJHP2f000735
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 19:18:43 GMT
Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com
 [9.57.198.25]) by ppma05wdc.us.ibm.com with ESMTP id 384ay9xh94-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <libc-alpha@sourceware.org>; Thu, 29 Apr 2021 19:18:43 +0000
Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com
 [9.57.199.109])
 by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 13TJIgUm38601208
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Thu, 29 Apr 2021 19:18:42 GMT
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 1C13A112064;
 Thu, 29 Apr 2021 19:18:42 +0000 (GMT)
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id DCC60112061;
 Thu, 29 Apr 2021 19:18:40 +0000 (GMT)
Received: from TP480.linux.ibm.com (unknown [9.160.146.242])
 by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTPS;
 Thu, 29 Apr 2021 19:18:40 +0000 (GMT)
References: <20210428174642.1437818-1-lamm@linux.ibm.com>
 <87wnsm2rbz.fsf@linux.ibm.com>
 <161971180792.1464995.9920465636880275787@fedora.local>
User-agent: mu4e 1.4.10; emacs 27.2
From: Matheus Castanho <msc@linux.ibm.com>
To: "Lucas A. M. Magalhaes" <lamm@linux.ibm.com>
Cc: tulio@linux.ibm.com, libc-alpha@sourceware.org
Subject: Re: [PATCH] powerpc: Optimized memmove for POWER10
In-reply-to: <161971180792.1464995.9920465636880275787@fedora.local>
Date: Thu, 29 Apr 2021 16:18:38 -0300
Message-ID: <87pmycyk4x.fsf@linux.ibm.com>
Content-Type: text/plain
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: jYkCowcKEgMcz36ZjDBHsxgMUFYIM_S2
X-Proofpoint-ORIG-GUID: jYkCowcKEgMcz36ZjDBHsxgMUFYIM_S2
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761
 definitions=2021-04-29_10:2021-04-28,
 2021-04-29 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 priorityscore=1501
 suspectscore=0 adultscore=0 lowpriorityscore=0 mlxlogscore=999 spamscore=0
 clxscore=1015 impostorscore=0 phishscore=0 malwarescore=0 mlxscore=0
 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2104060000 definitions=main-2104290121
X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Apr 2021 19:18:47 -0000


Lucas A. M. Magalhaes <lamm@linux.ibm.com> writes:

> Quoting Matheus Castanho (2021-04-29 11:18:30)
>>
>> No new test failures after this patch.
>>
>> Some comments and questions below.
>>
>> Lucas A. M. Magalhaes via Libc-alpha <libc-alpha@sourceware.org> writes:
>>
>> > This patch was initially based on the __memmove_power7 with some ideas
>> > from strncpy implementation for Power 9.
>> >
>> > Improvements from __memmove_power7:
>> >
>> > 1. Use lxvl/stxvl for alignment code.
>> >
>> >    The code for Power 7 uses branches when the input is not naturally
>> >    aligned to the width of a vector. The new implementation uses
>> >    lxvl/stxvl instead which reduces pressure on GPRs. It also allows
>> >    the removal of branch instructions, implicitly removing branch stalls
>> >    and mispredictions.
>> >
>> > 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited
>> >    memory.
>> >
>> >    On Power 10 vector load and stores are safe to use on CI memory for
>> >    addresses unaligned to 16B. This code takes advantage of this to
>> >    do unaligned loads.
>> >
>> >    The unaligned loads don't have a significant performance impact by
>> >    themselves. However doing so decreases register pressure on GPRs
>> >    and interdependence stalls on load/store pairs. This also improved
>> >    readability as there are now less code paths for different alignments.
>> >    Finally this reduces the overall code size.
>> >
>> > 3. Improved performance.
>> >
>> >    This version runs on average about 30% better than memmove_power7
>> >    for lengths  larger than 8KB. For input lengths shorter than 8KB
>> >    the improvement is smaller, it has on average about 17% better
>> >    performance.
>> >
>> >    This version has a degradation of about 50% for input lengths
>> >    in the 0 to 31 bytes range when dest is unaligned src.
>>
>> What do you mean? When dest is unaligned or when src is unaligned? Or both?
>>
> Fixed.
>
>> > ---
>> >  .../powerpc/powerpc64/le/power10/memmove.S    | 313 ++++++++++++++++++
>> >  sysdeps/powerpc/powerpc64/multiarch/Makefile  |   3 +-
>> >  .../powerpc64/multiarch/ifunc-impl-list.c     |   7 +
>> >  .../powerpc64/multiarch/memmove-power10.S     |  24 ++
>> >  sysdeps/powerpc/powerpc64/multiarch/memmove.c |  16 +-
>> >  5 files changed, 358 insertions(+), 5 deletions(-)
>> >  create mode 100644 sysdeps/powerpc/powerpc64/le/power10/memmove.S
>> >  create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S
>> >
>> > diff --git a/sysdeps/powerpc/powerpc64/le/power10/memmove.S b/sysdeps/powerpc/powerpc64/le/power10/memmove.S
>> > new file mode 100644
>> > index 0000000000..7cff5ef2ac
>> > --- /dev/null
>> > +++ b/sysdeps/powerpc/powerpc64/le/power10/memmove.S
>> > @@ -0,0 +1,313 @@
>> > +/* Optimized memmove implementation for POWER10.
>> > +   Copyright (C) 2021 Free Software Foundation, Inc.
>> > +   This file is part of the GNU C Library.
>> > +
>> > +   The GNU C Library is free software; you can redistribute it and/or
>> > +   modify it under the terms of the GNU Lesser General Public
>> > +   License as published by the Free Software Foundation; either
>> > +   version 2.1 of the License, or (at your option) any later version.
>> > +
>> > +   The GNU C Library is distributed in the hope that it will be useful,
>> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> > +   Lesser General Public License for more details.
>> > +
>> > +   You should have received a copy of the GNU Lesser General Public
>> > +   License along with the GNU C Library; if not, see
>> > +   <https://www.gnu.org/licenses/>.  */
>> > +
>> > +#include <sysdep.h>
>> > +
>> > +
>> > +/* void* [r3] memmove (void *dest [r3], const void *src [r4], size_t len [r5])
>> > +
>> > +   This optimization checks if 'src' and 'dst' overlap. If they do not
>>
>> Two spaces after '.'
>>
> Fixed.
>
>> > +   or 'src' is ahead of 'dest' then it copies forward.
>> > +   Otherwise, an optimized backward copy is used.  */
>> > +
>> > +#ifndef MEMMOVE
>> > +# define MEMMOVE memmove
>> > +#endif
>> > +     .machine power9
>> > +ENTRY_TOCLESS (MEMMOVE, 5)
>> > +     CALL_MCOUNT 3
>> > +
>> > +     .p2align 5
>> > +     /* Check if there is overlap, if so it will branch to backward copy.  */
>> > +     subf    r9,r4,r3
>> > +     cmpld   cr7,r9,r5
>> > +     blt     cr7,L(memmove_bwd)
>> > +
>> > +     /* Fast path for length shorter than 16 bytes.  */
>> > +     sldi    r7,r5,56
>> > +     lxvl    32+v2,r4,r7
>> > +     stxvl   32+v2,r3,r7
>> > +     subic.  r8,r5,16
>> > +     blelr
>>
>> Ok.
>>
>> > +
>> > +     cmpldi  cr6,r5,256
>> > +     bge     cr6,L(ge_256)
>> > +     /* Account for the first 16-byte copy. For shorter lengths the alignment
>> > +        either slows down or is irrelevant. I'm making use of this comparison
>> > +        to skip the alignment.  */
>> > +     addi    r4,r4,16
>> > +     addi    r11,r3,16       /* use r11 to keep dest address on r3.  */
>> > +     subi    r5,r5,16
>> > +     b       L(loop_head)
>>
>> Two spaces after '.'   (x2)
>>
> Fixed.
>
>> Also, I'd suggest slightly rephrasing the big comment and splitting it
>> in two:
>>
>>         /* For shorter lengths aligning the address to 16B either
>>            decreases performance or is irrelevant.  So make use of this
>>            comparison to skip the alignment code in such cases.  */
>>         cmpldi  cr6,r5,256
>>         bge     cr6,L(ge_256)
>>
>>         /* Account for the first 16-byte copy.  */
>>         addi    r4,r4,16
>>         addi    r11,r3,16       /* use r11 to keep dest address on r3.  */
>>         subi    r5,r5,16
>>         b       L(loop_head)
>>
> Fixed.
>
>> > +
>> > +     .p2align 5
>> > +L(ge_256):
>> > +     /* Account for the first copy <= 16 bytes. This is necessary for
>> > +        memmove because at this point the src address can be in front of the
> t> > +        dest address.  */
>> > +     clrldi  r9,r5,56
>> > +     li      r8,16
>> > +     cmpldi  r9,16
>> > +     iselgt  r9,r8,r9
>> > +     add     r4,r4,r9
>> > +     add     r11,r3,r9       /* use r11 to keep dest address on r3.  */
>> > +     sub     r5,r5,r9
>> > +
>> > +     /* Align dest to 16 bytes.  */
>> > +     neg     r7,r3
>> > +     clrldi. r9,r7,60
>> > +     beq     L(loop_head)
>> > +
>> > +     .p2align 5
>> > +     sldi    r6,r9,56
>> > +     lxvl    32+v0,r4,r6
>> > +     stxvl   32+v0,r11,r6
>> > +     sub     r5,r5,r9
>> > +     add     r4,r4,r9
>> > +     add     r11,r11,r9
>>
>> Ok. Only align for to 16B for strings >256.
>>
>> > +
>> > +L(loop_head):
>> > +     cmpldi  r5,63
>> > +     ble     L(final_64)
>> > +
>> > +     srdi.   r7,r5,7
>> > +     beq     L(loop_tail)
>> > +
>> > +     mtctr   r7
>> > +
>> > +/* Main loop that copies 128 bytes each iteration.  */
>> > +     .p2align 5
>> > +L(loop):
>> > +     addi    r9,r4,64
>> > +     addi    r10,r11,64
>> > +
>> > +     lxv     32+v0,0(r4)
>> > +     lxv     32+v1,16(r4)
>> > +     lxv     32+v2,32(r4)
>> > +     lxv     32+v3,48(r4)
>> > +
>> > +     stxv    32+v0,0(r11)
>> > +     stxv    32+v1,16(r11)
>> > +     stxv    32+v2,32(r11)
>> > +     stxv    32+v3,48(r11)
>> > +
>> > +     addi    r4,r4,128
>> > +     addi    r11,r11,128
>> > +
>> > +     lxv     32+v4,0(r9)
>> > +     lxv     32+v5,16(r9)
>> > +     lxv     32+v6,32(r9)
>> > +     lxv     32+v7,48(r9)
>> > +
>> > +     stxv    32+v4,0(r10)
>> > +     stxv    32+v5,16(r10)
>> > +     stxv    32+v6,32(r10)
>> > +     stxv    32+v7,48(r10)
>> > +
>> > +     bdnz    L(loop)
>> > +     clrldi. r5,r5,57
>> > +     beqlr
>> > +
>>
>> Ok. Copy 128 bytes per iteration using 2 pairs of pointers.
>>
>> > +/* Copy 64 bytes.  */
>> > +     .p2align 5
>> > +L(loop_tail):
>> > +     cmpldi  cr5,r5,63
>> > +     ble     cr5,L(final_64)
>> > +
>> > +     lxv     32+v0,0(r4)
>> > +     lxv     32+v1,16(r4)
>> > +     lxv     32+v2,32(r4)
>> > +     lxv     32+v3,48(r4)
>> > +
>> > +     stxv    32+v0,0(r11)
>> > +     stxv    32+v1,16(r11)
>> > +     stxv    32+v2,32(r11)
>> > +     stxv    32+v3,48(r11)
>> > +
>> > +     addi    r4,r4,64
>> > +     addi    r11,r11,64
>> > +     subi    r5,r5,64
>> > +
>>
>> Ok. Copy an extra 64B block if needed.
>>
>> > +/* Copies the last 1-63 bytes.  */
>> > +     .p2align 5
>> > +L(final_64):
>> > +     /* r8 hold the number of bytes that will be copied with lxv/stxv.  */
>>
>> r8 hold -> r8 holds
>>
> Fixed.
>
>> > +     clrrdi. r8,r5,4
>> > +     beq     L(tail1)
>> > +
>> > +     cmpldi  cr5,r5,32
>> > +     lxv     32+v0,0(r4)
>> > +     blt     cr5,L(tail2)
>> > +
>> > +     cmpldi  cr6,r5,48
>> > +     lxv     32+v1,16(r4)
>> > +     blt     cr6,L(tail3)
>> > +
>> > +     lxv     32+v2,32(r4)
>> > +
>> > +     .p2align 5
>> > +L(tail4):
>> > +     stxv    32+v2,32(r11)
>> > +L(tail3):
>> > +     stxv    32+v1,16(r11)
>> > +L(tail2):
>> > +     stxv    32+v0,0(r11)
>> > +     sub     r5,r5,r8
>> > +     add     r4,r4,r8
>> > +     add     r11,r11,r8
>> > +     .p2align 5
>> > +L(tail1):
>> > +     sldi    r6,r5,56
>> > +     lxvl    v4,r4,r6
>> > +     stxvl   v4,r11,r6
>> > +     blr
>> > +
>>
>> Ok.
>>
>> > +/* If dest and src overlap, we should copy backwards.  */
>> > +L(memmove_bwd):
>> > +     add     r11,r3,r5
>> > +     add     r4,r4,r5
>> > +
>> > +     /* Optimization for length smaller than 16 bytes.  */
>> > +     cmpldi  cr5,r5,15
>> > +     ble     cr5,L(tail1_bwd)
>> > +
>> > +     /* For shorter lengths the alignment either slows down or is irrelevant.
>> > +        The forward copy uses a already need 256 comparison for that. Here
>> > +        it's using 128 as it will reduce code and improve readability.  */
>>
>> What about something like this instead:
>>
>>         /* For shorter lengths the alignment either decreases
>>            performance or is irrelevant.  The forward copy applies the
>>            alignment only for len >= 256.  Here it's using 128 as it
>>            will reduce code and improve readability.  */
>>
>> But, I'm curious about why you chose a different threshold than the
>> forward case. I agree it reduces code and improves readability, but then
>> shouldn't the same be done for the forward case? If not, because of
>> performance reasons, why the same 256 threshold was not used here?
>>
>> The two codepaths are fairly similar (basically replacing sub by add and
>> vice-versa and using some different offsets), so I'd expect an
>> optimization like that to apply to both cases.
>>
>
> To do a first copy here would need a code like the bellow.
>
> 	/* Calculate how many bytes will be copied to adjust r4 and r3.  Saturate
> 	   it to 0 if subi results in a negative number.  */
> 	subi.	r10,r5,16
> 	isellt	r10,0,r10
>
> 	/* Ajust the pointers.  */
> 	sub	r4,r4,r10
> 	sub	r11,r11,r10
> 	sldi	r6,r10,56
> 	lxvl	32+v0,r4,r6
> 	stxvl	32+v0,r11,r6
> 	blelr	/* Using the subi. result.  */
>
> 	/* Account for the size copied here.  Different than the forward path
> 	   I need to calculate the size before the copy. If we get here I will
> 	   always have copied 16 bytes.  */
> 	subi	r5, r5, 16
>
> After this I will have to decide if I will skip the alignment for some
> size. I would decide here for 128 bytes anyway, for two reasons.  First
> empirically the alignment overhead will not decrease performance for
> sizes bigger than 100 bytes.  Second because the code already is made
> for blocks of 128, so I will use the already code to copy the last 127.
>
> For sizes less then 16 bytes the code today is as shown bellow.  I'm
> doing the branch so we can see the whole code path.
>
> 	cmpldi	cr5,r5,15
> 	ble	cr5,L(tail1_bwd)
> 	/* Branches to this.*/
> L(tail1_bwd):
> 	sub	r11,r11,r5
> 	sldi	r6,r5,56
> 	lxvl	v4,r4,r6
> 	stxvl	v4,r11,r6
> 	blr
>
> IMHO both code will have similar performance although the second version
> is smaller and simpler.  With that said I don't think it's worth to add
> that optimization here.
>
> Maybe add a comparison before adjusting the pointers would be a better
> option. Something like:
>
> L(memmove_bwd):
> 	cmpldi	cr5,r5,16
> 	bgt	cr5,L(bwd_gt_16)
> 	sldi	r6,r5,56
> 	lxvl	v4,r4,r6
> 	stxvl	v4,r11,r6
> 	blr
>
> L(bwd_gt_16):
> 	/* Continue the code. */
>
> Or maybe I could save a lot of code by doing this comparison for the
> forward version as well. I would like to hear your thoughts about it.
>

I think I get it now. You used 256 in the fwd case because that's the
length limit for lxvl/stvxl.  I'm OK with that. No changes needed.

But thanks for the detailed explanation anyways =)

>> > +     cmpldi  cr7,r5,128
>> > +     blt     cr7,L(bwd_loop_tail)
>> > +
>> > +     .p2align 5
>> > +     clrldi. r9,r11,60
>> > +     beq     L(bwd_loop_head)
>> > +     sub     r4,r4,r9
>> > +     sub     r11,r11,r9
>> > +     lxv     32+v0,0(r4)
>> > +     sldi    r6,r9,56
>> > +     stxvl   32+v0,r11,r6
>> > +     sub     r5,r5,r9
>> > +
>>
>> This block would benefit from a one-line comment about what it does.
>>
> Fixed.
>
>> > +L(bwd_loop_head):
>> > +     srdi.   r7,r5,7
>> > +     beq     L(bwd_loop_tail)
>> > +
>> > +     mtctr   r7
>> > +
>> > +/* Main loop that copies 128 bytes every iteration.  */
>> > +     .p2align 5
>> > +L(bwd_loop):
>> > +     addi    r9,r4,-64
>> > +     addi    r10,r11,-64
>> > +
>> > +     lxv     32+v0,-16(r4)
>> > +     lxv     32+v1,-32(r4)
>> > +     lxv     32+v2,-48(r4)
>> > +     lxv     32+v3,-64(r4)
>> > +
>> > +     stxv    32+v0,-16(r11)
>> > +     stxv    32+v1,-32(r11)
>> > +     stxv    32+v2,-48(r11)
>> > +     stxv    32+v3,-64(r11)
>> > +
>> > +     addi    r4,r4,-128
>> > +     addi    r11,r11,-128
>> > +
>> > +     lxv     32+v0,-16(r9)
>> > +     lxv     32+v1,-32(r9)
>> > +     lxv     32+v2,-48(r9)
>> > +     lxv     32+v3,-64(r9)
>> > +
>> > +     stxv    32+v0,-16(r10)
>> > +     stxv    32+v1,-32(r10)
>> > +     stxv    32+v2,-48(r10)
>> > +     stxv    32+v3,-64(r10)
>> > +
>> > +     bdnz    L(bwd_loop)
>> > +     clrldi. r5,r5,57
>> > +     beqlr
>> > +
>>
>> Ok.
>>
>> > +/* Copy 64 bytes.  */
>> > +     .p2align 5
>> > +L(bwd_loop_tail):
>> > +     cmpldi  cr5,r5,63
>> > +     ble     cr5,L(bwd_final_64)
>> > +
>> > +     addi    r4,r4,-64
>> > +     addi    r11,r11,-64
>> > +
>> > +     lxv     32+v0,0(r4)
>> > +     lxv     32+v1,16(r4)
>> > +     lxv     32+v2,32(r4)
>> > +     lxv     32+v3,48(r4)
>> > +
>> > +     stxv    32+v0,0(r11)
>> > +     stxv    32+v1,16(r11)
>> > +     stxv    32+v2,32(r11)
>> > +     stxv    32+v3,48(r11)
>> > +
>> > +     subi    r5,r5,64
>> > +
>>
>> Ok.
>>
>> > +/* Copies the last 1-63 bytes.  */
>> > +     .p2align 5
>> > +L(bwd_final_64):
>> > +     /* r8 hold the number of bytes that will be copied with lxv/stxv.  */
>>
>> r8 hold -> r8 holds
>>
> Fixed.
>
>> > +     clrrdi. r8,r5,4
>> > +     beq     L(tail1_bwd)
>> > +
>> > +     cmpldi  cr5,r5,32
>> > +     lxv     32+v2,-16(r4)
>> > +     blt     cr5,L(tail2_bwd)
>> > +
>> > +     cmpldi  cr6,r5,48
>> > +     lxv     32+v1,-32(r4)
>> > +     blt     cr6,L(tail3_bwd)
>> > +
>> > +     lxv     32+v0,-48(r4)
>> > +
>> > +     .p2align 5
>> > +L(tail4_bwd):
>> > +     stxv    32+v0,-48(r11)
>> > +L(tail3_bwd):
>> > +     stxv    32+v1,-32(r11)
>> > +L(tail2_bwd):
>> > +     stxv    32+v2,-16(r11)
>> > +     sub     r4,r4,r5
>> > +     sub     r11,r11,r5
>> > +     sub     r5,r5,r8
>> > +     sldi    r6,r5,56
>> > +     lxvl    v4,r4,r6
>> > +     stxvl   v4,r11,r6
>> > +     blr
>> > +
>> > +/* Copy last 16 bytes.  */
>> > +     .p2align 5
>> > +L(tail1_bwd):
>> > +     sub     r4,r4,r5
>> > +     sub     r11,r11,r5
>> > +     sldi    r6,r5,56
>> > +     lxvl    v4,r4,r6
>> > +     stxvl   v4,r11,r6
>> > +     blr
>>
>> The last two blocks are almost identical, I imagine you separated both
>> to save one sub for the cases that go straight to tail1_bwd because you
>> already know r8 will be 0 and thus the sub is not needed. So probably
>> saves 1 cycle.
>>
> Thats it. It will actually save more cycles the sub instructions will
> depend on the update of r5.
>
>> Ok.
>>
>> > +
>> > +
>> > +END (MEMMOVE)
>> > +
>> > +#ifdef DEFINE_STRLEN_HIDDEN_DEF
>> > +weak_alias (__memmove, memmove)
>> > +libc_hidden_builtin_def (memmove)
>> > +#endif
>> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
>> > index 8aa46a3702..16ad1ab439 100644
>> > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
>> > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
>> > @@ -24,7 +24,8 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \
>> >                  stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \
>> >                  strcmp-power8 strcmp-power7 strcmp-ppc64 \
>> >                  strcat-power8 strcat-power7 strcat-ppc64 \
>> > -                memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
>> > +                memmove-power10 memmove-power7 memmove-ppc64 \
>> > +                wordcopy-ppc64 bcopy-ppc64 \
>> >                  strncpy-power8 strstr-power7 strstr-ppc64 \
>> >                  strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \
>> >                  strlen-power8 strcasestr-power8 strcasestr-ppc64 \
>>
>> I agree with the issue Raoni pointed. This should be in the list of
>> LE-only ifuncs.
>>
> Fixed.
>
>> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>> > index 1a6993616f..d1c159f2f7 100644
>> > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>> > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
>> > @@ -67,6 +67,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>> >
>> >    /* Support sysdeps/powerpc/powerpc64/multiarch/memmove.c.  */
>> >    IFUNC_IMPL (i, name, memmove,
>> > +#ifdef __LITTLE_ENDIAN__
>> > +           IFUNC_IMPL_ADD (array, i, memmove,
>> > +                           hwcap2 & (PPC_FEATURE2_ARCH_3_1 |
>> > +                                     PPC_FEATURE2_HAS_ISEL)
>> > +                           && (hwcap & PPC_FEATURE_HAS_VSX),
>> > +                           __memmove_power10)
>> > +#endif
>> >             IFUNC_IMPL_ADD (array, i, memmove, hwcap & PPC_FEATURE_HAS_VSX,
>> >                             __memmove_power7)
>> >             IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_ppc))
>>
>> Ok.
>>
> Fixed.
>
>> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S
>> > new file mode 100644
>> > index 0000000000..d6d8ea83ec
>> > --- /dev/null
>> > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S
>> > @@ -0,0 +1,24 @@
>> > +/* Optimized memmove implementation for POWER10.
>> > +   Copyright (C) 2021 Free Software Foundation, Inc.
>> > +   This file is part of the GNU C Library.
>> > +
>> > +   The GNU C Library is free software; you can redistribute it and/or
>> > +   modify it under the terms of the GNU Lesser General Public
>> > +   License as published by the Free Software Foundation; either
>> > +   version 2.1 of the License, or (at your option) any later version.
>> > +
>> > +   The GNU C Library is distributed in the hope that it will be useful,
>> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> > +   Lesser General Public License for more details.
>> > +
>> > +   You should have received a copy of the GNU Lesser General Public
>> > +   License along with the GNU C Library; if not, see
>> > +   <https://www.gnu.org/licenses/>.  */
>> > +
>> > +#define MEMMOVE __memmove_power10
>> > +
>> > +#undef libc_hidden_builtin_def
>> > +#define libc_hidden_builtin_def(name)
>> > +
>> > +#include <sysdeps/powerpc/powerpc64/le/power10/memmove.S>
>> > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove.c b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>> > index 9bec61a321..4704636f5d 100644
>> > --- a/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>> > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
>> > @@ -28,14 +28,22 @@
>> >  # include "init-arch.h"
>> >
>> >  extern __typeof (__redirect_memmove) __libc_memmove;
>> > -
>> >  extern __typeof (__redirect_memmove) __memmove_ppc attribute_hidden;
>> >  extern __typeof (__redirect_memmove) __memmove_power7 attribute_hidden;
>> > +#ifdef __LITTLE_ENDIAN__
>> > +extern __typeof (__redirect_memmove) __memmove_power10 attribute_hidden;
>> > +#endif
>> >
>> >  libc_ifunc (__libc_memmove,
>> > -            (hwcap & PPC_FEATURE_HAS_VSX)
>> > -            ? __memmove_power7
>> > -            : __memmove_ppc);
>> > +#ifdef __LITTLE_ENDIAN__
>> > +                  hwcap2 & (PPC_FEATURE2_ARCH_3_1 |
>> > +                            PPC_FEATURE2_HAS_ISEL)
>> > +                  && (hwcap & PPC_FEATURE_HAS_VSX)
>> > +                  ? __memmove_power10 :
>> > +#endif
>>
>> This should be indented 1 level down.
>>
> Fixed.
>
>> > +                  (hwcap & PPC_FEATURE_HAS_VSX)
>> > +                  ? __memmove_power7
>> > +                  : __memmove_ppc);
>> >
>> >  #undef memmove
>> >  strong_alias (__libc_memmove, memmove);
>>
>>
>> --
>> Matheus Castanho


--
Matheus Castanho