From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 7D9D63945C38 for ; Thu, 29 Apr 2021 14:46:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 7D9D63945C38 Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 13TEYGFd039446 for ; Thu, 29 Apr 2021 10:46:29 -0400 Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0a-001b2d01.pphosted.com with ESMTP id 387x5g9qrn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 29 Apr 2021 10:46:29 -0400 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13TEc5nW003346 for ; Thu, 29 Apr 2021 14:46:28 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma01wdc.us.ibm.com with ESMTP id 384ay9w21b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 29 Apr 2021 14:46:28 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 13TEkRaX39387604 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 29 Apr 2021 14:46:27 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 11262112066; Thu, 29 Apr 2021 14:46:27 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6680C112062; Thu, 29 Apr 2021 14:46:26 +0000 (GMT) Received: from localhost (unknown [9.85.140.86]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 29 Apr 2021 14:46:26 +0000 (GMT) Content-Type: text/plain; charset="utf-8" In-Reply-To: References: <20210428174642.1437818-1-lamm@linux.ibm.com> Subject: Re: [PATCH] powerpc: Optimized memmove for POWER10 From: "Lucas A. M. Magalhaes" To: Raphael M Zinsly , libc-alpha@sourceware.org Date: Thu, 29 Apr 2021 11:46:22 -0300 Message-ID: <161970758268.1464995.414608869969618394@fedora.local> User-Agent: alot/0.9.1 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: FMLDQpa6RKsNlNiBxKn06aG9Uj36W_wh X-Proofpoint-GUID: FMLDQpa6RKsNlNiBxKn06aG9Uj36W_wh Content-Transfer-Encoding: quoted-printable X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-04-29_07:2021-04-28, 2021-04-29 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 clxscore=1015 adultscore=0 priorityscore=1501 mlxscore=0 phishscore=0 bulkscore=0 spamscore=0 lowpriorityscore=0 malwarescore=0 suspectscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2104290094 X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Apr 2021 14:46:33 -0000 Quoting Raphael M Zinsly (2021-04-29 11:22:01) >=20 > Hi Lucas, my review is following: >=20 > On 28/04/2021 14:46, Lucas A. M. Magalhaes via Libc-alpha wrote: > > This patch was initially based on the __memmove_power7 with some ideas > > from strncpy implementation for Power 9. > > > > Improvements from __memmove_power7: > > > > 1. Use lxvl/stxvl for alignment code. > > > > The code for Power 7 uses branches when the input is not naturally > > aligned to the width of a vector. The new implementation uses > > lxvl/stxvl instead which reduces pressure on GPRs. It also allows > > the removal of branch instructions, implicitly removing branch sta= lls > > and mispredictions. > > > > 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibit= ed > > memory. > > > > On Power 10 vector load and stores are safe to use on CI memory for > > addresses unaligned to 16B. This code takes advantage of this to > > do unaligned loads. > > > > The unaligned loads don't have a significant performance impact by > > themselves. However doing so decreases register pressure on GPRs > > and interdependence stalls on load/store pairs. This also improved > > readability as there are now less code paths for different=20 > alignments. > > Finally this reduces the overall code size. > > > > 3. Improved performance. > > > > This version runs on average about 30% better than memmove_power7 > > for lengths larger than 8KB. For input lengths shorter than 8KB > > the improvement is smaller, it has on average about 17% better > > performance. > > > > This version has a degradation of about 50% for input lengths > > in the 0 to 31 bytes range when dest is unaligned src. >=20 > Should it be just "dest is unaligned."? If not what do you mean by "dest= =20 > is unaligned src"? >=20 Yes. Sorry about that. > > --- > > .../powerpc/powerpc64/le/power10/memmove.S | 313 ++++++++++++++++= ++ > > sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +- > > .../powerpc64/multiarch/ifunc-impl-list.c | 7 + > > .../powerpc64/multiarch/memmove-power10.S | 24 ++ > > sysdeps/powerpc/powerpc64/multiarch/memmove.c | 16 +- > > 5 files changed, 358 insertions(+), 5 deletions(-) > > create mode 100644 sysdeps/powerpc/powerpc64/le/power10/memmove.S > > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memmove-power= 10.S > > > > diff --git a/sysdeps/powerpc/powerpc64/le/power10/memmove.S=20 > b/sysdeps/powerpc/powerpc64/le/power10/memmove.S > > new file mode 100644 > > index 0000000000..7cff5ef2ac > > --- /dev/null > > +++ b/sysdeps/powerpc/powerpc64/le/power10/memmove.S > > @@ -0,0 +1,313 @@ > > +/* Optimized memmove implementation for POWER10. > > + Copyright (C) 2021 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be usefu= l, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#include > > + > > + > > +/* void* [r3] memmove (void *dest [r3], const void *src [r4], size_t= =20 > len [r5]) > > + > > + This optimization checks if 'src' and 'dst' overlap. If they do not > > + or 'src' is ahead of 'dest' then it copies forward. >=20 > Looking at the check at the beginning seems to me if there is an overlap= =20 > it will go backwards even if the src is ahead of dest. >=20 The comparison is an cmpld which will compare them as unsigned integers. It would be otherwise if it was a cmpd. Also notice this is the right behavior here. If the src is ahead of the dest and we copy backwards the result will be corrupted. Maybe I need a comment about this. You are not the first one asking about this. > > + Otherwise, an optimized backward copy is used. */ > > + > > +#ifndef MEMMOVE > > +# define MEMMOVE memmove > > +#endif > > + .machine power9 > > +ENTRY_TOCLESS (MEMMOVE, 5) > > + CALL_MCOUNT 3 > > + > > + .p2align 5 > > + /* Check if there is overlap, if so it will branch to backward=20 > copy. */ > > + subf r9,r4,r3 > > + cmpld cr7,r9,r5 > > + blt cr7,L(memmove_bwd) > > + > > + /* Fast path for length shorter than 16 bytes. */ > > + sldi r7,r5,56 > > + lxvl 32+v2,r4,r7 > > + stxvl 32+v2,r3,r7 > > + subic. r8,r5,16 > > + blelr > > + > > + cmpldi cr6,r5,256 > > + bge cr6,L(ge_256) > > + /* Account for the first 16-byte copy. For shorter lengths the=20 > alignment > > + either slows down or is irrelevant. I'm making use of this=20 > comparison > > + to skip the alignment. */ > > + addi r4,r4,16 > > + addi r11,r3,16 /* use r11 to keep dest address on r3. */ > > + subi r5,r5,16 > > + b L(loop_head) > > + > > + .p2align 5 > > +L(ge_256): > > + /* Account for the first copy <=3D 16 bytes. This is necessary for > > + memmove because at this point the src address can be in front= =20 > of the > > + dest address. */ > > + clrldi r9,r5,56 > > + li r8,16 > > + cmpldi r9,16 > > + iselgt r9,r8,r9 > + add r4,r4,r9 > > + add r11,r3,r9 /* use r11 to keep dest address on r3. */ > > + sub r5,r5,r9 > > + > > + /* Align dest to 16 bytes. */ > > + neg r7,r3 > > + clrldi. r9,r7,60 > > + beq L(loop_head) > > + > > + .p2align 5 > > + sldi r6,r9,56 > > + lxvl 32+v0,r4,r6 > > + stxvl 32+v0,r11,r6 > > + sub r5,r5,r9 > > + add r4,r4,r9 > > + add r11,r11,r9 > > + > > +L(loop_head): > > + cmpldi r5,63 > > + ble L(final_64) > > + > > + srdi. r7,r5,7 > > + beq L(loop_tail) > > + > > + mtctr r7 > > + > > +/* Main loop that copies 128 bytes each iteration. */ > > + .p2align 5 > > +L(loop): > > + addi r9,r4,64 > > + addi r10,r11,64 > > + > > + lxv 32+v0,0(r4) > > + lxv 32+v1,16(r4) > > + lxv 32+v2,32(r4) > > + lxv 32+v3,48(r4) > > + > > + stxv 32+v0,0(r11) > > + stxv 32+v1,16(r11) > > + stxv 32+v2,32(r11) > > + stxv 32+v3,48(r11) > > + > > + addi r4,r4,128 > > + addi r11,r11,128 > > + > > + lxv 32+v4,0(r9) > > + lxv 32+v5,16(r9) > > + lxv 32+v6,32(r9) > > + lxv 32+v7,48(r9) > > + > > + stxv 32+v4,0(r10) > > + stxv 32+v5,16(r10) > > + stxv 32+v6,32(r10) > > + stxv 32+v7,48(r10) > > + > > + bdnz L(loop) > > + clrldi. r5,r5,57 > > + beqlr > > + > > +/* Copy 64 bytes. */ > > + .p2align 5 > > +L(loop_tail): > > + cmpldi cr5,r5,63 > > + ble cr5,L(final_64) > > + > > + lxv 32+v0,0(r4) > > + lxv 32+v1,16(r4) > > + lxv 32+v2,32(r4) > > + lxv 32+v3,48(r4) > > + > > + stxv 32+v0,0(r11) > > + stxv 32+v1,16(r11) > > + stxv 32+v2,32(r11) > > + stxv 32+v3,48(r11) > > + > > + addi r4,r4,64 > > + addi r11,r11,64 > > + subi r5,r5,64 > > + > > +/* Copies the last 1-63 bytes. */ > > + .p2align 5 > > +L(final_64): > > + /* r8 hold the number of bytes that will be copied with=20 > lxv/stxv. */ > > + clrrdi. r8,r5,4 > > + beq L(tail1) > > + > > + cmpldi cr5,r5,32 > > + lxv 32+v0,0(r4) > > + blt cr5,L(tail2) > > + > > + cmpldi cr6,r5,48 > > + lxv 32+v1,16(r4) > > + blt cr6,L(tail3) > > + > > + lxv 32+v2,32(r4) > > + > > + .p2align 5 > > +L(tail4): >=20 > This label is not used. >=20 Ok. > > + stxv 32+v2,32(r11) > > +L(tail3): > > + stxv 32+v1,16(r11) > > +L(tail2): > > + stxv 32+v0,0(r11) > > + sub r5,r5,r8 > > + add r4,r4,r8 > > + add r11,r11,r8 > > + .p2align 5 > > +L(tail1): > > + sldi r6,r5,56 > > + lxvl v4,r4,r6 > > + stxvl v4,r11,r6 > > + blr > > + > > +/* If dest and src overlap, we should copy backwards. */ > > +L(memmove_bwd): > > + add r11,r3,r5 > > + add r4,r4,r5 > > + > > + /* Optimization for length smaller than 16 bytes. */ > > + cmpldi cr5,r5,15 > > + ble cr5,L(tail1_bwd) > > + > > + /* For shorter lengths the alignment either slows down or is=20 > irrelevant. > > + The forward copy uses a already need 256 comparison for that.= =20 > Here > > + it's using 128 as it will reduce code and improve=20 > readability. */ > > + cmpldi cr7,r5,128 > > + blt cr7,L(bwd_loop_tail) > > + > > + .p2align 5 > > + clrldi. r9,r11,60 > > + beq L(bwd_loop_head) > > + sub r4,r4,r9 > > + sub r11,r11,r9 > > + lxv 32+v0,0(r4) > > + sldi r6,r9,56 > > + stxvl 32+v0,r11,r6 > > + sub r5,r5,r9 > > + > > +L(bwd_loop_head): > > + srdi. r7,r5,7 > > + beq L(bwd_loop_tail) > > + > > + mtctr r7 > > + > > +/* Main loop that copies 128 bytes every iteration. */ > > + .p2align 5 > > +L(bwd_loop): > > + addi r9,r4,-64 > > + addi r10,r11,-64 > > + > > + lxv 32+v0,-16(r4) > > + lxv 32+v1,-32(r4) > > + lxv 32+v2,-48(r4) > > + lxv 32+v3,-64(r4) > > + > > + stxv 32+v0,-16(r11) > > + stxv 32+v1,-32(r11) > > + stxv 32+v2,-48(r11) > > + stxv 32+v3,-64(r11) > > + > > + addi r4,r4,-128 > > + addi r11,r11,-128 > > + > > + lxv 32+v0,-16(r9) > > + lxv 32+v1,-32(r9) > > + lxv 32+v2,-48(r9) > > + lxv 32+v3,-64(r9) > > + > > + stxv 32+v0,-16(r10) > > + stxv 32+v1,-32(r10) > > + stxv 32+v2,-48(r10) > > + stxv 32+v3,-64(r10) > > + > > + bdnz L(bwd_loop) > > + clrldi. r5,r5,57 > > + beqlr > > + > > +/* Copy 64 bytes. */ > > + .p2align 5 > > +L(bwd_loop_tail): > > + cmpldi cr5,r5,63 > > + ble cr5,L(bwd_final_64) > > + > > + addi r4,r4,-64 > > + addi r11,r11,-64 > > + > > + lxv 32+v0,0(r4) > > + lxv 32+v1,16(r4) > > + lxv 32+v2,32(r4) > > + lxv 32+v3,48(r4) > > + > > + stxv 32+v0,0(r11) > > + stxv 32+v1,16(r11) > > + stxv 32+v2,32(r11) > > + stxv 32+v3,48(r11) > > + > > + subi r5,r5,64 > > + > > +/* Copies the last 1-63 bytes. */ > > + .p2align 5 > > +L(bwd_final_64): > > + /* r8 hold the number of bytes that will be copied with=20 > lxv/stxv. */ > > + clrrdi. r8,r5,4 > > + beq L(tail1_bwd) > > + > > + cmpldi cr5,r5,32 > > + lxv 32+v2,-16(r4) > > + blt cr5,L(tail2_bwd) > > + > > + cmpldi cr6,r5,48 > > + lxv 32+v1,-32(r4) > > + blt cr6,L(tail3_bwd) > > + > > + lxv 32+v0,-48(r4) > > + > > + .p2align 5 > > +L(tail4_bwd): >=20 > This label is not used. >=20 Ok. > > + stxv 32+v0,-48(r11) > > +L(tail3_bwd): > > + stxv 32+v1,-32(r11) > > +L(tail2_bwd): > > + stxv 32+v2,-16(r11) > > + sub r4,r4,r5 > > + sub r11,r11,r5 > > + sub r5,r5,r8 > > + sldi r6,r5,56 > > + lxvl v4,r4,r6 > > + stxvl v4,r11,r6 > > + blr > > + > > +/* Copy last 16 bytes. */ > > + .p2align 5 > > +L(tail1_bwd): > > + sub r4,r4,r5 > > + sub r11,r11,r5 > > + sldi r6,r5,56 > > + lxvl v4,r4,r6 > > + stxvl v4,r11,r6 > > + blr > > + > > + > > +END (MEMMOVE) > > + > > +#ifdef DEFINE_STRLEN_HIDDEN_DEF > > +weak_alias (__memmove, memmove) > > +libc_hidden_builtin_def (memmove) > > +#endif >=20 > IMO you should copy bcopy here to keep taking advantage of skipping a=20 > function call, like is done for memmove_power7. >=20 Ok. > > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile=20 > b/sysdeps/powerpc/powerpc64/multiarch/Makefile > > index 8aa46a3702..16ad1ab439 100644 > > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile > > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile > > @@ -24,7 +24,8 @@ sysdep_routines +=3D memcpy-power8-cached=20 > memcpy-power7 memcpy-a2 memcpy-power6 \ > > stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \ > > strcmp-power8 strcmp-power7 strcmp-ppc64 \ > > strcat-power8 strcat-power7 strcat-ppc64 \ > > - memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \ > > + memmove-power10 memmove-power7 memmove-ppc64 \ > > + wordcopy-ppc64 bcopy-ppc64 \ > > strncpy-power8 strstr-power7 strstr-ppc64 \ > > strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \ > > strlen-power8 strcasestr-power8 strcasestr-ppc64 \ > > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c=20 > b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > > index 1a6993616f..d1c159f2f7 100644 > > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > > @@ -67,6 +67,13 @@ __libc_ifunc_impl_list (const char *name, struct=20 > libc_ifunc_impl *array, > > > > /* Support sysdeps/powerpc/powerpc64/multiarch/memmove.c. */ > > IFUNC_IMPL (i, name, memmove, > > +#ifdef __LITTLE_ENDIAN__ > > + IFUNC_IMPL_ADD (array, i, memmove, > > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 | > > + PPC_FEATURE2_HAS_ISEL) > > + && (hwcap & PPC_FEATURE_HAS_VSX), > > + __memmove_power10) > > +#endif > > IFUNC_IMPL_ADD (array, i, memmove, hwcap &=20 > PPC_FEATURE_HAS_VSX, > > __memmove_power7) > > IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_ppc)) > > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S=20 > b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S > > new file mode 100644 > > index 0000000000..d6d8ea83ec > > --- /dev/null > > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power10.S > > @@ -0,0 +1,24 @@ > > +/* Optimized memmove implementation for POWER10. > > + Copyright (C) 2021 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be usefu= l, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#define MEMMOVE __memmove_power10 > > + > > +#undef libc_hidden_builtin_def > > +#define libc_hidden_builtin_def(name) > > + > > +#include > > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove.c=20 > b/sysdeps/powerpc/powerpc64/multiarch/memmove.c > > index 9bec61a321..4704636f5d 100644 > > --- a/sysdeps/powerpc/powerpc64/multiarch/memmove.c > > +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove.c > > @@ -28,14 +28,22 @@ > > # include "init-arch.h" > > > > extern __typeof (__redirect_memmove) __libc_memmove; > > - > > extern __typeof (__redirect_memmove) __memmove_ppc attribute_hidden; > > extern __typeof (__redirect_memmove) __memmove_power7 attribute_hidd= en; > > +#ifdef __LITTLE_ENDIAN__ > > +extern __typeof (__redirect_memmove) __memmove_power10 attribute_hidd= en; > > +#endif > > > > libc_ifunc (__libc_memmove, > > - (hwcap & PPC_FEATURE_HAS_VSX) > > - ? __memmove_power7 > > - : __memmove_ppc); > > +#ifdef __LITTLE_ENDIAN__ > > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 | > > + PPC_FEATURE2_HAS_ISEL) > > + && (hwcap & PPC_FEATURE_HAS_VSX) > > + ? __memmove_power10 : > > +#endif > > + (hwcap & PPC_FEATURE_HAS_VSX) > > + ? __memmove_power7 > > + : __memmove_ppc); > > > > #undef memmove > > strong_alias (__libc_memmove, memmove); > > >=20 > Best Regards, > --=20 > Raphael Moreira Zinsly >