From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 611A23939C33 for ; Wed, 28 Apr 2021 13:31:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 611A23939C33 Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 13SD3EVQ051683 for ; Wed, 28 Apr 2021 09:31:45 -0400 Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com with ESMTP id 3874vwyckq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 28 Apr 2021 09:31:45 -0400 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13SDRXaN026102 for ; Wed, 28 Apr 2021 13:31:44 GMT Received: from b01cxnp23033.gho.pok.ibm.com (b01cxnp23033.gho.pok.ibm.com [9.57.198.28]) by ppma02dal.us.ibm.com with ESMTP id 384qdj8dtj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 28 Apr 2021 13:31:44 +0000 Received: from b01ledav006.gho.pok.ibm.com (b01ledav006.gho.pok.ibm.com [9.57.199.111]) by b01cxnp23033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 13SDVhm324314326 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 28 Apr 2021 13:31:43 GMT Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5A215AC05E; Wed, 28 Apr 2021 13:31:43 +0000 (GMT) Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 62B32AC05B; Wed, 28 Apr 2021 13:31:42 +0000 (GMT) Received: from [9.160.120.103] (unknown [9.160.120.103]) by b01ledav006.gho.pok.ibm.com (Postfix) with ESMTP; Wed, 28 Apr 2021 13:31:42 +0000 (GMT) Subject: Re: [PATCH] powerpc64le: Optimize memcpy for POWER10 To: Tulio Magno Quites Machado Filho , libc-alpha@sourceware.org References: <20210428023138.795316-1-tuliom@linux.ibm.com> From: Raphael M Zinsly Message-ID: <14144f9f-3bbd-7360-ca71-4301a2e7c47f@linux.ibm.com> Date: Wed, 28 Apr 2021 10:31:40 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 In-Reply-To: <20210428023138.795316-1-tuliom@linux.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 3_I3H_HuEpWZ-LpnnEImN_akJ5F3MZf1 X-Proofpoint-GUID: 3_I3H_HuEpWZ-LpnnEImN_akJ5F3MZf1 Content-Transfer-Encoding: 7bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-04-28_06:2021-04-27, 2021-04-28 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 phishscore=0 mlxscore=0 clxscore=1015 priorityscore=1501 bulkscore=0 spamscore=0 suspectscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2104280087 X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT, NICE_REPLY_A, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Apr 2021 13:31:52 -0000 This patch LGTM, thanks! On 27/04/2021 23:31, Tulio Magno Quites Machado Filho via Libc-alpha wrote: > This implementation is based on __memcpy_power8_cached and integrates > suggestions from Anton Blanchard. > It benefits from loads and stores with length for short lengths and for > tail code, simplifying the code. > > All unaligned memory accesses use instructions that do not generate > alignment interrupts on POWER10, making it safe to use on > caching-inhibited memory. > > The main loop has also been modified in order to increase instruction > throughput by reducing the dependency on updates from previous iterations. > > On average, this implementation provides around 30% improvement when > compared to __memcpy_power7 and 10% improvement in comparison to > __memcpy_power8_cached. > --- > sysdeps/powerpc/powerpc64/le/power10/memcpy.S | 198 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +- > .../powerpc64/multiarch/ifunc-impl-list.c | 6 + > .../powerpc64/multiarch/memcpy-power10.S | 26 +++ > sysdeps/powerpc/powerpc64/multiarch/memcpy.c | 7 + > 5 files changed, 239 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power10/memcpy.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memcpy-power10.S > > diff --git a/sysdeps/powerpc/powerpc64/le/power10/memcpy.S b/sysdeps/powerpc/powerpc64/le/power10/memcpy.S > new file mode 100644 > index 0000000000..f84acabec5 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power10/memcpy.S > @@ -0,0 +1,198 @@ > +/* Optimized memcpy implementation for POWER10. > + Copyright (C) 2021 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > + > + > +#ifndef MEMCPY > +# define MEMCPY memcpy > +#endif > + > +/* __ptr_t [r3] memcpy (__ptr_t dst [r3], __ptr_t src [r4], size_t len [r5]); > + Returns 'dst'. */ > + > + .machine power9 > +ENTRY_TOCLESS (MEMCPY, 5) > + CALL_MCOUNT 3 > + > + /* Copy up to 16 bytes. */ > + sldi r6,r5,56 /* Prepare [l|st]xvl counter. */ > + lxvl v10,r4,r6 > + stxvl v10,r3,r6 > + subic. r6,r5,16 /* Return if len <= 16. */ > + blelr > + > + /* If len >= 256, assume nothing got copied before and copy > + again. This might cause issues with overlapped memory, but memcpy > + is not expected to treat overlapped memory. */ > + cmpdi r5,256 > + bge L(copy_ge_256) > + /* 16 < len < 256 and the first 16 bytes have already been copied. */ > + addi r10,r3,16 /* Keep r3 intact as return value. */ > + addi r4,r4,16 > + subic r5,r5,16 > + b L(copy_lt_256) /* Avoid the main loop if len < 256. */ > + > + .p2align 5 > +L(copy_ge_256): > + mr r10,r3 /* Keep r3 intact as return value. */ > + /* Align dst to 16 bytes. */ > + andi. r9,r10,0xf > + beq L(dst_is_align_16) > + lxv v10,0(r4) > + subfic r12,r9,16 > + subf r5,r12,r5 > + add r4,r4,r12 > + stxv v10,0(r10) > + add r10,r10,r12 > + > +L(dst_is_align_16): > + srdi r9,r5,7 /* Divide by 128. */ > + mtctr r9 > + addi r6,r4,64 > + addi r7,r10,64 > + > + > + /* Main loop, copy 128 bytes per iteration. > + Use r6=src+64 and r7=dest+64 in order to reduce the dependency on > + r4 and r10. */ > + .p2align 5 > +L(copy_128): > + > + lxv v10, 0(r4) > + lxv v11, 16(r4) > + lxv v12, 32(r4) > + lxv v13, 48(r4) > + > + addi r4,r4,128 > + > + stxv v10, 0(r10) > + stxv v11, 16(r10) > + stxv v12, 32(r10) > + stxv v13, 48(r10) > + > + addi r10,r10,128 > + > + lxv v10, 0(r6) > + lxv v11, 16(r6) > + lxv v12, 32(r6) > + lxv v13, 48(r6) > + > + addi r6,r6,128 > + > + stxv v10, 0(r7) > + stxv v11, 16(r7) > + stxv v12, 32(r7) > + stxv v13, 48(r7) > + > + addi r7,r7,128 > + > + bdnz L(copy_128) > + > + clrldi. r5,r5,64-7 /* Have we copied everything? */ > + beqlr > + > + .p2align 5 > +L(copy_lt_256): > + cmpdi r5,16 > + ble L(copy_le_16) > + srdi. r9,r5,5 /* Divide by 32. */ > + beq L(copy_lt_32) > + mtctr r9 > + /* Use r6=src+32, r7=dest+32, r8=src+64, r9=dest+64 in order to reduce > + the dependency on r4 and r10. */ > + addi r6,r4,32 > + addi r7,r10,32 > + addi r8,r4,64 > + addi r9,r10,64 > + > + .p2align 5 > + /* Copy 32 bytes at a time, unaligned. > + The loop is unrolled 3 times in order to reduce the dependency on > + r4 and r10, copying up-to 96 bytes per iteration. */ > +L(copy_32): > + lxv v10, 0(r4) > + lxv v11, 16(r4) > + stxv v10, 0(r10) > + stxv v11, 16(r10) > + bdz L(end_copy_32a) > + addi r4,r4,96 > + addi r10,r10,96 > + > + lxv v10, 0(r6) > + lxv v11, 16(r6) > + addi r6,r6,96 > + stxv v10, 0(r7) > + stxv v11, 16(r7) > + bdz L(end_copy_32b) > + addi r7,r7,96 > + > + lxv v12, 0(r8) > + lxv v13, 16(r8) > + addi r8,r8,96 > + stxv v12, 0(r9) > + stxv v13, 16(r9) > + addi r9,r9,96 > + bdnz L(copy_32) > + > + clrldi. r5,r5,64-5 /* Have we copied everything? */ > + beqlr > + cmpdi r5,16 > + ble L(copy_le_16) > + b L(copy_lt_32) > + > + .p2align 5 > +L(end_copy_32a): > + clrldi. r5,r5,64-5 /* Have we copied everything? */ > + beqlr > + /* 32 bytes have been copied since the last update of r4 and r10. */ > + addi r4,r4,32 > + addi r10,r10,32 > + cmpdi r5,16 > + ble L(copy_le_16) > + b L(copy_lt_32) > + > + .p2align 5 > +L(end_copy_32b): > + clrldi. r5,r5,64-5 /* Have we copied everything? */ > + beqlr > + /* The last iteration of the loop copied 64 bytes. Update r4 and r10 > + accordingly. */ > + addi r4,r4,-32 > + addi r10,r10,-32 > + cmpdi r5,16 > + ble L(copy_le_16) > + > + .p2align 5 > +L(copy_lt_32): > + lxv v10, 0(r4) > + stxv v10, 0(r10) > + addi r4,r4,16 > + addi r10,r10,16 > + subic r5,r5,16 > + > + .p2align 5 > +L(copy_le_16): > + sldi r6,r5,56 > + lxvl v10,r4,r6 > + stxvl v10,r10,r6 > + blr > + > + > +END_GEN_TB (MEMCPY,TB_TOCLESS) > +libc_hidden_builtin_def (memcpy) > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile > index 8aa46a3702..fdaa5ddb24 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile > @@ -1,5 +1,6 @@ > ifeq ($(subdir),string) > -sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ > +sysdep_routines += memcpy-power10 \ > + memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ > memcpy-cell memcpy-power4 memcpy-ppc64 \ > memcmp-power8 memcmp-power7 memcmp-power4 memcmp-ppc64 \ > memset-power7 memset-power6 memset-power4 \ > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > index 1a6993616f..7bb3028676 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > @@ -51,6 +51,12 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > #ifdef SHARED > /* Support sysdeps/powerpc/powerpc64/multiarch/memcpy.c. */ > IFUNC_IMPL (i, name, memcpy, > +#ifdef __LITTLE_ENDIAN__ > + IFUNC_IMPL_ADD (array, i, memcpy, > + hwcap2 & PPC_FEATURE2_ARCH_3_1 > + && hwcap & PPC_FEATURE_HAS_VSX, > + __memcpy_power10) > +#endif > IFUNC_IMPL_ADD (array, i, memcpy, hwcap2 & PPC_FEATURE2_ARCH_2_07, > __memcpy_power8_cached) > IFUNC_IMPL_ADD (array, i, memcpy, hwcap & PPC_FEATURE_HAS_VSX, > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memcpy-power10.S b/sysdeps/powerpc/powerpc64/multiarch/memcpy-power10.S > new file mode 100644 > index 0000000000..70e0fc3ed6 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/multiarch/memcpy-power10.S > @@ -0,0 +1,26 @@ > +/* Optimized memcpy implementation for POWER10. > + Copyright (C) 2021 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#if defined __LITTLE_ENDIAN__ && IS_IN (libc) > +#define MEMCPY __memcpy_power10 > + > +#undef libc_hidden_builtin_def > +#define libc_hidden_builtin_def(name) > + > +#include > +#endif > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memcpy.c b/sysdeps/powerpc/powerpc64/multiarch/memcpy.c > index 5733192932..53ab32ef26 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/memcpy.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/memcpy.c > @@ -36,8 +36,15 @@ extern __typeof (__redirect_memcpy) __memcpy_power6 attribute_hidden; > extern __typeof (__redirect_memcpy) __memcpy_a2 attribute_hidden; > extern __typeof (__redirect_memcpy) __memcpy_power7 attribute_hidden; > extern __typeof (__redirect_memcpy) __memcpy_power8_cached attribute_hidden; > +# if defined __LITTLE_ENDIAN__ > +extern __typeof (__redirect_memcpy) __memcpy_power10 attribute_hidden; > +# endif > > libc_ifunc (__libc_memcpy, > +# if defined __LITTLE_ENDIAN__ > + (hwcap2 & PPC_FEATURE2_ARCH_3_1 && hwcap & PPC_FEATURE_HAS_VSX) > + ? __memcpy_power10 : > +# endif > ((hwcap2 & PPC_FEATURE2_ARCH_2_07) && use_cached_memopt) > ? __memcpy_power8_cached : > (hwcap & PPC_FEATURE_HAS_VSX) > -- Raphael Moreira Zinsly