From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 101054 invoked by alias); 10 Dec 2017 07:11:21 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 99308 invoked by uid 89); 10 Dec 2017 07:08:40 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.6 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy= X-HELO: mx0a-001b2d01.pphosted.com Subject: Re: [PATCHv2] powerpc: POWER8 memcpy optimization for cached memory References: <87vaik8uxy.fsf@linux.vnet.ibm.com> <20171208194020.5005-1-tuliom@linux.vnet.ibm.com> From: Rajalakshmi Srinivasaraghavan To: libc-alpha@sourceware.org Date: Sun, 10 Dec 2017 07:11:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171208194020.5005-1-tuliom@linux.vnet.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 17121007-0012-0000-0000-000005975193 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17121007-0013-0000-0000-000019125E00 Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-12-10_02:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=1 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1712100107 X-SW-Source: 2017-12/txt/msg00297.txt.bz2 On 12/09/2017 01:10 AM, Tulio Magno Quites Machado Filho wrote: > From: Adhemerval Zanella > > I made the changes I requested, updated copyright entries, added a > manual entry and fixed a build issue on powerpc64. > > --- 8< --- > > On POWER8, unaligned memory accesses to cached memory has little impact > on performance as opposed to its ancestors. > > It is disabled by default and will only be available when the tunable > glibc.tune.cached_memopt is set to 1. > > __memcpy_power8_cached __memcpy_power7 > ============================================================ > max-size=4096: 33325.70 ( 12.65%) 38153.00 > max-size=8192: 32878.20 ( 11.17%) 37012.30 > max-size=16384: 33782.20 ( 11.61%) 38219.20 > max-size=32768: 33296.20 ( 11.30%) 37538.30 > max-size=65536: 33765.60 ( 10.53%) 37738.40 > > 2017-12-08 Adhemerval Zanella > Tulio Magno Quites Machado Filho > > * manual/tunables.texi (Hardware Capability Tunables): Document > glibc.tune.cached_memopt. > * sysdeps/powerpc/cpu-features.c: New file. > * sysdeps/powerpc/cpu-features.h: New file. > * sysdeps/powerpc/dl-procinfo.c [!IS_IN(ldconfig)]: Add > _dl_powerpc_cpu_features. > * sysdeps/powerpc/dl-tunables.list: New file. > * sysdeps/powerpc/ldsodefs.h: Include cpu-features.h. > * sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h: . Comment missing. > * sysdeps/powerpc/powerpc64/dl-machine.h (INIT_ARCH): Initialize > use_aligned_memopt. Should this be moved to init-arch.h? (also use_cached_memopt) > * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): > Add memcpy-power8-cached. > * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Add > __memcpy_power8_cached. > * sysdeps/powerpc/powerpc64/multiarch/memcpy.c: Likewise. > * sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S: > New file. > --- > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S b/sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S > new file mode 100644 > index 0000000..e5b6f25 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S > @@ -0,0 +1,179 @@ > + stxvd2x v0,r0,r3 > +L(dst_is_align_16): > + cmpldi cr7,r5,127 > + ble cr7,L(tail_copy) > + addi r8,r5,-128 > + mr r9,r12 > + rldicr r8,r8,0,56 > + li r11,16 > + srdi r10,r8,7 > + addi r0,r8,128 > + addi r10,r10,1 Can we directly do rldicr r0, r5, 0, 56 srdi r10,r5,7 instead of this sequence? 79 addi r8,r5,-128 81 rldicr r8,r8,0,56 83 srdi r10,r8,7 84 addi r0,r8,128 85 addi r10,r10,1 > + li r6,32 > + mtctr r10 > + li r7,48 > + > + /* Main loop, copy 128 bytes each time. */ LGTM. Reviewed-by: Rajalakshmi Srinivasaraghavan -- Thanks Rajalakshmi S