From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 32355 invoked by alias); 3 Apr 2013 09:19:12 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 32345 invoked by uid 89); 3 Apr 2013 09:19:12 -0000 X-Spam-SWARE-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL,TW_CP autolearn=no version=3.3.1 Received: from popelka.ms.mff.cuni.cz (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Wed, 03 Apr 2013 09:19:08 +0000 Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131]) by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 6FE62433EB; Wed, 3 Apr 2013 11:19:01 +0200 (CEST) Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000) id 0DB786046C; Wed, 3 Apr 2013 11:18:56 +0200 (CEST) Date: Wed, 03 Apr 2013 09:19:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Will Newton Cc: "Shih-Yuan Lee (FourDollars)" , patches@eglibc.org, libc-ports@sourceware.org, rex.tsai@canonical.com, jesse.sung@canonical.com, yc.cheng@canonical.com, Shih-Yuan Lee Subject: Re: [PATCH] ARM: NEON detected memcpy. Message-ID: <20130403091855.GA3467@domone.kolej.mff.cuni.cz> References: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="X1bOJ3K7DJ5YkBrT" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-Virus-Found: No X-SW-Source: 2013-04/txt/msg00005.txt.bz2 --X1bOJ3K7DJ5YkBrT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-length: 1426 On Wed, Apr 03, 2013 at 09:15:46AM +0100, Will Newton wrote: > On 3 April 2013 08:58, Shih-Yuan Lee (FourDollars) wrote: > > Hi, > > > > I am working on the NEON detected memcpy. > > This is based on what Siarhei Siamashka did at 2009 [1]. > > > > The idea is to use HWCAP and check NEON bit. > > If there is a NEON bit, using NEON optimized memcpy. > > If not, using the original memcpy instead. > > > > If using NEON optimized memcpy, the performance of memcpy will be > > raised up by about 50% [2]. > > > > How do you think about this idea? Any comment is welcome. > > Hi, > > I am working on a similar project within Linaro, which is to add the > NEON/VFP capable memcpy from cortex-strings[1] to glibc. However I am > looking at enabling it at runtime via indirect functions which makes > it slightly more complex than just importing the cortex strings code, > so I don't have any patches to show you just yet. > > [1] https://launchpad.net/cortex-strings Hi, You need to optimize header beacuse you typically copy less than 128 bytes. My measurement how many 16 byte blocks are used is here. http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/result.html If I had code to get number of cycles from perf counter I could provide tool to see memcpy performance in arbitrary binary. On x64 I used overlapping load/store to minimize branches. Try how attached memcpy works on small inputs. --X1bOJ3K7DJ5YkBrT Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="memcpy_generic.c" Content-length: 2048 #include #include /* Align VALUE down by ALIGN bytes. */ #define ALIGN_DOWN(value, align) \ ALIGN_DOWN_M1(value, align - 1) /* Align VALUE down by ALIGN_M1 + 1 bytes. Useful if you have precomputed ALIGN - 1. */ #define ALIGN_DOWN_M1(value, align_m1) \ (void *)((uintptr_t)(value) \ & ~(uintptr_t)(align_m1)) /* Align VALUE up by ALIGN bytes. */ #define ALIGN_UP(value, align) \ ALIGN_UP_M1(value, align - 1) /* Align VALUE up by ALIGN_M1 + 1 bytes. Useful if you have precomputed ALIGN - 1. */ #define ALIGN_UP_M1(value, align_m1) \ (void *)(((uintptr_t)(value) + (uintptr_t)(align_m1)) \ & ~(uintptr_t)(align_m1)) #define STOREU(x,y) STORE(x,y) #define STORE(x,y) ((uint64_t*)(x))[0]=((uint64_t*)(y))[0]; ((uint64_t*)(x))[1]=((uint64_t*)(y))[1]; #define LOAD(x) x #define LOADU(x) x static char *memcpy_small (char *dest, char *src, size_t no, char *ret); void *memcpy_new_u(char *dest, char *src, size_t n) { char *from,*to; if (n < 16) { return memcpy_small(dest, src, n, dest); } else { STOREU(dest, LOADU(src)); STOREU(dest + n - 16, LOADU(src + n - 16)); to = ALIGN_DOWN(dest + n, 16); from = ALIGN_DOWN(src + 16, 16); dest += src - from; src = from; from = dest; while (from != to) { STOREU(from, LOAD(src)); from += 16; src += 16; } } return dest; } static char *memcpy_small (char *dest, char *src, size_t no, char *ret) { if (no & (8 + 16)) { ((uint64_t *) dest)[0] = ((uint64_t *) src)[0]; ((uint64_t *)(dest + no - 8))[0] = ((uint64_t *)(src + no - 8))[0]; return ret; } if (no & 4) { ((uint32_t *) dest)[0] = ((uint32_t *) src)[0]; ((uint32_t *)(dest + no - 4))[0] = ((uint32_t *)(src + no - 4))[0]; return ret; } dest[0] = src[0]; if (no & 2) { ((uint16_t *)(dest + no - 2))[0] = ((uint16_t *)(src + no - 2))[0]; return ret; } return ret; } --X1bOJ3K7DJ5YkBrT--