On Wed, Apr 03, 2013 at 09:15:46AM +0100, Will Newton wrote: > On 3 April 2013 08:58, Shih-Yuan Lee (FourDollars) wrote: > > Hi, > > > > I am working on the NEON detected memcpy. > > This is based on what Siarhei Siamashka did at 2009 [1]. > > > > The idea is to use HWCAP and check NEON bit. > > If there is a NEON bit, using NEON optimized memcpy. > > If not, using the original memcpy instead. > > > > If using NEON optimized memcpy, the performance of memcpy will be > > raised up by about 50% [2]. > > > > How do you think about this idea? Any comment is welcome. > > Hi, > > I am working on a similar project within Linaro, which is to add the > NEON/VFP capable memcpy from cortex-strings[1] to glibc. However I am > looking at enabling it at runtime via indirect functions which makes > it slightly more complex than just importing the cortex strings code, > so I don't have any patches to show you just yet. > > [1] https://launchpad.net/cortex-strings Hi, You need to optimize header beacuse you typically copy less than 128 bytes. My measurement how many 16 byte blocks are used is here. http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/result.html If I had code to get number of cycles from perf counter I could provide tool to see memcpy performance in arbitrary binary. On x64 I used overlapping load/store to minimize branches. Try how attached memcpy works on small inputs.