From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-3964-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 32355 invoked by alias); 3 Apr 2013 09:19:12 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 32345 invoked by uid 89); 3 Apr 2013 09:19:12 -0000
X-Spam-SWARE-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL,TW_CP autolearn=no version=3.3.1
Received: from popelka.ms.mff.cuni.cz (HELO popelka.ms.mff.cuni.cz) (195.113.20.131)    by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Wed, 03 Apr 2013 09:19:08 +0000
Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131])	by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 6FE62433EB;	Wed,  3 Apr 2013 11:19:01 +0200 (CEST)
Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000)	id 0DB786046C; Wed,  3 Apr 2013 11:18:56 +0200 (CEST)
Date: Wed, 03 Apr 2013 09:19:00 -0000
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: Will Newton <will.newton@linaro.org>
Cc: "Shih-Yuan Lee (FourDollars)" <sylee@canonical.com>, patches@eglibc.org,	libc-ports@sourceware.org, rex.tsai@canonical.com,	jesse.sung@canonical.com, yc.cheng@canonical.com,	Shih-Yuan Lee <fourdollars@gmail.com>
Subject: Re: [PATCH] ARM: NEON detected memcpy.
Message-ID: <20130403091855.GA3467@domone.kolej.mff.cuni.cz>
References: <CAAT15mNnqeb6tuVdV6b4uJf-qFDH1acxevyW6f-gH+SkguENmg@mail.gmail.com> <CANu=DmjOZBWu2D=+0BZxyeGSRbH-heZ+e1ofS8bOWM7yG1hPsw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="X1bOJ3K7DJ5YkBrT"
Content-Disposition: inline
In-Reply-To: <CANu=DmjOZBWu2D=+0BZxyeGSRbH-heZ+e1ofS8bOWM7yG1hPsw@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Virus-Found: No
X-SW-Source: 2013-04/txt/msg00005.txt.bz2


--X1bOJ3K7DJ5YkBrT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-length: 1426

On Wed, Apr 03, 2013 at 09:15:46AM +0100, Will Newton wrote:
> On 3 April 2013 08:58, Shih-Yuan Lee (FourDollars) <sylee@canonical.com> wrote:
> > Hi,
> >
> > I am working on the NEON detected memcpy.
> > This is based on what Siarhei Siamashka did at 2009 [1].
> >
> > The idea is to use HWCAP and check NEON bit.
> > If there is a NEON bit, using NEON optimized memcpy.
> > If not, using the original memcpy instead.
> >
> > If using NEON optimized memcpy, the performance of memcpy will be
> > raised up by about 50% [2].
> >
> > How do you think about this idea? Any comment is welcome.
> 
> Hi,
> 
> I am working on a similar project within Linaro, which is to add the
> NEON/VFP capable memcpy from cortex-strings[1] to glibc. However I am
> looking at enabling it at runtime via indirect functions which makes
> it slightly more complex than just importing the cortex strings code,
> so I don't have any patches to show you just yet.
> 
> [1] https://launchpad.net/cortex-strings

Hi,

You need to optimize header beacuse you typically copy less than 128 bytes.

My measurement how many 16 byte blocks are used is here.
http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/result.html

If I had code to get number of cycles from perf counter I could provide
tool to see memcpy performance in arbitrary binary.

On x64 I used overlapping load/store to minimize branches. Try how attached
memcpy works on small inputs.

--X1bOJ3K7DJ5YkBrT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_generic.c"
Content-length: 2048

#include <stdint.h>
#include <stdlib.h>

/* Align VALUE down by ALIGN bytes.  */
#define ALIGN_DOWN(value, align) \
       ALIGN_DOWN_M1(value, align - 1)
/* Align VALUE down by ALIGN_M1 + 1 bytes.
   Useful if you have precomputed ALIGN - 1.  */
#define ALIGN_DOWN_M1(value, align_m1) \
       (void *)((uintptr_t)(value) \
                & ~(uintptr_t)(align_m1))

/* Align VALUE up by ALIGN bytes.  */
#define ALIGN_UP(value, align) \
       ALIGN_UP_M1(value, align - 1)
/* Align VALUE up by ALIGN_M1 + 1 bytes.
   Useful if you have precomputed ALIGN - 1.  */
#define ALIGN_UP_M1(value, align_m1) \
       (void *)(((uintptr_t)(value) + (uintptr_t)(align_m1)) \
                & ~(uintptr_t)(align_m1))

#define STOREU(x,y) STORE(x,y)
#define STORE(x,y) ((uint64_t*)(x))[0]=((uint64_t*)(y))[0]; ((uint64_t*)(x))[1]=((uint64_t*)(y))[1];

#define LOAD(x) x
#define LOADU(x) x

static char *memcpy_small (char *dest, char *src, size_t no, char *ret);
void *memcpy_new_u(char *dest, char *src, size_t n)
{
	char *from,*to;
  if (n < 16)
    {
      return memcpy_small(dest, src, n, dest);
    }
  else
    {
      STOREU(dest, LOADU(src));
      STOREU(dest + n - 16, LOADU(src + n - 16));
      to   = ALIGN_DOWN(dest + n, 16);
      from = ALIGN_DOWN(src + 16, 16);
			dest += src - from;
			src = from;
			from = dest;
      while (from != to)
        {
          STOREU(from, LOAD(src));
          from += 16;
					src  += 16;
        }
    }
  return dest;
}

static char *memcpy_small (char *dest, char *src, size_t no, char *ret)
{
  if (no & (8 + 16))
    {
      ((uint64_t *) dest)[0] = ((uint64_t *) src)[0];
      ((uint64_t *)(dest + no - 8))[0] = ((uint64_t *)(src + no - 8))[0];
      return ret;
    }
  if (no & 4)
    {
      ((uint32_t *) dest)[0] = ((uint32_t *) src)[0];
      ((uint32_t *)(dest + no - 4))[0] = ((uint32_t *)(src + no - 4))[0];
      return ret;
    }
  dest[0] = src[0];
  if (no & 2)
    {
      ((uint16_t *)(dest + no - 2))[0] = ((uint16_t *)(src + no - 2))[0];
      return ret;
    }
  return ret;
}

--X1bOJ3K7DJ5YkBrT--