From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-3973-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 2106 invoked by alias); 4 Apr 2013 04:15:44 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 2096 invoked by uid 89); 4 Apr 2013 04:15:43 -0000
X-Spam-SWARE-Status: No, score=-3.8 required=5.0 tests=AWL,BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,TW_CP autolearn=ham version=3.3.1
Received: from youngberry.canonical.com (HELO youngberry.canonical.com) (91.189.89.112)    by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Thu, 04 Apr 2013 04:15:40 +0000
Received: from mail-wg0-f51.google.com ([74.125.82.51])	by youngberry.canonical.com with esmtpsa (TLS1.0:RSA_ARCFOUR_SHA1:16)	(Exim 4.71)	(envelope-from <sylee@canonical.com>)	id 1UNbaM-00048x-2k	for libc-ports@sourceware.org; Thu, 04 Apr 2013 04:15:38 +0000
Received: by mail-wg0-f51.google.com with SMTP id b12so2323839wgh.30        for <libc-ports@sourceware.org>; Wed, 03 Apr 2013 21:15:37 -0700 (PDT)
X-Received: by 10.194.87.229 with SMTP id bb5mr6775439wjb.32.1365048937948; Wed, 03 Apr 2013 21:15:37 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.194.119.168 with HTTP; Wed, 3 Apr 2013 21:15:17 -0700 (PDT)
In-Reply-To: <20130403161949.GA6759@domone.kolej.mff.cuni.cz>
References: <CAAT15mNnqeb6tuVdV6b4uJf-qFDH1acxevyW6f-gH+SkguENmg@mail.gmail.com> <Pine.LNX.4.64.1304031505020.580@digraph.polyomino.org.uk> <CAAT15mMJSHiO5rZ6EAbss79f_t4Qiaryi-qjmw3TwGg4vrg2=A@mail.gmail.com> <20130403161949.GA6759@domone.kolej.mff.cuni.cz>
From: "Shih-Yuan Lee (FourDollars)" <sylee@canonical.com>
Date: Thu, 04 Apr 2013 04:15:00 -0000
Message-ID: <CAAT15mMZgtfcUr3rgz3BiY-v14-DW9u1LHP+5jp2rD3uxA+=sw@mail.gmail.com>
Subject: Re: [Patches] [PATCH] ARM: NEON detected memcpy.
To: =?ISO-8859-2?B?T25k+GVqIELtbGth?= <neleai@seznam.cz>
Cc: "Joseph S. Myers" <joseph@codesourcery.com>, libc-ports@sourceware.org, 	Jesse Sung <jesse.sung@canonical.com>, patches@eglibc.org, 	YC Cheng <yc.cheng@canonical.com>, rex.tsai@canonical.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2013-04/txt/msg00014.txt.bz2

Hi Ondrej,

I do have some benchmark data.

--- Running benchmarks (average case/perfect alignment case) ---

very small data test:
memcpy_arm     :  (3 bytes copy) =3D   86.2 MB/s /   88.3 MB/s
memcpy_neon    :  (3 bytes copy) =3D   53.4 MB/s /   54.5 MB/s
memcpy_arm     :  (4 bytes copy) =3D   79.8 MB/s /   62.9 MB/s
memcpy_neon    :  (4 bytes copy) =3D   72.5 MB/s /   73.9 MB/s
memcpy_arm     :  (5 bytes copy) =3D   91.0 MB/s /   78.7 MB/s
memcpy_neon    :  (5 bytes copy) =3D   90.2 MB/s /   91.0 MB/s
memcpy_arm     :  (7 bytes copy) =3D  109.5 MB/s /  104.7 MB/s
memcpy_neon    :  (7 bytes copy) =3D  122.1 MB/s /  126.6 MB/s
memcpy_arm     :  (8 bytes copy) =3D  122.4 MB/s /  122.4 MB/s
memcpy_neon    :  (8 bytes copy) =3D  142.0 MB/s /  148.2 MB/s
memcpy_arm     :  (11 bytes copy) =3D  157.8 MB/s /  161.3 MB/s
memcpy_neon    :  (11 bytes copy) =3D  193.8 MB/s /  196.2 MB/s
memcpy_arm     :  (12 bytes copy) =3D  170.1 MB/s /  172.7 MB/s
memcpy_neon    :  (12 bytes copy) =3D  206.8 MB/s /  212.5 MB/s
memcpy_arm     :  (15 bytes copy) =3D  204.0 MB/s /  209.6 MB/s
memcpy_neon    :  (15 bytes copy) =3D  247.5 MB/s /  270.3 MB/s
memcpy_arm     :  (16 bytes copy) =3D  212.2 MB/s /  225.6 MB/s
memcpy_neon    :  (16 bytes copy) =3D  175.3 MB/s /  252.2 MB/s
memcpy_arm     :  (24 bytes copy) =3D  274.6 MB/s /  326.5 MB/s
memcpy_neon    :  (24 bytes copy) =3D  244.7 MB/s /  367.8 MB/s
memcpy_arm     :  (31 bytes copy) =3D  333.3 MB/s /  399.2 MB/s
memcpy_neon    :  (31 bytes copy) =3D  304.3 MB/s /  463.5 MB/s

L1 cached data:
memcpy_arm     :  (4096 bytes copy) =3D 1295.5 MB/s / 2691.8 MB/s
memcpy_neon    :  (4096 bytes copy) =3D 1826.3 MB/s / 2021.8 MB/s
memcpy_arm     :  (6144 bytes copy) =3D 1306.5 MB/s / 2724.1 MB/s
memcpy_neon    :  (6144 bytes copy) =3D 1857.8 MB/s / 2053.2 MB/s

L2 cached data:
memcpy_arm     :  (65536 bytes copy) =3D 1291.5 MB/s / 2304.8 MB/s
memcpy_neon    :  (65536 bytes copy) =3D 1866.5 MB/s / 2441.7 MB/s
memcpy_arm     :  (98304 bytes copy) =3D 1285.6 MB/s / 2283.8 MB/s
memcpy_neon    :  (98304 bytes copy) =3D 1860.7 MB/s / 2454.7 MB/s

SDRAM:
memcpy_arm     :  (2097152 bytes copy) =3D  466.7 MB/s /  736.5 MB/s
memcpy_neon    :  (2097152 bytes copy) =3D  727.5 MB/s /  868.8 MB/s
memcpy_arm     :  (3145728 bytes copy) =3D  507.9 MB/s /  854.7 MB/s
memcpy_neon    :  (3145728 bytes copy) =3D  852.9 MB/s / 1038.0 MB/s

(*) 1 MB =3D 1000000 bytes
(*) 'memcpy_arm' - an implementation for older ARM cores from glibc-ports

The similar benchmark is at
http://sourceware.org/ml/libc-ports/2009-07/msg00000.html .

Regards,
$4

On Thu, Apr 4, 2013 at 12:19 AM, Ond=C5=99ej B=C3=ADlka <neleai@seznam.cz> =
wrote:
> On Wed, Apr 03, 2013 at 11:47:36PM +0800, Shih-Yuan Lee (FourDollars) wro=
te:
>> Hi Joseph,
>>
> ...
>> > I was previously told by people at ARM that NEON memcpy wasn't a good =
idea
>> > in practice because of raised power consumption, context switch costs =
etc.
>> > from using NEON in processes that otherwise didn't use it, even if it
>> > appeared superficially beneficial in benchmarks.
>> >
>> About raised power consumption and context switch costs, I may be able
>> to add some option in configure for the users to decide if they want
>> to use this feature or not.
>> How do you think?
>>
> Configure option is bit overkill.
>
> You need to compare neon/other implementation speed. Then determine
> size where neon is faster if we include energy cost and context switch.
> My first estimate is use neon when larger than 4096 bytes.
>
> However to determine context switch cost of neon you must account network=
 effect.
>
> If you use neon in one function that is called sufficiently often (to
> always save registers) then adding neon implementation for additional fun=
ctions
> does not increase cost.