From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 2106 invoked by alias); 4 Apr 2013 04:15:44 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 2096 invoked by uid 89); 4 Apr 2013 04:15:43 -0000 X-Spam-SWARE-Status: No, score=-3.8 required=5.0 tests=AWL,BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,TW_CP autolearn=ham version=3.3.1 Received: from youngberry.canonical.com (HELO youngberry.canonical.com) (91.189.89.112) by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Thu, 04 Apr 2013 04:15:40 +0000 Received: from mail-wg0-f51.google.com ([74.125.82.51]) by youngberry.canonical.com with esmtpsa (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.71) (envelope-from ) id 1UNbaM-00048x-2k for libc-ports@sourceware.org; Thu, 04 Apr 2013 04:15:38 +0000 Received: by mail-wg0-f51.google.com with SMTP id b12so2323839wgh.30 for ; Wed, 03 Apr 2013 21:15:37 -0700 (PDT) X-Received: by 10.194.87.229 with SMTP id bb5mr6775439wjb.32.1365048937948; Wed, 03 Apr 2013 21:15:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.119.168 with HTTP; Wed, 3 Apr 2013 21:15:17 -0700 (PDT) In-Reply-To: <20130403161949.GA6759@domone.kolej.mff.cuni.cz> References: <20130403161949.GA6759@domone.kolej.mff.cuni.cz> From: "Shih-Yuan Lee (FourDollars)" Date: Thu, 04 Apr 2013 04:15:00 -0000 Message-ID: Subject: Re: [Patches] [PATCH] ARM: NEON detected memcpy. To: =?ISO-8859-2?B?T25k+GVqIELtbGth?= Cc: "Joseph S. Myers" , libc-ports@sourceware.org, Jesse Sung , patches@eglibc.org, YC Cheng , rex.tsai@canonical.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2013-04/txt/msg00014.txt.bz2 Hi Ondrej, I do have some benchmark data. --- Running benchmarks (average case/perfect alignment case) --- very small data test: memcpy_arm : (3 bytes copy) =3D 86.2 MB/s / 88.3 MB/s memcpy_neon : (3 bytes copy) =3D 53.4 MB/s / 54.5 MB/s memcpy_arm : (4 bytes copy) =3D 79.8 MB/s / 62.9 MB/s memcpy_neon : (4 bytes copy) =3D 72.5 MB/s / 73.9 MB/s memcpy_arm : (5 bytes copy) =3D 91.0 MB/s / 78.7 MB/s memcpy_neon : (5 bytes copy) =3D 90.2 MB/s / 91.0 MB/s memcpy_arm : (7 bytes copy) =3D 109.5 MB/s / 104.7 MB/s memcpy_neon : (7 bytes copy) =3D 122.1 MB/s / 126.6 MB/s memcpy_arm : (8 bytes copy) =3D 122.4 MB/s / 122.4 MB/s memcpy_neon : (8 bytes copy) =3D 142.0 MB/s / 148.2 MB/s memcpy_arm : (11 bytes copy) =3D 157.8 MB/s / 161.3 MB/s memcpy_neon : (11 bytes copy) =3D 193.8 MB/s / 196.2 MB/s memcpy_arm : (12 bytes copy) =3D 170.1 MB/s / 172.7 MB/s memcpy_neon : (12 bytes copy) =3D 206.8 MB/s / 212.5 MB/s memcpy_arm : (15 bytes copy) =3D 204.0 MB/s / 209.6 MB/s memcpy_neon : (15 bytes copy) =3D 247.5 MB/s / 270.3 MB/s memcpy_arm : (16 bytes copy) =3D 212.2 MB/s / 225.6 MB/s memcpy_neon : (16 bytes copy) =3D 175.3 MB/s / 252.2 MB/s memcpy_arm : (24 bytes copy) =3D 274.6 MB/s / 326.5 MB/s memcpy_neon : (24 bytes copy) =3D 244.7 MB/s / 367.8 MB/s memcpy_arm : (31 bytes copy) =3D 333.3 MB/s / 399.2 MB/s memcpy_neon : (31 bytes copy) =3D 304.3 MB/s / 463.5 MB/s L1 cached data: memcpy_arm : (4096 bytes copy) =3D 1295.5 MB/s / 2691.8 MB/s memcpy_neon : (4096 bytes copy) =3D 1826.3 MB/s / 2021.8 MB/s memcpy_arm : (6144 bytes copy) =3D 1306.5 MB/s / 2724.1 MB/s memcpy_neon : (6144 bytes copy) =3D 1857.8 MB/s / 2053.2 MB/s L2 cached data: memcpy_arm : (65536 bytes copy) =3D 1291.5 MB/s / 2304.8 MB/s memcpy_neon : (65536 bytes copy) =3D 1866.5 MB/s / 2441.7 MB/s memcpy_arm : (98304 bytes copy) =3D 1285.6 MB/s / 2283.8 MB/s memcpy_neon : (98304 bytes copy) =3D 1860.7 MB/s / 2454.7 MB/s SDRAM: memcpy_arm : (2097152 bytes copy) =3D 466.7 MB/s / 736.5 MB/s memcpy_neon : (2097152 bytes copy) =3D 727.5 MB/s / 868.8 MB/s memcpy_arm : (3145728 bytes copy) =3D 507.9 MB/s / 854.7 MB/s memcpy_neon : (3145728 bytes copy) =3D 852.9 MB/s / 1038.0 MB/s (*) 1 MB =3D 1000000 bytes (*) 'memcpy_arm' - an implementation for older ARM cores from glibc-ports The similar benchmark is at http://sourceware.org/ml/libc-ports/2009-07/msg00000.html . Regards, $4 On Thu, Apr 4, 2013 at 12:19 AM, Ond=C5=99ej B=C3=ADlka = wrote: > On Wed, Apr 03, 2013 at 11:47:36PM +0800, Shih-Yuan Lee (FourDollars) wro= te: >> Hi Joseph, >> > ... >> > I was previously told by people at ARM that NEON memcpy wasn't a good = idea >> > in practice because of raised power consumption, context switch costs = etc. >> > from using NEON in processes that otherwise didn't use it, even if it >> > appeared superficially beneficial in benchmarks. >> > >> About raised power consumption and context switch costs, I may be able >> to add some option in configure for the users to decide if they want >> to use this feature or not. >> How do you think? >> > Configure option is bit overkill. > > You need to compare neon/other implementation speed. Then determine > size where neon is faster if we include energy cost and context switch. > My first estimate is use neon when larger than 4096 bytes. > > However to determine context switch cost of neon you must account network= effect. > > If you use neon in one function that is called sufficiently often (to > always save registers) then adding neon implementation for additional fun= ctions > does not increase cost.