From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-3993-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 14802 invoked by alias); 9 Apr 2013 15:00:18 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 14789 invoked by uid 89); 9 Apr 2013 15:00:17 -0000
X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL,BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,TW_CP autolearn=ham version=3.3.1
Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44)    by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Tue, 09 Apr 2013 15:00:16 +0000
Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Tue, 09 Apr 2013 16:00:10 +0100
Received: from [10.1.69.67] ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Tue, 9 Apr 2013 16:00:05 +0100
Message-ID: <51642CF3.2040506@arm.com>
Date: Tue, 09 Apr 2013 15:00:00 -0000
From: Richard Earnshaw <rearnsha@arm.com>
User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:15.0) Gecko/20120907 Thunderbird/15.0.1
MIME-Version: 1.0
To: Carlos O'Donell <carlos@redhat.com>
CC: "Joseph S. Myers" <joseph@codesourcery.com>,  "Shih-Yuan Lee (FourDollars)" <sylee@canonical.com>, "patches@eglibc.org" <patches@eglibc.org>,  "libc-ports@sourceware.org" <libc-ports@sourceware.org>, "rex.tsai@canonical.com" <rex.tsai@canonical.com>,  "jesse.sung@canonical.com" <jesse.sung@canonical.com>, "yc.cheng@canonical.com" <yc.cheng@canonical.com>,  Shih-Yuan Lee <fourdollars@gmail.com>
Subject: Re: [PATCH] ARM: NEON detected memcpy.
References: <CAAT15mNnqeb6tuVdV6b4uJf-qFDH1acxevyW6f-gH+SkguENmg@mail.gmail.com> <Pine.LNX.4.64.1304031505020.580@digraph.polyomino.org.uk> <5163D9B8.7030008@arm.com> <51641077.4000102@redhat.com>
In-Reply-To: <51641077.4000102@redhat.com>
X-MC-Unique: 113040916001002901
Content-Type: text/plain; charset=WINDOWS-1252; format=flowed
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2013-04/txt/msg00034.txt.bz2

On 09/04/13 13:58, Carlos O'Donell wrote:
> On 04/09/2013 05:04 AM, Richard Earnshaw wrote:
>> On 03/04/13 16:08, Joseph S. Myers wrote:
>>> I was previously told by people at ARM that NEON memcpy wasn't a good i=
dea
>>> in practice because of raised power consumption, context switch costs e=
tc.
>>> from using NEON in processes that otherwise didn't use it, even if it
>>> appeared superficially beneficial in benchmarks.
>>
>> What really matters is system power increase vs performance gain and
>> what you might be able to save if you finish sooner.  If a 10%
>> improvement to memcpy performance comes at a 12% increase in CPU
>> power, then that might seem like a net loss.  But if the CPU is only
>> 50% of the system power, then the increase in system power increase
>> is just half of that (ie 6%), but the performance improvement will
>> still be 10%.  Note that 20% is just an example to make the figures
>> easier here, I've no idea what the real numbers are, and they will be
>> hightly dependent on the other components in the system: a back-lit
>> display, in particular, will use a significant amount of power.
>>
>> It's also necessary to think about how the Neon unit in the processor
>> is managed.  Is it power gated or simply clock gated.  Power gated
>> regions are likely to have long power-up times (relative to normal
>> CPU operations), but clock-gated regions are typically
>> instantaneously available.
>>
>> Finally, you need to consider whether the unit is likely to be
>> already in use.  With the increasing trend to using the hard-float
>> ABI, VFP (and Neon) are generally much more widely used in code now
>> than they were, so the other potential cost of using Neon (lazy
>> context switching) is also likely to be a non-issue, than if the unit
>> is almost never touched.
>
> My expectation here is that downstream integrators run the
> glibc microbenchmarks, or their own benchmarks, measure power,
> and engage the community to discuss alternate runtime tunings
> for their systems.
>
> The project lacks any generalized whole-system benchmarking,
> but my opinion is that  microbenchmarks are the best "first step"
> towards achieving measurable performance goals (since whole-system
> benchmarking is much more complicated).
>
> At present the only policy we have as a community is that faster
> is always better.


You still have to be careful how you measure 'faster'.  Repeatedly=20
running the same fragment of code under the same boundary conditions=20
will only ever give you the 'warm caches' number (I, D and branch=20
target), but if the code is called cold (or with different boundary=20
conditions in the case of the Branch target cache) most of the time in=20
real life, that's unlikely to be very meaningful.

R.