From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14802 invoked by alias); 9 Apr 2013 15:00:18 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 14789 invoked by uid 89); 9 Apr 2013 15:00:17 -0000 X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL,BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,TW_CP autolearn=ham version=3.3.1 Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Tue, 09 Apr 2013 15:00:16 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Tue, 09 Apr 2013 16:00:10 +0100 Received: from [10.1.69.67] ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 9 Apr 2013 16:00:05 +0100 Message-ID: <51642CF3.2040506@arm.com> Date: Tue, 09 Apr 2013 15:00:00 -0000 From: Richard Earnshaw User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:15.0) Gecko/20120907 Thunderbird/15.0.1 MIME-Version: 1.0 To: Carlos O'Donell CC: "Joseph S. Myers" , "Shih-Yuan Lee (FourDollars)" , "patches@eglibc.org" , "libc-ports@sourceware.org" , "rex.tsai@canonical.com" , "jesse.sung@canonical.com" , "yc.cheng@canonical.com" , Shih-Yuan Lee Subject: Re: [PATCH] ARM: NEON detected memcpy. References: <5163D9B8.7030008@arm.com> <51641077.4000102@redhat.com> In-Reply-To: <51641077.4000102@redhat.com> X-MC-Unique: 113040916001002901 Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: quoted-printable X-SW-Source: 2013-04/txt/msg00034.txt.bz2 On 09/04/13 13:58, Carlos O'Donell wrote: > On 04/09/2013 05:04 AM, Richard Earnshaw wrote: >> On 03/04/13 16:08, Joseph S. Myers wrote: >>> I was previously told by people at ARM that NEON memcpy wasn't a good i= dea >>> in practice because of raised power consumption, context switch costs e= tc. >>> from using NEON in processes that otherwise didn't use it, even if it >>> appeared superficially beneficial in benchmarks. >> >> What really matters is system power increase vs performance gain and >> what you might be able to save if you finish sooner. If a 10% >> improvement to memcpy performance comes at a 12% increase in CPU >> power, then that might seem like a net loss. But if the CPU is only >> 50% of the system power, then the increase in system power increase >> is just half of that (ie 6%), but the performance improvement will >> still be 10%. Note that 20% is just an example to make the figures >> easier here, I've no idea what the real numbers are, and they will be >> hightly dependent on the other components in the system: a back-lit >> display, in particular, will use a significant amount of power. >> >> It's also necessary to think about how the Neon unit in the processor >> is managed. Is it power gated or simply clock gated. Power gated >> regions are likely to have long power-up times (relative to normal >> CPU operations), but clock-gated regions are typically >> instantaneously available. >> >> Finally, you need to consider whether the unit is likely to be >> already in use. With the increasing trend to using the hard-float >> ABI, VFP (and Neon) are generally much more widely used in code now >> than they were, so the other potential cost of using Neon (lazy >> context switching) is also likely to be a non-issue, than if the unit >> is almost never touched. > > My expectation here is that downstream integrators run the > glibc microbenchmarks, or their own benchmarks, measure power, > and engage the community to discuss alternate runtime tunings > for their systems. > > The project lacks any generalized whole-system benchmarking, > but my opinion is that microbenchmarks are the best "first step" > towards achieving measurable performance goals (since whole-system > benchmarking is much more complicated). > > At present the only policy we have as a community is that faster > is always better. You still have to be careful how you measure 'faster'. Repeatedly=20 running the same fragment of code under the same boundary conditions=20 will only ever give you the 'warm caches' number (I, D and branch=20 target), but if the code is called cold (or with different boundary=20 conditions in the case of the Branch target cache) most of the time in=20 real life, that's unlikely to be very meaningful. R.