From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 15916 invoked by alias); 9 Apr 2013 15:54:04 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 15906 invoked by uid 89); 9 Apr 2013 15:54:04 -0000 X-Spam-SWARE-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL,TW_CP autolearn=no version=3.3.1 Received: from popelka.ms.mff.cuni.cz (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Tue, 09 Apr 2013 15:54:03 +0000 Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131]) by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 503D332C7; Tue, 9 Apr 2013 17:53:58 +0200 (CEST) Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000) id A12966046C; Tue, 9 Apr 2013 17:53:44 +0200 (CEST) Date: Tue, 09 Apr 2013 15:54:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Richard Earnshaw Cc: Carlos O'Donell , "Joseph S. Myers" , "Shih-Yuan Lee (FourDollars)" , "patches@eglibc.org" , "libc-ports@sourceware.org" , "rex.tsai@canonical.com" , "jesse.sung@canonical.com" , "yc.cheng@canonical.com" , Shih-Yuan Lee Subject: Re: [PATCH] ARM: NEON detected memcpy. Message-ID: <20130409155344.GA8760@domone.kolej.mff.cuni.cz> References: <5163D9B8.7030008@arm.com> <51641077.4000102@redhat.com> <51642CF3.2040506@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51642CF3.2040506@arm.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-SW-Source: 2013-04/txt/msg00035.txt.bz2 On Tue, Apr 09, 2013 at 04:00:03PM +0100, Richard Earnshaw wrote: > On 09/04/13 13:58, Carlos O'Donell wrote: > >On 04/09/2013 05:04 AM, Richard Earnshaw wrote: > >>On 03/04/13 16:08, Joseph S. Myers wrote: > >>>I was previously told by people at ARM that NEON memcpy wasn't a good idea > >>>in practice because of raised power consumption, context switch costs etc. > >>>from using NEON in processes that otherwise didn't use it, even if it > >>>appeared superficially beneficial in benchmarks. > >> > >>What really matters is system power increase vs performance gain and > >>what you might be able to save if you finish sooner. If a 10% > >>improvement to memcpy performance comes at a 12% increase in CPU > >>power, then that might seem like a net loss. But if the CPU is only > >>50% of the system power, then the increase in system power increase > >>is just half of that (ie 6%), but the performance improvement will > >>still be 10%. Note that 20% is just an example to make the figures > >>easier here, I've no idea what the real numbers are, and they will be > >>hightly dependent on the other components in the system: a back-lit > >>display, in particular, will use a significant amount of power. > >> > >>It's also necessary to think about how the Neon unit in the processor > >>is managed. Is it power gated or simply clock gated. Power gated > >>regions are likely to have long power-up times (relative to normal > >>CPU operations), but clock-gated regions are typically > >>instantaneously available. > >> > >>Finally, you need to consider whether the unit is likely to be > >>already in use. With the increasing trend to using the hard-float > >>ABI, VFP (and Neon) are generally much more widely used in code now > >>than they were, so the other potential cost of using Neon (lazy > >>context switching) is also likely to be a non-issue, than if the unit > >>is almost never touched. > > > >My expectation here is that downstream integrators run the > >glibc microbenchmarks, or their own benchmarks, measure power, > >and engage the community to discuss alternate runtime tunings > >for their systems. > > > >The project lacks any generalized whole-system benchmarking, > >but my opinion is that microbenchmarks are the best "first step" > >towards achieving measurable performance goals (since whole-system > >benchmarking is much more complicated). > > > >At present the only policy we have as a community is that faster > >is always better. > I am rewriting my whole-system benchmarks to be more generic. Still measuring performance would be time consuming, benchmarks needs minimaly hour to get enough data. Then I cannot replicate exact conditions of measurement. It depends on what I do with computer which varies. There is problem with representability. I know how conditions for popular programs (gcc, firefox) Most other programs show very similar characteristic but I do not know anything about tail. To get more direct feedback I also do record/replay benchmark, see my previous mail. > > You still have to be careful how you measure 'faster'. Repeatedly > running the same fragment of code under the same boundary conditions > will only ever give you the 'warm caches' number (I, D and branch > target), but if the code is called cold (or with different boundary > conditions in the case of the Branch target cache) most of the time > in real life, that's unlikely to be very meaningful. > > R. >