From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4311 invoked by alias); 4 Sep 2013 11:03:41 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 4301 invoked by uid 89); 4 Sep 2013 11:03:40 -0000 Received: from popelka.ms.mff.cuni.cz (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 04 Sep 2013 11:03:40 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131]) by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id DE77E62A4C; Wed, 4 Sep 2013 13:03:33 +0200 (CEST) Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000) id B79AD5F822; Wed, 4 Sep 2013 13:03:33 +0200 (CEST) Date: Wed, 04 Sep 2013 11:03:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Siddhesh Poyarekar Cc: Carlos O'Donell , Will Newton , "libc-ports@sourceware.org" , Patch Tracking Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance. Message-ID: <20130904110333.GA6216@domone.kolej.mff.cuni.cz> References: <5220D30B.9080306@redhat.com> <5220F1F0.80501@redhat.com> <52260BD0.6090805@redhat.com> <20130903173710.GA2028@domone.kolej.mff.cuni.cz> <522621E2.6020903@redhat.com> <20130903185721.GA3876@domone.kolej.mff.cuni.cz> <5226354D.8000006@redhat.com> <20130904073008.GA4306@spoyarek.pnq.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130904073008.GA4306@spoyarek.pnq.redhat.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-IsSubscribed: yes X-SW-Source: 2013-09/txt/msg00034.txt.bz2 On Wed, Sep 04, 2013 at 01:00:09PM +0530, Siddhesh Poyarekar wrote: > On Tue, Sep 03, 2013 at 03:15:25PM -0400, Carlos O'Donell wrote: > > I agree. The eventual goal of the project is to have some kind of > > whole system benchmarking that allows users to feed in their profiles > > and allow us as developers to see what users are doing with our library. > > > > Just like CPU designers feed in a whole distribution of applications > > and look at the probability of instruction selection and tweak instruction > > to microcode mappings. > > > > I am willing to accept a certain error in the process as long as I know > > we are headed in the right direction. If we all disagree about the > > direction we are going in then we should talk about it. > > > > I see: > > > > microbenchmarks -> whole system benchmarks -> profile driven optimizations > > I've mentioned this before - microbenchmarks are not a way to whole > system benchmarks in that they don't replace system benchmarks. We > need to work on both in parallel because both have different goals. > > A microbenchmark would have parameters such as alignment, size and > cache pressure to determine how an implementation scales. These are > generic numbers (i.e. they're not tied to specific high level > workloads) that a developer can use to design their programs. > > Whole system benchmarks however work at a different level. They would > give an average case number that describes how a specific recipe > impacts performance of a set of programs. An administrator would use > these to tweak the system for the workload. > > > I would be happy to accept a patch that does: > > * Shows the benchmark numbers. > > * Explains relevant factors not caught by the benchmark that affect > > performance, what they are, and why the patch should go in. > > > > My goal is to increase the quality of the written rationales for > > performance related submissions. > > Agreed. In fact, this should go in as a large comment in the > implementation itself. Someone had mentioned in the past (was it > Torvald?) that every assembly implementation we write should be as > verbose in comments as it can possibly be so that there is no > ambiguity about the rationale for selection of specific instruction > sequences over others. > > > >> If we have N tests and they produce N numbers, for a given target, > > >> for a given device, for a given workload, there is a set of importance > > >> weights on N that should give you some kind of relevance. > > >> > > > You are jumping to case when we will have these weights. Problematic > > > part is getting those. > > > > I agree. > > > > It's hard to know the weights without having an intuitive understanding > > of the applications you're running on your system and what's relevant > > for their performance. > > 1. Assume aligned input. Nothing should take (any noticeable) > performance away from align copies/moves Not very useful as this is extremely dependant on function measured. For functions like strcmp and strlen alignments are mostly random so aligned case does not say much. On opposite end of spectrum is memset which is almost always 8 byte aligned and unaligned performance does not make lot of sense. > 2. Scale with size Not very important for several reasons. One is that big sizes are cold (just look in oprofile output that loops are less frequent than header.) Second reason is that if we look at caller large sizes are unlikely bottleneck. One type of usage is find delimiter like: i=strlen(n); for (i=0;i 3. Provide acceptable performance for unaligned sizes without > penalizing the aligned case This is quite important case. It should be measured correctly, what is important is that alignment varies. This can be slower than when you pick fixed alignment and alignment varies in reality. > 4. Measure the effect of dcache pressure on function performance > 5. Measure effect of icache pressure on function performance. > Here you really need to base weigths on function usage patterns. A bigger code size is acceptable for functions that are called more often. You need to see distribution of how are calls clustered to get full picture. A strcmp is least sensitive to icache concerns, as when it is called its mostly 100 times over in tight loop so size is not big issue. If same number of call is uniformnly spread through program we need stricter criteria. > Depending on the actual cost of cache misses on different processors, > the icache/dcache miss cost would either have higher or lower weight > but for 1-3, I'd go in that order of priorities with little concern > for unaligned cases. > > Siddhesh