From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16835 invoked by alias); 3 Sep 2013 19:15:39 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 16826 invoked by uid 89); 3 Sep 2013 19:15:38 -0000 Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 03 Sep 2013 19:15:38 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-5.0 required=5.0 tests=AWL,BAYES_05,KHOP_THREADED,RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: mx1.redhat.com Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r83JFR0T014237 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 3 Sep 2013 15:15:27 -0400 Received: from [10.3.113.109] (ovpn-113-109.phx2.redhat.com [10.3.113.109]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id r83JFPxZ032366; Tue, 3 Sep 2013 15:15:26 -0400 Message-ID: <5226354D.8000006@redhat.com> Date: Tue, 03 Sep 2013 19:15:00 -0000 From: "Carlos O'Donell" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130805 Thunderbird/17.0.8 MIME-Version: 1.0 To: =?UTF-8?B?T25kxZllaiBCw61sa2E=?= CC: Will Newton , "libc-ports@sourceware.org" , Patch Tracking , Siddhesh Poyarekar Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance. References: <520894D5.7060207@linaro.org> <5220D30B.9080306@redhat.com> <5220F1F0.80501@redhat.com> <52260BD0.6090805@redhat.com> <20130903173710.GA2028@domone.kolej.mff.cuni.cz> <522621E2.6020903@redhat.com> <20130903185721.GA3876@domone.kolej.mff.cuni.cz> In-Reply-To: <20130903185721.GA3876@domone.kolej.mff.cuni.cz> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-IsSubscribed: yes X-SW-Source: 2013-09/txt/msg00025.txt.bz2 On 09/03/2013 02:57 PM, Ondřej Bílka wrote: > On Tue, Sep 03, 2013 at 01:52:34PM -0400, Carlos O'Donell wrote: >> On 09/03/2013 01:37 PM, Ondřej Bílka wrote: >>>> We have one, it's the glibc microbenchmark, and we want to expand it, >>>> otherwise when ACME comes with their patch for ARM and breaks performance >>>> for targets that Linaro cares about I have no way to reject the patch >>>> objectively :-) >>>> >>> Carlos, you are asking for impossible. When you publish benchmark people >>> will try to maximize benchmark number. After certain point this becomes >>> possible only by employing shady accounting: Move part of time to place >>> wehre it will not be measured by benchmark (for example by having >>> function that is 4kb large, on benchmarks it will fit into instruction >>> cache but that does not happen in reality). >> >> What is it that I'm asking that is impossible? >> > Having static set of benchmarks that can say if implementation is > improvement. I agree that a static set of benchmarks will not provide us with an accurate answer for "Is this patch objectively better for performance?" I don't see that that makes it impossible. The static set of benchmarks at a given point in time (since we should be adding new benchmarks) may have some error with respect to "fastest" for a given ISA/device/workload (which I will shorten as `workload' from now on). Therefore I would not say it's impossible. It's probably impossible for the error between the estimator and reality being close to zero though. That's just expected. > We are shooting to moving target, architectures change and as what we write > will code that will come to users with considerable delay and factors we > used for decision will change in meantime. What's wrong with a moving target? I have spoken with CPU designers and I've been told that they do whole system profiling in order assist in making instruction to microcode decoding decisions. Therefore what we select as optimal sequences are also fed forward into *new* architecture designs. > Once implementation reach certain quality question what is better > becomes dependent on program used. Until we could decide from profile > feedback we will lose some percents by having to use single > implementation. I agree. The eventual goal of the project is to have some kind of whole system benchmarking that allows users to feed in their profiles and allow us as developers to see what users are doing with our library. Just like CPU designers feed in a whole distribution of applications and look at the probability of instruction selection and tweak instruction to microcode mappings. I am willing to accept a certain error in the process as long as I know we are headed in the right direction. If we all disagree about the direction we are going in then we should talk about it. I see: microbenchmarks -> whole system benchmarks -> profile driven optimizations With the latter driving the set of tunnables we expose in the library, and possibly even the functions selected by the IFUNC resolvers at program startup. >>> Taking care of common factors that can cause that is about ten times >>> more complex than whole system benchmarking, analysis will be quite >>> difficult as you will get twenty numbers and you will need to decide >>> which ones could made real impact and which wont. >> >> Sorry, could you clarify this a bit more, exactly what is ten times >> more complex? >> > Having benchmark suite that will catch all relevant factors that can > affect performance. Some are hard to qualify for them we need to know > how average program stresses resources. I agree. I would be happy to accept a patch that does: * Shows the benchmark numbers. * Explains relevant factors not caught by the benchmark that affect performance, what they are, and why the patch should go in. My goal is to increase the quality of the written rationales for performance related submissions. > Take instruction cache usage, a function will occupy cache lines and we > can accurately measure probability and cost of cache misses inside > function. What is hard to estimate is how this will affect rest of > program. For this we would need to know average probability that cache > line will be referenced in future. Good example. >> If we have N tests and they produce N numbers, for a given target, >> for a given device, for a given workload, there is a set of importance >> weights on N that should give you some kind of relevance. >> > You are jumping to case when we will have these weights. Problematic > part is getting those. I agree. It's hard to know the weights without having an intuitive understanding of the applications you're running on your system and what's relevant for their performance. Perhaps my example of a weighted average is too abstract to use today. >> We should be able to come up with some kind of framework from which >> we can clearly say "this patch is better than this other patch", even >> if not automated, it should be possible to reason from the results, >> and that reasoning recorded as a discussion on this list. >> > What is possible is to say that some patch is significantly worse based > on some criteria. There is lot of gray area where decision is unclear. :-) Cheers, Carlos.