From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12631 invoked by alias); 2 Sep 2013 19:57:56 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 12616 invoked by uid 89); 2 Sep 2013 19:57:56 -0000 Received: from popelka.ms.mff.cuni.cz (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 02 Sep 2013 19:57:56 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131]) by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 44ACB67D63; Mon, 2 Sep 2013 21:57:47 +0200 (CEST) Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000) id 10C135F822; Mon, 2 Sep 2013 21:57:46 +0200 (CEST) Date: Mon, 02 Sep 2013 19:57:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Will Newton Cc: Carlos O'Donell , "libc-ports@sourceware.org" , Patch Tracking , Siddhesh Poyarekar Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance. Message-ID: <20130902195746.GA13759@domone.kolej.mff.cuni.cz> References: <520894D5.7060207@linaro.org> <5220D30B.9080306@redhat.com> <5220F1F0.80501@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-IsSubscribed: yes X-SW-Source: 2013-09/txt/msg00010.txt.bz2 On Mon, Sep 02, 2013 at 02:58:23PM +0100, Will Newton wrote: > On 30 August 2013 20:26, Carlos O'Donell wrote: > > On 08/30/2013 02:48 PM, Will Newton wrote: > >> On 30 August 2013 18:14, Carlos O'Donell wrote: > >>>> Ping? > >>> > >>> How did you test the performance? > >>> > >>> glibc has a performance microbenchmark, did you use that? > >> > >> No, I used the cortex-strings package developed by Linaro for > >> benchmarking various string functions against one another[1]. > >> > >> I haven't checked the glibc benchmarks but I'll look into that. It's > >> quite a specific case that shows the problem so it may not be obvious > >> which one is better however. > > > > If it's not obvious how is someone supposed to review this patch? :-) > > With difficulty. ;-) > > Joseph has raised some good points about the comments and I'll go back > through the code and make sure everything is correct in that regard. > The change was actually made to the copy of the code in cortex-strings > some time ago but I delayed pushing the patch due to the 2.18 release > so I have to refresh my memory somewhat. > > Ideally we would have an agreed upon benchmark with which everyone > could analyse the performance of the code on their systems, however > that does not seem to exist as far as I can tell. > Well, for measuring performance about only way that everybody will agree with is compile implementations as old.so and new.so and then use LD_PRELOAD=old.so time cmd LD_PRELOAD=new.so time cmd in loop until you calculate that there is statistically significant difference (provided that commands you use are representative enough). For any other somebody will argue that its opposite because you forgotten to take some factor into account. Even when you change LD_PRELOAD=old.so implementation that to accurately measure time spend in function it need not be enough. You could have implementation that will be 5 cycles faster on that benchmark but slower in reality because 1) Its code is 1000 bytes bigger than alternative. Gains in function itself will be eaten by instruction cache misses outside function. Or 2) Function agressively prefetches data (say loop that prefetches lines 512 bytes after current buffer position). This makes benchmark performance better but cache will be littered by data after buffer end buffers, real performance suffers. Or 3) For malloc saving metadata at same cache line as start of allocated memory could make benchmark look bad due to cache misses. But it will improve performance as user will write there and metadata write serves as prefetch. or ... > > e.g. > > http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/view/head:/benchmarks/multi/harness.c > > > > I would not call `multi' exhaustive, and while neither is the glibc performance > > benchmark tests the glibc tests have received review from the glibc community > > and are our preferred way of demonstrating performance gains when posting > > performance patches. > > The key advantage of the cortex-strings framework is that it allows > graphing the results of benchmarks. Often changes to string function > performance can only really be analysed graphically as otherwise you > end up with a huge soup of numbers, some going up, some going down and > it is very hard to separate the signal from the noise. > Like following? http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_rand/result.html On real workloads of memcpy it is still bit hard to see what is going on: http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_loop/results_gcc/result.html > The glibc benchmarks also have some other weaknesses that should > really be addressed, hopefully I'll have some time to write patches > for some of this work. > How will you fix measuring in tight loop with same arguments and only 32 times?