From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4412 invoked by alias); 5 Sep 2013 11:07:05 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 4400 invoked by uid 89); 5 Sep 2013 11:07:04 -0000 Received: from popelka.ms.mff.cuni.cz (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 05 Sep 2013 11:07:04 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131]) by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 38A4A5036E; Thu, 5 Sep 2013 13:06:58 +0200 (CEST) Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000) id 128135F822; Thu, 5 Sep 2013 13:06:58 +0200 (CEST) Date: Thu, 05 Sep 2013 11:07:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: "Ryan S. Arnold" Cc: Siddhesh Poyarekar , Carlos O'Donell , Will Newton , "libc-ports@sourceware.org" , Patch Tracking Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance. Message-ID: <20130905110657.GB5401@domone.kolej.mff.cuni.cz> References: <5220F1F0.80501@redhat.com> <52260BD0.6090805@redhat.com> <20130903173710.GA2028@domone.kolej.mff.cuni.cz> <522621E2.6020903@redhat.com> <20130903185721.GA3876@domone.kolej.mff.cuni.cz> <5226354D.8000006@redhat.com> <20130904073008.GA4306@spoyarek.pnq.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-IsSubscribed: yes X-SW-Source: 2013-09/txt/msg00042.txt.bz2 On Wed, Sep 04, 2013 at 12:35:46PM -0500, Ryan S. Arnold wrote: > On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar wrote: > > 3. Provide acceptable performance for unaligned sizes without > > penalizing the aligned case > > There are cases where the user can't control the alignment of the data > being fed into string functions, and we shouldn't penalize them for > these situations if possible, but in reality if a string routine shows > up hot in a profile this is a likely culprit and there's not much that > can be done once the unaligned case is made as stream-lined as > possible. > > Simply testing for alignment (not presuming aligned data) itself slows > down the processing of aligned-data, but that's an unavoidable > reality. How expensive are unaligned loads on powerpc? On x64 a penalty for using them is smaller than alternatives(increased branch misprediction...) > I've chatted with some compiler folks about the possibility > of branching directly to aligned case labels in string routines if the > compiler is able to detect aligned data.. but was informed that this > suggestion might get me burned at the stake. > You would need to improve gcc detection of alignments first. Now gcc misses most of opportunities, even in following code gcc issues retundant alignment checks: #include char *foo(long *x){ if (((uintptr_t)x)%16) return x+4; else { __builtin_memset(x,0,512); return x; } } If gcc guys fix that then we do not have to ask them anything. We could just change headers to recognize aligned case like #define strchr(x,c) ({ char *__x=x;\ if (__builtin_constant_p(((uintptr_t)__x)%16) && !((uintptr_t)__x)%16)\ strchr_aligned(__x,c);\ else\ strchr(__x,c);\ }) > > 4. Measure the effect of dcache pressure on function performance > > 5. Measure effect of icache pressure on function performance. > > > > Depending on the actual cost of cache misses on different processors, > > the icache/dcache miss cost would either have higher or lower weight > > but for 1-3, I'd go in that order of priorities with little concern > > for unaligned cases. > > I know that icache and dcache miss penalty/costs are known for most > architectures but not whether they're "published". I suppose we can, > at least, encourage developers for the CPU manufacturers to indicate > in the documentation of preconditions which is more expensive, > relative to the other if they're unable to indicate the exact costs of > these misses. > These cost are relatively difficult to describe, take strlen on main memory as example. http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strlen_profile/results_rand_nocache/result.html Here we see hardware prefetcher in action. A time goes linearly with size until 512 bytes and remains constant until 4096 bytes(switch to block view) where it starts increasing at slower rate. For core2 shape is similar except that plateau starts at 256 bytes and ends at 1024 bytes. http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strlen_profile/results_rand_nocache/result.html AMD processors are different, phenomII performance is line, and for fx10 there is even area where time decreases with size. http://kam.mff.cuni.cz/~ondra/benchmark_string/phenomII/strlen_profile/results_rand_nocache/result.html http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/strlen_profile/results_rand_nocache/result.html