From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16963 invoked by alias); 4 Sep 2013 17:35:51 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 16953 invoked by uid 89); 4 Sep 2013 17:35:50 -0000 Received: from mail-we0-f172.google.com (HELO mail-we0-f172.google.com) (74.125.82.172) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Wed, 04 Sep 2013 17:35:50 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,KHOP_THREADED,NO_RELAYS autolearn=ham version=3.3.2 X-HELO: mail-we0-f172.google.com Received: by mail-we0-f172.google.com with SMTP id n5so707079wev.31 for ; Wed, 04 Sep 2013 10:35:46 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.194.123.227 with SMTP id md3mr3433681wjb.17.1378316146559; Wed, 04 Sep 2013 10:35:46 -0700 (PDT) Received: by 10.216.179.5 with HTTP; Wed, 4 Sep 2013 10:35:46 -0700 (PDT) In-Reply-To: <20130904073008.GA4306@spoyarek.pnq.redhat.com> References: <5220D30B.9080306@redhat.com> <5220F1F0.80501@redhat.com> <52260BD0.6090805@redhat.com> <20130903173710.GA2028@domone.kolej.mff.cuni.cz> <522621E2.6020903@redhat.com> <20130903185721.GA3876@domone.kolej.mff.cuni.cz> <5226354D.8000006@redhat.com> <20130904073008.GA4306@spoyarek.pnq.redhat.com> Date: Wed, 04 Sep 2013 17:35:00 -0000 Message-ID: Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance. From: "Ryan S. Arnold" To: Siddhesh Poyarekar Cc: "Carlos O'Donell" , =?UTF-8?B?T25kxZllaiBCw61sa2E=?= , Will Newton , "libc-ports@sourceware.org" , Patch Tracking Content-Type: text/plain; charset=UTF-8 X-IsSubscribed: yes X-SW-Source: 2013-09/txt/msg00037.txt.bz2 On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar wrote: > 1. Assume aligned input. Nothing should take (any noticeable) > performance away from align copies/moves > 2. Scale with size In my experience scaling with data-size isn't really possible beyond a certain point. We pick a target range of sizes to optimize for based upon customer feedback and we try to use pre-fetching in that range as efficiently as possible. But I get your point. We don't want any particular size to be severely penalized. Each architecture and specific platform needs to know/decide what the optimal range is and document it. Even for Power we have different expectations on server hardware like POWER7, vs. embedded hardware like ppc 476. > 3. Provide acceptable performance for unaligned sizes without > penalizing the aligned case There are cases where the user can't control the alignment of the data being fed into string functions, and we shouldn't penalize them for these situations if possible, but in reality if a string routine shows up hot in a profile this is a likely culprit and there's not much that can be done once the unaligned case is made as stream-lined as possible. Simply testing for alignment (not presuming aligned data) itself slows down the processing of aligned-data, but that's an unavoidable reality. I've chatted with some compiler folks about the possibility of branching directly to aligned case labels in string routines if the compiler is able to detect aligned data.. but was informed that this suggestion might get me burned at the stake. As previously discussed, we might be able to use tunables in the future to mitigate this. But of course, this would be 'use at your own risk'. > 4. Measure the effect of dcache pressure on function performance > 5. Measure effect of icache pressure on function performance. > > Depending on the actual cost of cache misses on different processors, > the icache/dcache miss cost would either have higher or lower weight > but for 1-3, I'd go in that order of priorities with little concern > for unaligned cases. I know that icache and dcache miss penalty/costs are known for most architectures but not whether they're "published". I suppose we can, at least, encourage developers for the CPU manufacturers to indicate in the documentation of preconditions which is more expensive, relative to the other if they're unable to indicate the exact costs of these misses. Some further thoughts (just to get this stuff documented): Some performance regressions I'm familiar with (on Power), which CAN be measured with a baseline micro-benchmark regardless of use-case: 1. Hazard/Penalties - I'm thinking things like load-hit-store in the tail of a loop, e.g., label: load value from a register, do work, store to same register, branch to loop. Take a stall when the value at the top of the loop isn't ready to load. 2. Dispatch grouping - Some instructions need to be first-in-group, etc. Grouping is also based on instruction alignment. At least on Power I believe some instructions benefit from specific alignment. 3. Instruction Grouping - Depending on topology of the pipeline, specific groupings of instructions of might incur pipeline stalls due to unavailability of the load/store unit (for instance). 4. Facility usage costs - Sometimes using certain facilities for certain sizes of data are more costly than not using the facility. For instance, I believe that using the DFPU on Power requires that the floating-point pipeline be flushed, so BFP and DFP really shouldn't be used together. I believe there is a powerpc32 string function which uses FPRs because they're 64-bits wide even on ppc32. But we measured the cost/benefit ratio of using this vs. not. On Power, micro benchmarks are run in-house with these (and many other) factors in mind. Ryan