From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-4442-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 4412 invoked by alias); 5 Sep 2013 11:07:05 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 4400 invoked by uid 89); 5 Sep 2013 11:07:04 -0000
Received: from popelka.ms.mff.cuni.cz (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 05 Sep 2013 11:07:04 +0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2
X-HELO: popelka.ms.mff.cuni.cz
Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131])	by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 38A4A5036E;	Thu,  5 Sep 2013 13:06:58 +0200 (CEST)
Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000)	id 128135F822; Thu,  5 Sep 2013 13:06:58 +0200 (CEST)
Date: Thu, 05 Sep 2013 11:07:00 -0000
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: "Ryan S. Arnold" <ryan.arnold@gmail.com>
Cc: Siddhesh Poyarekar <siddhesh@redhat.com>,	Carlos O'Donell <carlos@redhat.com>,	Will Newton <will.newton@linaro.org>,	"libc-ports@sourceware.org" <libc-ports@sourceware.org>,	Patch Tracking <patches@linaro.org>
Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
Message-ID: <20130905110657.GB5401@domone.kolej.mff.cuni.cz>
References: <CANu=DmiXLL9v1Z1KS0sBOs-pL8csEUGc9YE829_-tidKd-GruQ@mail.gmail.com> <5220F1F0.80501@redhat.com> <CANu=DmhA9QvSe6RS72Db2P=yyjC72fsE8d4QZKHEcNiwqxNMvw@mail.gmail.com> <52260BD0.6090805@redhat.com> <20130903173710.GA2028@domone.kolej.mff.cuni.cz> <522621E2.6020903@redhat.com> <20130903185721.GA3876@domone.kolej.mff.cuni.cz> <5226354D.8000006@redhat.com> <20130904073008.GA4306@spoyarek.pnq.redhat.com> <CAAKybw87cyx67bpX=qjedrfjKxDmtgOfi_zCiaCfHGgx328Bsw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAAKybw87cyx67bpX=qjedrfjKxDmtgOfi_zCiaCfHGgx328Bsw@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-IsSubscribed: yes
X-SW-Source: 2013-09/txt/msg00042.txt.bz2

On Wed, Sep 04, 2013 at 12:35:46PM -0500, Ryan S. Arnold wrote:
> On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
> > 3. Provide acceptable performance for unaligned sizes without
> >    penalizing the aligned case
> 
> There are cases where the user can't control the alignment of the data
> being fed into string functions, and we shouldn't penalize them for
> these situations if possible, but in reality if a string routine shows
> up hot in a profile this is a likely culprit and there's not much that
> can be done once the unaligned case is made as stream-lined as
> possible.
> 
> Simply testing for alignment (not presuming aligned data) itself slows
> down the processing of aligned-data, but that's an unavoidable
> reality.

How expensive are unaligned loads on powerpc?  On x64 a penalty for
using them is smaller than alternatives(increased branch
misprediction...)

>  I've chatted with some compiler folks about the possibility
> of branching directly to aligned case labels in string routines if the
> compiler is able to detect aligned data.. but was informed that this
> suggestion might get me burned at the stake.
> 
You would need to improve gcc detection of alignments first. Now gcc
misses most of opportunities, even in following code gcc issues
retundant alignment checks:

#include <stdint.h>
char *foo(long *x){
 if (((uintptr_t)x)%16)
  return x+4;
 else {
  __builtin_memset(x,0,512);
  return x;
 }
}

If gcc guys fix that then we do not have to ask them anything. We could
just change headers to recognize aligned case like

#define strchr(x,c) ({ char *__x=x;\
  if (__builtin_constant_p(((uintptr_t)__x)%16) && !((uintptr_t)__x)%16)\
    strchr_aligned(__x,c);\
  else\
    strchr(__x,c);\
})

> > 4. Measure the effect of dcache pressure on function performance
> > 5. Measure effect of icache pressure on function performance.
> >
> > Depending on the actual cost of cache misses on different processors,
> > the icache/dcache miss cost would either have higher or lower weight
> > but for 1-3, I'd go in that order of priorities with little concern
> > for unaligned cases.
> 
> I know that icache and dcache miss penalty/costs are known for most
> architectures but not whether they're "published".  I suppose we can,
> at least, encourage developers for the CPU manufacturers to indicate
> in the documentation of preconditions which is more expensive,
> relative to the other if they're unable to indicate the exact costs of
> these misses.
>
These cost are relatively difficult to describe, take strlen on main
memory as example.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strlen_profile/results_rand_nocache/result.html

Here we see hardware prefetcher in action. A time goes linearly with
size until 512 bytes and remains constant until 4096 bytes(switch to
block view) where it starts increasing at slower rate.
 
For core2 shape is similar except that plateau starts at 256 bytes and
ends at 1024 bytes.
http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strlen_profile/results_rand_nocache/result.html

AMD processors are different, phenomII performance is line, and for fx10
there is even area where time decreases with size. 
http://kam.mff.cuni.cz/~ondra/benchmark_string/phenomII/strlen_profile/results_rand_nocache/result.html 
http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/strlen_profile/results_rand_nocache/result.html