From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-4437-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 16963 invoked by alias); 4 Sep 2013 17:35:51 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 16953 invoked by uid 89); 4 Sep 2013 17:35:50 -0000
Received: from mail-we0-f172.google.com (HELO mail-we0-f172.google.com) (74.125.82.172) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Wed, 04 Sep 2013 17:35:50 +0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,KHOP_THREADED,NO_RELAYS autolearn=ham version=3.3.2
X-HELO: mail-we0-f172.google.com
Received: by mail-we0-f172.google.com with SMTP id n5so707079wev.31        for <libc-ports@sourceware.org>; Wed, 04 Sep 2013 10:35:46 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.194.123.227 with SMTP id md3mr3433681wjb.17.1378316146559; Wed, 04 Sep 2013 10:35:46 -0700 (PDT)
Received: by 10.216.179.5 with HTTP; Wed, 4 Sep 2013 10:35:46 -0700 (PDT)
In-Reply-To: <20130904073008.GA4306@spoyarek.pnq.redhat.com>
References: <CANu=DmiBHoymFKTvaW_VsdhWZEYwkfViz1tTeRgj7H80f0FntA@mail.gmail.com>	<5220D30B.9080306@redhat.com>	<CANu=DmiXLL9v1Z1KS0sBOs-pL8csEUGc9YE829_-tidKd-GruQ@mail.gmail.com>	<5220F1F0.80501@redhat.com>	<CANu=DmhA9QvSe6RS72Db2P=yyjC72fsE8d4QZKHEcNiwqxNMvw@mail.gmail.com>	<52260BD0.6090805@redhat.com>	<20130903173710.GA2028@domone.kolej.mff.cuni.cz>	<522621E2.6020903@redhat.com>	<20130903185721.GA3876@domone.kolej.mff.cuni.cz>	<5226354D.8000006@redhat.com>	<20130904073008.GA4306@spoyarek.pnq.redhat.com>
Date: Wed, 04 Sep 2013 17:35:00 -0000
Message-ID: <CAAKybw87cyx67bpX=qjedrfjKxDmtgOfi_zCiaCfHGgx328Bsw@mail.gmail.com>
Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
From: "Ryan S. Arnold" <ryan.arnold@gmail.com>
To: Siddhesh Poyarekar <siddhesh@redhat.com>
Cc: "Carlos O'Donell" <carlos@redhat.com>, =?UTF-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>, 	Will Newton <will.newton@linaro.org>, 	"libc-ports@sourceware.org" <libc-ports@sourceware.org>, Patch Tracking <patches@linaro.org>
Content-Type: text/plain; charset=UTF-8
X-IsSubscribed: yes
X-SW-Source: 2013-09/txt/msg00037.txt.bz2

On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
> 1. Assume aligned input.  Nothing should take (any noticeable)
>    performance away from align copies/moves
> 2. Scale with size

In my experience scaling with data-size isn't really possible beyond a
certain point.  We pick a target range of sizes to optimize for based
upon customer feedback and we try to use pre-fetching in that range as
efficiently as possible.  But I get your point.  We don't want any
particular size to be severely penalized.

Each architecture and specific platform needs to know/decide what the
optimal range is and document it.  Even for Power we have different
expectations on server hardware like POWER7, vs. embedded hardware
like ppc 476.

> 3. Provide acceptable performance for unaligned sizes without
>    penalizing the aligned case

There are cases where the user can't control the alignment of the data
being fed into string functions, and we shouldn't penalize them for
these situations if possible, but in reality if a string routine shows
up hot in a profile this is a likely culprit and there's not much that
can be done once the unaligned case is made as stream-lined as
possible.

Simply testing for alignment (not presuming aligned data) itself slows
down the processing of aligned-data, but that's an unavoidable
reality.  I've chatted with some compiler folks about the possibility
of branching directly to aligned case labels in string routines if the
compiler is able to detect aligned data.. but was informed that this
suggestion might get me burned at the stake.

As previously discussed, we might be able to use tunables in the
future to mitigate this.  But of course, this would be 'use at your
own risk'.

> 4. Measure the effect of dcache pressure on function performance
> 5. Measure effect of icache pressure on function performance.
>
> Depending on the actual cost of cache misses on different processors,
> the icache/dcache miss cost would either have higher or lower weight
> but for 1-3, I'd go in that order of priorities with little concern
> for unaligned cases.

I know that icache and dcache miss penalty/costs are known for most
architectures but not whether they're "published".  I suppose we can,
at least, encourage developers for the CPU manufacturers to indicate
in the documentation of preconditions which is more expensive,
relative to the other if they're unable to indicate the exact costs of
these misses.

Some further thoughts (just to get this stuff documented):

Some performance regressions I'm familiar with (on Power), which CAN
be measured with a baseline micro-benchmark regardless of use-case:

1. Hazard/Penalties - I'm thinking things like load-hit-store in the
tail of a loop, e.g., label: load value from a register, do work,
store to same register, branch to loop.  Take a stall when the value
at the top of the loop isn't ready to load.

2. Dispatch grouping - Some instructions need to be first-in-group,
etc.  Grouping is also based on instruction alignment.  At least on
Power I believe some instructions benefit from specific alignment.

3. Instruction Grouping - Depending on topology of the pipeline,
specific groupings of instructions of might incur pipeline stalls due
to unavailability of the load/store unit (for instance).

4. Facility usage costs - Sometimes using certain facilities for
certain sizes of data are more costly than not using the facility.
For instance, I believe that using the DFPU on Power requires that the
floating-point pipeline be flushed, so BFP and DFP really shouldn't be
used together.  I believe there is a powerpc32 string function which
uses FPRs because they're 64-bits wide even on ppc32.  But we measured
the cost/benefit ratio of using this vs. not.

On Power, micro benchmarks are run in-house with these (and many
other) factors in mind.

Ryan