From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-4429-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 25397 invoked by alias); 3 Sep 2013 20:56:34 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 25384 invoked by uid 89); 3 Sep 2013 20:56:33 -0000
Received: from mail-we0-f174.google.com (HELO mail-we0-f174.google.com) (74.125.82.174) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 03 Sep 2013 20:56:33 +0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.3 required=5.0 tests=AWL,BAYES_05,FREEMAIL_FROM,KHOP_THREADED,NO_RELAYS autolearn=ham version=3.3.2
X-HELO: mail-we0-f174.google.com
Received: by mail-we0-f174.google.com with SMTP id q54so5247675wes.19        for <libc-ports@sourceware.org>; Tue, 03 Sep 2013 13:56:29 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.194.174.36 with SMTP id bp4mr28790620wjc.7.1378241789739; Tue, 03 Sep 2013 13:56:29 -0700 (PDT)
Received: by 10.216.179.5 with HTTP; Tue, 3 Sep 2013 13:56:29 -0700 (PDT)
In-Reply-To: <52263E63.2080301@redhat.com>
References: <520894D5.7060207@linaro.org>	<CANu=DmiBHoymFKTvaW_VsdhWZEYwkfViz1tTeRgj7H80f0FntA@mail.gmail.com>	<5220D30B.9080306@redhat.com>	<CANu=DmiXLL9v1Z1KS0sBOs-pL8csEUGc9YE829_-tidKd-GruQ@mail.gmail.com>	<5220F1F0.80501@redhat.com>	<CANu=DmhA9QvSe6RS72Db2P=yyjC72fsE8d4QZKHEcNiwqxNMvw@mail.gmail.com>	<52260BD0.6090805@redhat.com>	<CAAKybw99YcSoyU58w2iqHGRTQpajAtKX6JZp=r57bT37fjvQ2Q@mail.gmail.com>	<52263E63.2080301@redhat.com>
Date: Tue, 03 Sep 2013 20:56:00 -0000
Message-ID: <CAAKybw_7VE3zYM1Vb4sfE-HRMMdCx2E9Obf45_11=bGjVZXeJQ@mail.gmail.com>
Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
From: "Ryan S. Arnold" <ryan.arnold@gmail.com>
To: "Carlos O'Donell" <carlos@redhat.com>
Cc: Will Newton <will.newton@linaro.org>, 	"libc-ports@sourceware.org" <libc-ports@sourceware.org>, Patch Tracking <patches@linaro.org>, 	=?UTF-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>, 	Siddhesh Poyarekar <siddhesh@redhat.com>
Content-Type: text/plain; charset=UTF-8
X-IsSubscribed: yes
X-SW-Source: 2013-09/txt/msg00029.txt.bz2

On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> The current set of performance preconditions are baked into the experience
> of the core developers reviewing patches. I want the experts out of the
> loop.
>

This is the clutch.

Developers working for the CPU manufacturers are privy to a lot of
unpublished timing, penalty/hazard information, as well as proprietary
pipeline analysis tools.

Will "J. Random Hacker" working for MegaCorp tell you that the reason
he's chosen a particular instruction sequence is because the system
he's working on has a tiny branch cache (the size of which might be
unpublished)?

>> PowerPC has had the luxury of not having their performance
>> pre-conditions contested.  PowerPC string performance is optimized
>> based upon customer data-set analysis.  So PowerPC's preconditions are
>> pretty concrete...  Optimize for aligned data in excess of 128-bytes
>> (I believe).
>
> We should be documenting this somewhere, preferably in a Power-specific
> test that looks at just this kind of issue.

I might be mistaken, but I think you'll find these preconditions
explicitly defined in the string function implementation source files
for PowerPC.

> Documenting this statically is the first, in my opinion, stepping stone
> to having something like dynamic feedback.

Absolutely!

>> Unless technology evolves that you can statistically analyze data in
>> real time and adjust the implementation based on what you find (an
>> implementation with a different set of preconditions) to account for
>> this you're going to end up with a lot of in-fighting over
>> performance.
>
> Why do you assume we'll have a lot of in-fighting over performance?

I'm projecting here.  If someone proposed to adjust the PowerPC
optimized string functions to their own preconditions and it
drastically changed the performance of existing customers, or future
customers you'd see me panic.

> At present we've split the performance intensive (or so we believe)
> routines on a per-machine basis. The arguments are then going to be
> had only on a per-machine basis, and even then for each hardware
> variant can have an IFUNC resolver select the right routine at
> runtime.

Right, selecting the right variant with IFUNC has certainly helped
platforms that didn't use optimized libraries.  This is the low
hanging fruit.  So now our concern is the proliferation of micro-tuned
variants and a lack of qualified eyes to objectively review the
patches.

> Then we come upon the tunables that should allow some dynamic adjustment
> of an algorithm based on realtime data.

Yes, you can do this with tunables if the developer knows something
about the data (more about that later).

>> I've run into situations where I recommended that a customer code
>> their own string function implementation because they continually
>> encountered unaligned-data when copying-by-value in C++ functions and
>> PowerPC's string function implementations penalized unaligned copies
>> in preference for aligned copies.
>
> Provide both in glibc and expose a tunable?

So do we (the glibc community) no longer consider the proliferation of
tunables to be a mortal sin?  Or was that only with regard to
configuration options?  Regardless, it still burdens the Linux
distributions and developers who have to provide QA.

If tunables are available, then trial-and-error would help where a
user doesn't know the particulars of his application's data usage.

Using tunables is potentially problematic as well.  Often testing a
condition in highly optimized code is enough to obviate the
performance benefit you're attempting to provide. Checking for feature
availability might consume enough cycles to make it senseless to use
the facility itself.  I believe this is what happened in the early
days trying to use VMX in string routines.

Additionally, while dynamically linked applications won't suffer from
using IFUNC resolved functions (because of mandatory PLT usage), glibc
internal usage of IFUNC resolved functions very likely will if/when
forced to go through the PLT, especially on systems like PowerPC where
indirect branching is more expensive than direct branching.  When
Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
keeping a 'generic' optimized version for internal libc usage.  We'll
see how it all works together.

So using tunables alone isn't necessarily a win unless it's coupled
with IFUNC.  But using IFUNC also isn't a guaranteed win in all cases.

For external usage, Using IFUNC in combination with a tunable should
be beneficial.  For instance, on systems that don't have a concrete
cacheline size (e.g., the A2 processor), at process initialization we
query the system cacheline size, populate a static with the size, and
then the string routines will query that size at runtime.  It'd be
nice to do that query at initialization and then pre-select an
implementation based on cacheline size so we don't have to test for
the cacheline size each time through the string function.

This of course increases the cost of maintaining the string routines
by having myriad of combinations.

These are all the trade-offs we weigh.

Ryan