From: "Ondřej Bílka" <firstname.lastname@example.org>
To: "Ryan S. Arnold" <email@example.com>
Cc: Carlos O'Donell <firstname.lastname@example.org>,
Will Newton <email@example.com>,
Patch Tracking <firstname.lastname@example.org>,
Siddhesh Poyarekar <email@example.com>
Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
Date: Tue, 03 Sep 2013 23:29:00 -0000 [thread overview]
Message-ID: <20130903232918.GB7148@domone.kolej.mff.cuni.cz> (raw)
On Tue, Sep 03, 2013 at 03:56:29PM -0500, Ryan S. Arnold wrote:
> On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <firstname.lastname@example.org> wrote:
> > The current set of performance preconditions are baked into the experience
> > of the core developers reviewing patches. I want the experts out of the
> > loop.
> This is the clutch.
> Developers working for the CPU manufacturers are privy to a lot of
> unpublished timing, penalty/hazard information, as well as proprietary
> pipeline analysis tools.
One problem of those on x64 is how much are these informations complete.
At least based on reading AMD optimization manuals around half of advices
is inaccurate so I need to check if they hold anyway.
> > At present we've split the performance intensive (or so we believe)
> > routines on a per-machine basis. The arguments are then going to be
> > had only on a per-machine basis, and even then for each hardware
> > variant can have an IFUNC resolver select the right routine at
> > runtime.
> Right, selecting the right variant with IFUNC has certainly helped
> platforms that didn't use optimized libraries. This is the low
> hanging fruit. So now our concern is the proliferation of micro-tuned
> variants and a lack of qualified eyes to objectively review the
Not there yet, at least for AMD, where its often using ifunc to select
slowest implementation. When you look at x64 benchmarks relation between
asymptoticaly best and selected implementation is quite random.
> > Then we come upon the tunables that should allow some dynamic adjustment
> > of an algorithm based on realtime data.
> Yes, you can do this with tunables if the developer knows something
> about the data (more about that later).
> >> I've run into situations where I recommended that a customer code
> >> their own string function implementation because they continually
> >> encountered unaligned-data when copying-by-value in C++ functions and
> >> PowerPC's string function implementations penalized unaligned copies
> >> in preference for aligned copies.
> > Provide both in glibc and expose a tunable?
> So do we (the glibc community) no longer consider the proliferation of
> tunables to be a mortal sin? Or was that only with regard to
> configuration options? Regardless, it still burdens the Linux
> distributions and developers who have to provide QA.
> If tunables are available, then trial-and-error would help where a
> user doesn't know the particulars of his application's data usage.
Here auto-tuning as described in other mail would give better result.
Unless there data usage would not fit any of our variants in which case
a custom routine would be needed.
> Using tunables is potentially problematic as well. Often testing a
> condition in highly optimized code is enough to obviate the
> performance benefit you're attempting to provide. Checking for feature
> availability might consume enough cycles to make it senseless to use
> the facility itself. I believe this is what happened in the early
> days trying to use VMX in string routines.
> Additionally, while dynamically linked applications won't suffer from
> using IFUNC resolved functions (because of mandatory PLT usage), glibc
> internal usage of IFUNC resolved functions very likely will if/when
> forced to go through the PLT, especially on systems like PowerPC where
> indirect branching is more expensive than direct branching. When
> Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
> keeping a 'generic' optimized version for internal libc usage. We'll
> see how it all works together.
Those that are concerned about getting each bit of performance would
recompile everything with --march=target anyway. It would be nice to
select internal version based on --march.
> So using tunables alone isn't necessarily a win unless it's coupled
> with IFUNC. But using IFUNC also isn't a guaranteed win in all cases.
> For external usage, Using IFUNC in combination with a tunable should
> be beneficial. For instance, on systems that don't have a concrete
> cacheline size (e.g., the A2 processor), at process initialization we
> query the system cacheline size, populate a static with the size, and
> then the string routines will query that size at runtime. It'd be
> nice to do that query at initialization and then pre-select an
> implementation based on cacheline size so we don't have to test for
> the cacheline size each time through the string function.
> This of course increases the cost of maintaining the string routines
> by having myriad of combinations.
> These are all the trade-offs we weigh.
next prev parent reply other threads:[~2013-09-03 23:29 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-12 7:55 Will Newton
2013-08-27 7:46 ` Will Newton
2013-08-30 17:14 ` Carlos O'Donell
2013-08-30 18:48 ` Will Newton
2013-08-30 19:26 ` Carlos O'Donell
2013-09-02 14:18 ` Will Newton
2013-09-03 16:14 ` Carlos O'Donell
[not found] ` <CANu=DmhA9QvSe6RS72Db2P=yyjC72fsE8d4QZKHEcNiwqxNMvw@mail.gmail.com>
2013-09-02 14:18 ` benchmark improvements (Was: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.) Siddhesh Poyarekar
2013-09-03 13:46 ` Will Newton
2013-09-03 17:48 ` Ondřej Bílka
2013-09-02 19:57 ` [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance Ondřej Bílka
2013-09-03 16:18 ` Carlos O'Donell
2013-09-03 17:37 ` Ondřej Bílka
2013-09-03 17:52 ` Carlos O'Donell
2013-09-03 18:57 ` Ondřej Bílka
2013-09-03 19:15 ` Carlos O'Donell
2013-09-04 7:27 ` Siddhesh Poyarekar
2013-09-04 11:03 ` Ondřej Bílka
2013-09-04 11:43 ` Siddhesh Poyarekar
2013-09-04 17:37 ` Ryan S. Arnold
2013-09-05 8:04 ` Ondřej Bílka
2013-09-04 15:30 ` Carlos O'Donell
2013-09-04 17:35 ` Ryan S. Arnold
2013-09-05 11:07 ` Ondřej Bílka
2013-09-05 11:54 ` Joseph S. Myers
2013-09-03 19:34 ` Ryan S. Arnold
2013-09-07 11:55 ` Ondřej Bílka
2013-09-03 19:31 ` Ryan S. Arnold
2013-09-03 19:54 ` Carlos O'Donell
2013-09-03 20:56 ` Ryan S. Arnold
2013-09-03 23:29 ` Ondřej Bílka [this message]
2013-09-03 23:31 ` Carlos O'Donell
2013-09-03 22:27 ` Ondřej Bílka
2013-08-29 23:58 ` Joseph S. Myers
2013-08-30 14:56 ` Will Newton
2013-08-30 15:18 ` Joseph S. Myers
2013-08-30 18:46 ` Will Newton
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).