Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.

public inbox for libc-ports@sourceware.org
 help / color / mirror / Atom feed

From: "Ryan S. Arnold" <ryan.arnold@gmail.com>
To: Siddhesh Poyarekar <siddhesh@redhat.com>
Cc: "Carlos O'Donell" <carlos@redhat.com>,
	"Ondřej Bílka" <neleai@seznam.cz>,
	"Will Newton" <will.newton@linaro.org>,
	"libc-ports@sourceware.org" <libc-ports@sourceware.org>,
	"Patch Tracking" <patches@linaro.org>
Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
Date: Wed, 04 Sep 2013 17:35:00 -0000	[thread overview]
Message-ID: <CAAKybw87cyx67bpX=qjedrfjKxDmtgOfi_zCiaCfHGgx328Bsw@mail.gmail.com> (raw)
In-Reply-To: <20130904073008.GA4306@spoyarek.pnq.redhat.com>

On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
> 1. Assume aligned input.  Nothing should take (any noticeable)
>    performance away from align copies/moves
> 2. Scale with size

In my experience scaling with data-size isn't really possible beyond a
certain point.  We pick a target range of sizes to optimize for based
upon customer feedback and we try to use pre-fetching in that range as
efficiently as possible.  But I get your point.  We don't want any
particular size to be severely penalized.

Each architecture and specific platform needs to know/decide what the
optimal range is and document it.  Even for Power we have different
expectations on server hardware like POWER7, vs. embedded hardware
like ppc 476.

> 3. Provide acceptable performance for unaligned sizes without
>    penalizing the aligned case

There are cases where the user can't control the alignment of the data
being fed into string functions, and we shouldn't penalize them for
these situations if possible, but in reality if a string routine shows
up hot in a profile this is a likely culprit and there's not much that
can be done once the unaligned case is made as stream-lined as
possible.

Simply testing for alignment (not presuming aligned data) itself slows
down the processing of aligned-data, but that's an unavoidable
reality.  I've chatted with some compiler folks about the possibility
of branching directly to aligned case labels in string routines if the
compiler is able to detect aligned data.. but was informed that this
suggestion might get me burned at the stake.

As previously discussed, we might be able to use tunables in the
future to mitigate this.  But of course, this would be 'use at your
own risk'.

> 4. Measure the effect of dcache pressure on function performance
> 5. Measure effect of icache pressure on function performance.
>
> Depending on the actual cost of cache misses on different processors,
> the icache/dcache miss cost would either have higher or lower weight
> but for 1-3, I'd go in that order of priorities with little concern
> for unaligned cases.

I know that icache and dcache miss penalty/costs are known for most
architectures but not whether they're "published".  I suppose we can,
at least, encourage developers for the CPU manufacturers to indicate
in the documentation of preconditions which is more expensive,
relative to the other if they're unable to indicate the exact costs of
these misses.

Some further thoughts (just to get this stuff documented):

Some performance regressions I'm familiar with (on Power), which CAN
be measured with a baseline micro-benchmark regardless of use-case:

1. Hazard/Penalties - I'm thinking things like load-hit-store in the
tail of a loop, e.g., label: load value from a register, do work,
store to same register, branch to loop.  Take a stall when the value
at the top of the loop isn't ready to load.

2. Dispatch grouping - Some instructions need to be first-in-group,
etc.  Grouping is also based on instruction alignment.  At least on
Power I believe some instructions benefit from specific alignment.

3. Instruction Grouping - Depending on topology of the pipeline,
specific groupings of instructions of might incur pipeline stalls due
to unavailability of the load/store unit (for instance).

4. Facility usage costs - Sometimes using certain facilities for
certain sizes of data are more costly than not using the facility.
For instance, I believe that using the DFPU on Power requires that the
floating-point pipeline be flushed, so BFP and DFP really shouldn't be
used together.  I believe there is a powerpc32 string function which
uses FPRs because they're 64-bits wide even on ppc32.  But we measured
the cost/benefit ratio of using this vs. not.

On Power, micro benchmarks are run in-house with these (and many
other) factors in mind.

Ryan

next prev parent reply	other threads:[~2013-09-04 17:35 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-12  7:55 Will Newton
2013-08-27  7:46 ` Will Newton
2013-08-30 17:14   ` Carlos O'Donell
2013-08-30 18:48     ` Will Newton
2013-08-30 19:26       ` Carlos O'Donell
2013-09-02 14:18         ` Will Newton
2013-09-03 16:14           ` Carlos O'Donell
     [not found]         ` <CANu=DmhA9QvSe6RS72Db2P=yyjC72fsE8d4QZKHEcNiwqxNMvw@mail.gmail.com>
2013-09-02 14:18           ` benchmark improvements (Was: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.) Siddhesh Poyarekar
2013-09-03 13:46             ` Will Newton
2013-09-03 17:48               ` Ondřej Bílka
2013-09-02 19:57           ` [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance Ondřej Bílka
2013-09-03 16:18           ` Carlos O'Donell
2013-09-03 17:37             ` Ondřej Bílka
2013-09-03 17:52               ` Carlos O'Donell
2013-09-03 18:57                 ` Ondřej Bílka
2013-09-03 19:15                   ` Carlos O'Donell
2013-09-04  7:27                     ` Siddhesh Poyarekar
2013-09-04 11:03                       ` Ondřej Bílka
2013-09-04 11:43                         ` Siddhesh Poyarekar
2013-09-04 17:37                         ` Ryan S. Arnold
2013-09-05  8:04                           ` Ondřej Bílka
2013-09-04 15:30                       ` Carlos O'Donell
2013-09-04 17:35                       ` Ryan S. Arnold [this message]
2013-09-05 11:07                         ` Ondřej Bílka
2013-09-05 11:54                         ` Joseph S. Myers
2013-09-03 19:34               ` Ryan S. Arnold
2013-09-07 11:55                 ` Ondřej Bílka
2013-09-03 19:31             ` Ryan S. Arnold
2013-09-03 19:54               ` Carlos O'Donell
2013-09-03 20:56                 ` Ryan S. Arnold
2013-09-03 23:29                   ` Ondřej Bílka
2013-09-03 23:31                   ` Carlos O'Donell
2013-09-03 22:27               ` Ondřej Bílka
2013-08-29 23:58 ` Joseph S. Myers
2013-08-30 14:56   ` Will Newton
2013-08-30 15:18     ` Joseph S. Myers
2013-08-30 18:46       ` Will Newton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAKybw87cyx67bpX=qjedrfjKxDmtgOfi_zCiaCfHGgx328Bsw@mail.gmail.com' \
    --to=ryan.arnold@gmail.com \
    --cc=carlos@redhat.com \
    --cc=libc-ports@sourceware.org \
    --cc=neleai@seznam.cz \
    --cc=patches@linaro.org \
    --cc=siddhesh@redhat.com \
    --cc=will.newton@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).