public inbox for libc-ports@sourceware.org
 help / color / mirror / Atom feed
From: Will Newton <will.newton@linaro.org>
To: "Måns Rullgård" <mans@mansr.com>
Cc: libc-ports@sourceware.org, patches@linaro.org
Subject: Re: [PATCH] ARM: Add Cortex-A15 optimized NEON and VFP memcpy routines, with IFUNC.
Date: Mon, 15 Apr 2013 10:38:00 -0000	[thread overview]
Message-ID: <CANu=DmjJUZ319+7_M8cyxMga_rYxbGb_QSs87Q29JBdkKX_97g@mail.gmail.com> (raw)
In-Reply-To: <yw1x8v4k6rcc.fsf@unicorn.mansr.com>

[-- Attachment #1: Type: text/plain, Size: 517 bytes --]

On 15 April 2013 11:06, Måns Rullgård <mans@mansr.com> wrote:

Hi Måns,

>> Add a high performance memcpy routine optimized for Cortex-A15 with
>> variants for use in the presence of NEON and VFP hardware, selected
>> at runtime using indirect function support.
>
> How does this perform on Cortex-A9?

The code is also faster on A9 although the gains are not quite as
pronounced. A set of numbers is attached (they linewrap pretty
horribly inline).


--
Will Newton
Toolchain Working Group, Linaro

[-- Attachment #2: memcpy_benchmark_a9.txt --]
[-- Type: text/plain, Size: 8450 bytes --]

before:8:100000000:1:3.847382: took 3.847382 s for 100000000 calls to memcpy of 8 bytes.  ~258.630 MB/s corrected.
after:8:100000000:1:3.171783: took 3.171783 s for 100000000 calls to memcpy of 8 bytes.  ~335.458 MB/s corrected.
before:8:100000000:2:3.763550: took 3.763550 s for 100000000 calls to memcpy of 8 bytes.  ~266.195 MB/s corrected.
after:8:100000000:2:2.360168: took 2.360168 s for 100000000 calls to memcpy of 8 bytes.  ~521.594 MB/s corrected.
before:8:100000000:4:3.183990: took 3.183990 s for 100000000 calls to memcpy of 8 bytes.  ~333.667 MB/s corrected.
after:8:100000000:4:2.357422: took 2.357422 s for 100000000 calls to memcpy of 8 bytes.  ~522.575 MB/s corrected.
before:8:100000000:8:3.105652: took 3.105652 s for 100000000 calls to memcpy of 8 bytes.  ~345.504 MB/s corrected.
after:8:100000000:8:2.339081: took 2.339081 s for 100000000 calls to memcpy of 8 bytes.  ~529.223 MB/s corrected.
before:16:100000000:1:3.887695: took 3.887695 s for 100000000 calls to memcpy of 16 bytes.  ~510.287 MB/s corrected.
after:16:100000000:1:2.506378: took 2.506378 s for 100000000 calls to memcpy of 16 bytes.  ~948.388 MB/s corrected.
before:16:100000000:2:4.114410: took 4.114410 s for 100000000 calls to memcpy of 16 bytes.  ~474.325 MB/s corrected.
after:16:100000000:2:2.506226: took 2.506226 s for 100000000 calls to memcpy of 16 bytes.  ~948.478 MB/s corrected.
before:16:100000000:4:3.460236: took 3.460236 s for 100000000 calls to memcpy of 16 bytes.  ~595.401 MB/s corrected.
after:16:100000000:4:2.509155: took 2.509155 s for 100000000 calls to memcpy of 16 bytes.  ~946.754 MB/s corrected.
before:16:100000000:8:3.344055: took 3.344055 s for 100000000 calls to memcpy of 16 bytes.  ~623.674 MB/s corrected.
after:16:100000000:8:2.339264: took 2.339264 s for 100000000 calls to memcpy of 16 bytes.  ~1058.312 MB/s corrected.
before:20:100000000:1:4.080444: took 4.080444 s for 100000000 calls to memcpy of 20 bytes.  ~599.233 MB/s corrected.
after:20:100000000:1:3.094452: took 3.094452 s for 100000000 calls to memcpy of 20 bytes.  ~868.164 MB/s corrected.
before:20:100000000:2:4.399658: took 4.399658 s for 100000000 calls to memcpy of 20 bytes.  ~544.615 MB/s corrected.
after:20:100000000:2:3.091522: took 3.091522 s for 100000000 calls to memcpy of 20 bytes.  ~869.323 MB/s corrected.
before:20:100000000:4:3.512451: took 3.512451 s for 100000000 calls to memcpy of 20 bytes.  ~729.390 MB/s corrected.
after:20:100000000:4:3.094696: took 3.094696 s for 100000000 calls to memcpy of 20 bytes.  ~868.067 MB/s corrected.
before:20:100000000:8:3.579956: took 3.579956 s for 100000000 calls to memcpy of 20 bytes.  ~711.035 MB/s corrected.
after:20:100000000:8:2.339600: took 2.339600 s for 100000000 calls to memcpy of 20 bytes.  ~1322.583 MB/s corrected.
before:31:100000000:1:4.722931: took 4.722931 s for 100000000 calls to memcpy of 31 bytes.  ~772.817 MB/s corrected.
after:31:100000000:1:3.512634: took 3.512634 s for 100000000 calls to memcpy of 31 bytes.  ~1130.475 MB/s corrected.
before:31:100000000:2:4.926422: took 4.926422 s for 100000000 calls to memcpy of 31 bytes.  ~733.785 MB/s corrected.
after:31:100000000:2:3.700684: took 3.700684 s for 100000000 calls to memcpy of 31 bytes.  ~1054.640 MB/s corrected.
before:31:100000000:4:3.725647: took 3.725647 s for 100000000 calls to memcpy of 31 bytes.  ~1045.331 MB/s corrected.
after:31:100000000:4:3.430481: took 3.430481 s for 100000000 calls to memcpy of 31 bytes.  ~1167.140 MB/s corrected.
before:31:100000000:8:3.706085: took 3.706085 s for 100000000 calls to memcpy of 31 bytes.  ~1052.611 MB/s corrected.
after:31:100000000:8:2.669373: took 2.669373 s for 100000000 calls to memcpy of 31 bytes.  ~1668.474 MB/s corrected.
before:32:100000000:1:4.521362: took 4.521362 s for 100000000 calls to memcpy of 32 bytes.  ~842.119 MB/s corrected.
after:32:100000000:1:3.682373: took 3.682373 s for 100000000 calls to memcpy of 32 bytes.  ~1095.818 MB/s corrected.
before:32:100000000:2:4.879456: took 4.879456 s for 100000000 calls to memcpy of 32 bytes.  ~766.389 MB/s corrected.
after:32:100000000:2:3.680542: took 3.680542 s for 100000000 calls to memcpy of 32 bytes.  ~1096.539 MB/s corrected.
before:32:100000000:4:3.563934: took 3.563934 s for 100000000 calls to memcpy of 32 bytes.  ~1144.492 MB/s corrected.
after:32:100000000:4:3.679932: took 3.679932 s for 100000000 calls to memcpy of 32 bytes.  ~1096.779 MB/s corrected.
before:32:100000000:8:3.602142: took 3.602142 s for 100000000 calls to memcpy of 32 bytes.  ~1128.324 MB/s corrected.
after:32:100000000:8:2.703949: took 2.703949 s for 100000000 calls to memcpy of 32 bytes.  ~1689.331 MB/s corrected.
before:63:100000000:1:5.548370: took 5.548370 s for 100000000 calls to memcpy of 63 bytes.  ~1291.822 MB/s corrected.
after:63:100000000:1:5.854523: took 5.854523 s for 100000000 calls to memcpy of 63 bytes.  ~1212.038 MB/s corrected.
before:63:100000000:2:5.685883: took 5.685883 s for 100000000 calls to memcpy of 63 bytes.  ~1254.724 MB/s corrected.
after:63:100000000:2:6.084839: took 6.084839 s for 100000000 calls to memcpy of 63 bytes.  ~1158.224 MB/s corrected.
before:63:100000000:4:4.683136: took 4.683136 s for 100000000 calls to memcpy of 63 bytes.  ~1587.074 MB/s corrected.
after:63:100000000:4:5.771179: took 5.771179 s for 100000000 calls to memcpy of 63 bytes.  ~1232.765 MB/s corrected.
before:63:100000000:8:4.640594: took 4.640594 s for 100000000 calls to memcpy of 63 bytes.  ~1605.112 MB/s corrected.
after:63:100000000:8:4.098389: took 4.098389 s for 100000000 calls to memcpy of 63 bytes.  ~1877.002 MB/s corrected.
before:64:100000000:1:5.395660: took 5.395660 s for 100000000 calls to memcpy of 64 bytes.  ~1356.879 MB/s corrected.
after:64:100000000:1:4.349274: took 4.349274 s for 100000000 calls to memcpy of 64 bytes.  ~1768.205 MB/s corrected.
before:64:100000000:2:5.692108: took 5.692108 s for 100000000 calls to memcpy of 64 bytes.  ~1272.985 MB/s corrected.
after:64:100000000:2:4.457306: took 4.457306 s for 100000000 calls to memcpy of 64 bytes.  ~1714.545 MB/s corrected.
before:64:100000000:4:4.468567: took 4.468567 s for 100000000 calls to memcpy of 64 bytes.  ~1709.138 MB/s corrected.
after:64:100000000:4:4.772614: took 4.772614 s for 100000000 calls to memcpy of 64 bytes.  ~1575.038 MB/s corrected.
before:64:100000000:8:4.309143: took 4.309143 s for 100000000 calls to memcpy of 64 bytes.  ~1789.004 MB/s corrected.
after:64:100000000:8:3.262054: took 3.262054 s for 100000000 calls to memcpy of 64 bytes.  ~2581.210 MB/s corrected.
before:100:100000000:1:7.877625: took 7.877625 s for 100000000 calls to memcpy of 100 bytes.  ~1366.263 MB/s corrected.
after:100:100000000:1:4.935211: took 4.935211 s for 100000000 calls to memcpy of 100 bytes.  ~2361.895 MB/s corrected.
before:100:100000000:2:8.309174: took 8.309174 s for 100000000 calls to memcpy of 100 bytes.  ~1286.712 MB/s corrected.
after:100:100000000:2:4.851624: took 4.851624 s for 100000000 calls to memcpy of 100 bytes.  ~2411.823 MB/s corrected.
before:100:100000000:4:5.450745: took 5.450745 s for 100000000 calls to memcpy of 100 bytes.  ~2094.476 MB/s corrected.
after:100:100000000:4:5.515472: took 5.515472 s for 100000000 calls to memcpy of 100 bytes.  ~2065.119 MB/s corrected.
before:100:100000000:8:5.214142: took 5.214142 s for 100000000 calls to memcpy of 100 bytes.  ~2209.276 MB/s corrected.
after:100:100000000:8:4.516113: took 4.516113 s for 100000000 calls to memcpy of 100 bytes.  ~2635.440 MB/s corrected.
before:200:100000000:1:8.623077: took 8.623077 s for 100000000 calls to memcpy of 200 bytes.  ~2468.862 MB/s corrected.
after:200:100000000:1:7.694977: took 7.694977 s for 100000000 calls to memcpy of 200 bytes.  ~2805.949 MB/s corrected.
before:200:100000000:2:9.148895: took 9.148895 s for 100000000 calls to memcpy of 200 bytes.  ~2311.536 MB/s corrected.
after:200:100000000:2:7.444061: took 7.444061 s for 100000000 calls to memcpy of 200 bytes.  ~2913.494 MB/s corrected.
before:200:100000000:4:8.382385: took 8.382385 s for 100000000 calls to memcpy of 200 bytes.  ~2548.253 MB/s corrected.
after:200:100000000:4:7.862091: took 7.862091 s for 100000000 calls to memcpy of 200 bytes.  ~2738.621 MB/s corrected.
before:200:100000000:8:8.110168: took 8.110168 s for 100000000 calls to memcpy of 200 bytes.  ~2644.428 MB/s corrected.
after:200:100000000:8:6.816742: took 6.816742 s for 100000000 calls to memcpy of 200 bytes.  ~3222.264 MB/s corrected.

  reply	other threads:[~2013-04-15 10:38 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-15  9:57 Will Newton
2013-04-15 10:01 ` Will Newton
2013-04-15 10:23   ` Ondřej Bílka
2013-04-15 10:59     ` Will Newton
2013-04-15 13:38       ` Ondřej Bílka
2013-04-15 10:06 ` Måns Rullgård
2013-04-15 10:38   ` Will Newton [this message]
2013-04-15 10:46     ` Måns Rullgård
2013-04-15 10:49       ` Will Newton
2013-04-18  9:39     ` Ondřej Bílka
2013-04-18  9:47       ` Will Newton
2013-04-18 11:56         ` Ondřej Bílka
2013-04-15 17:14 ` Richard Henderson
2013-04-15 17:44   ` Will Newton
2013-04-15 18:22     ` Richard Henderson
2013-04-15 18:31       ` Will Newton
2013-04-15 18:37         ` Richard Henderson
2013-04-15 18:48           ` Will Newton
2013-04-15 19:12             ` Richard Henderson
2013-04-15 19:47               ` Will Newton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANu=DmjJUZ319+7_M8cyxMga_rYxbGb_QSs87Q29JBdkKX_97g@mail.gmail.com' \
    --to=will.newton@linaro.org \
    --cc=libc-ports@sourceware.org \
    --cc=mans@mansr.com \
    --cc=patches@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).