From: Will Newton <will.newton@linaro.org>
To: "Måns Rullgård" <mans@mansr.com>
Cc: libc-ports@sourceware.org, patches@linaro.org
Subject: Re: [PATCH] ARM: Add Cortex-A15 optimized NEON and VFP memcpy routines, with IFUNC.
Date: Mon, 15 Apr 2013 10:38:00 -0000 [thread overview]
Message-ID: <CANu=DmjJUZ319+7_M8cyxMga_rYxbGb_QSs87Q29JBdkKX_97g@mail.gmail.com> (raw)
In-Reply-To: <yw1x8v4k6rcc.fsf@unicorn.mansr.com>
[-- Attachment #1: Type: text/plain, Size: 517 bytes --]
On 15 April 2013 11:06, Måns Rullgård <mans@mansr.com> wrote:
Hi Måns,
>> Add a high performance memcpy routine optimized for Cortex-A15 with
>> variants for use in the presence of NEON and VFP hardware, selected
>> at runtime using indirect function support.
>
> How does this perform on Cortex-A9?
The code is also faster on A9 although the gains are not quite as
pronounced. A set of numbers is attached (they linewrap pretty
horribly inline).
--
Will Newton
Toolchain Working Group, Linaro
[-- Attachment #2: memcpy_benchmark_a9.txt --]
[-- Type: text/plain, Size: 8450 bytes --]
before:8:100000000:1:3.847382: took 3.847382 s for 100000000 calls to memcpy of 8 bytes. ~258.630 MB/s corrected.
after:8:100000000:1:3.171783: took 3.171783 s for 100000000 calls to memcpy of 8 bytes. ~335.458 MB/s corrected.
before:8:100000000:2:3.763550: took 3.763550 s for 100000000 calls to memcpy of 8 bytes. ~266.195 MB/s corrected.
after:8:100000000:2:2.360168: took 2.360168 s for 100000000 calls to memcpy of 8 bytes. ~521.594 MB/s corrected.
before:8:100000000:4:3.183990: took 3.183990 s for 100000000 calls to memcpy of 8 bytes. ~333.667 MB/s corrected.
after:8:100000000:4:2.357422: took 2.357422 s for 100000000 calls to memcpy of 8 bytes. ~522.575 MB/s corrected.
before:8:100000000:8:3.105652: took 3.105652 s for 100000000 calls to memcpy of 8 bytes. ~345.504 MB/s corrected.
after:8:100000000:8:2.339081: took 2.339081 s for 100000000 calls to memcpy of 8 bytes. ~529.223 MB/s corrected.
before:16:100000000:1:3.887695: took 3.887695 s for 100000000 calls to memcpy of 16 bytes. ~510.287 MB/s corrected.
after:16:100000000:1:2.506378: took 2.506378 s for 100000000 calls to memcpy of 16 bytes. ~948.388 MB/s corrected.
before:16:100000000:2:4.114410: took 4.114410 s for 100000000 calls to memcpy of 16 bytes. ~474.325 MB/s corrected.
after:16:100000000:2:2.506226: took 2.506226 s for 100000000 calls to memcpy of 16 bytes. ~948.478 MB/s corrected.
before:16:100000000:4:3.460236: took 3.460236 s for 100000000 calls to memcpy of 16 bytes. ~595.401 MB/s corrected.
after:16:100000000:4:2.509155: took 2.509155 s for 100000000 calls to memcpy of 16 bytes. ~946.754 MB/s corrected.
before:16:100000000:8:3.344055: took 3.344055 s for 100000000 calls to memcpy of 16 bytes. ~623.674 MB/s corrected.
after:16:100000000:8:2.339264: took 2.339264 s for 100000000 calls to memcpy of 16 bytes. ~1058.312 MB/s corrected.
before:20:100000000:1:4.080444: took 4.080444 s for 100000000 calls to memcpy of 20 bytes. ~599.233 MB/s corrected.
after:20:100000000:1:3.094452: took 3.094452 s for 100000000 calls to memcpy of 20 bytes. ~868.164 MB/s corrected.
before:20:100000000:2:4.399658: took 4.399658 s for 100000000 calls to memcpy of 20 bytes. ~544.615 MB/s corrected.
after:20:100000000:2:3.091522: took 3.091522 s for 100000000 calls to memcpy of 20 bytes. ~869.323 MB/s corrected.
before:20:100000000:4:3.512451: took 3.512451 s for 100000000 calls to memcpy of 20 bytes. ~729.390 MB/s corrected.
after:20:100000000:4:3.094696: took 3.094696 s for 100000000 calls to memcpy of 20 bytes. ~868.067 MB/s corrected.
before:20:100000000:8:3.579956: took 3.579956 s for 100000000 calls to memcpy of 20 bytes. ~711.035 MB/s corrected.
after:20:100000000:8:2.339600: took 2.339600 s for 100000000 calls to memcpy of 20 bytes. ~1322.583 MB/s corrected.
before:31:100000000:1:4.722931: took 4.722931 s for 100000000 calls to memcpy of 31 bytes. ~772.817 MB/s corrected.
after:31:100000000:1:3.512634: took 3.512634 s for 100000000 calls to memcpy of 31 bytes. ~1130.475 MB/s corrected.
before:31:100000000:2:4.926422: took 4.926422 s for 100000000 calls to memcpy of 31 bytes. ~733.785 MB/s corrected.
after:31:100000000:2:3.700684: took 3.700684 s for 100000000 calls to memcpy of 31 bytes. ~1054.640 MB/s corrected.
before:31:100000000:4:3.725647: took 3.725647 s for 100000000 calls to memcpy of 31 bytes. ~1045.331 MB/s corrected.
after:31:100000000:4:3.430481: took 3.430481 s for 100000000 calls to memcpy of 31 bytes. ~1167.140 MB/s corrected.
before:31:100000000:8:3.706085: took 3.706085 s for 100000000 calls to memcpy of 31 bytes. ~1052.611 MB/s corrected.
after:31:100000000:8:2.669373: took 2.669373 s for 100000000 calls to memcpy of 31 bytes. ~1668.474 MB/s corrected.
before:32:100000000:1:4.521362: took 4.521362 s for 100000000 calls to memcpy of 32 bytes. ~842.119 MB/s corrected.
after:32:100000000:1:3.682373: took 3.682373 s for 100000000 calls to memcpy of 32 bytes. ~1095.818 MB/s corrected.
before:32:100000000:2:4.879456: took 4.879456 s for 100000000 calls to memcpy of 32 bytes. ~766.389 MB/s corrected.
after:32:100000000:2:3.680542: took 3.680542 s for 100000000 calls to memcpy of 32 bytes. ~1096.539 MB/s corrected.
before:32:100000000:4:3.563934: took 3.563934 s for 100000000 calls to memcpy of 32 bytes. ~1144.492 MB/s corrected.
after:32:100000000:4:3.679932: took 3.679932 s for 100000000 calls to memcpy of 32 bytes. ~1096.779 MB/s corrected.
before:32:100000000:8:3.602142: took 3.602142 s for 100000000 calls to memcpy of 32 bytes. ~1128.324 MB/s corrected.
after:32:100000000:8:2.703949: took 2.703949 s for 100000000 calls to memcpy of 32 bytes. ~1689.331 MB/s corrected.
before:63:100000000:1:5.548370: took 5.548370 s for 100000000 calls to memcpy of 63 bytes. ~1291.822 MB/s corrected.
after:63:100000000:1:5.854523: took 5.854523 s for 100000000 calls to memcpy of 63 bytes. ~1212.038 MB/s corrected.
before:63:100000000:2:5.685883: took 5.685883 s for 100000000 calls to memcpy of 63 bytes. ~1254.724 MB/s corrected.
after:63:100000000:2:6.084839: took 6.084839 s for 100000000 calls to memcpy of 63 bytes. ~1158.224 MB/s corrected.
before:63:100000000:4:4.683136: took 4.683136 s for 100000000 calls to memcpy of 63 bytes. ~1587.074 MB/s corrected.
after:63:100000000:4:5.771179: took 5.771179 s for 100000000 calls to memcpy of 63 bytes. ~1232.765 MB/s corrected.
before:63:100000000:8:4.640594: took 4.640594 s for 100000000 calls to memcpy of 63 bytes. ~1605.112 MB/s corrected.
after:63:100000000:8:4.098389: took 4.098389 s for 100000000 calls to memcpy of 63 bytes. ~1877.002 MB/s corrected.
before:64:100000000:1:5.395660: took 5.395660 s for 100000000 calls to memcpy of 64 bytes. ~1356.879 MB/s corrected.
after:64:100000000:1:4.349274: took 4.349274 s for 100000000 calls to memcpy of 64 bytes. ~1768.205 MB/s corrected.
before:64:100000000:2:5.692108: took 5.692108 s for 100000000 calls to memcpy of 64 bytes. ~1272.985 MB/s corrected.
after:64:100000000:2:4.457306: took 4.457306 s for 100000000 calls to memcpy of 64 bytes. ~1714.545 MB/s corrected.
before:64:100000000:4:4.468567: took 4.468567 s for 100000000 calls to memcpy of 64 bytes. ~1709.138 MB/s corrected.
after:64:100000000:4:4.772614: took 4.772614 s for 100000000 calls to memcpy of 64 bytes. ~1575.038 MB/s corrected.
before:64:100000000:8:4.309143: took 4.309143 s for 100000000 calls to memcpy of 64 bytes. ~1789.004 MB/s corrected.
after:64:100000000:8:3.262054: took 3.262054 s for 100000000 calls to memcpy of 64 bytes. ~2581.210 MB/s corrected.
before:100:100000000:1:7.877625: took 7.877625 s for 100000000 calls to memcpy of 100 bytes. ~1366.263 MB/s corrected.
after:100:100000000:1:4.935211: took 4.935211 s for 100000000 calls to memcpy of 100 bytes. ~2361.895 MB/s corrected.
before:100:100000000:2:8.309174: took 8.309174 s for 100000000 calls to memcpy of 100 bytes. ~1286.712 MB/s corrected.
after:100:100000000:2:4.851624: took 4.851624 s for 100000000 calls to memcpy of 100 bytes. ~2411.823 MB/s corrected.
before:100:100000000:4:5.450745: took 5.450745 s for 100000000 calls to memcpy of 100 bytes. ~2094.476 MB/s corrected.
after:100:100000000:4:5.515472: took 5.515472 s for 100000000 calls to memcpy of 100 bytes. ~2065.119 MB/s corrected.
before:100:100000000:8:5.214142: took 5.214142 s for 100000000 calls to memcpy of 100 bytes. ~2209.276 MB/s corrected.
after:100:100000000:8:4.516113: took 4.516113 s for 100000000 calls to memcpy of 100 bytes. ~2635.440 MB/s corrected.
before:200:100000000:1:8.623077: took 8.623077 s for 100000000 calls to memcpy of 200 bytes. ~2468.862 MB/s corrected.
after:200:100000000:1:7.694977: took 7.694977 s for 100000000 calls to memcpy of 200 bytes. ~2805.949 MB/s corrected.
before:200:100000000:2:9.148895: took 9.148895 s for 100000000 calls to memcpy of 200 bytes. ~2311.536 MB/s corrected.
after:200:100000000:2:7.444061: took 7.444061 s for 100000000 calls to memcpy of 200 bytes. ~2913.494 MB/s corrected.
before:200:100000000:4:8.382385: took 8.382385 s for 100000000 calls to memcpy of 200 bytes. ~2548.253 MB/s corrected.
after:200:100000000:4:7.862091: took 7.862091 s for 100000000 calls to memcpy of 200 bytes. ~2738.621 MB/s corrected.
before:200:100000000:8:8.110168: took 8.110168 s for 100000000 calls to memcpy of 200 bytes. ~2644.428 MB/s corrected.
after:200:100000000:8:6.816742: took 6.816742 s for 100000000 calls to memcpy of 200 bytes. ~3222.264 MB/s corrected.
next prev parent reply other threads:[~2013-04-15 10:38 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-15 9:57 Will Newton
2013-04-15 10:01 ` Will Newton
2013-04-15 10:23 ` Ondřej Bílka
2013-04-15 10:59 ` Will Newton
2013-04-15 13:38 ` Ondřej Bílka
2013-04-15 10:06 ` Måns Rullgård
2013-04-15 10:38 ` Will Newton [this message]
2013-04-15 10:46 ` Måns Rullgård
2013-04-15 10:49 ` Will Newton
2013-04-18 9:39 ` Ondřej Bílka
2013-04-18 9:47 ` Will Newton
2013-04-18 11:56 ` Ondřej Bílka
2013-04-15 17:14 ` Richard Henderson
2013-04-15 17:44 ` Will Newton
2013-04-15 18:22 ` Richard Henderson
2013-04-15 18:31 ` Will Newton
2013-04-15 18:37 ` Richard Henderson
2013-04-15 18:48 ` Will Newton
2013-04-15 19:12 ` Richard Henderson
2013-04-15 19:47 ` Will Newton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CANu=DmjJUZ319+7_M8cyxMga_rYxbGb_QSs87Q29JBdkKX_97g@mail.gmail.com' \
--to=will.newton@linaro.org \
--cc=libc-ports@sourceware.org \
--cc=mans@mansr.com \
--cc=patches@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).