public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>
Subject: Re: [PATCH] AArch64: Improve A64FX memcpy
Date: Fri, 9 Jul 2021 12:50:36 +0000	[thread overview]
Message-ID: <VE1PR08MB559914B246CEB671E7618A6783189@VE1PR08MB5599.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <TYAPR01MB60256D721B80AAF3EFFC0B37DF1B9@TYAPR01MB6025.jpnprd01.prod.outlook.com>

Hi Naohiro,

> The following Google Sheet Graph [1] shows the most distinctive data
> without noise among the performance data I measured some times.

I do see a lot of noise (easily 10-20% from run to run), but it reduces a lot
if you execute a few iterations before the benchmark timing starts.

> - memset-default shows that the update patch performance is better
>  from 64 byte to 1024 byte, but the master performance is better from
>  1024 byte to 8192 byte. 
>
> I think we need to do a lot of trial and error to keep the same
> performance with reducing the code size.
> Do you have any idea how to further update the patch?
> I believe that memset 1024 byte to 8192 byte performance has something
> to do with removing unroll32.

I don't see unroll32 making any difference. After you unroll by about 4 times,
any loop overhead is fully hidden by the unrolled code and thus further unrolling
cannot make any difference. I do sometimes see a small difference between 4x
and 8x unrolling - for smaller sizes up to about 4KB using unroll of 4x is faster,
but for 32-64KB unroll of 8x wins by 5-10%. I've left it at 8x for now.

The difference is due to handling of the last 512 bytes - for copies around 1KB
it can be a measurable overhead to always do an extra 512 bytes from the end.
So I've added some branches to do this in smaller chunks, but this also adds
extra branch mispredictions if the size varies from call to call (which is what
these microbenchmarks don't simulate).

I also fixed the zero size issue with memset - it's almost twice as fast to zero
memory with DC ZVA, so that is used when possible for large copies.
For non-zero copies it turns out to be a bad idea to use DC ZVA plus a second
store. The unroll8 loop is about 10% faster, so the new version just uses that.

Cheers,
Wilco

  reply	other threads:[~2021-07-09 12:50 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-30 15:38 Wilco Dijkstra
2021-07-01  8:26 ` naohirot
2021-07-06 12:35   ` naohirot
2021-07-09 12:50     ` Wilco Dijkstra [this message]
2021-07-13  8:33       ` naohirot
2021-07-14 17:49         ` Wilco Dijkstra
2021-07-15  8:16           ` naohirot
2021-07-22 17:06             ` Wilco Dijkstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VE1PR08MB559914B246CEB671E7618A6783189@VE1PR08MB5599.eurprd08.prod.outlook.com \
    --to=wilco.dijkstra@arm.com \
    --cc=libc-alpha@sourceware.org \
    --cc=naohirot@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).