[PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>,
	Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Subject: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Mon, 12 Apr 2021 12:52:05 +0000	[thread overview]
Message-ID: <VE1PR08MB559991EE24FFB21C1258B71783709@VE1PR08MB5599.eurprd08.prod.outlook.com> (raw)

Hi,

I have a few comments about memcpy design (the principles apply equally to memset):

1. Overall the code is too large due to enormous unroll factors

Our current memcpy is about 300 bytes (that includes memmove), this memcpy is ~12 times larger!
This hurts performance due to the code not fitting in the I-cache for common copies.
On a modern OoO core you need very little unrolling since ALU operations and branches
become essentially free while the CPU executes loads and stores. So rather than unrolling
by 32-64 times, try 4 times - you just need enough to hide the taken branch latency.

2. I don't see any special handling for small copies

Even if you want to hyper optimize gigabyte sized copies, small copies are still extremely common,
so you always want to handle those as quickly (and with as little code) as possible. Special casing
small copies does not slow down the huge copies - the reverse is more likely since you no longer
need to handle small cases.

3. Check whether using SVE helps small/medium copies

Run memcpy-random benchmark to see whether it is faster to use SVE for small cases or just the SIMD
copy on your uarch.

4. Avoid making the code too general or too specialistic

I see both appearing in the code - trying to deal with different cacheline sizes and different vector lengths,
and also splitting these out into separate cases. If you depend on a particular cacheline size, specialize
the code for that and check the size in the ifunc selector (as various memsets do already). If you want to
handle multiple vector sizes, just use a register for the increment rather than repeating the same code
several times for each vector length.

5. Odd prefetches

I have a hard time believing first prefetching the data to be written, then clearing it using DC ZVA (???),
then prefetching the same data a 2nd time, before finally write the loaded data is helping performance...
Generally hardware prefetchers are able to do exactly the right thing since memcpy is trivial to prefetch.
So what is the performance gain of each prefetch/clear step? What is the difference between memcpy
and memmove performance (given memmove doesn't do any of this)?

Cheers,
Wilco

next             reply	other threads:[~2021-04-12 12:52 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 12:52 Wilco Dijkstra [this message]
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra
2021-04-30 15:01             ` Szabolcs Nagy
2021-04-30 15:23               ` Wilco Dijkstra
2021-04-30 15:30                 ` Florian Weimer
2021-04-30 15:40                   ` Wilco Dijkstra
2021-05-04  7:56                     ` Szabolcs Nagy
2021-05-04 10:17                       ` Florian Weimer
2021-05-04 10:38                         ` Wilco Dijkstra
2021-05-04 10:42                         ` Szabolcs Nagy
2021-05-04 11:07                           ` Florian Weimer
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy
2021-05-06 15:09                 ` Florian Weimer
2021-05-06 17:31               ` Wilco Dijkstra
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy
2021-05-19  0:11     ` naohirot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VE1PR08MB559991EE24FFB21C1258B71783709@VE1PR08MB5599.eurprd08.prod.outlook.com \
    --to=wilco.dijkstra@arm.com \
    --cc=Szabolcs.Nagy@arm.com \
    --cc=libc-alpha@sourceware.org \
    --cc=naohirot@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).