From: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
To: 'Wilco Dijkstra' <Wilco.Dijkstra@arm.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>,
Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Tue, 13 Apr 2021 12:07:37 +0000 [thread overview]
Message-ID: <TYAPR01MB6025FC72FC0F3C3B1D7D2A6ADF4F9@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <VE1PR08MB559991EE24FFB21C1258B71783709@VE1PR08MB5599.eurprd08.prod.outlook.com>
Hi Wilco-san,
Thanks for the comments.
I've been continuously updated the first patch since I posted on Mar. 17 2021,
and fixed some bugs.
Here is my local repository's commit history:
https://github.com/NaohiroTamura/glibc/commits/patch-20210317
I answer your comments referring to the latest source code above and
benchtests graphs uploaded to Google drive.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
>
> 1. Overall the code is too large due to enormous unroll factors
>
> Our current memcpy is about 300 bytes (that includes memmove), this memcpy is
> ~12 times larger!
> This hurts performance due to the code not fitting in the I-cache for common
> copies.
OK, I'll try to remove unnecessary code which doesn't contribute performance gain
based on benchtests performance data.
> On a modern OoO core you need very little unrolling since ALU operations and
> branches become essentially free while the CPU executes loads and stores. So
> rather than unrolling by 32-64 times, try 4 times - you just need enough to hide the
> taken branch latency.
>
In terms of loop unrolling, I tested several cases in my local environment.
Here is the result.
The source code is based on the latest commit of the branch patch-20210317 in my GitHub repository.
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S
Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
This unroll configuration recorded the highest performance.
Memcpy 35 Gbps/sec [3]
Memmove 70 Gbps/sec [4]
Mmemset 70 Gbps/sec [5]
[3] https://drive.google.com/file/d/1Xz04kV-S1E4tKOKLJRl8KgO8ZdCQqv1O/view
[4] https://drive.google.com/file/d/1QDmt7LMscXIJSpaq2sPOiCKl3nxcLxwk/view
[5] https://drive.google.com/file/d/1rpy7rkIskRs6czTARNIh4yCeh8d-L-cP/view
In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
The performance degraded minus 5 to 15 Gbps/sec at the peak.
Memcpy 30 Gbps/sec [6]
Memmove 65 Gbps/sec [7]
Mmemset 45 Gbps/sec [8]
[6] https://drive.google.com/file/d/1P-QJGeuHPlfj3ax8GlxRShV0_HVMJWGc/view
[7] https://drive.google.com/file/d/1R2IK5eWr8NEduNnvqkdPZyoNE0oImRcp/view
[8] https://drive.google.com/file/d/1WMZFjzF5WgmfpXSOnAd9YMjLqv1mcsEm/view
> 2. I don't see any special handling for small copies
>
> Even if you want to hyper optimize gigabyte sized copies, small copies are still
> extremely common, so you always want to handle those as quickly (and with as
> little code) as possible. Special casing small copies does not slow down the huge
> copies - the reverse is more likely since you no longer need to handle small cases.
>
Yes, I implemented for the case of 1 byte to 512 byte [9][10].
SVE code seems faster than ASIMD in small/medium range too [11][12][13].
[9] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[10] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S#L68-L78
[11] https://drive.google.com/file/d/1VgkFTrWgjFMQ35btWjqHJbEGMgb3ZE-h/view
[12] https://drive.google.com/file/d/1SJ-WMUEEX73SioT9F7tVEIc4iRa8SfjU/view
[13] https://drive.google.com/file/d/1DPPgh2r6t16Ppe0Cpo5XzkVqWA_AVRUc/view
> 3. Check whether using SVE helps small/medium copies
>
> Run memcpy-random benchmark to see whether it is faster to use SVE for small
> cases or just the SIMD copy on your uarch.
>
Thanks for the memcpy-random benchmark info.
For small/medium copies, I needed to remove BTI macro from ASM ENTRY in order
to see the distinct performance difference between ASIMD and SVE.
I'll post the patch [14] with the A64FX second patch.
And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random
didn't start due to mprotect error.
I needed to fix memcpy-random [15].
If this is not wrong, I'll post the patch [15] with the a64fx second patch.
[14] https://github.com/NaohiroTamura/glibc/commit/07ea389846c7c63622b6c0b3aaead3f93e21f356
[15] https://github.com/NaohiroTamura/glibc/commit/ec0b55a855529f75bd6f280e59dc2b1c25640490
> 4. Avoid making the code too general or too specialistic
>
> I see both appearing in the code - trying to deal with different cacheline sizes and
> different vector lengths, and also splitting these out into separate cases. If you
> depend on a particular cacheline size, specialize the code for that and check the
> size in the ifunc selector (as various memsets do already). If you want to handle
> multiple vector sizes, just use a register for the increment rather than repeating
> the same code several times for each vector length.
>
In terms of the cache line size, A64FX is not configurable, it is fixed to 256 byte.
I've already removed the code to get it [16][17]
[16] https://github.com/NaohiroTamura/glibc/commit/4bcc6d83c970f7a7283abfec753ecf6b697cf6f7
[17] https://github.com/NaohiroTamura/glibc/commit/f2b2c1ca03b50d414e03411ed65e4b131615e865
In terms of Vector Length, I'll remove the code for VL256 bit and 128 bit.
Because Vector Length agnostic code can cover the both cases.
> 5. Odd prefetches
>
> I have a hard time believing first prefetching the data to be written, then clearing it
> using DC ZVA (???), then prefetching the same data a 2nd time, before finally
> write the loaded data is helping performance...
> Generally hardware prefetchers are able to do exactly the right thing since
> memcpy is trivial to prefetch.
> So what is the performance gain of each prefetch/clear step? What is the
> difference between memcpy and memmove performance (given memmove
> doesn't do any of this)?
Sorry, memcpy prefetch code was not right, I noticed this bug and fixed it
soon after posting the first patch [18].
Basically " prfm pstl1keep, [dest_ptr, tmp1]" should be " prfm pldl2keep, [src_ptr, tmp1]".
[18] https://github.com/NaohiroTamura/glibc/commit/f5bf15708830f91fb886b15928158db2e875ac88
Without DC_VZA and L2 prefetch, memcpy and memset performance degraded over 4MB.
Please compare [19] with [22], and [21] with [24] for memset.
Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
Please compare [20] with [23].
The reason why I didn't implement DC_VZA and L2 prefetch is that memmove calls memcpy in
most cases, and memmove code only handles backward copy.
Maybe most of memmove-large benchtest cases are backward copy, I need to check.
DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch doesn't get any improvement.
With DC_VZA and L2 prefetch:
[19] https://drive.google.com/file/d/1mmYaLwzEoytBJZ913jaWmucL0j564Ta7/view
[20] https://drive.google.com/file/d/1Bc_DVGBcDRpvDjxCB_2yOk3MOy5BEiOs/view
[21] https://drive.google.com/file/d/19cHvU2lxF28DW9_Z5_5O6gOOdUmVz_ps/view
Without DC_VZA and L2 prefetch:
[22] https://drive.google.com/file/d/1My6idNuQsrsPVODl0VrqiRbMR9yKGsGS/view
[23] https://drive.google.com/file/d/1q8KhvIqDf27fJ8HGWgjX0nBhgPgGBg_T/view
[24] https://drive.google.com/file/d/1l6pDhuPWDLy5egQ6BhRIYRvshvDeIrGl/view
Thanks.
Naohiro
next prev parent reply other threads:[~2021-04-13 12:07 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-12 12:52 Wilco Dijkstra
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot [this message]
2021-04-14 16:02 ` Wilco Dijkstra
2021-04-15 12:20 ` naohirot
2021-04-20 16:00 ` Wilco Dijkstra
2021-04-27 11:58 ` naohirot
2021-04-29 15:13 ` Wilco Dijkstra
2021-04-30 15:01 ` Szabolcs Nagy
2021-04-30 15:23 ` Wilco Dijkstra
2021-04-30 15:30 ` Florian Weimer
2021-04-30 15:40 ` Wilco Dijkstra
2021-05-04 7:56 ` Szabolcs Nagy
2021-05-04 10:17 ` Florian Weimer
2021-05-04 10:38 ` Wilco Dijkstra
2021-05-04 10:42 ` Szabolcs Nagy
2021-05-04 11:07 ` Florian Weimer
2021-05-06 10:01 ` naohirot
2021-05-06 14:26 ` Szabolcs Nagy
2021-05-06 15:09 ` Florian Weimer
2021-05-06 17:31 ` Wilco Dijkstra
2021-05-07 12:31 ` naohirot
2021-04-19 2:51 ` naohirot
2021-04-19 14:57 ` Wilco Dijkstra
2021-04-21 10:10 ` naohirot
2021-04-21 15:02 ` Wilco Dijkstra
2021-04-22 13:17 ` naohirot
2021-04-23 0:58 ` naohirot
2021-04-19 12:43 ` naohirot
2021-04-20 3:31 ` naohirot
2021-04-20 14:44 ` Wilco Dijkstra
2021-04-27 9:01 ` naohirot
2021-04-20 5:49 ` naohirot
2021-04-20 11:39 ` Wilco Dijkstra
2021-04-27 11:03 ` naohirot
2021-04-23 13:22 ` naohirot
-- strict thread matches above, loose matches on Subject: below --
2021-03-17 2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy
2021-05-10 1:45 ` naohirot
2021-05-14 13:35 ` Szabolcs Nagy
2021-05-19 0:11 ` naohirot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=TYAPR01MB6025FC72FC0F3C3B1D7D2A6ADF4F9@TYAPR01MB6025.jpnprd01.prod.outlook.com \
--to=naohirot@fujitsu.com \
--cc=Szabolcs.Nagy@arm.com \
--cc=Wilco.Dijkstra@arm.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).