RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
To: 'Wilco Dijkstra' <Wilco.Dijkstra@arm.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>,
	Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Tue, 13 Apr 2021 12:07:37 +0000	[thread overview]
Message-ID: <TYAPR01MB6025FC72FC0F3C3B1D7D2A6ADF4F9@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <VE1PR08MB559991EE24FFB21C1258B71783709@VE1PR08MB5599.eurprd08.prod.outlook.com>

Hi Wilco-san,

Thanks for the comments.

I've been continuously updated the first patch since I posted on Mar. 17 2021,
and fixed some bugs.
Here is my local repository's commit history:
https://github.com/NaohiroTamura/glibc/commits/patch-20210317

I answer your comments referring to the latest source code above and
benchtests graphs uploaded to Google drive.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> 
> 1. Overall the code is too large due to enormous unroll factors
> 
> Our current memcpy is about 300 bytes (that includes memmove), this memcpy is
> ~12 times larger!
> This hurts performance due to the code not fitting in the I-cache for common
> copies.

OK, I'll try to remove unnecessary code which doesn't contribute performance gain
based on benchtests performance data. 

> On a modern OoO core you need very little unrolling since ALU operations and
> branches become essentially free while the CPU executes loads and stores. So
> rather than unrolling by 32-64 times, try 4 times - you just need enough to hide the
> taken branch latency.
> 

In terms of loop unrolling, I tested several cases in my local environment.
Here is the result.
The source code is based on the latest commit of the branch patch-20210317 in my GitHub repository.
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S

Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
This unroll configuration recorded the highest performance.
Memcpy    35 Gbps/sec [3]
Memmove  70 Gbps/sec [4]
Mmemset  70 Gbps/sec [5]

[3] https://drive.google.com/file/d/1Xz04kV-S1E4tKOKLJRl8KgO8ZdCQqv1O/view
[4] https://drive.google.com/file/d/1QDmt7LMscXIJSpaq2sPOiCKl3nxcLxwk/view
[5] https://drive.google.com/file/d/1rpy7rkIskRs6czTARNIh4yCeh8d-L-cP/view

In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
The performance degraded minus 5 to 15 Gbps/sec at the peak.
Memcpy    30 Gbps/sec [6]
Memmove  65 Gbps/sec [7]
Mmemset  45 Gbps/sec [8]

[6] https://drive.google.com/file/d/1P-QJGeuHPlfj3ax8GlxRShV0_HVMJWGc/view
[7] https://drive.google.com/file/d/1R2IK5eWr8NEduNnvqkdPZyoNE0oImRcp/view
[8] https://drive.google.com/file/d/1WMZFjzF5WgmfpXSOnAd9YMjLqv1mcsEm/view

> 2. I don't see any special handling for small copies
> 
> Even if you want to hyper optimize gigabyte sized copies, small copies are still
> extremely common, so you always want to handle those as quickly (and with as
> little code) as possible. Special casing small copies does not slow down the huge
> copies - the reverse is more likely since you no longer need to handle small cases.
>

Yes, I implemented for the case of 1 byte to 512 byte [9][10].
SVE code seems faster than ASIMD in small/medium range too [11][12][13].

[9] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[10] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S#L68-L78
[11] https://drive.google.com/file/d/1VgkFTrWgjFMQ35btWjqHJbEGMgb3ZE-h/view
[12] https://drive.google.com/file/d/1SJ-WMUEEX73SioT9F7tVEIc4iRa8SfjU/view
[13] https://drive.google.com/file/d/1DPPgh2r6t16Ppe0Cpo5XzkVqWA_AVRUc/view

> 3. Check whether using SVE helps small/medium copies
> 
> Run memcpy-random benchmark to see whether it is faster to use SVE for small
> cases or just the SIMD copy on your uarch.
> 

Thanks for the memcpy-random benchmark info.
For small/medium copies, I needed to remove BTI macro from ASM ENTRY in order
to see the distinct performance difference between ASIMD and SVE.
I'll post the patch [14] with the A64FX second patch.

And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random
didn't start due to mprotect error.
I needed to fix memcpy-random [15].
If this is not wrong, I'll post the patch [15] with the a64fx second patch.

[14] https://github.com/NaohiroTamura/glibc/commit/07ea389846c7c63622b6c0b3aaead3f93e21f356
[15] https://github.com/NaohiroTamura/glibc/commit/ec0b55a855529f75bd6f280e59dc2b1c25640490

> 4. Avoid making the code too general or too specialistic
> 
> I see both appearing in the code - trying to deal with different cacheline sizes and
> different vector lengths, and also splitting these out into separate cases. If you
> depend on a particular cacheline size, specialize the code for that and check the
> size in the ifunc selector (as various memsets do already). If you want to handle
> multiple vector sizes, just use a register for the increment rather than repeating
> the same code several times for each vector length.
> 

In terms of the cache line size, A64FX is not configurable, it is fixed to 256 byte.
I've already removed the code to get it [16][17]

[16] https://github.com/NaohiroTamura/glibc/commit/4bcc6d83c970f7a7283abfec753ecf6b697cf6f7
[17] https://github.com/NaohiroTamura/glibc/commit/f2b2c1ca03b50d414e03411ed65e4b131615e865

In terms of Vector Length, I'll remove the code for VL256 bit and 128 bit.
Because Vector Length agnostic code can cover the both cases.

> 5. Odd prefetches
> 
> I have a hard time believing first prefetching the data to be written, then clearing it
> using DC ZVA (???), then prefetching the same data a 2nd time, before finally
> write the loaded data is helping performance...
> Generally hardware prefetchers are able to do exactly the right thing since
> memcpy is trivial to prefetch.
> So what is the performance gain of each prefetch/clear step? What is the
> difference between memcpy and memmove performance (given memmove
> doesn't do any of this)?

Sorry, memcpy prefetch code was not right, I noticed this bug and fixed it
soon after posting the first patch [18].
Basically " prfm pstl1keep, [dest_ptr, tmp1]" should be " prfm pldl2keep, [src_ptr, tmp1]".

[18] https://github.com/NaohiroTamura/glibc/commit/f5bf15708830f91fb886b15928158db2e875ac88

Without DC_VZA and L2 prefetch, memcpy and memset performance degraded over 4MB.
Please compare [19] with [22], and [21] with [24] for memset.
Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
Please compare [20] with [23].
The reason why I didn't implement DC_VZA and L2 prefetch is that memmove calls memcpy in
most cases, and memmove code only handles backward copy.
Maybe most of memmove-large benchtest cases are backward copy, I need to check.
DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch doesn't get any improvement.

With DC_VZA and L2 prefetch:
[19] https://drive.google.com/file/d/1mmYaLwzEoytBJZ913jaWmucL0j564Ta7/view
[20] https://drive.google.com/file/d/1Bc_DVGBcDRpvDjxCB_2yOk3MOy5BEiOs/view
[21] https://drive.google.com/file/d/19cHvU2lxF28DW9_Z5_5O6gOOdUmVz_ps/view

Without DC_VZA and L2 prefetch:
[22] https://drive.google.com/file/d/1My6idNuQsrsPVODl0VrqiRbMR9yKGsGS/view
[23] https://drive.google.com/file/d/1q8KhvIqDf27fJ8HGWgjX0nBhgPgGBg_T/view
[24] https://drive.google.com/file/d/1l6pDhuPWDLy5egQ6BhRIYRvshvDeIrGl/view

Thanks.
Naohiro

next prev parent reply	other threads:[~2021-04-13 12:07 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 12:52 Wilco Dijkstra
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot [this message]
2021-04-14 16:02   ` Wilco Dijkstra
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra
2021-04-30 15:01             ` Szabolcs Nagy
2021-04-30 15:23               ` Wilco Dijkstra
2021-04-30 15:30                 ` Florian Weimer
2021-04-30 15:40                   ` Wilco Dijkstra
2021-05-04  7:56                     ` Szabolcs Nagy
2021-05-04 10:17                       ` Florian Weimer
2021-05-04 10:38                         ` Wilco Dijkstra
2021-05-04 10:42                         ` Szabolcs Nagy
2021-05-04 11:07                           ` Florian Weimer
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy
2021-05-06 15:09                 ` Florian Weimer
2021-05-06 17:31               ` Wilco Dijkstra
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy
2021-05-19  0:11     ` naohirot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=TYAPR01MB6025FC72FC0F3C3B1D7D2A6ADF4F9@TYAPR01MB6025.jpnprd01.prod.outlook.com \
    --to=naohirot@fujitsu.com \
    --cc=Szabolcs.Nagy@arm.com \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).